Extract Text#

class src.openCHA.tasks.extract_text.ExtractText(*, name: str = 'extract_text', chat_name: str = 'ExtractText', description: str = 'Extract all the text on the current webpage', dependencies: List[str] = [], inputs: List[str] = ['url to extract the text from. It requires links which is gathered from other tools. Never provide urls on your own.'], outputs: List[str] = ['An string containing the text of the scraped webpage.'], datapipe: DataPipe = None, output_type: bool = False, return_direct: bool = False, sync_playwright: Any = None, high_level: Any = None, bs4: Any = None)[source]#

Description:

This task extracts all the text from the current webpage.

_execute(inputs: List[Any]) str[source]#

Execute the ExtractText task.

Parameters:

input (str) – The input parameter for the task.

Returns:

The extracted text from the current webpage.

Return type:

str

Raises:

ValueError – If the synchronous browser is not provided.

classmethod check_acheck_bs_importrgs(values: dict) dict[source]#

Check that the arguments are valid.

Parameters:

values (Dict) – The current attribute values.

Returns:

The updated attribute values.

Return type:

Dict

Raises:

ImportError – If ‘beautifulsoup4’, ‘lxml’, or ‘pdfminer’ packages are not installed.

explain() str[source]#

Explain the ExtractText task.

Returns:

A brief explanation of the ExtractText task.

Return type:

str

validate_url(url)[source]#

This method validates a given URL by checking if its scheme is either ‘http’ or ‘https’.

Parameters:

url (str) – The URL to be validated.

Returns:

The validated URL.

Return type:

str

Raises:

ValueError – If the URL scheme is not ‘http’ or ‘https’.