Extract Text#

class src.openCHA.tasks.extract_text.ExtractText(*, name: str = 'extract_text', chat_name: str = 'ExtractText', description: str = 'Extract all the text on the current webpage', dependencies: List[str] = [], inputs: List[str] = ['url to extract the text from. It requires links which is gathered from other tools. Never provide urls on your own.'], outputs: List[str] = ['An string containing the text of the scraped webpage.'], datapipe: DataPipe = None, output_type: bool = False, return_direct: bool = False, sync_playwright: Any = None, high_level: Any = None, bs4: Any = None)[source]#

Description:

This task extracts all the text from the current webpage.

_execute(inputs: List[Any]) → str[source]#

Execute the ExtractText task.

Parameters:: input (str) – The input parameter for the task.
Returns:: The extracted text from the current webpage.
Return type:: str
Raises:: ValueError – If the synchronous browser is not provided.

classmethod check_acheck_bs_importrgs(values: dict) → dict[source]#

Check that the arguments are valid.

Parameters:: values (Dict) – The current attribute values.
Returns:: The updated attribute values.
Return type:: Dict
Raises:: ImportError – If ‘beautifulsoup4’, ‘lxml’, or ‘pdfminer’ packages are not installed.

explain() → str[source]#

Explain the ExtractText task.

Returns:: A brief explanation of the ExtractText task.
Return type:: str

validate_url(url)[source]#

This method validates a given URL by checking if its scheme is either ‘http’ or ‘https’.

Parameters:: url (str) – The URL to be validated.
Returns:: The validated URL.
Return type:: str
Raises:: ValueError – If the URL scheme is not ‘http’ or ‘https’.