Extract Text#
- class src.openCHA.tasks.extract_text.ExtractText(*, name: str = 'extract_text', chat_name: str = 'ExtractText', description: str = 'Extract all the text on the current webpage', dependencies: List[str] = [], inputs: List[str] = ['url to extract the text from. It requires links which is gathered from other tools. Never provide urls on your own.'], outputs: List[str] = ['An string containing the text of the scraped webpage.'], datapipe: DataPipe = None, output_type: bool = False, return_direct: bool = False, sync_playwright: Any = None, high_level: Any = None, bs4: Any = None)[source]#
Description:
This task extracts all the text from the current webpage.
- _execute(inputs: List[Any]) str [source]#
Execute the ExtractText task.
- Parameters:
input (str) – The input parameter for the task.
- Returns:
The extracted text from the current webpage.
- Return type:
str
- Raises:
ValueError – If the synchronous browser is not provided.
- classmethod check_acheck_bs_importrgs(values: dict) dict [source]#
Check that the arguments are valid.
- Parameters:
values (Dict) – The current attribute values.
- Returns:
The updated attribute values.
- Return type:
Dict
- Raises:
ImportError – If ‘beautifulsoup4’, ‘lxml’, or ‘pdfminer’ packages are not installed.