webextractor¶
package name: pyf.components.producers.webextractor
“webextractor” plugin¶
- class pyf.components.producers.webextractor.WebExtractor(config_node, process_name)¶
This is a producer that will take urls and will output items based on xpath selectors.
- Configuration available :
- “advanced” (label: Advanced): Compound key (each sub key is an individual tag)
- “separate_process” (label: Separate Process): boolean
- “name” (label: Name): Simple key/value (text-based)
unique name
- “start_urls” (label: Start_Urls): Key with repeated start_url content (default: “[‘’]”)
- Key contains repeated items “start_url”:
- “start_url” (label: Start_Url): Simple key/value (text-based)
- “item_selector” (label: Individual item XPath): Simple key/value (text-based)
ex. ‘//ul[1]/li’
- “fields” (label: Fields): Key with repeated field content (default: “[{‘xpath’: ‘’, ‘name’: ‘’}]”)
- Key contains repeated items “field”:
- “field” (label: Field): Compound key (“xpath” key is the text content of the node)
- “name” (label: Attribute): input
Target attribute
- “xpath” (label: Field XPath): input
Path to search (ex. “p/*/text()”)
- “link_selector” (label: Other pages urls xpath): Simple key/value (text-based)
ex “p[@id=’links’]/a/@href” (optionnal)
- “url_base” (label: Base url for links): Simple key/value (text-based)
ex. “http://wwww.example.com/“
“page_limit” (label: Limit to N Pages): Simple key/value (text-based) (default: “10”)
- launch(progression_callback=None, message_callback=None, params=None)¶
Extracts the data from a file using the passed descriptor. If there is a data item in params, just yield it.
Available params in params dict: - data: if provided: iterates over the lines in data and yield them. - descriptor: use this descriptor to read the data - source: use this file-like object as data source - source_filename: use this file as data source.
requires the source_encoding config key.