New feature request #422

hktalent · 2023-04-27T08:50:14Z

Several new feature requests

Allow setting to crawl all sites of the same organization as the entry URL
Run users to write more and more complex extractors to analyze and extract more useful information. For example:

Metadata of images, OCR information of images, AI classification and recognition information of images, facial recognition information of images (including age, facial expression, gender), license plate recognition of images, etc
Based on static output, identify potential anti sequence vulnerabilities of various programming languages and identify other potential security issues
Cache all outputs and automatically skip URLs that have already been crawled during multiple runs

Run users to write more and more complex extractors to analyze and extract more useful information. For example:

Metadata of images, OCR information of images, AI classification and recognition information of images, facial recognition information of images (including age, facial expression, gender), license plate recognition of images, etc

Based on static output, identify potential anti sequence vulnerabilities of various programming languages and identify other potential security issues
Cache all outputs and automatically skip URLs that have already been crawled during multiple runs
Allow configuration to take complete screenshots of pages that comply with rules, such as screenshots of pages that have been recognized as having forms
Allow configuration and setting of search storage engines, usually by setting a URL. The crawler engine posts the results (URL, response status, response header, response body) to the configured URL, making it easier to establish a big data search engine
It allows you to configure js fragment code to extract more structured data, such as https://mvnrepository.com/ Extract useful information from each page
It is recommended to add several "anti crawler" bypass strategies appropriately, such as only crawling the visible links on the page to avoid traps intentionally left by "anti crawlers". Once a request is made, it may trigger anti crawler firewall policies, leading to the failure of crawler behavior
option "start headless chrome with additional options"，Can you provide some demonstrations, such as prohibiting the loading of images, especially in headless mode? Prohibiting the loading of images and fonts can improve the efficiency of crawling, which is very important

# 
--disable-javascript
# https://stackoverflow.com/questions/55540694/how-to-disable-webrtc-in-chromium
# https://stackoverflow.com/questions/44599265/how-do-i-disable-webrtc-in-chrome-driver
--force-webrtc-ip-handling-policy=default_public_interface_only

Can you provide some demonstrations, such as prohibiting the loading of images, especially in headless mode? Prohibiting the loading of images and fonts can improve the efficiency of crawling, which is very important

option "system-chrome" true What are the risks to users?
Allow configuration to ignore invalid SSL and continue accessing the page for crawling. Also, what are the risks to the client? Can you provide an explanation? Thank you very much

The text was updated successfully, but these errors were encountered:

hktalent added the Type: Enhancement Most issues will probably ask for additions or changes. label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature request #422

New feature request #422

hktalent commented Apr 27, 2023 •

edited

Loading

New feature request #422

New feature request #422

Comments

hktalent commented Apr 27, 2023 • edited Loading

Several new feature requests

hktalent commented Apr 27, 2023 •

edited

Loading