Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature request #422

Open
hktalent opened this issue Apr 27, 2023 · 0 comments
Open

New feature request #422

hktalent opened this issue Apr 27, 2023 · 0 comments
Labels
Type: Enhancement Most issues will probably ask for additions or changes.

Comments

@hktalent
Copy link

hktalent commented Apr 27, 2023

@ehsandeep

Several new feature requests

  • Allow setting to crawl all sites of the same organization as the entry URL
  • Run users to write more and more complex extractors to analyze and extract more useful information. For example:
  1. Metadata of images, OCR information of images, AI classification and recognition information of images, facial recognition information of images (including age, facial expression, gender), license plate recognition of images, etc

  2. Based on static output, identify potential anti sequence vulnerabilities of various programming languages and identify other potential security issues

  3. Cache all outputs and automatically skip URLs that have already been crawled during multiple runs

Run users to write more and more complex extractors to analyze and extract more useful information. For example:

  1. Metadata of images, OCR information of images, AI classification and recognition information of images, facial recognition information of images (including age, facial expression, gender), license plate recognition of images, etc
  • Based on static output, identify potential anti sequence vulnerabilities of various programming languages and identify other potential security issues

  • Cache all outputs and automatically skip URLs that have already been crawled during multiple runs

  • Allow configuration to take complete screenshots of pages that comply with rules, such as screenshots of pages that have been recognized as having forms

  • Allow configuration and setting of search storage engines, usually by setting a URL. The crawler engine posts the results (URL, response status, response header, response body) to the configured URL, making it easier to establish a big data search engine

  • It allows you to configure js fragment code to extract more structured data, such as https://mvnrepository.com/ Extract useful information from each page

  • It is recommended to add several "anti crawler" bypass strategies appropriately, such as only crawling the visible links on the page to avoid traps intentionally left by "anti crawlers". Once a request is made, it may trigger anti crawler firewall policies, leading to the failure of crawler behavior

  • option "start headless chrome with additional options",Can you provide some demonstrations, such as prohibiting the loading of images, especially in headless mode? Prohibiting the loading of images and fonts can improve the efficiency of crawling, which is very important

# 
--disable-javascript
# https://stackoverflow.com/questions/55540694/how-to-disable-webrtc-in-chromium
# https://stackoverflow.com/questions/44599265/how-do-i-disable-webrtc-in-chrome-driver
--force-webrtc-ip-handling-policy=default_public_interface_only

Can you provide some demonstrations, such as prohibiting the loading of images, especially in headless mode? Prohibiting the loading of images and fonts can improve the efficiency of crawling, which is very important

  • option "system-chrome" true What are the risks to users?
  • Allow configuration to ignore invalid SSL and continue accessing the page for crawling. Also, what are the risks to the client? Can you provide an explanation? Thank you very much
@hktalent hktalent added the Type: Enhancement Most issues will probably ask for additions or changes. label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Most issues will probably ask for additions or changes.
Projects
None yet
Development

No branches or pull requests

2 participants
@hktalent and others