stealthscraper

Craiyon AI generated logo of a ninja holding a paint scraper.

A social media scraper that attempts to be stealthy by using gui automation. This will start a Chrome browser and actually move the mouse and keyboard. A VM can be used to run this in the background. Uses Realistic-Mouse for mouse movement. Currently, it relys on select all + copy for getting the html content from the browser. Eventually, I want to use a plugin and also support undetected_chromedriver. The idea is that plugins are harder/impossible to detect compared to something like undetected_chromedriver, which is a "cat and mouse game" of detection.

Supports:

LinkedIn Employees
LinkedIn posts
Twitter posts
Youtube channel vid urls + descriptions
Facebook (Post text only. Text based scraping.)

Usage

Currently, only Chrome and Windows is supported. Install:

pip install -r requirements.txt

Help:

python main.py -h                                                                                              
usage: main.py [-h] [-u URL] [-U URL_FILE] [-o OUTFILE] [-b BROWSER_PATH]
               [-m {linkedin_posts,linkedin_employees,youtube_vids,facebook_posts_txt,twitter_tweets}] [-t {txt,json}] [-usr USR] [-pwd PWD] [-w] [-l LIMIT]    

Social media scraper.

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     Url to scrape.
  -U URL_FILE, --url-file URL_FILE
                        File of urls to scrape. Use comment out using # to ignore lines
  -o OUTFILE, --out OUTFILE
                        File to dump output. If blank, output is printed
  -b BROWSER_PATH, --browser_path BROWSER_PATH
                        Path to browser. Currently, only chrome is supported.
  -m {linkedin_posts,linkedin_employees,youtube_vids,facebook_posts_txt,twitter_tweets}, --module {linkedin_posts,linkedin_employees,youtube_vids,facebook_posts
_txt,twitter_tweets}
                        Which module to use for scraping.
  -t {txt,json}, --out-type {txt,json}
                        Output raw json or text.
  -usr USR, --username USR
                        Username to use for login.
  -pwd PWD, --password PWD
                        Password to use for login.
  -w, --wait-auth       Launches a browser and waits for you to login. Press any key to continue.
  -l LIMIT, --limit LIMIT
                        Max number of scroll iterations to go through.

Launches a browser and waits for manual login (-w). Scrape 3 pages of LinkedIn employees (total of 30).

python3 main.py -u 'https://www.linkedin.com/company/procter-and-gamble' -m linkedin_employees -o out.json -l 3 -w

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
LICENSE		LICENSE
README.md		README.md
html_scraper.py		html_scraper.py
logo.png		logo.png
main.py		main.py
requirements.txt		requirements.txt
text_scraper.py		text_scraper.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stealthscraper

Usage

About

Releases

Packages

Languages

License

TheKevinWang/stealthscraper

Folders and files

Latest commit

History

Repository files navigation

stealthscraper

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages