Skip to content

Unviersity project on analysing news media in Russia before COVID-19 and in the pandemic period

Notifications You must be signed in to change notification settings

dstsimokha/covid-news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Scraper

Code for the university project on scraping news for topic modelling.

How to use

  1. This code helps you to gather title, time and text from news without major modifications in itself.
  2. Put all urls in the sitemaps folder in .csv format with , as newline delimeter - first line as a header named 'url', all following lines are just urls.
  3. In the settings.json file map:
    • css-selectors for scraping what you want from the site
    • cleaning techniques for deleting html-/css-tags and any other garbage from the text
  4. Set folder with this code as working directory and run python scraper.py --test sitename url to test previously mapped css-selectors and cleaning techniques.
  5. After that run python scraper.py sitename for basic parsing with requests package or python scraper.py --selenium sitename for more complex parsing with selenium package (will silently run browser window and behaves more like a human - sometimes it helps)
  6. ...
  7. PROFIT

About

Unviersity project on analysing news media in Russia before COVID-19 and in the pandemic period

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages