Skip to content

graphistry/dots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

644ef4b · Apr 12, 2024

History

96 Commits
Apr 12, 2024
Mar 22, 2024
Apr 12, 2024
Apr 12, 2024
Mar 18, 2024
Feb 23, 2024
Feb 23, 2024
Apr 12, 2024
Mar 28, 2024
Apr 11, 2024
Apr 12, 2024
Mar 12, 2024
Mar 22, 2024
Mar 12, 2024
Apr 12, 2024

Repository files navigation

Current Events Scraper & Featurizer

Using OpenSearch and Google News APIs, this tool pulls news stories and extracts features from the text. The features are then stored in a CSV file.

Can gather stories from multiple sources and languages. GNews maxes out at ~3000 stories per day, OpenSearch has no limit. OpenSearch uses scroll and slice to pull a large number of stories .

Clone current version & run dots_feat.py

requirements : pytest, pyarrow, spacy, python-dotenv, bs4, pandas, scikit-learn, transformers, torch, opensearch-py, requests, nltk, numpy, graphistry[umap-learn], umap-learn, validators, pytesseract, selenium, webdriver_manager, undetected_chromedriver, gliner,

the example below will pull 100 OS gnews stories and return features each in additon to location and date to a file

    git clone https://github.com/graphistry/dots
    python dots/dots_feat.py -n 100 -e 0 -d 0 -o dots_drba_feats.csv
    python dots/dots_feat.py -n 100 -e 1 -d 0 -o dots_gpy_feats.csv  
    python dots/dots_feat.py -n 100 -e 2 -d 0 -o dots_glnr_feats.csv  

"'Gaza Strip', '16-01-2024', ","['neighborhoods', 'rebels', 'widespread famine', 'egypt', 'disease']"
"'Miseno, Campania, Italy', '16-01-2024', ","['disasters', 'mount vesuvius', 'ancient cataclysm', 'costruzione', 'beach']"
"'Clarendon, Clarendon, Jamaica', '16-01-2024', ","['new bowen', 'fight', 'whatsapp', 'st catherine', 'jamaica']"
"'Philadelphia, Pennsylvania, United States', '16-01-2024', ","['meteorologists', 'snow shovels', 'snowstorm', 'accuweather alerts', 'accuweather meteorologists']"
"'New Bedford, Massachusetts, United States', '16-01-2024', ","['massachusetts law', 'saturday', 'ariel dorsey', 'traffic', 'united states']"
"'Corofin, Clare, Ireland', '16-01-2024', ","['emergency services', 'breathing', 'rescue service', 'firefighters', 'afternoon']"
"'United States', '16-01-2024', ","['preparedness', 'earthquake', 'quake', 'morning', 'disaster']"
"'Syria', '16-01-2024', ","['neighboring countries', 'early recovery', 'cholera', 'symptom', 'mohamad katoub']"
"'Iceland', '16-01-2024', ","['lava flows', 'evacuation', 'eruptions', 'jóhannesson', 'lúðvík pétursson']"

here is an example produced every day via gh_actions parsing gNews stories and extracting features: Feature Table and Full Table

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages