Don't fear the scReaper
Niche market focus: Bearings
We identify the list of companies active in this field.
We will identify the list of products associated with each seller.
We will apply relation extraction, as well as design the platform to ask people where they sell, to make similar recommendations.
python -m deeppavlov install ner_ontonotes_bert_mult
- Record all contacts and companies that could be useful on HubSpot
- Record all TODOs on this Google Sheet
Treat this as an ontology learning problem
Create a database of all documents, then you can apply tf-idf for document similarity
Create a database of all edges between entities, entitiy-classifications and entity-link. You can then apply supervised learning (e.g. graph neural networks) to infer structure on such websites. You would need to a manually crafted website for this.
Each node can have PoS, NER information, and any other NLP information
Only if a node occurs multiple times, and is classified as a specific NER, use it as a NER.
Only use text of certain length to classify as an entity.
Save all extracted entities into a database, and then do big data operations on this (do a script-first implementation)
On this edge-set apply matrix factorization techniques to induce hidden relations
index all websites around key entities (organizations, brands, products)
-> how to determine a key entity
entities, attributes, relationships, trends, events, bare analytics, predictions over time, connect them
1.3T documents, 4-5 biggest elasticsearch cluster
recommender engines, semantic searchers, contextual searchers, inference engines
closed domains around companies and related entities
knowledge graph embedding
base-url as entitiy (?) (bcs organization).
then do entity-label-matching (title of html)
rule-based feature engineering
2.5k features per website
predictor on what sites to prioritize by the terms included in the data
NLP tasks:
language detection
text similarity
text classification
Noun Chunking
Dependency Parsing
Named Entity Recognition
Named Entity Disambiguation
Key Phrases
Concept Extraction ->
Sentiment Analysis
Relation Extraction
NER: CRF models, another one BERT models
populate knowledge graph with links from web
DBPedia Schema
apache lucene triplet store
ontology of 100s of tasks
- based on semantic rules on NLP tasks
topic classifier
record linking
schema mapping, normalization, blocking, pairwise similarity classification, agglorometive clustering, cluster UUID assignment
truth discovery with majority voting
human validation of feedback loop
check out later:
-> start scraping the website already? (on bob)
- 700M documents per day ; 120 articles/s
Install all dependencies
pip install -r kubernetes/scraper_language_model/requirements.txt
sudo apt-get install python3 python-dev python3-dev \ build-essential libssl-dev libffi-dev \ libxml2-dev libxslt1-dev zlib1g-dev \ python-pip
pip install --upgrade pip
pip install deeppavlov
python -m deeppavlov install ner_ontonotes_bert_mult
python -c "from deeppavlov import configs, build_model; build_model(configs.ner.ner_ontonotes_bert_mult, download=True)"
Install PostgreSQL and create database
sudo apt update
sudo apt install postgresql postgresql-contrib
sudo -i -u postgres
Instantiate the tables
python -m screaper.resources.entities
-> Perhaps do not include any scripts that are not company pages, or just be more careful which pages to scrape, and which ones to just skip
nice -n 17 python -m screaper.core.main
nice -n 17 python -m screaper.engine.core.