-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #85 from brave/move-to-pytorch
Switch from tensorflow to sentence-transformer
- Loading branch information
Showing
7 changed files
with
140 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,32 @@ | ||
# brave-news-source-suggestion | ||
|
||
Pipeline for producing the source embedding representations and similarity matrix needed for source suggestion feature in Brave News. | ||
Service for producing the source embedding representations and similarity matrix needed for source suggestion feature in Brave News. | ||
|
||
## Scripts | ||
Run the scripts in the order in which they are presented. | ||
|
||
`source-feed-accumulator.py`: parses periodically Brave News's feed, creating articles buckets for each source. These buckets are collected in `articles_history.csv` and catalogued by the `publisher_id` attribute. | ||
|
||
`sources-similarity-matrix.py`: takes in the source buckets and produces an 512-dimensional embedding for each source, built as the mean of the 512-dimensional embeddings of all articles belonging to the source, as generated by the Universal Sentence Encoder model (https://arxiv.org/abs/1803.11175). | ||
## Installation | ||
|
||
## Outputs | ||
|
||
`source_embeddings.csv`: [`index | publisher_id | 0 | 1 ... | ... 511`] stores all the 512-dimensional embeddings for each source under its `publisher_name`. | ||
|
||
`source_similarity_t10.json` stores the top-10 most similar sources, with similarity score, for each source. | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Scripts | ||
**source-feed-accumulator.py**: parses Brave News feed periodically, collecting articles for each source in `articles_history.csv`. For each article, we store the `publisher_id` attribute. | ||
|
||
**sources-similarity-matrix.py**: takes as input the article history and produces a 384-dimensional embedding for each source, using the `sentence-transformer` package. More in particular: | ||
- `all-MiniLM-L6-v2` for english language sources. | ||
- `paraphrase-multilingual-MiniLM-L12-v2` for non-english language sources. | ||
Once all source embeddings are generated, a pairwise source similarity matrix is produced. | ||
|
||
## Running locally | ||
To collect and accumulate article history: | ||
``` | ||
export NO_UPLOAD=1 | ||
export NO_DOWNLOAD=1 | ||
python source-feed-accumulator.py | ||
``` | ||
|
||
To computed source embeddings and produce the source similarity matrix: | ||
``` | ||
export NO_UPLOAD=1 | ||
export NO_DOWNLOAD=1 | ||
python sources-similarity-matrix.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
import numpy as np | ||
from sentence_transformers import util | ||
from structlog import get_logger | ||
|
||
import config | ||
|
||
EMBEDDING_DIMENSIONALITY = 384 | ||
|
||
logger = get_logger() | ||
|
||
|
||
def compute_source_similarity(source_1, source_2, function='cosine'): | ||
if function == 'dot': | ||
return util.dot_score(source_1, np.transpose(source_2)) | ||
elif function == 'cosine': | ||
return util.pytorch_cos_sim(source_1, source_2)[0][0] | ||
|
||
|
||
def get_source_representation_from_titles(titles, model): | ||
if len(titles) < config.MINIMUM_ARTICLE_HISTORY_SIZE: | ||
return np.zeros((1, EMBEDDING_DIMENSIONALITY)) | ||
|
||
return model.encode(titles).mean(axis=0) | ||
|
||
|
||
def compute_source_representation_from_articles(articles_df, publisher_id, model): | ||
publisher_bucket_df = articles_df[articles_df.publisher_id == publisher_id] | ||
|
||
titles = [ | ||
title for title in publisher_bucket_df.title.to_numpy() if title is not None] | ||
return get_source_representation_from_titles(titles, model) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,14 @@ | ||
feedparser==6.0.10 | ||
numpy==1.23.5 | ||
pandas==1.5.1 | ||
requests==2.28.1 | ||
requests==2.31.0 | ||
scipy==1.9.3 | ||
tensorflow==2.9.3 | ||
tensorflow_text==2.9.0 | ||
tensorflow_hub==0.12.0 | ||
tqdm==4.64.1 | ||
sentence-transformers==2.2.2 | ||
sentry-sdk==1.28.1 | ||
tqdm==4.65.0 | ||
boto3==1.26.14 | ||
botocore==1.29.14 | ||
structlog==22.2.0 | ||
structlog==22.3.0 | ||
torch==2.0.1 | ||
torchvision==0.15.2 | ||
transformers==4.31.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters