Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to scrape non-english reviews #28

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bmestanov
Copy link

Currently, the scraper only gets reviews in English due to the rl=en query param being hardcoded.

In my case, I need ALL reviews, no matter the language. Example: https://www.yelp.de/biz/cuccuma-berlin has 94 reviews but only 36 are in English.

To achieve this, I added a new step in the scraping process called REVIEW_LANGUAGES. This fetches the URL https://www.yelp.com/biz/j1KMoWRKHnDTqKBEVM45bw/review_feed. Here in the JSON Yelp provides a histogram of the review distribution by languages.

изображение

Then I call the REVIEW step with the rl= param set to every language in the map.

One thing that had to change was the way we push to the main dataset - instead of pushing an item for every language, we use a temporary KV called OUTPUT, which gets pushed to the result only when all languages are processed.

This change should bring NO change in behavior if the new input field is false, the default. Then the REVIEW_LANGUAGES step is skipped, and the scraper works the same way as now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant