Add option to scrape non-english reviews #28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the scraper only gets reviews in English due to the
rl=en
query param being hardcoded.In my case, I need ALL reviews, no matter the language. Example: https://www.yelp.de/biz/cuccuma-berlin has 94 reviews but only 36 are in English.
To achieve this, I added a new step in the scraping process called
REVIEW_LANGUAGES
. This fetches the URL https://www.yelp.com/biz/j1KMoWRKHnDTqKBEVM45bw/review_feed. Here in the JSON Yelp provides a histogram of the review distribution by languages.Then I call the
REVIEW
step with therl=
param set to every language in the map.One thing that had to change was the way we push to the main dataset - instead of pushing an item for every language, we use a temporary KV called
OUTPUT
, which gets pushed to the result only when all languages are processed.This change should bring NO change in behavior if the new input field is
false
, the default. Then theREVIEW_LANGUAGES
step is skipped, and the scraper works the same way as now.