This is a collection of a bunch of scrapers which fetch concert listings from various venues in the Bay Area.
It also is a website to view the listings.
The scraper uses headless Selenium (Chromedriver) and is written in Ruby.
Results are converted to JSON and pushed to a Google Cloud Storage bucket.
The front end is written in React and styled with Bulma CSS framework.
-
Run
bundle install
in the root of the repo using a stable Ruby version -
You're gonna need to install Chromedriver.
- On Raspberry Pi: you can do this with
sudo apt-get install chromium-chromedriver
and this will happen automatically (the scraper will check your system architecture and use the right chromedriver path) - on OSX and Windows you can find all versions at https://chromedriver.chromium.org/ and then add it to your PATH.
For example, add this to your
~/.bash_profile
:export PATH=$PATH:/path/to/folder/containing/chromedriver/
- On Raspberry Pi: you can do this with
-
Run
cp .env.example .env
in the root of the repo. The.env
file sets environment variables which can be used to customize the application's behavior. This file can be edited directly. -
Make a new "project" on google cloud. Create a GCS bucket in the project. Add the credentials to
.env
:STORAGE_PROJECT = "my-project-id" STORAGE_CREDENTIALS = "path/to/keyfile.json"
-
Change the GCS bucket permissions so all files are publicly available by default.
-
Configure
gsutils
to use your new project, then upload the CORS file which I've included in the repo:gsutil cors set cors-json-file.json gs://<BUCKET_NAME>
- Make sure you're using a stable Node version
cd frontend/react-app
- run
yarn install
to get dependencies yarn start
and then openlocalhost:3000
To build the project for production, use yarn build
Note: There is also a script which you can run from the root of the repo to start the react server:
bin/run_frontend
This runs nvm use 14; cd frontend/react_app && yarn install & yarn start
Note you will probably have to change this nvm use 14
if you are using a different Node version.
There is a command line tool at bin/run_scraper
.
By default it will run all scrapers (each will fetch a maximum of 200 events)
and then upload the results to GCS.
Options (note that most of these can also be set from .env
)
# Limits each scraper to N results
--limit=10
# Just print the results, don't upload them to GCS
--skip-persist
# Don't rescue scraping errors - one broken scraper will stop the whole script
# If rescue=true (the default) then broken scrapers will just be skipped
--rescue=false
# Just update the list of venues. Don't actually scrape any events.
--no-scrape
# Limit the scrape to a set of venues. Comma-separated list.
--sources=GreyArea,Cornerstone
# Run headlessly, or not
--headless=true
--headless=false
For example, combining options:
bin/run-scraper --headless=true --limit=5 --skip-persist --rescue=false sources=ElboRoom,Knockout
There are some other configuration options done through ENV, see .env.example
Note that every time you run a scraper, it will completely overwrite the list of events for that venue.
There are some basic automated tests for the scrapers. Run bundle exec rspec
from the root of the repo.
Using rspec you can also isolate certain tests to run (left as an exercise to the reader).
-
Add a new entry to
sources.json
. You can getlatlng
from Google Maps (right click the marker on the map and the coords will pop up). Fordesc
you can just copy the blurb from Google Maps as well. -
Create a new file
scraper/lib/sources/venue_name.rb
(replacingvenue_name
, obviously). -
You can copy one of the existing scraper classes as a starting point. Note that there are a few different types of websites (calendar view, infinite scroll, all-on-one-page) so it's best to find another scraper that is similar in that regard.
-
Make sure the class name is the exact same as the
name
value insources.json
-
Fill out the contents of the scraper, using
binding.pry
and theHEADLESS=false
environment variable as needed for debugging. -
Add a test case to
scraper_spec.rb
(can just usegeneric_run_test
like the other scrapers)
Note, there is no need to explicitly require
the scraper class anywhere into the codebase.
Autoloading is already set up based on sources.json
.
These are both unused. I kept them here in case I want to have a dedicated backend at some point.
For now it suffices to go backend-less and just host the results on GCS.
- Map View
- Add more meta-scrapers (e.g. scrape other scrapers/aggregators), especially for electronic shows which aren't really captured by the current venue list or "The List"
- Add more venues (have specifically received requests for South Bay, but probably there are new SF / East Bay venues as well).
- Add Venue Events List view (accessible from Venue List View)
- Find a way to handle events that don't have an explicit year in their date
- Add Submit Event / About pages