_ _
(\\.-""-.//)
/ \ /) _____ __
\o o/ (( / ___/______________ _____ __ __/ /_ ____ __________ _
/\ /\ )) \__ \/ ___/ ___/ __ `/ __ \/ / / / __ \/ __ `/ ___/ __ `/
/==\ () /==\ // ___/ / /__/ / / /_/ / /_/ / /_/ / /_/ / /_/ / / / /_/ /
| `UU` |// /____/\___/_/ \__,_/ .___/\__, /_.___/\__,_/_/ \__,_/
| |/ /_/ /____/
.-'\ /'-.
(((` ) |----| ( `)))
(((` `)))
A modular web scraping framework based on capybara, capybara-webkit, and poltergeist
- Capabara DSL and drivers
- Modular plugins for scraping specific sites
- Additional utility methods to simplify your scraping efforts
- Ruby 1.9/2.0
- libxml2
- libxslt
- Qt (*capybara-webkit)
- PhantomJS (*poltergeist)
The simplest way to install Scrapybara is to use Bundler.
Add Scrapybara to your Gemfile:
gem 'scrapybara'
Or install the gem manually:
gem install scrapybara
Note *You'll need to manually load it from irb until this is packaged as a gem. For example:
cd /path/to/scrapybara
irb -Ilib -rscraper
You can now access the libraries inside IRB:
scraper = Scraper::Edgar.new
#=> #<Scraper::Edgar:0x007fbf478f73a8 @app_host="http://www.sec.gov">
# import most recent filings
scraper.import_filings
# import filings for given day
scraper.import_filings(Date.new(2012,12,21))
# query mongoid db using a named scope, see more: http://mongoid.org/en/mongoid/docs/querying.html
form_10ks = Scraper::Edgar::Filing.form_10k
# view documents for given filing:
most_recent_10k = Scraper::Edgar::Filing.form_10k.last
most_recent_10k.documents
#=> []
Generate it:
yardoc 'lib/*.rb' 'lib/**/*.rb' 'lib/**/**/*.rb'
- better test coverage
- more field validations of models
- more plugins
A lot of new contributors ask "Well, where do I start?". Below are some links to comprehensive resources for newcomers to get up to speed and get dive right in to fixing bugs and adding features.
We try to stick to a set of guidelines when it comes to contributing code. When you're writing a bugfix or custom code from scratch, it's good practice to ask yourself:
- Does my code have tests?
- Am I sticking to the Git Workflow the best I can?
Below are some relevant links to other parts of the wiki. We're currently restructuring everything, so the below links may be subject to change.
- How to work with Pull Requests
- An Overview of Required Ruby Gems
- How to get a dev environment set up
- How to Report a Bug
- A Detailed Introduction to the Source Code
Thank you Diaspora project for the basic ideas on how to structure the README and wikis
Scrapybara has been tested on the following ruby interpreters:
- MRI 1.9.3
- MRI 2.0.0
- Source hosted on GitHub.
- Direct questions and discussions to the IRC channel
- Report issues on GitHub Issues.
- Pull requests are very welcome! Please include spec and/or feature coverage for every patch, and create a topic branch for every separate change you make.
- See the Contributing guide for instructions on running the specs and features.
- Documentation is generated with YARD (cheat sheet). To generate while developing:
yard server --reload