GitHub - JesseWeinstein/tosback2: Reimplementing TOSBack with Ruby and using git to see TOS changes!

JesseWeinstein / tosback2 Public

forked from tosdr/tosback2

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Reimplementing TOSBack with Ruby and using git to see TOS changes!

tosback.org

GPL-2.0 license

0 stars 42 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 509 Commits
GitPython @ 6e86f8a		GitPython @ 6e86f8a
code		code
crawl		crawl
lib		lib
logs		logs
rubycode		rubycode
rules		rules
web-frontend		web-frontend
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README		README

Repository files navigation

This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder.

The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:

Here's an example:

Now, tosback.rb should only grab the content we want from that URL! Hooray!

You can also pass a rule file as an argument to the script to get a preview of the results! For example:

rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml

This will just print out whatever data it grabs from the rule, so you can add xpath data to a rule and quickly test to make sure it's correct.

Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:

rubycode$ ruby tosback.rb -empty

Original README below:

---

This is TOSBack version 2, a clean redesign & reimplementation of EFF's
TOSBack project.

It uses Git as an inherently and efficiently versioned backend storage
database.

After cloning the git repository, you need to execute this command:

git submodule update --init --recursive

That will fetch a recent version of the GitPython code, which we depend upon.

*BUGS IN WGET*

If you want to actually run the crawler yourself (not really necessary unless
you're testing something), be aware that TOSBack2 also exposes a number of
bugs in common versions of wget. As of December 2011, there are two bugs you
might need to patch yourself!

(FOR YOUR CONVENIENCE, a patched version of the wget source can be found in
lib/wget-1.13.4/ . There is also a binary .deb that Debian and Ubuntu users
can try in lib/. More hints on building from source below)

1. Versions of wget built against
gnutls may suffer from fatal memory leaks
https://lists.gnu.org/archive/html/bug-wget/2011-10/msg00050.html
(so apply that patch, or build against openssl using ./configure --with-ssl=openssl).

2. You should also apply the following patch
https://savannah.gnu.org/support/download.php?file_id=24473
to fix this bug: https://savannah.gnu.org/bugs/?21714

HINTS FOR BUILDING WGET FROM SOURCE ON DEBIAN OR UBUNTU

sudo apt-get build-dep wget
cd lib/wget-1.13.4/
fakeroot debian/rules binary
an installable .deb file *should* be written to the lib/ directory