Skip to content

Reimplementing TOSBack with Ruby and using git to see TOS changes!

License

Notifications You must be signed in to change notification settings

JesseWeinstein/tosback2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder. 

The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:

Here's an example:

 <docname name="Privacy Policy">
   <url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
     <norecurse name="arbitrary"/>
   </url>
 </docname>

Now, tosback.rb should only grab the content we want from that URL! Hooray!

You can also pass a rule file as an argument to the script to get a preview of the results! For example:

rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml

This will just print out whatever data it grabs from the rule, so you can add xpath data to a rule and quickly test to make sure it's correct.

Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:

rubycode$ ruby tosback.rb -empty

Original README below:

---

This is TOSBack version 2, a clean redesign & reimplementation of EFF's
TOSBack project.

It uses Git as an inherently and efficiently versioned backend storage
database.

After cloning the git repository, you need to execute this command:

git submodule update --init --recursive

That will fetch a recent version of the GitPython code, which we depend upon.

*BUGS IN WGET*

If you want to actually run the crawler yourself (not really necessary unless
you're testing something), be aware that TOSBack2 also exposes a number of
bugs in common versions of wget.  As of December 2011, there are two bugs you
might need to patch yourself!

(FOR YOUR CONVENIENCE, a patched version of the wget source can be found in
lib/wget-1.13.4/ .  There is also a binary .deb that Debian and Ubuntu users
can try in lib/.  More hints on building from source below) 

1. Versions of wget built against
   gnutls may suffer from fatal memory leaks 
   https://lists.gnu.org/archive/html/bug-wget/2011-10/msg00050.html
   (so apply that patch, or build against openssl using ./configure --with-ssl=openssl).

2. You should also apply the following patch 
   https://savannah.gnu.org/support/download.php?file_id=24473
   to fix this bug: https://savannah.gnu.org/bugs/?21714

HINTS FOR BUILDING WGET FROM SOURCE ON DEBIAN OR UBUNTU

sudo apt-get build-dep wget
cd lib/wget-1.13.4/
fakeroot debian/rules binary
an installable .deb file *should* be written to the lib/ directory

About

Reimplementing TOSBack with Ruby and using git to see TOS changes!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 81.9%
  • Python 11.5%
  • Perl 2.2%
  • Shell 1.9%
  • C++ 1.9%
  • JavaScript 0.4%
  • Other 0.2%