forked from tosdr/tosback2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
66 lines (43 loc) · 2.73 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
This is a ruby implementation of TOSBack! Designed to scrape the Privacy Policies and Terms of Service agreements from sites defined in the rules folder.
The log files in "logs" should give info on when the script was last run, and if one of the rule's URLs needs to be updated. Typically, tosback.rb will grab the body of a URL and try to strip away the html before storing the policy, but if a site is coming back as modified every time the script runs (thanks to ads or related links changing), you can now add an xpath attribute to the url in the xml data to pinpoint the TOS data on the page:
Here's an example:
<docname name="Privacy Policy">
<url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
<norecurse name="arbitrary"/>
</url>
</docname>
Now, tosback.rb should only grab the content we want from that URL! Hooray!
You can also pass a rule file as an argument to the script to get a preview of the results! For example:
rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml
This will just print out whatever data it grabs from the rule, so you can add xpath data to a rule and quickly test to make sure it's correct.
Running with the "-empty" argument will scan the crawl directory and update the empty.log! Example:
rubycode$ ruby tosback.rb -empty
Original README below:
---
This is TOSBack version 2, a clean redesign & reimplementation of EFF's
TOSBack project.
It uses Git as an inherently and efficiently versioned backend storage
database.
After cloning the git repository, you need to execute this command:
git submodule update --init --recursive
That will fetch a recent version of the GitPython code, which we depend upon.
*BUGS IN WGET*
If you want to actually run the crawler yourself (not really necessary unless
you're testing something), be aware that TOSBack2 also exposes a number of
bugs in common versions of wget. As of December 2011, there are two bugs you
might need to patch yourself!
(FOR YOUR CONVENIENCE, a patched version of the wget source can be found in
lib/wget-1.13.4/ . There is also a binary .deb that Debian and Ubuntu users
can try in lib/. More hints on building from source below)
1. Versions of wget built against
gnutls may suffer from fatal memory leaks
https://lists.gnu.org/archive/html/bug-wget/2011-10/msg00050.html
(so apply that patch, or build against openssl using ./configure --with-ssl=openssl).
2. You should also apply the following patch
https://savannah.gnu.org/support/download.php?file_id=24473
to fix this bug: https://savannah.gnu.org/bugs/?21714
HINTS FOR BUILDING WGET FROM SOURCE ON DEBIAN OR UBUNTU
sudo apt-get build-dep wget
cd lib/wget-1.13.4/
fakeroot debian/rules binary
an installable .deb file *should* be written to the lib/ directory