GitHub - psycotica0/Scrape: This is a small tool designed to help scraping streams into lists of useful data.

psycotica0 / Scrape Public

Notifications You must be signed in to change notification settings
Fork 1
Star 4

This is a small tool designed to help scraping streams into lists of useful data.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
COPYING		COPYING
Makefile		Makefile
README		README
scrape.c		scrape.c

Repository files navigation

Scrape.

This program is to be used to easily parse out either single items or lists of items from input streams.

It works by being given a Perl Compatible Regular Expression, assumedly with captures, that represents each item in the list of things you want to scrape from a stream.
It then outputs, in order, all of the captures from all occurences of that regexp found in the stream.
The captures are, by default, all wrapped in curly braces, and while a capture may contain newlines, each set of wrapped captures for each item will take up only a line.

For example:
echo "<div>This is text one</div><div>This is text two</div><div>This text
has a newline </div><div>This is done</div>" | scrape '<div>([^ ]*) (.*?)</div>'
Would output:
{This} {is text one}
{This} {is text two}
{This} {text
has a newline}
{This} {is done}

Also fun is:
curl -L google.com | scrape '<a.*?href="(.*?)".*?>\s*(.*?)\s*</a>'
Which returns the urls and text of all the links on google's homepage, or any page, in a list like:
{URL} {TEXT}
{URL} {TEXT}
{URL} {TEXT}
etc.

Also, it has optional delimeters to help narrow down the search.
For instance: scrape -s '<body[^>]*>' -e'</body>' '<p>(.*?)</p>'
Would go through and pull all paragraphs out of (assumedly) an html page without even considering the stuff outside of the body.

To change the output the flags "a","b", and "d" can be used.
Each takes a string argument.
-b is the string placed before the field
-a is the string placed after the field
-d is the string placed between fields
So, by default, -b'{' -a'}' -d' '.

It also supports what I call a Continuation Pattern.
This is useful when trying to scrape all the data from a paginated input.

For example, one could pull in an html page with curl, pass it to scrape, then scrape would also output the URL of the "next" link it found, and the loop would start over.

Currently this works by outputing the first capture of the pattern to stderr.
So, then, the script that's using scrape can capture this stderr and if it's a URL go back to the top of the loop.

Example:
curl $URL | scrape -s '<body[^>]*>' -e '</body>' -c '<[^>]*href="([^"]*)"[^>]*>Next' '<p>(.*?)</p>' 2> /tmp/NEXT_URL

Would output all paragraphs in a page, and store the url of the next page in /tmp/NEXT_URL, allowing the script to loop on that.