GitHub - InteractiveAdvertisingBureau/adstxtcrawler: A reference implementation in python of a simple crawler for Ads.txt

Synopsis

An example crawler for ads.txt files given a list of URLs or domains etc and saves them to a SQLite DB table.

Usage Example

Usage: adstxt_crawler.py [options]

Options:
  -h, --help            show this help message and exit
  -t FILE, --targets=FILE
                        list of domains to crawler ads.txt from
  -d FILE, --database=FILE
                        Database to dump crawlered data into
  -v, --verbose         Increase verbosity (specify multiple times for more)

Targets File

The targets file can be a list of domains, URLs etc. For each, line the crawler will extract the full hostname, validate it, and cause a request to http://HOSTNAME/ads.txt

$ cat target_domains.txt 
#https://chicagotribune.com
#http://latimes.com/sports
#washingtonpost.com
#http://nytimes.com/index.html
localhosttribune.com

Installation

The project depends on these libraries and programs installed

Python 2 or better
sqlite3
See requirements.txt for all Python packages to install

Execute this command to install the DB table

$sqlite3 adstxt.db < adstxt_crawler.sql

Running

The usual usage would be to pass a filename of target URLs and a filename of the SQLite DB.

$ ./adstxt_crawler.py -t target_domains.txt -d adstxt.db
Wrote 3 records from 1 URLs to adstxt.db

Upon each run a sequence of entries in adstxt_crawler.log is created.

You can examine the DB records created as follows:

$echo "select * from adstxt;" | sqlite3 adstxt.db

You can clear the DB records as follows:

$echo "delete from adstxt;" | sqlite3 adstxt.db

Warnings

This is an example prototype crawler and would be suitable only for a very modest production usage. It doesn't contain a lot of niceties of a production crawler, such as parallel HTTP download and parsing of the data files, stateful recovery of target servers being down, usage of a real production DB server etc.

Contributors

Maintainer: Neal Richter, [email protected] or [email protected]

Contributors (GitHub.com account names) iantri jhpacker brk212 bradlucas nag4 AntoineJac markparolisi sean-mcmann Breza miyaichi

License

The open source license used is the 2-clause BSD license

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
fakeserver		fakeserver
.gitignore		.gitignore
README.md		README.md
adstxt_crawler.py		adstxt_crawler.py
adstxt_crawler.sql		adstxt_crawler.sql
adstxt_domains_2017-09-11		adstxt_domains_2017-09-11
adstxt_domains_2017-09-19.txt		adstxt_domains_2017-09-19.txt
adstxt_domains_2017-09-25.txt		adstxt_domains_2017-09-25.txt
adstxt_domains_2017-10-02.txt		adstxt_domains_2017-10-02.txt
adstxt_domains_2017-10-09.txt		adstxt_domains_2017-10-09.txt
adstxt_domains_2017-10-16.txt		adstxt_domains_2017-10-16.txt
adstxt_domains_2017-10-23.txt		adstxt_domains_2017-10-23.txt
adstxt_domains_2017-10-31.txt		adstxt_domains_2017-10-31.txt
adstxt_domains_2017-11-09.txt		adstxt_domains_2017-11-09.txt
adstxt_domains_2018-01-19.txt		adstxt_domains_2018-01-19.txt
adstxt_domains_2018-02-13.txt		adstxt_domains_2018-02-13.txt
adstxt_domains_july31.txt		adstxt_domains_july31.txt
reinit.sh		reinit.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synopsis

Usage Example

Targets File

Installation

Running

Warnings

Contributors

License

About

Releases

Packages

Contributors 7

Languages

InteractiveAdvertisingBureau/adstxtcrawler

Folders and files

Latest commit

History

Repository files navigation

Synopsis

Usage Example

Targets File

Installation

Running

Warnings

Contributors

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages