Skip to content

[ads.txt crawler] ads.txt is an IAB initiative to improve transparency in programmatic advertising. The files are publicly available and crawlable by exchanges, Supply-Side Platforms (SSP), and other buyers and third-party vendors.

Notifications You must be signed in to change notification settings

rashanjyot/Crawler-ads.txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler ads.txt

--Implementation and assumptions

Referring to the Ads.txt specification v1.0.1, certain assumptions have been taken for the implementation and hence, account for the same.

The code follows the above to parse ads.txt responses for success (2xx) and redirect statuses (3xx) for as many times the root domain is maintained with at max 1 domain change hop delegation. Also, there were a few websites that returned ads.txt content with status of 4xx. However, 4xx status responses have been marked as failures and not saved to db.




For ads.txt that had content-type other than text/plain are ignored. However, as per my observation, there were a few websites that had content-type text/html for their valid ads.txt file. Regarding the format, the format in green has been followed to store records and since the second format - Variable format(marked in red) wasn't mentioned in the requirement document, it has been ignored.

Running the project

Requirements:

  1. Java
  2. PostgreSQL 9.6 or above
  3. Postgres driver JAR for JDBC

Process:

  1. Configure in main.Setup.java file, the settings for your db.
  2. Run Init.java file once to create tables and indexes.
  3. Run Crawler.java to start crawling ads.txt files of domains mentioned in txt file configured in Setup.java variableDOMAIN_LIST_FILE

--Table creation SQLs

CREATE TABLE website(
website_id SERIAL PRIMARY KEY,
name varchar(100) UNIQUE NOT NULL,
last_crawled_at timestamp
);

CREATE TABLE advertiser(
advertiser_id SERIAL PRIMARY KEY,
name varchar(100) UNIQUE NOT NULL
//removing tagid as of now
);

CREATE TABLE publisher(
publisher_id SERIAL PRIMARY KEY,
website_id INTEGER NOT NULL REFERENCES website(website_id) ON DELETE CASCADE,
advertiser_id INTEGER NOT NULL REFERENCES advertiser(advertiser_id) ON DELETE CASCADE,
account_id varchar(100) NOT NULL,
account_type varchar(200) NOT NULL,
UNIQUE (website_id, advertiser_id, account_id)
);

--Indexes (Explicitly created ones)

CREATE INDEX ON publisher (advertiser_id);

CREATE INDEX ON publisher (account_id);

--Queries

  1. List of unique advertisers on a website.
    SELECT DISTINCT(advertiser.name) FROM website INNER JOIN publisher ON publisher.website_id = website.website_id INNER JOIN advertiser ON advertiser.advertiser_id = publisher.advertiser_id WHERE website.name = 'steadyhealth.com';

  2. List of websites that contain a given advertiser.
    SELECT DISTINCT(website.name) FROM advertiser INNER JOIN publisher ON advertiser.advertiser_id = publisher.advertiser_id INNER JOIN website ON publisher.website_id = website.website_id WHERE advertiser.name = 'brightmountainmedia.com';

  3. List of websites that contain a given advertiser id.
    SELECT DISTINCT(website.name) FROM publisher INNER JOIN website ON website.website_id = publisher.website_id WHERE publisher.account_id = 'pub-2051007210431666';

  4. List of all unique advertisers.
    Select name from advertiser;

About

[ads.txt crawler] ads.txt is an IAB initiative to improve transparency in programmatic advertising. The files are publicly available and crawlable by exchanges, Supply-Side Platforms (SSP), and other buyers and third-party vendors.

Topics

Resources

Stars

Watchers

Forks

Languages