Crawler ads.txt

--Implementation and assumptions

Referring to the Ads.txt specification v1.0.1, certain assumptions have been taken for the implementation and hence, account for the same.

The code follows the above to parse ads.txt responses for success (2xx) and redirect statuses (3xx) for as many times the root domain is maintained with at max 1 domain change hop delegation. Also, there were a few websites that returned ads.txt content with status of 4xx. However, 4xx status responses have been marked as failures and not saved to db.

For ads.txt that had content-type other than text/plain are ignored. However, as per my observation, there were a few websites that had content-type text/html for their valid ads.txt file. Regarding the format, the format in green has been followed to store records and since the second format - Variable format(marked in red) wasn't mentioned in the requirement document, it has been ignored.

Running the project

Requirements:

Java
PostgreSQL 9.6 or above
Postgres driver JAR for JDBC

Process:

Configure in main.Setup.java file, the settings for your db.
Run Init.java file once to create tables and indexes.
Run Crawler.java to start crawling ads.txt files of domains mentioned in txt file configured in Setup.java variableDOMAIN_LIST_FILE

--Table creation SQLs

CREATE TABLE website(
website_id SERIAL PRIMARY KEY,
name varchar(100) UNIQUE NOT NULL,
last_crawled_at timestamp
);

CREATE TABLE advertiser(
advertiser_id SERIAL PRIMARY KEY,
name varchar(100) UNIQUE NOT NULL
//removing tagid as of now
);

CREATE TABLE publisher(
publisher_id SERIAL PRIMARY KEY,
website_id INTEGER NOT NULL REFERENCES website(website_id) ON DELETE CASCADE,
advertiser_id INTEGER NOT NULL REFERENCES advertiser(advertiser_id) ON DELETE CASCADE,
account_id varchar(100) NOT NULL,
account_type varchar(200) NOT NULL,
UNIQUE (website_id, advertiser_id, account_id)
);

--Indexes (Explicitly created ones)

CREATE INDEX ON publisher (advertiser_id);

CREATE INDEX ON publisher (account_id);

--Queries

List of unique advertisers on a website.
SELECT DISTINCT(advertiser.name) FROM website INNER JOIN publisher ON publisher.website_id = website.website_id INNER JOIN advertiser ON advertiser.advertiser_id = publisher.advertiser_id WHERE website.name = 'steadyhealth.com';
List of websites that contain a given advertiser.
SELECT DISTINCT(website.name) FROM advertiser INNER JOIN publisher ON advertiser.advertiser_id = publisher.advertiser_id INNER JOIN website ON publisher.website_id = website.website_id WHERE advertiser.name = 'brightmountainmedia.com';
List of websites that contain a given advertiser id.
SELECT DISTINCT(website.name) FROM publisher INNER JOIN website ON website.website_id = publisher.website_id WHERE publisher.account_id = 'pub-2051007210431666';
List of all unique advertisers.
Select name from advertiser;

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
img		img
res		res
src/main		src/main
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler ads.txt

Running the project

About

Languages

rashanjyot/Crawler-ads.txt

Folders and files

Latest commit

History

Repository files navigation

Crawler ads.txt

Running the project

About

Topics

Resources

Stars

Watchers

Forks

Languages