Update (30/07/2024): We are currently resolving some issues related to OpenWPM compatibility with Firefox for WebGraph crawls. We will release fixes soon. In the meantime, a trained model of WebGraph on 20k sites can be found here: https://drive.google.com/drive/folders/1IivW179R5YHYCqOMFH77LxyZ0aJixqFV?usp=drive_link . This model was trained on April 6-8 2023.
Artifact release for the paper "WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking", published at USENIX Security 2022.
This project has been run and tested on Ubuntu 18.04
First, make sure you have python3
, binutils
, pip
, gcc
and g++
installed. Otherwise run the following command
apt-get install binutils python3-dev python3-pip gcc g++
To run all tasks (Graph building, Feature extraction or Classification) on WebGraph, the crawl data used is collected using a custom version of OpenWPM. Follow the instructions here to setup OpenWPM in your environment.
After OpenWPM is installed, if you haven't done it yet, activate the conda environment:
conda activate openwpm
go into <project-directory>/code
in the project folder and install the python libraries in requirements.txt
:
pip install -r requirements.txt
To generate the crawl data needed for the pipeline, you need to run a crawl using the installed OpenWPM tool. To run a crawl, first update the script demo.py
to read in the list of sites that you want to visit. Then, run demo.py
.
After you run the demo, a datadir
folder will be created in your demo
directory. Inside the folder, you will find two database files to be used in our pipeline: crawl-db.sqlite
and content.ldb
The codebase consists of two pipelines: WebGraph and Robustness. We describe each of them below.
This pipeline runs the WebGraph system, which is a graph-based Ad and Tracking Services (ATS) detection system. WebGraph takes in crawl data, builds graph representations of sites, extracts features and labels from these representations, and trains a machine learning model.
With the WebGraph code, we present two tasks that you can run:
- Graph Preprocessing and Feature building
- Classification (training and testing)
In this task, WebGraph constructs the dataset for classification by:
- taking your sqlite and leveldb database files to construct a graph representation of each crawl as explained in the paper and export it in a tabular format to a
graph.csv
file andfeatures.csv
file - applying the rules from public filterlists to label the nodes in each graph and export it in a tabular format to a
labelled.csv
file
To run this task, run the following script:
python <project-directory>/code/run.py --input-db <location-to-datadir>/datadir/crawl-db.sqlite --ldb <location-to-datadir>/datadir/content.ldb --mode webgraph
All additional arguments accepted by this command:
--input-db
: the path to the.sqlite
file generated by the crawl--ldb
: the path to the.ldb
file generated by the crawl--features
: the path to the.yaml
feature categories list. A defaultfeatures.yaml
is used if unspecified.--filters
: the path to the directory to save the filter lists in. A defaultfilterlists
folder will be created if unspecified.--out
: the path to the directory of the output.csv
files.--mode
: the system to run (webgraph or adgraph).
Note: With the --mode
argument, you can also run AdGraph (we evaluate AdGraph in Section 3 of the paper).
The classification takes in the output from Step 1 (features and labels), and performs cross validation on the data. To run this task, run the following script:
python <project-directory>/code/classification/classify.py --features features.csv --labels labels.csv --out results --save False --probability False --interpret False
Arguments of this command:
--features
: the path to the features.csv file--labels
: the path to the labels.csv file--out
: the path to the directory of the output files--save
: Whether to save the trained model.--probability
: Whether to save prediction probabilities.--interpret
: Whether to run interpretation on results.
This pipeline runs the robustness experiments performed in the paper. There are two types of robustness experiments: content and structure mutations. All the code and READMEs associated with these experiments are in the robustness
folder.
The output of the WebGraph pipeline is three files: graph.csv
, features.csv
, labelled.csv
.
These are the columns present in the graph output under graph.csv
Column | Applies to | Description |
---|---|---|
visit_id | All | the visit id of the crawl |
name | All | the name of the node or edge |
graph_attr | All | Node or Edge |
top_level_url | All | The top level URL (page being visited) |
attr | All | additional attributes of nodes and edges |
domain | All | The parent domain of nodes or edges |
top_level_domain | All | Top level domain (domain of page being visited) |
type | Node | The type of node `Document |
document_url | Node | Context of a script's execution. |
setter | Node | The name of the node that sets a storage node. |
setting_time_stamp | Node | Time stamp of storage node setting. |
setter_domain | Node | Domain of the node that sets a storage node. |
party | Node | The partiness of a node either first or third or N/A |
src | Edge | The source node name of the edge |
dst | Edge | The destination node name of the edge |
reqattr | Edge | HTTP request headers |
respattr | Edge | HTTP response headers |
response_status | Edge | HTTP response status |
content_hash | Edge | Content hash if logged by OpenWPM |
post_body | Edge | POST response body hash |
post_body_raw | Edge | POST response body raw |
The features in features.csv
used are described in features.yaml
Nodes labeled by either True or False if they are blocked by filter lists or not. These are the columns present in the labelled.csv
file.
Column | Description |
---|---|
visit_id | The visit id of the crawl |
top_level_url | The top level URL (page being visited) |
name | The name of the node |
label | The label of the node |
The WebGraph pipeline is in the code
folder. The Robustness pipeline is in the robustness
folder.
WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso USENIX Security Symposium (USENIX), 2022
Abstract -- Users rely on ad and tracker blocking tools to protect their privacy. Unfortunately, existing ad and tracker blocking tools are susceptible to mutable advertising and tracking content. In this paper, we first demonstrate that a state-of-the-art ad and tracker blocker, AdGraph, is susceptible to such adversarial evasion techniques that are currently deployed on the web. Second, we introduce WebGraph, the first ML-based ad and tracker blocker that detects ads and trackers based on their action rather than their content. By featurizing the actions that are fundamental to advertising and tracking information flows – e.g., storing an identifier in the browser or sharing an identifier with another tracker – WebGraph performs nearly as well as prior approaches, but is significantly more robust to adversarial evasions. In particular, we show that WebGraph achieves comparable accuracy to AdGraph, while significantly decreasing the success rate of an adversary from near-perfect for AdGraph to around 8% for WebGraph. Finally, we show that WebGraph remains robust to sophisticated adversaries that use adversarial evasion techniques beyond those currently deployed on the web.
The full paper can be found here.
If you use the code/data in your research, please cite our work as follows:
@inproceedings{Siby22WebGraph,
title = {WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking},
author = {Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso},
booktitle = {USENIX Security Symposium (USENIX)},
year = {2022}
}
In case of questions, please get in touch with Sandra Siby.
Thanks to Laurent Girod and Saiid El Hajj Chehade for helping test and improve the code.