PDFs are one of the most popular formats to access documents, it is crucial that the level of security keeps up with their usage. This project introduces two approaches to analyzing the security of PDF documents: first, by creating adversarial examples of malicious PDFs (through noise injection) to spoof standard PDF malware detectors such as VirusTotal, and second, by fine-tuning ResNet18 to identify malicious PDFs based on their Markov plots.
Malicious PDFs that can bypass VirusTotal scanning (32/60 tools did not correctly classify the PDF as malicious)
In order to create a malicious PDF, you can use ./attacks/find_perturbable.ipynb
. This notebook provides methods to find the areas in the PDF that can have noise injected, as well as methods to inject JavaScript attacks. The example in the notebook causes the Calculator app to be opened when the user opens the PDF.
Using code in ./data/generate_dataset.py
, create a .pkl
file with the Markov plot data saved. This is demonstrated in cells 1-53 of ./finetune.ipynb
. Once the datasets have been generated, use the code in ./resnet18_classifier.ipynb
to train our fine-tuned version of ResNet to classify malicious and safe PDFs based on their Markov plots!