A semantic-based network representation of the web archive. In this project we process a json file consisting of html data _which can be produced using warchtml here https://github.com/AmrSheta22/warchtml _ to output a graph based on the similarity between each pair of html pages, then use topic modelling to create clusters of pages containing the same topic.
You can install the projects requirments using the following code:
pip install -r requirments.txt
To run the script on an input file you can run something like this:
python3 main.py -i ./data/ -o ./data/ -n 200 -d 10 -e ./data/embeddings.txt -p 90 -s 0.37 -t 10
The code should output pickled lists which are used to speed up the process in case an error happened in the middle of the code, and three csv files: edges, nodes and merged_tfidf, you can use the edges and nodes as an input to Gephi to show a graph of the data directly. Note that the cluster 0 consists very small clusters lumbed together.
Each parameter is explained in the -help argument, but some arguments may not be clear, so I will explain them here:
-n
or--nclusters
: Setting the number here to be 200 won't really produce 200 cluster in the output nodes, but it will divide the data 200 times, but some divisions will end up being noisy and contain virtually no data which will be filtered automatically in the code. It's generally good to set the cluster number to be 1/100 of the data size to produce meaningful clusters.-d
or--ndivisable
: Leaving the default value here which is 100 will probably be good enough but if your data is noisy you can decrease it, but no that when decreasing it, it's advised to increase--nclusters
.-p
or--percentage_filtered
: It's 90 by default, but you can increase it or decrease it if you find the keywords not satisfying.
After entering the nodes and edges to gephi, if you want the graph to appear as it does in the following screenshot:
You can use OpenOrd layout with the parameters shown here:
Then use a preset with these details:
The following images are screenshots from the zoomed in graph which had some notible color configuration where:
- Cluster 0 (the combined small clusters) is colored in black.
- Clusters with lower than 2 percent of the data is colored in grey. You can toggle the label to show the URL instead of the html title if you want.