meshwarc

A semantic-based network representation of the web archive. In this project we process a json file consisting of html data _which can be produced using warchtml here https://github.com/AmrSheta22/warchtml _ to output a graph based on the similarity between each pair of html pages, then use topic modelling to create clusters of pages containing the same topic.

Usage

You can install the projects requirments using the following code:

pip install -r requirments.txt

To run the script on an input file you can run something like this:

python3 main.py -i ./data/ -o ./data/ -n 200 -d 10 -e ./data/embeddings.txt -p 90 -s 0.37 -t 10

The code should output pickled lists which are used to speed up the process in case an error happened in the middle of the code, and three csv files: edges, nodes and merged_tfidf, you can use the edges and nodes as an input to Gephi to show a graph of the data directly. Note that the cluster 0 consists very small clusters lumbed together.

Parameters

Each parameter is explained in the -help argument, but some arguments may not be clear, so I will explain them here:

-n or --nclusters : Setting the number here to be 200 won't really produce 200 cluster in the output nodes, but it will divide the data 200 times, but some divisions will end up being noisy and contain virtually no data which will be filtered automatically in the code. It's generally good to set the cluster number to be 1/100 of the data size to produce meaningful clusters.
-d or --ndivisable : Leaving the default value here which is 100 will probably be good enough but if your data is noisy you can decrease it, but no that when decreasing it, it's advised to increase --nclusters .
-p or --percentage_filtered : It's 90 by default, but you can increase it or decrease it if you find the keywords not satisfying.

Output

After entering the nodes and edges to gephi, if you want the graph to appear as it does in the following screenshot:
You can use OpenOrd layout with the parameters shown here:

Then use a preset with these details:

The following images are screenshots from the zoomed in graph which had some notible color configuration where:

Cluster 0 (the combined small clusters) is colored in black.
Clusters with lower than 2 percent of the data is colored in grey. You can toggle the label to show the URL instead of the html title if you want.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
ctfidfvectorizer.py		ctfidfvectorizer.py
main.py		main.py
meshwarc_utils.py		meshwarc_utils.py
requirments.txt		requirments.txt
rgr40.py		rgr40.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

meshwarc

Usage

Parameters

Output

About

Releases

Packages

Languages

AmrSheta22/meshwarc

Folders and files

Latest commit

History

Repository files navigation

meshwarc

Usage

Parameters

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages