Skip to content

A semantic-based network representation of the web archive

Notifications You must be signed in to change notification settings

AmrSheta22/meshwarc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

meshwarc

A semantic-based network representation of the web archive. In this project we process a json file consisting of html data _which can be produced using warchtml here https://github.com/AmrSheta22/warchtml _ to output a graph based on the similarity between each pair of html pages, then use topic modelling to create clusters of pages containing the same topic.

Usage

You can install the projects requirments using the following code:

pip install -r requirments.txt

To run the script on an input file you can run something like this:

python3 main.py -i ./data/ -o ./data/ -n 200 -d 10 -e ./data/embeddings.txt -p 90 -s 0.37 -t 10

The code should output pickled lists which are used to speed up the process in case an error happened in the middle of the code, and three csv files: edges, nodes and merged_tfidf, you can use the edges and nodes as an input to Gephi to show a graph of the data directly. Note that the cluster 0 consists very small clusters lumbed together.

Parameters

Each parameter is explained in the -help argument, but some arguments may not be clear, so I will explain them here:

  1. -n or --nclusters : Setting the number here to be 200 won't really produce 200 cluster in the output nodes, but it will divide the data 200 times, but some divisions will end up being noisy and contain virtually no data which will be filtered automatically in the code. It's generally good to set the cluster number to be 1/100 of the data size to produce meaningful clusters.
  2. -d or --ndivisable : Leaving the default value here which is 100 will probably be good enough but if your data is noisy you can decrease it, but no that when decreasing it, it's advised to increase --nclusters .
  3. -p or --percentage_filtered : It's 90 by default, but you can increase it or decrease it if you find the keywords not satisfying.

Output

After entering the nodes and edges to gephi, if you want the graph to appear as it does in the following screenshot: image
You can use OpenOrd layout with the parameters shown here:
image

Then use a preset with these details:
image

The following images are screenshots from the zoomed in graph which had some notible color configuration where:

  1. Cluster 0 (the combined small clusters) is colored in black.
  2. Clusters with lower than 2 percent of the data is colored in grey. You can toggle the label to show the URL instead of the html title if you want.

image


image

image

image

image

About

A semantic-based network representation of the web archive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages