Skip to content
/ project Public

This is the master repo we should always have a working project in this branch. Please branch when updating the code.

Notifications You must be signed in to change notification settings

cs172/project

Repository files navigation

Spring 2019 CS172 Project

The program is a multithreaded Web Crawler for gov pages written in java. The program uses the seed.txt file to seed the crawler with .gov urls. Then it follows hyperlinks on the seed pages to other sites and downloads the page (html file) if the link corresponses to a .gov site. The jsoup library is used to parse the html webpages. You can specify the path to the file with the url seeds and the max number of sites and the max hop distance from the seed urls via the command prompt.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • If Java is not installed go here to install java for your operating system.
  • Unzip the Project.
  • Open a terminal
Windows:
Press WindowsKey+r then type "cmd"

Linux(Ubuntu):
Press Ctrl+Alt+t
  • Change to project directory
Windows: (in CMD terminal)
Type cd path\to\the\project\folder 

Linux(Ubuntu):
Type cd path\to\the\project\folder 
  • Must include a folder name "storage" in the project directory in order to run. We included this folder already in the project zip file. If it get's deleted just create another one in the project directory.

Instructions

Easiest method:

Type in command terminal

./run.sh [seed file path] [max sites] [max hop distance] 

Else if not working:

Compiling the Program

Windows: (in CMD terminal)
javac -d . -cp ".;./jsoup-1.11.3.jar;commons-io-2.6.jar" *.java

Linux(Ubuntu):
javac -d . -cp ".:./jsoup-1.11.3.jar:commons-io-2.6.jar" *.java

Running the Program

In terminal

Argument Template: .SpiderTest [seed file path] [max sites] [max hop distance]

Windows: (in CMD terminal)
java -cp ".;./jsoup-1.11.3.jar;commons-io-2.6.jar" com.ucr.cs172.project.crawler.SpiderTest ./seeds.txt 2000 500


Linux(Ubuntu):
java -cp ".:./jsoup-1.11.3.jar:commons-io-2.6.jar" com.ucr.cs172.project.crawler.SpiderTest ./seeds.txt 2000 500 

Built With

  • GitHub - Remote Repo for team collaboration
  • Jsoup - Java html parser library

Authors

  • Raudel Blazquez Munoz
  • Ji Houn Huh
  • Juan Ceja

About

This is the master repo we should always have a working project in this branch. Please branch when updating the code.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages