Web Information Retrieval Utility

About: The Web Information Retrieval Tool is a comprehensive solution for efficiently searching, crawling, and analyzing web content. Here are some of its key features:

Web Crawling with Jsoup: This tool leverages Jsoup to connect to websites, download HTML content, and operate on HTML elements, attributes, and text using DOM selectors. It also provides advanced encoding techniques for parsing HTML content to text format.
URL Validation with Regex: To ensure the quality of data, the tool includes URL validation using regular expressions. This helps in validating and filtering out invalid or irrelevant links during web crawling.
Creating a Dictionary of Words: The project employs a HashSet to create a unique set of words found in text files, removing stop-words. The resulting dictionary is saved as a .txt file and is used in various functionalities such as pattern search and spell check.
Search Word and Spell Check: Utilizing regular expressions and the Edit Distance Algorithm, the tool offers powerful search capabilities and efficient spell checking. It sorts results using a priority queue and stores them in a hashtable for quick retrieval.
Word Suggestion: For enhanced user experience, a word suggestion feature is implemented using a Ternary Search Tree (TST), allowing users to correct spelling errors seamlessly.
Frequency Counter: This functionality enables the tool to count the frequency of specific words across a collection of websites crawled. It provides valuable insights into website themes and trends.
Storing History: Implementing priority queues and HashMaps, the tool efficiently stores historical data. Priority queues organize data by priority, while HashMaps allow for fast retrieval of historical information.

How to run

Import the project in Eclipse. Make sure you have empty folders "HTMLFiles" and "TextFiles" in your project root directory.
Use Java 8 to build the project
Run the SearchEngine.java file inside 'main' package.
Provide the url in the console. Example- https://walmart.com

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
code		code
.project		.project
ACC_Project_Team2_Section2.pptx		ACC_Project_Team2_Section2.pptx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Information Retrieval Utility

About

Releases

Packages

Contributors 4

Languages

navjotmakkar/Web_Information_Retrieval_Tool

Folders and files

Latest commit

History

Repository files navigation

Web Information Retrieval Utility

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages