WikiSearch

[ Repository made public ]

Hello peeps We are going to work on our Project on GitHub. Search-Engine-Project is a private repository. Only those can view/edit who have been granted access.

~~Our first milestone is to create a Crawler. I will share resources over here. If you find some too, add them here.~~ Done.

~~Second milestone: Parser~~ Parses a page in O(n) :D Done.

~~Third milestone: Indexer Work in progress.~~ Done.

~~Fourth Milestone: Page Rank algorithm.~~ Done.

Download GitHub from here https://desktop.github.com/

If you don't know GitHub yet, look at these tutorials => https://www.youtube.com/watch?v=XdhuWDdu-rk (look for other tutorials if this doesn't make sense or ask me if you have any problem)

Feel free to push(add) any dummy files to practice on this repository. Or create your own and practice there. There is a tutorial in the installed GitHub application too.

CRAWLER =>

The crawler has been implemented in Python. Given

a domain, it crawls all the pages without leaving that domain. It also stores the web page source code along with the page URL encoded at the top of each txt file.

PARSER =>

Basically its just a Lexical Analyzer, not a complete

parser. We have used Flex to extract words and certain other data. For each word, we store the tag in which it occurs, its relative position within this tag along with the relative postion of the tag itself relative to all other tags. The whole operation operates at O(n) for each page where n is the number of characters on that page. So far, the submitted code only parses a given page. Its output for https://simple.wikipedia.org/wiki/April is stored in webpage.txt. (Its a temporary arrangment. Once the indexer is done, this data will be directly stored in MySQL.)

The .l lexer file DOES NOT ALLOW COMMENTS

Moreover, flex needs to be configured for Windows which frankly is quite a pain.

INDEXER =>

The output of the parser will be processed by the indexer

which will create a reverse index.

PRIORITIZE =>

Sorts pages on the basis of priority for each word.

SEARCH =>

PHP web page to show results.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
crawler		crawler
Database.sql		Database.sql
Indexer.cpp		Indexer.cpp
README.md		README.md
indexer.php		indexer.php
lexer.l		lexer.l
prioritize.cpp		prioritize.cpp
search.php		search.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSearch

Hello peeps We are going to work on our Project on GitHub. Search-Engine-Project is a private repository. Only those can view/edit who have been granted access.

About

Releases

Packages

Contributors 2

Languages

imAliAzhar/WikiSearch

Folders and files

Latest commit

History

Repository files navigation

WikiSearch

Hello peeps We are going to work on our Project on GitHub. Search-Engine-Project is a private repository. Only those can view/edit who have been granted access.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages