Skip to content

Given a query string q and a corpus of documents, retrieve the top k documents that are the closest match to query string using tf-idf

Notifications You must be signed in to change notification settings

rajesh-bhat/tf-idf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#TF-IDF TASK
Given a query string q and a corpus of documents, retrieve the top k documents that are the closest match to query string using tf-idf

Dataset
Has a list of cricket commentary units in the file dataset.txt. A single unit of cricket commentary is the commentary for 1 ball and this constitutes 1 document.

Packages that are to be installed before executing the program
1.nltk(natural language toolkit)
2.num2words(installation command: sudo pip install num2words)

Commands for executing the program
python tfidf.py

Input Format for Query string
If input starts with " "(double quotes) program returns ONLY those documents that match ALL the terms in the query (logical AND of query terms but NOT phrase matching).That is, “q1 q2 ...qk”, where q1 to qk are the terms of the query placed within double quotes as shown. For example, if the query contains “Stuart Broad to Virat Kohli”, program returns the ranked list of documents [d1, d2..., d10] only if every document di matches ALL the terms used in the query. Note that exact phrase matching is not needed. We only require that the given document contain ALL the terms in the query (regardless of the order) in order to be considered for ranking.
Otherwise program returns documents that match the query where the query terms are considered as logical OR.

Sample Input and Output
enter the query string: or q to quit
driven through midwicket for a couple of runs
**************************************************************
rank: 1 score 1.05838456103
Anderson to Rogers 2 runs too straight and tucked off the pads through midwicket an easy couple
-----------------------------------------------
rank: 2 score 1.01438166749
Jarvis to Jahurul Islam 2 runs ooh swing there turns out to be too full and that's driven through midwicketfor a couple of runs
-----------------------------------------------
rank: 3 score 0.962167782759
Broad to Williamson 2 runs little too straight and tucked past midwicket so they get a couple
-----------------------------------------------
rank: 4 score 0.723499546899
Southee to Cook no run driven to cover
-----------------------------------------------
rank: 5 score 0.723499546899
Southee to Trott no run driven into the covers
-----------------------------------------------

enter the query string: or q to quit
"around the wicket"(exact search similar to google search)
**************************************************************
rank: 1 score 1.02036361107
Broad to Rogers no run around the wicket Rogers back and across the off stump to block up the wicket
-----------------------------------------------
rank: 2 score 0.526893945533
Swann to Rogers 1 run around the wicket to the leftie but too straight and played calmly off the back foot into midwicket for a single
-----------------------------------------------
rank: 3 score 0.491767682498
Anderson to Rogers 1 run around the wicket slants back in Rodgers uses the angle and clips out to deep square leg
-----------------------------------------------
rank: 4 score 0.461032202341
Swann to Rogers no run Rogers plays back to Swann around the wicket flicks it to square leg but can't find a run
-----------------------------------------------
rank: 5 score 0.433912661027
Anderson to Rogers FOUR around the wicket but too straight just back of a length clipped through square leg and it's well timed runs away from the fielder
-----------------------------------------------
enter the query string: or q to quit
q

About

Given a query string q and a corpus of documents, retrieve the top k documents that are the closest match to query string using tf-idf

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages