Information Search Engine

Objectives

Implement an information retrieval system using well known tf-idf measures
Practice with data structures, creating own functions and algorithms

Aim

The project implements an information retrieval engine using well known term-weighting scheme called "tf-idf". For further information about the tf-idf please visit the wikipedia article:

Tf-idf Link

tf-idf

Steps involved

Calculates the relative frequency tf(t,d) adjusted for document length (d). tf(t, d) = (frequency of term t in d) / (number of words in d).
Calculates inverse document frequency idf(t, D) by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient:

idf(this, D) = log(N / |{d ∈ D : t ∈ d}| )
Calculates tfidf(t, d) = tf(t,d) * idf(t,D)

Example from Wikipedia:

Part 1 - Inverted Index and BST

Implementation of functions in the file invertedIndex.c that reads data from a given collection of files in collection.txt (see simple example files in repo with .txt) and generates an "inverted index" that provides a sorted list (set) of filenames for every word in a given collection of files. The program uses a binary search Tree to implement the inverted index.

The functions in invertedIndex.h:

char * normaliseWord(char *str);
InvertedIndexBST generateInvertedIndex(char *collectionFilename);
void printInvertedIndex(InvertedIndexBST tree);

Before inserting words into the linked list, the function,

char * normaliseWord(char *str);

normalises each word in each file containing the following symbols:

'.' (dot),
',' (comma),
';' (semicolon),
? (question mark)

It also:

Removes leading and trailing spaces
Converts all characters to lowercase

Note: If multiple symbols appear in the word only the first one is removed from the word. The function will only remove symbols at the end of the word. The function modifies a given string. It does not create a copy.

Example

Word (input)	Normalised Word (output)
LukaGamulin.	lukagamulin
Luka?	luka
LukaJose??	lukajose?
Luka.Jose?	luka.jose

The function generateInvertedIndex ,

InvertedIndexBST generateInvertedIndex(char *collectionFilename);

Takes in as input a collection of filenames, reads each of these files, generates and returns the inverted index. This is where the magic happens. The tree data structure is created where every node is a word and every word has a list, containing every file where the word appeared from the collection filename.

Each node has a tf-idf number, the list is sorted from highest to lowest tf-idf.

Finally, the function printInvertedIndex,

void printInvertedIndex(InvertedIndexBST tree);

Takes in the ADT implemented in the function generateInvertedIndex and prints the tree.

Example output format

design file11.txt file21.txt
mars nasa.txt news1.txt
weather info31.txt nasa.txt news1.txt

Part 2 - Information Retrieval Search engine

In this part, the information retrieval uses a function that finds files (documents) with one or more query terms, and uses summation of tf-idf values of all matching query terms (words) for ranking such files (documents). It calculates the tf-idf values for each matching query term in a file (document), and ranks the files (documents) based on the summation of tf-idf values for all matching query terms present in that file. It uses the ADT "inverted index" created in Part-1 to locate files with one or more query terms, and to calculate the required tf-idf values for such files.

The information retrieval engine in this part implements a function in the file invertedIndex.c that given search terms (words), returns an ordered list of type TfIdfList, where each node contains filename and the corresponding summation of tf-idf values for given searchWords. The output list is in descending order of summation of tf-idf values.

See invertedIndex.h for the type definition of TfIdfList. There are other functions to calculateTfIdf.

The following two functions in the file invertedIndex.c. Total number of documents D is provided as an argument in both the functions.

TfIdfList calculateTfIdf(InvertedIndexBST tree,
                        char *searchWord , int D);

The function returns an ordered list where each node contains filename and the corresponding tf-idf value for a given searchWord. It only returns documents (files) that contain a given searchWord. The list is in descending order of tf-idf values.

Note: If the word contains multiple files with the same tf-idf value, it orders such files (documents) on their filenames using ascending order.

TfIdfList retrieve(InvertedIndexBST tree, char* searchWords[] ,
                                         int D);

The function returns an ordered list where each node contains filename and summation of tf-idf values of all matching searchWords. Only returns documents (files) that contain one or more searchWords.

The list returns in descending order of summation of tf-idf values (tfidf_sum). If multiple files with same tfidf_sum value, the files (documents) are ordered on their filenames using ascending order.

Note: the searchWords[] needs to be terminated with a NULL pointer. For example,

char *words[] = { "nasa", "mars", "earth", NULL };
TfIdfList list = retrieve(index, words, 7);

For compiling, create a new directory with all files in the repository. Then simply use the following command to compile,

% gcc -Wall -Werror -lm -std=c11  *.c  -o invertedIndex

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DLList.c		DLList.c
DLList.h		DLList.h
Example_tf_idf.png		Example_tf_idf.png
List.c		List.c
List.h		List.h
README.md		README.md
Tree.c		Tree.c
Tree.h		Tree.h
collection.txt		collection.txt
file11.txt		file11.txt
file21.txt		file21.txt
info31.txt		info31.txt
invertedIndex.c		invertedIndex.c
invertedIndex.h		invertedIndex.h
invertedIndex_exp.txt		invertedIndex_exp.txt
mars_TfIdfList_exp.txt		mars_TfIdfList_exp.txt
mixed.txt		mixed.txt
nasa.txt		nasa.txt
nasa_mars_moon_exp.txt		nasa_mars_moon_exp.txt
news1.txt		news1.txt
planets.txt		planets.txt
sun_TfIdfList_exp.txt		sun_TfIdfList_exp.txt
test_engine.c		test_engine.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Search Engine

Objectives

Aim

Tf-idf Link

Steps involved

idf(this, D) = log(N / |{d ∈ D : t ∈ d}| )

Example from Wikipedia:

Part 1 - Inverted Index and BST

Example

Example output format

Part 2 - Information Retrieval Search engine

About

Releases

Packages

Contributors 2

Languages

lukajose/Information-Retrieval-engine

Folders and files

Latest commit

History

Repository files navigation

Information Search Engine

Objectives

Aim

Tf-idf Link

Steps involved

idf(this, D) = log(N / |{d ∈ D : t ∈ d}| )

Example from Wikipedia:

Part 1 - Inverted Index and BST

Example

Example output format

Part 2 - Information Retrieval Search engine

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages