IndexWiki A search engine for Wikipedia Based on Boolean Retrieval model and TF IDF

ReadMe

Reads a XMLCorpus of Wikipedia. ( Sample is given in XMLCorpus folder ) Generated the index structure and assures an one Second Retrieval of relevant documents Follows a Boolean retrieval model along with weighting in terms of TF IDF factor.

How to run:

Source files are placed in "src" folder.
To create index, run index.sh in "toRun" folder with two arguments 1) Path of XML corpus 2) Output directory eg : bash index.sh ../XMLCorpus/sample.xml ../IndexDirectory/
To search, run query.sh in "toRun" folder with Output directory as argument eg : bash query.sh ../IndexDirectory/

Approach:

First ReadXML class reads the XML corpus though SAX Parser and gives the lines for processing.
The LineProcess class parses the Line into words.
These words are checked for Stopword removal by StopWord class.
Then they are stemmed with help of Stemmer class.
Finally , the word and its pageId in which it is occurring are stored into TreeMap.
Once, TreeMap gets 3000 records, the content of it are dumped to the disk using the fuction dumpOnDisk
Existing index file is merged with TreeMap being dumped.
After first level of Index( indexDataFile ), secondary index (secondaryIndex) is prepared which stores the starting addresses of each alphabet. eg. Start address of 'a', 'b', ...'z'.
Once a query is submitted to Searcher class, it is parsed , stemmed and given to binary search for searching.
Results along with PageIds are displayed.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
IndexDirectory		IndexDirectory
XMLCorpus		XMLCorpus
src		src
toRun		toRun
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndexWiki A search engine for Wikipedia Based on Boolean Retrieval model and TF IDF

About

Releases

Packages

Languages

SushantM/IndexWiki

Folders and files

Latest commit

History

Repository files navigation

IndexWiki A search engine for Wikipedia Based on Boolean Retrieval model and TF IDF

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages