ReadMe
Reads a XMLCorpus of Wikipedia. ( Sample is given in XMLCorpus folder ) Generated the index structure and assures an one Second Retrieval of relevant documents Follows a Boolean retrieval model along with weighting in terms of TF IDF factor.
How to run:
-
Source files are placed in "src" folder.
-
To create index, run index.sh in "toRun" folder with two arguments 1) Path of XML corpus 2) Output directory eg : bash index.sh ../XMLCorpus/sample.xml ../IndexDirectory/
-
To search, run query.sh in "toRun" folder with Output directory as argument eg : bash query.sh ../IndexDirectory/
Approach:
- First ReadXML class reads the XML corpus though SAX Parser and gives the lines for processing.
- The LineProcess class parses the Line into words.
- These words are checked for Stopword removal by StopWord class.
- Then they are stemmed with help of Stemmer class.
- Finally , the word and its pageId in which it is occurring are stored into TreeMap.
- Once, TreeMap gets 3000 records, the content of it are dumped to the disk using the fuction dumpOnDisk
- Existing index file is merged with TreeMap being dumped.
- After first level of Index( indexDataFile ), secondary index (secondaryIndex) is prepared which stores the starting addresses of each alphabet. eg. Start address of 'a', 'b', ...'z'.
- Once a query is submitted to Searcher class, it is parsed , stemmed and given to binary search for searching.
- Results along with PageIds are displayed.