A C++ program that creates an inverted index structure from a set of text documents and a program that processes queries over such inverted index.
📖 Report avaibile here: report.pdf
To clone the repository, use the following command:
git clone --recursive https://github.com/scarburato/searchenginepp.gitTo ensure the program works on Ubuntu, make sure you have the following dependencies installed:
sudo apt-get update
sudo apt install build-essential cmake libstemmer-dev libhyperscan-devIf you're not building on an amd64 (x86) system you may have to install vectorscan in place of hyperscan, on Ubuntu
the package to install should be named libvectorscan-dev.
Assuming you have already installed a C++ compiler and the brew packet manager; to compile on MacOS, you need to install the following dependencies:
brew install cmake snowball hyperscanIf you're not building on an amd64 (x86) system, you have to install vectorscan in place of hyperscan.
The CMakeLists.txt file assumes that brew installs its stuff in $(brew --prefix), if your version of brew differs
then modify the CMakeLists.txt file accordingly.
Navigate to the project directory and use the following commands:
mkdir build
cd build/
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_STEMMER:BOOL=ON -DFIX_MSMARCO_LATIN1:BOOL=ON -DTEXT_FULL_LATIN1_CASE:BOOL=ON ..
make -j8you should replace the -j8 flag with the number of cores in your CPU.
On MacOS, if perfidious Apple's clang gives you problems compiling, you should consider installing GNU GCC or a more recent version of clang through brew.
Those are the building options:
USE_STEMMERused to enable or disable Snowball's stemmer and stopword removalFIX_MSMARCO_LATIN1used to enable or disable heuristic and encoding fix for certain wronly encoded docs in MSMARCO. If you are not using MSMARCO you should put this flag to OFF (default option is OFF)TEXT_FULL_LATIN1_CASEreplaces the ASCII-only lower-case algorithm with a latin1 (a larger subset of utf8) lower-caseUSE_FAST_LOGreplaces the floating point version of log with a faster integer version. It doesn't improve performance by much
Go in the build/ directory and use the following command:
tests/Google_Tests_runIf all tests return RUN OK, it means all tests passed successfully.
To read the collection efficiently, use the following command:
time tar -xOzf ../data/collection.tar.gz collection.tsv | ./builderOn MacOS, you may find that GNU's implementation of tar is faster; you can install it with brew
--- brew install gnu-tar --- and then replace tar with gtar in the command above.
The use of tar alongside UNIX's pipes, allows the system to decompress the collection
in blocks, and keep in the input buffer of only the chunk that's being proccessed at the moment,
thus removing the necessity to decompress the file separately and to load it in memory all at once.
To run the query processor, run the following command:
./engine [options] [data]where [options] are:
-b|--batchto run the query processor in batch mode, that is to read queries in the formatquery_id\tquery_textfrom stdin, if omitted the programm will accept only the query text from stdin-k|--top-kto specify the number of top documents to return for each query (default is 10)-t|--threadsto specify the number of threads to use (default is 1). It is not advisable to use more than one if all chunks are on the same disk-s|--scoreto specify the scoring function to use for the query processor. The available algorithms are:bm25to use the BM25 scoring function (default)tfidfto use the TF-IDF scoring function
-a|--algorithmto specify the algorithm to use for the query processor. The available algorithms are:daat|daat-disjunctiveto use the daat in disjunctive mode (default)daat-c|daat-conjunctiveto use the daat in conjunctive modebmmto use the BMM dynamic programming algorithm
-r|--run-nameto specify the name of the run (default isMIRCV0)
and [data] is the path to the data directory that contains the files (default is data/)
For example, you can run the query processor in batch mode with the BM25 scoring function and the DAAT algorithm to produce a run file to use with trec_eval with the following command:
time ./engine -b -k20 -r MIRCV-DAAT-BM25-20 < ../../msmarco-test2020-queries.tsv > mircv-daat-bm25-20.run