|
1 |
| -# <h2 align="center"> Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions </h2> |
| 1 | +# <h1 align="center"> Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions </h1> |
2 | 2 |
|
3 |
| -This is the repository for our paper ["Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions"](https://github.com/StonyBrookNLP/ircot/blob/main/ircot.pdf). |
| 3 | +This is the repository for our ACL 2023 paper ["Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions"](https://arxiv.org/abs/2212.10509). |
4 | 4 |
|
5 |
| -The code and prompts for it will be released here soon. |
| 5 | + |
| 6 | + |
| 7 | +# Installation |
| 8 | + |
| 9 | +```bash |
| 10 | +conda create -n ircot python=3.8.0 -y && conda activate ircot |
| 11 | +pip install -r requirements.txt |
| 12 | +python -m spacy download en_core_web_sm |
| 13 | +``` |
| 14 | + |
| 15 | +# Prepare Data |
| 16 | + |
| 17 | +You can download all our processed data by running |
| 18 | + |
| 19 | +```bash |
| 20 | +./download/processed_data.sh |
| 21 | +``` |
| 22 | + |
| 23 | +The data will be downloaded in `processed_data/{dataset_name}/`. If you're just looking for just dev/test data we used in the paper, it's `processed_data/{dataset_name}/{dev|test}_subsampled.jsonl`. |
| 24 | + |
| 25 | +<details> |
| 26 | +<summary>Follow these steps if you want to generate all processed data from scratch again.</summary> |
| 27 | + |
| 28 | +```bash |
| 29 | +# 1. Download raw data: |
| 30 | +## raw data will be in raw_data/{dataset_name}/ |
| 31 | +./download/raw_data.sh |
| 32 | + |
| 33 | +# 2. Process raw data files in a single standard format |
| 34 | +## processed data will be in processed_data/{dataset_name}/ |
| 35 | +python processing_scripts/process_hotpotqa.py |
| 36 | +python processing_scripts/process_2wikimultihopqa.py |
| 37 | +python processing_scripts/process_musique.py |
| 38 | +python processing_scripts/process_iirc.py |
| 39 | + |
| 40 | +# 4. Subsample the processed datasets. |
| 41 | +## Note (i) dev processing has to be done before test. |
| 42 | +## (ii) because of randomness it may create different samples that what we used, |
| 43 | +## so consider using the released data if the goal is reproduction. |
| 44 | +## (iii) sampled data will be in processed_data/{dataset_name}/{dev|test}_subsampled.jsonl |
| 45 | +python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa dev |
| 46 | +python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa test |
| 47 | +python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa dev |
| 48 | +python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa test |
| 49 | +python processing_scripts/subsample_dataset_and_remap_paras.py musique dev |
| 50 | +python processing_scripts/subsample_dataset_and_remap_paras.py musique test |
| 51 | +python processing_scripts/subsample_dataset_and_remap_paras.py iirc dev |
| 52 | +python processing_scripts/subsample_dataset_and_remap_paras.py iirc test |
| 53 | + |
| 54 | +# 5. Attach reasoning steps and supporting para annotations |
| 55 | +## to the preprocessed (train) data files. |
| 56 | +## To do this, you'll set up elasticsearch server, index all dataset corpuses. |
| 57 | +## See 'Prepare Retriever and LLM Servers' section in the readme. |
| 58 | +python prompt_generator/attach_data_annotations.py hotpotqa |
| 59 | +python prompt_generator/attach_data_annotations.py 2wikimultihopqa |
| 60 | +python prompt_generator/attach_data_annotations.py musique |
| 61 | +python prompt_generator/attach_data_annotations.py iirc |
| 62 | +``` |
| 63 | + |
| 64 | +</details> |
| 65 | + |
| 66 | +You'll also need `raw_data` if you want to build elasticsearch indices and run retriever or odqa systems. |
| 67 | + |
| 68 | +```bash |
| 69 | +./download_raw_data.sh |
| 70 | +``` |
| 71 | + |
| 72 | +The data will be downloaded in `raw_data/{dataset_name}/`. |
| 73 | + |
| 74 | + |
| 75 | +# Prepare Prompts |
| 76 | + |
| 77 | +All our prompts are available in `prompts/` directory. If you're using these prompts outside of this codebase, note that `# METADATA: ...` lines need to be ignored at runtime from it. |
| 78 | + |
| 79 | +If you want to generate them from scratch, run |
| 80 | + |
| 81 | +```bash |
| 82 | +python prompt_generator/generate_prompts.py {dataset_name} --task_name qa # hotpotqa, 2wikimultihopqa, musique, iirc |
| 83 | +python prompt_generator/generate_prompts.py iirc --task_name no_context_open_retrieval |
| 84 | +``` |
| 85 | + |
| 86 | +Note though that because of random sampling to select distractors, some of the regenerated prompts may be different. So if you're goal is to reproduce the experiments, use the released ones. |
| 87 | + |
| 88 | +# Prepare Retriever and LLM Servers |
| 89 | + |
| 90 | +<details> |
| 91 | +<summary> First, install Elasticsearch 7.10. </summary> |
| 92 | + |
| 93 | +### Install on Mac (option 1) |
| 94 | +``` |
| 95 | +# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/brew.html |
| 96 | +brew tap elastic/tap |
| 97 | +brew install elastic/tap/elasticsearch-full # if it doesn't work: try 'brew untap elastic/tap' first: untap>tap>install. |
| 98 | +brew services start elastic/tap/elasticsearch-full # to start the server |
| 99 | +brew services stop elastic/tap/elasticsearch-full # to stop the server |
| 100 | +``` |
| 101 | + |
| 102 | +### Install on Mac (option 2) |
| 103 | +``` |
| 104 | +# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html |
| 105 | +wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz |
| 106 | +wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512 |
| 107 | +shasum -a 512 -c elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512 |
| 108 | +tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz |
| 109 | +cd elasticsearch-7.10.2/ |
| 110 | +./bin/elasticsearch # start the server |
| 111 | +pkill -f elasticsearch # to stop the server |
| 112 | +``` |
| 113 | + |
| 114 | +### Install on Linux |
| 115 | + |
| 116 | +``` |
| 117 | +# source: https://www.elastic.co/guide/en/elasticsearch/reference/8.1/targz.html |
| 118 | +wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz |
| 119 | +wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512 |
| 120 | +shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512 |
| 121 | +tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz |
| 122 | +cd elasticsearch-7.10.2/ |
| 123 | +./bin/elasticsearch # start the server |
| 124 | +pkill -f elasticsearch # to stop the server |
| 125 | +``` |
| 126 | + |
| 127 | +Checkout the references sources if you run into problems installing it. |
| 128 | + |
| 129 | +</details> |
| 130 | + |
| 131 | +Start the the elasticsearch server on port 9200 (default), and then start the retriever server as show here. You can change the elasticsearch port in `retriever_server/serve.py` if needed. |
| 132 | + |
| 133 | +```bash |
| 134 | +uvicorn serve:app --port 8000 --app-dir retriever_server |
| 135 | +``` |
| 136 | + |
| 137 | +Next, index the wikipedia corpuses for the datasets. Make sure you've downloaded `raw_data` and `processed_data`. |
| 138 | + |
| 139 | +```bash |
| 140 | +python retriever_server/build_index.py {dataset_name} # hotpotqa, iirc, 2wikimultihopqa, musique |
| 141 | +``` |
| 142 | + |
| 143 | +After indexing you can check the number of documents in each index by running `curl localhost:9200/_cat/indices`. You should have 4 indices, one for each dataset, called `{dataset}-wikipedia`. Make sure the match up to the statistics given in the paper. You should expec to see following sizes: HotpotQA (5,233,329), 2WikiMultihopQA (430,225), MuSiQue (139,416), and IIRC (1,882,415). |
| 144 | + |
| 145 | +Next, if you want to use flan-t5-* models, start the llm_server by running: |
| 146 | + |
| 147 | +```bash |
| 148 | +MODEL_NAME={model_name} uvicorn serve:app --port 8010 --app-dir llm_server # model_name: flan-t5-xxl, flan-t5-xl, flan-t5-large, flan-t5-base |
| 149 | +``` |
| 150 | + |
| 151 | +If you want to use openai models (e.g., codex in our experiments), you don't need to start it. In that case, you just need to set the environment variable `OPENAI_API_KEY`. |
| 152 | + |
| 153 | +If you start retriever and/or llm_server on different host or port, update them in `.retriever_address.jsonnet` and `.llm_server_address.jsonnet` before running retrieval/odqa systems. |
| 154 | + |
| 155 | + |
| 156 | +# Run Retrieval and ODQA Systems |
| 157 | + |
| 158 | +First, download dataset repositories for official evaluation: `./download/official_eval.sh`. |
| 159 | + |
| 160 | +Next, set the variables: |
| 161 | + |
| 162 | +- SYSTEM: choose from (`ircot`, `ircot_qa`, `oner`, `oner_qa`, `nor_qa`) |
| 163 | +- MODEL: choose from (`codex`, `flan-t5-xxl`, `flan-t5-xl`, `flan-t5-large`, `flan-t5-base`, `none`) |
| 164 | +- DATASET: choose from (`hotpotqa`, `2wikimultihopqa`, `musique`, `iirc`) |
| 165 | + |
| 166 | +The systems ending with `_qa` are for ODQA and others are for retrieval. The `ircot` and `ircot_qa` are proposed systems and others are baselines (see NoR, OneR in paper). For `oner`, choose model to be `none`, not otherwise. |
| 167 | + |
| 168 | +Now you can run the system using (language) model and dataset of your choice by running: |
| 169 | + |
| 170 | +```bash |
| 171 | +./reproduce.sh $SYSTEM $MODEL $DATASET |
| 172 | +``` |
| 173 | + |
| 174 | +This script runs several things one after the other: instantiating experiment configs with HPs, running predictions for them on the dev set, picking up the best HP, making experiment config with the best HP, running it on the test set, and summarizing the results with mean and std. |
| 175 | + |
| 176 | +If you prefer to have more control, you can also run it step-by-step as follows: |
| 177 | + |
| 178 | + |
| 179 | +```bash |
| 180 | +# Instantiate experiment configs with different HPs and write them in files. |
| 181 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 1 |
| 182 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 2 |
| 183 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 3 |
| 184 | +## if you make a change to base_configs, the above steps need to be rerun to |
| 185 | +## regenerate instantiated experiment configs (with HPs populated) |
| 186 | + |
| 187 | +# Run experiments for different HPs on dev set |
| 188 | +python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 1 |
| 189 | +## predict command runs evaluation at the end by default. If you want to run evaluation |
| 190 | +## separately after prediction, you can replace predict with evaluate here. |
| 191 | + |
| 192 | +# Show results for experiments with different HPs |
| 193 | +python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 1 |
| 194 | +## Not necessary as such, it'll just show you the results using different HPs in a nice table. |
| 195 | + |
| 196 | +# Pick the best HP and save the config with that HP. |
| 197 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 1 --best |
| 198 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 2 --best |
| 199 | +python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 3 --best |
| 200 | + |
| 201 | +# Run the experiment with best HP on test set |
| 202 | +python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 1 --best --eval_test --official |
| 203 | +python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 2 --best --eval_test --official |
| 204 | +python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 3 --best --eval_test --official |
| 205 | +## predict command runs evaluation at the end by default. If you want to run evaluation |
| 206 | +## separately after prediction, you can replace predict with evaluate here. |
| 207 | + |
| 208 | +# Summarize best test results for individual prompts and aggregate (mean +- std) of them) |
| 209 | +python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 1 --best --eval_test --official |
| 210 | +python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 2 --best --eval_test --official |
| 211 | +python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 3 --best --eval_test --official |
| 212 | +python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set aggregate --best --eval_test --official |
| 213 | +## The mean and std in the final command is what we reported in the paper. |
| 214 | +``` |
| 215 | + |
| 216 | +DISCLAIMER: Please note that our Codex-based experiments were done when it was free. Now it has been deprecated. You can do these experiments with other OpenAI completion modes, or other open/commercial models (see notes below). But keep track of the cost, as it may add up quickly doing these experiments. |
| 217 | + |
| 218 | +# Running IRCoT (QA) using a Different Dataset or LLM |
| 219 | + |
| 220 | +Each experiment (system, model, data combination) in this project corresponds to an experiment config in `base_configs/...jsonnet`. Find the experiment closest to your usecase and change the model, dataset and related information in it as per your need. |
| 221 | + |
| 222 | +If you've changed the dataset, you'll need to ensure the Elasticsearch index of that name is available (see processing-notes and setting-up-retriever for it). |
| 223 | + |
| 224 | +If you've changed the model, you'll need to ensure model of that name is implemented and available in the code. If you want to try out a different OpenAI completion model, it'd just involve configuring the `engine` variable and setting the `model_tokens_limit` in here. Chat-based API isn't readily supported yet, but shouldn't be much work if you're interested. If you're interested in open LLMs, like Llama, MPT, etc, you can set up OpenAI-complaint FastChat server as shown [here](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), and made necessary changes in the base_config/ and you should be good to go. |
| 225 | + |
| 226 | +If you're stuck anywhere in this process, open an issue with your specific choice of data/model, and I can help you to get there. |
| 227 | + |
| 228 | +# Acknowledgement |
| 229 | + |
| 230 | +This code is heavily based on [CommaQA](https://github.com/allenai/CommaQA), which provides a way to build complex/multi-step systems involving agents. All modeling-related code for IRCoT project is in `commaqa/inference/ircot.py`, and all experiment configs (without HPs instantiated) for this project are in `base_configs/`. |
| 231 | + |
| 232 | +# Citation |
| 233 | + |
| 234 | +If you find this work useful, consider citing it: |
| 235 | + |
| 236 | +```bib |
| 237 | +@article{trivedi2022interleaving, |
| 238 | + title={Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions}, |
| 239 | + author={Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish}, |
| 240 | + journal={arXiv preprint arXiv:2212.10509}, |
| 241 | + year={2022} |
| 242 | +} |
| 243 | +``` |
0 commit comments