Skip to content

Commit 68f617e

Browse files
committed
add camera-ready code.
1 parent 23e3667 commit 68f617e

File tree

240 files changed

+45210
-3
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

240 files changed

+45210
-3
lines changed

.gitignore

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
*.pyc
2+
.DS_Store
3+
raw_data/
4+
processed_data/
5+
predictions/
6+
instantiated_configs/
7+
official_evaluation/
8+
.retriever_address.json
9+
.llm_server_address.json
10+
.retriever_address.jsonnet
11+
.llm_server_address.jsonnet
12+
.history
13+
.temp

README.md

+241-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,243 @@
1-
# <h2 align="center"> Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions </h2>
1+
# <h1 align="center"> Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions </h1>
22

3-
This is the repository for our paper ["Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions"](https://github.com/StonyBrookNLP/ircot/blob/main/ircot.pdf).
3+
This is the repository for our ACL 2023 paper ["Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions"](https://arxiv.org/abs/2212.10509).
44

5-
The code and prompts for it will be released here soon.
5+
![IRCoT Main Figure](ircot.jpg?raw=true)
6+
7+
# Installation
8+
9+
```bash
10+
conda create -n ircot python=3.8.0 -y && conda activate ircot
11+
pip install -r requirements.txt
12+
python -m spacy download en_core_web_sm
13+
```
14+
15+
# Prepare Data
16+
17+
You can download all our processed data by running
18+
19+
```bash
20+
./download/processed_data.sh
21+
```
22+
23+
The data will be downloaded in `processed_data/{dataset_name}/`. If you're just looking for just dev/test data we used in the paper, it's `processed_data/{dataset_name}/{dev|test}_subsampled.jsonl`.
24+
25+
<details>
26+
<summary>Follow these steps if you want to generate all processed data from scratch again.</summary>
27+
28+
```bash
29+
# 1. Download raw data:
30+
## raw data will be in raw_data/{dataset_name}/
31+
./download/raw_data.sh
32+
33+
# 2. Process raw data files in a single standard format
34+
## processed data will be in processed_data/{dataset_name}/
35+
python processing_scripts/process_hotpotqa.py
36+
python processing_scripts/process_2wikimultihopqa.py
37+
python processing_scripts/process_musique.py
38+
python processing_scripts/process_iirc.py
39+
40+
# 4. Subsample the processed datasets.
41+
## Note (i) dev processing has to be done before test.
42+
## (ii) because of randomness it may create different samples that what we used,
43+
## so consider using the released data if the goal is reproduction.
44+
## (iii) sampled data will be in processed_data/{dataset_name}/{dev|test}_subsampled.jsonl
45+
python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa dev
46+
python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa test
47+
python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa dev
48+
python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa test
49+
python processing_scripts/subsample_dataset_and_remap_paras.py musique dev
50+
python processing_scripts/subsample_dataset_and_remap_paras.py musique test
51+
python processing_scripts/subsample_dataset_and_remap_paras.py iirc dev
52+
python processing_scripts/subsample_dataset_and_remap_paras.py iirc test
53+
54+
# 5. Attach reasoning steps and supporting para annotations
55+
## to the preprocessed (train) data files.
56+
## To do this, you'll set up elasticsearch server, index all dataset corpuses.
57+
## See 'Prepare Retriever and LLM Servers' section in the readme.
58+
python prompt_generator/attach_data_annotations.py hotpotqa
59+
python prompt_generator/attach_data_annotations.py 2wikimultihopqa
60+
python prompt_generator/attach_data_annotations.py musique
61+
python prompt_generator/attach_data_annotations.py iirc
62+
```
63+
64+
</details>
65+
66+
You'll also need `raw_data` if you want to build elasticsearch indices and run retriever or odqa systems.
67+
68+
```bash
69+
./download_raw_data.sh
70+
```
71+
72+
The data will be downloaded in `raw_data/{dataset_name}/`.
73+
74+
75+
# Prepare Prompts
76+
77+
All our prompts are available in `prompts/` directory. If you're using these prompts outside of this codebase, note that `# METADATA: ...` lines need to be ignored at runtime from it.
78+
79+
If you want to generate them from scratch, run
80+
81+
```bash
82+
python prompt_generator/generate_prompts.py {dataset_name} --task_name qa # hotpotqa, 2wikimultihopqa, musique, iirc
83+
python prompt_generator/generate_prompts.py iirc --task_name no_context_open_retrieval
84+
```
85+
86+
Note though that because of random sampling to select distractors, some of the regenerated prompts may be different. So if you're goal is to reproduce the experiments, use the released ones.
87+
88+
# Prepare Retriever and LLM Servers
89+
90+
<details>
91+
<summary> First, install Elasticsearch 7.10. </summary>
92+
93+
### Install on Mac (option 1)
94+
```
95+
# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/brew.html
96+
brew tap elastic/tap
97+
brew install elastic/tap/elasticsearch-full # if it doesn't work: try 'brew untap elastic/tap' first: untap>tap>install.
98+
brew services start elastic/tap/elasticsearch-full # to start the server
99+
brew services stop elastic/tap/elasticsearch-full # to stop the server
100+
```
101+
102+
### Install on Mac (option 2)
103+
```
104+
# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html
105+
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz
106+
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512
107+
shasum -a 512 -c elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512
108+
tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz
109+
cd elasticsearch-7.10.2/
110+
./bin/elasticsearch # start the server
111+
pkill -f elasticsearch # to stop the server
112+
```
113+
114+
### Install on Linux
115+
116+
```
117+
# source: https://www.elastic.co/guide/en/elasticsearch/reference/8.1/targz.html
118+
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz
119+
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
120+
shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
121+
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
122+
cd elasticsearch-7.10.2/
123+
./bin/elasticsearch # start the server
124+
pkill -f elasticsearch # to stop the server
125+
```
126+
127+
Checkout the references sources if you run into problems installing it.
128+
129+
</details>
130+
131+
Start the the elasticsearch server on port 9200 (default), and then start the retriever server as show here. You can change the elasticsearch port in `retriever_server/serve.py` if needed.
132+
133+
```bash
134+
uvicorn serve:app --port 8000 --app-dir retriever_server
135+
```
136+
137+
Next, index the wikipedia corpuses for the datasets. Make sure you've downloaded `raw_data` and `processed_data`.
138+
139+
```bash
140+
python retriever_server/build_index.py {dataset_name} # hotpotqa, iirc, 2wikimultihopqa, musique
141+
```
142+
143+
After indexing you can check the number of documents in each index by running `curl localhost:9200/_cat/indices`. You should have 4 indices, one for each dataset, called `{dataset}-wikipedia`. Make sure the match up to the statistics given in the paper. You should expec to see following sizes: HotpotQA (5,233,329), 2WikiMultihopQA (430,225), MuSiQue (139,416), and IIRC (1,882,415).
144+
145+
Next, if you want to use flan-t5-* models, start the llm_server by running:
146+
147+
```bash
148+
MODEL_NAME={model_name} uvicorn serve:app --port 8010 --app-dir llm_server # model_name: flan-t5-xxl, flan-t5-xl, flan-t5-large, flan-t5-base
149+
```
150+
151+
If you want to use openai models (e.g., codex in our experiments), you don't need to start it. In that case, you just need to set the environment variable `OPENAI_API_KEY`.
152+
153+
If you start retriever and/or llm_server on different host or port, update them in `.retriever_address.jsonnet` and `.llm_server_address.jsonnet` before running retrieval/odqa systems.
154+
155+
156+
# Run Retrieval and ODQA Systems
157+
158+
First, download dataset repositories for official evaluation: `./download/official_eval.sh`.
159+
160+
Next, set the variables:
161+
162+
- SYSTEM: choose from (`ircot`, `ircot_qa`, `oner`, `oner_qa`, `nor_qa`)
163+
- MODEL: choose from (`codex`, `flan-t5-xxl`, `flan-t5-xl`, `flan-t5-large`, `flan-t5-base`, `none`)
164+
- DATASET: choose from (`hotpotqa`, `2wikimultihopqa`, `musique`, `iirc`)
165+
166+
The systems ending with `_qa` are for ODQA and others are for retrieval. The `ircot` and `ircot_qa` are proposed systems and others are baselines (see NoR, OneR in paper). For `oner`, choose model to be `none`, not otherwise.
167+
168+
Now you can run the system using (language) model and dataset of your choice by running:
169+
170+
```bash
171+
./reproduce.sh $SYSTEM $MODEL $DATASET
172+
```
173+
174+
This script runs several things one after the other: instantiating experiment configs with HPs, running predictions for them on the dev set, picking up the best HP, making experiment config with the best HP, running it on the test set, and summarizing the results with mean and std.
175+
176+
If you prefer to have more control, you can also run it step-by-step as follows:
177+
178+
179+
```bash
180+
# Instantiate experiment configs with different HPs and write them in files.
181+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 1
182+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 2
183+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 3
184+
## if you make a change to base_configs, the above steps need to be rerun to
185+
## regenerate instantiated experiment configs (with HPs populated)
186+
187+
# Run experiments for different HPs on dev set
188+
python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 1
189+
## predict command runs evaluation at the end by default. If you want to run evaluation
190+
## separately after prediction, you can replace predict with evaluate here.
191+
192+
# Show results for experiments with different HPs
193+
python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 1
194+
## Not necessary as such, it'll just show you the results using different HPs in a nice table.
195+
196+
# Pick the best HP and save the config with that HP.
197+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 1 --best
198+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 2 --best
199+
python runner.py $SYSTEM $MODEL $DATASET write --prompt_set 3 --best
200+
201+
# Run the experiment with best HP on test set
202+
python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 1 --best --eval_test --official
203+
python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 2 --best --eval_test --official
204+
python runner.py $SYSTEM $MODEL $DATASET predict --prompt_set 3 --best --eval_test --official
205+
## predict command runs evaluation at the end by default. If you want to run evaluation
206+
## separately after prediction, you can replace predict with evaluate here.
207+
208+
# Summarize best test results for individual prompts and aggregate (mean +- std) of them)
209+
python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 1 --best --eval_test --official
210+
python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 2 --best --eval_test --official
211+
python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set 3 --best --eval_test --official
212+
python runner.py $SYSTEM $MODEL $DATASET summarize --prompt_set aggregate --best --eval_test --official
213+
## The mean and std in the final command is what we reported in the paper.
214+
```
215+
216+
DISCLAIMER: Please note that our Codex-based experiments were done when it was free. Now it has been deprecated. You can do these experiments with other OpenAI completion modes, or other open/commercial models (see notes below). But keep track of the cost, as it may add up quickly doing these experiments.
217+
218+
# Running IRCoT (QA) using a Different Dataset or LLM
219+
220+
Each experiment (system, model, data combination) in this project corresponds to an experiment config in `base_configs/...jsonnet`. Find the experiment closest to your usecase and change the model, dataset and related information in it as per your need.
221+
222+
If you've changed the dataset, you'll need to ensure the Elasticsearch index of that name is available (see processing-notes and setting-up-retriever for it).
223+
224+
If you've changed the model, you'll need to ensure model of that name is implemented and available in the code. If you want to try out a different OpenAI completion model, it'd just involve configuring the `engine` variable and setting the `model_tokens_limit` in here. Chat-based API isn't readily supported yet, but shouldn't be much work if you're interested. If you're interested in open LLMs, like Llama, MPT, etc, you can set up OpenAI-complaint FastChat server as shown [here](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md), and made necessary changes in the base_config/ and you should be good to go.
225+
226+
If you're stuck anywhere in this process, open an issue with your specific choice of data/model, and I can help you to get there.
227+
228+
# Acknowledgement
229+
230+
This code is heavily based on [CommaQA](https://github.com/allenai/CommaQA), which provides a way to build complex/multi-step systems involving agents. All modeling-related code for IRCoT project is in `commaqa/inference/ircot.py`, and all experiment configs (without HPs instantiated) for this project are in `base_configs/`.
231+
232+
# Citation
233+
234+
If you find this work useful, consider citing it:
235+
236+
```bib
237+
@article{trivedi2022interleaving,
238+
title={Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions},
239+
author={Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish},
240+
journal={arXiv preprint arXiv:2212.10509},
241+
year={2022}
242+
}
243+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Set dataset:
2+
local dataset = "2wikimultihopqa";
3+
local retrieval_corpus_name = dataset;
4+
local add_pinned_paras = if dataset == "iirc" then true else false;
5+
local valid_qids = {
6+
"hotpotqa": ["5ab92dba554299131ca422a2","5a7bbc50554299042af8f7d0","5add363c5542990dbb2f7dc8","5a835abe5542996488c2e426","5ae0185b55429942ec259c1b","5a790e7855429970f5fffe3d","5a754ab35542993748c89819","5a89c14f5542993b751ca98a","5abb14bd5542992ccd8e7f07","5a89d58755429946c8d6e9d9","5a88f9d55542995153361218","5a90620755429933b8a20508","5a77acab5542992a6e59df76","5abfb3435542990832d3a1c1","5a8f44ab5542992414482a25","5adfad0c554299603e41835a","5a7fc53555429969796c1b55","5a8ed9f355429917b4a5bddd","5ac2ada5554299657fa2900d","5a758ea55542992db9473680"],
7+
"2wikimultihopqa": ["5811079c0bdc11eba7f7acde48001122","97954d9408b011ebbd84ac1f6bf848b6","35bf3490096d11ebbdafac1f6bf848b6","c6805b2908a911ebbd80ac1f6bf848b6","5897ec7a086c11ebbd61ac1f6bf848b6","e5150a5a0bda11eba7f7acde48001122","a5995da508ab11ebbd82ac1f6bf848b6","cdbb82ec0baf11ebab90acde48001122","f44939100bda11eba7f7acde48001122","4724c54e08e011ebbda1ac1f6bf848b6","f86b4a28091711ebbdaeac1f6bf848b6","13cda43c09b311ebbdb0ac1f6bf848b6","228546780bdd11eba7f7acde48001122","c6f63bfb089e11ebbd78ac1f6bf848b6","1ceeab380baf11ebab90acde48001122","8727d1280bdc11eba7f7acde48001122","f1ccdfee094011ebbdaeac1f6bf848b6","79a863dc0bdc11eba7f7acde48001122","028eaef60bdb11eba7f7acde48001122","af8c6722088b11ebbd6fac1f6bf848b6"],
8+
"musique": ["2hop__323282_79175","2hop__292995_8796","2hop__439265_539716","4hop3__703974_789671_24078_24137","2hop__154225_727337","2hop__861128_15822","3hop1__858730_386977_851569","2hop__642271_608104","2hop__387702_20661","2hop__131516_53573","2hop__496817_701819","2hop__804754_52230","3hop1__61746_67065_43617","3hop1__753524_742157_573834","2hop__427213_79175","3hop1__443556_763924_573834","2hop__782642_52667","2hop__102217_58400","2hop__195347_20661","4hop3__463724_100414_35260_54090"],
9+
"iirc": ["q_10236","q_3268","q_8776","q_9499","q_389","q_8350","q_3283","q_3208","q_1672","q_9433","q_8173","q_8981","q_10227","q_2466","q_8736","q_9591","q_10344","q_10270","q_9518","q_3290"],
10+
}[dataset];
11+
local prompt_reader_args = {
12+
"filter_by_key_values": {
13+
"qid": valid_qids
14+
},
15+
"order_by_key": "qid",
16+
"estimated_generation_length": 300,
17+
"shuffle": false,
18+
"model_length_limit": 8000,
19+
};
20+
21+
# (Potentially) Hyper-parameters:
22+
# null means it's unused.
23+
local llm_retrieval_count = null;
24+
local llm_map_count = null;
25+
local bm25_retrieval_count = 6;
26+
local rc_context_type_ = "gold_with_n_distractors"; # Choices: no, gold, gold_with_n_distractors
27+
local distractor_count = "2"; # Choices: 1, 2, 3
28+
local rc_context_type = (
29+
if rc_context_type_ == "gold_with_n_distractors"
30+
then "gold_with_" + distractor_count + "_distractors" else rc_context_type_
31+
);
32+
local multi_step_show_titles = null;
33+
local multi_step_show_paras = null;
34+
local multi_step_show_cot = null;
35+
local rc_qa_type = null; # Choices: direct, cot
36+
37+
{
38+
"start_state": "step_by_step_bm25_retriever",
39+
"end_state": "[EOQ]",
40+
"models": {
41+
"step_by_step_bm25_retriever": {
42+
"name": "retrieve_and_reset_paragraphs",
43+
"next_model": "step_by_step_cot_reasoning_gen",
44+
"retrieval_type": "bm25",
45+
"retriever_host": std.extVar("RETRIEVER_HOST"),
46+
"retriever_port": std.extVar("RETRIEVER_PORT"),
47+
"retrieval_count": bm25_retrieval_count,
48+
"global_max_num_paras": 15,
49+
"query_source": "question_or_last_generated_sentence",
50+
"source_corpus_name": retrieval_corpus_name,
51+
"document_type": "title_paragraph_text",
52+
"return_pids": false,
53+
"cumulate_titles": true,
54+
"end_state": "[EOQ]",
55+
},
56+
"step_by_step_cot_reasoning_gen": {
57+
"name": "step_by_step_cot_gen",
58+
"next_model": "step_by_step_exit_controller",
59+
"prompt_file": "prompts/"+dataset+"/"+rc_context_type+"_context_cot_qa_codex.txt",
60+
"prompt_reader_args": prompt_reader_args,
61+
"generation_type": "sentences",
62+
"reset_queries_as_sentences": false,
63+
"add_context": true,
64+
"shuffle_paras": false,
65+
"terminal_return_type": null,
66+
"disable_exit": true,
67+
"end_state": "[EOQ]",
68+
"gen_model": "gpt3",
69+
"engine": "code-davinci-002",
70+
"retry_after_n_seconds": 50,
71+
},
72+
"step_by_step_exit_controller": {
73+
"name": "step_by_step_exit_controller",
74+
"next_model": "step_by_step_bm25_retriever",
75+
"answer_extractor_regex": ".* answer is:? (.*)\\.?",
76+
"answer_extractor_remove_last_fullstop": true,
77+
"terminal_state_next_model": null,
78+
"terminal_return_type": "pids",
79+
"global_max_num_paras": 15,
80+
"end_state": "[EOQ]",
81+
},
82+
},
83+
"reader": {
84+
"name": "multi_para_rc",
85+
"add_paras": false,
86+
"add_gold_paras": false,
87+
"add_pinned_paras": add_pinned_paras,
88+
},
89+
"prediction_type": "pids",
90+
}

0 commit comments

Comments
 (0)