Skip to content

Commit 079acd4

Browse files
committed
Initial release
1 parent 2d83a5b commit 079acd4

22 files changed

+5328
-0
lines changed

Diff for: .gitignore

+136
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
# GreaseLM project specific
132+
data/
133+
logs/
134+
runs/
135+
*.zip
136+
wandb/

Diff for: README.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# GreaseLM: Graph REASoning Enhanced Language Models
2+
3+
This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models".
4+
5+
<p align="center">
6+
<img src="./figs/greaselm.png" width="600" title="GreaseLM model architecture" alt="">
7+
</p>
8+
9+
## Usage
10+
### 1. Dependencies
11+
12+
- [Python](<https://www.python.org/>) == 3.8
13+
- [PyTorch](<https://pytorch.org/get-started/locally/>) == 1.8.0
14+
- [transformers](<https://github.com/huggingface/transformers/tree/v3.4.0>) == 3.4.0
15+
- [torch-geometric](https://pytorch-geometric.readthedocs.io/) == 1.7.0
16+
17+
Run the following commands to create a conda environment (assuming CUDA 10.1):
18+
```bash
19+
conda create -y -n greaselm python=3.8
20+
conda activate greaselm
21+
pip install numpy==1.18.3 tqdm
22+
pip install torch==1.8.0+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html
23+
pip install transformers==3.4.0 nltk spacy
24+
pip install wandb
25+
conda install -y -c conda-forge tensorboardx
26+
conda install -y -c conda-forge tensorboard
27+
28+
# for torch-geometric
29+
pip install torch-scatter==2.0.7 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
30+
pip install torch-cluster==1.5.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
31+
pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
32+
pip install torch-spline-conv==1.2.1 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
33+
pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
34+
```
35+
36+
37+
### 2. Download data
38+
39+
Download all the raw data -- ConceptNet, CommonsenseQA, OpenBookQA -- by
40+
```
41+
./download_raw_data.sh
42+
```
43+
44+
You can preprocess the raw data by running
45+
```
46+
CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes>
47+
```
48+
You can specify the GPU you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. The script will:
49+
* Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
50+
* Convert the QA datasets into .jsonl files (e.g., stored in `data/csqa/statement/`)
51+
* Identify all mentioned concepts in the questions and answers
52+
* Extract subgraphs for each q-a pair
53+
54+
**TL;DR**. The preprocessing may take long; for your convenience, you can download all the processed data [here](https://drive.google.com/drive/folders/1T6B4nou5P3u-6jr0z6e3IkitO8fNVM6f?usp=sharing) into the top-level directory of this repo and run
55+
```
56+
unzip data_preprocessed.zip
57+
```
58+
59+
The resulting file structure should look like this:
60+
61+
```plain
62+
.
63+
├── README.md
64+
└── data/
65+
├── cpnet/ (preprocessed ConceptNet)
66+
└── csqa/
67+
├── train_rand_split.jsonl
68+
├── dev_rand_split.jsonl
69+
├── test_rand_split_no_answers.jsonl
70+
├── statement/ (converted statements)
71+
├── grounded/ (grounded entities)
72+
├── graphs/ (extracted subgraphs)
73+
├── ...
74+
```
75+
76+
### 3. Training GreaseLM
77+
To train GreaseLM on CommonsenseQA, run
78+
```
79+
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh csqa --data_dir data/
80+
```
81+
You can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`.
82+
83+
Similarly, to train GreaseLM on OpenbookQA, run
84+
```
85+
CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh obqa --data_dir data/
86+
```
87+
88+
### 4. Pretrained model checkpoints
89+
You can download a pretrained GreaseLM model on CommonsenseQA [here](https://drive.google.com/file/d/1QPwLZFA6AQ-pFfDR6TWLdBAvm3c_HOUr/view?usp=sharing), which achieves an IH-dev acc. of `79.0` and an IH-test acc. of `74.0`.
90+
91+
You can also download a pretrained GreaseLM model on OpenbookQA [here](https://drive.google.com/file/d/1-QqyiQuU9xlN20vwfIaqYQ_uJMP8d7Pv/view?usp=sharing), which achieves an test acc. of `84.8`.
92+
93+
### 5. Evaluating a pretrained model checkpoint
94+
To evaluate a pretrained GreaseLM model checkpoint on CommonsenseQA, run
95+
```
96+
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh csqa --data_dir data/ --load_model_path /path/to/checkpoint
97+
```
98+
Again you can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`.
99+
100+
SimilarlyTo evaluate a pretrained GreaseLM model checkpoint on OpenbookQA, run
101+
```
102+
CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh obqa --data_dir data/ --load_model_path /path/to/checkpoint
103+
```
104+
105+
### 6. Use your own dataset
106+
- Convert your dataset to `{train,dev,test}.statement.jsonl` in .jsonl format (see `data/csqa/statement/train.statement.jsonl`)
107+
- Create a directory in `data/{yourdataset}/` to store the .jsonl files
108+
- Modify `preprocess.py` and perform subgraph extraction for your data
109+
- Modify `utils/parser_utils.py` to support your own dataset
110+
111+
## Acknowledgment
112+
This repo is built upon the following work:
113+
```
114+
QA-GNN: Question Answering using Language Models and Knowledge Graphs
115+
https://github.com/michiyasunaga/qagnn
116+
```
117+
Many thanks to the authors and developers!

Diff for: download_raw_data.sh

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# download ConceptNet
2+
mkdir -p data/
3+
mkdir -p data/cpnet/
4+
wget -nc -P data/cpnet/ https://s3.amazonaws.com/conceptnet/downloads/2018/edges/conceptnet-assertions-5.6.0.csv.gz
5+
cd data/cpnet/
6+
yes n | gzip -d conceptnet-assertions-5.6.0.csv.gz
7+
# download ConceptNet entity embedding
8+
wget https://csr.s3-us-west-1.amazonaws.com/tzw.ent.npy
9+
cd ../../
10+
11+
12+
13+
14+
# download CommensenseQA dataset
15+
mkdir -p data/csqa/
16+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/train_rand_split.jsonl
17+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl
18+
wget -nc -P data/csqa/ https://s3.amazonaws.com/commensenseqa/test_rand_split_no_answers.jsonl
19+
20+
# create output folders
21+
mkdir -p data/csqa/grounded/
22+
mkdir -p data/csqa/graph/
23+
mkdir -p data/csqa/statement/
24+
25+
26+
27+
# download OpenBookQA dataset
28+
wget -nc -P data/obqa/ https://s3-us-west-2.amazonaws.com/ai2-website/data/OpenBookQA-V1-Sep2018.zip
29+
yes n | unzip data/obqa/OpenBookQA-V1-Sep2018.zip -d data/obqa/
30+
31+
# create output folders
32+
mkdir -p data/obqa/fairseq/official/
33+
mkdir -p data/obqa/grounded/
34+
mkdir -p data/obqa/graph/
35+
mkdir -p data/obqa/statement/

Diff for: eval_greaselm.sh

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
export TOKENIZERS_PARALLELISM=true
2+
dt=`date '+%Y%m%d_%H%M%S'`
3+
4+
dataset=$1
5+
shift
6+
args=$@
7+
ebs=32
8+
9+
# Added for eval
10+
mode=eval
11+
12+
echo "***** hyperparameters *****"
13+
echo "dataset: $dataset"
14+
echo "******************************"
15+
16+
run_name=eval__greaselm__${dataset}__${dt}
17+
18+
###### Eval ######
19+
python3 -u greaselm.py \
20+
--run_name ${run_name} \
21+
--mode ${mode} -ebs ${ebs} --dataset ${dataset} \
22+
$args

Diff for: figs/greaselm.png

134 KB
Loading

0 commit comments

Comments
 (0)