**This code is for our paper: Pre-trained language models with domain knowledge for biomedical extractive summarization
Python version: This code is in Python3.6
Package Requirements: torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge
Some codes are borrowed from PreSumm (https://github.com/nlpyang/PreSumm)
Download and unzip the CORD-19
directories from here. Put all files in the directory ./raw_data
Download zip file from [here] (https://drive.google.com/file/d/1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja/view). You can also use the command below to download the files via the cli using linux. Put all files in directory ./raw
.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja" -O pubmed.zip && rm -rf /tmp/cookies.txt
Details of the dataset can be found [here] (https://github.com/allenai/s2orc). To prepare, follow instructions [here] (src/datasets/s2orc/README.md))
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile (/.bashrc
file):
for file in `find /home/qianqian/stanford-corenlp-4.2.1 -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done
replacing /path/to/
with the path to where you saved the stanford-corenlp-4.2.0
directory.
For CORD19 or S2ORC data (both from allenai), use the following command to preprocess the data, the raw data files should be in a folder ./raw_data/pmc_data/document_parses/pmc_json/ with the associated metadata csv file at ./raw_data/pmc_data/document_parses/metadata.csv
python src/preprocess.py -mode tokenize_allenai_datasets -raw_path ./raw_data/ -save_path ./token_data/ -log ./tokenize_allenai.log
For the PubMed dataset.
python src/preprocess.py -mode tokenize_pubmed_dataset -raw_path ./raw/ -save_path ./token_data/ -log ./tokenize_pubmed.log
RAW_PATH
is the directory containing story files,save_path
is the target directory to save the generated tokenized files
Using scibert (https://github.com/allenai/scibert) trained on the EBM-NLP dataset (https://github.com/bepnye/EBM-NLP):
- Preprocess the tokenized data into the pico input data on the trained scibert:
python src/preprocess_pico.py -raw_path .=/token_data/ -save_path ..output_data/pico_preprocess/
- Training pico extraction model
Install the allennlp:
git clone https://github.com/ibeltagy/allennlp.git
git checkout fp16_and_others
pip install --editable .
Download scibert model with the link (https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar)
Export scibert in the bash script:
export BERT_VOCAB=/home/qianqian/scibert/model/vocab.txt
export BERT_WEIGHTS=/home/qianqian/scibert/model/weights.tar.gz
bash scripts/train_allennlp_local.sh wotune_model/
- Predicting pico for cord-19
export CUDA_VISIBLE_DEVICES=0
export PICO_MODE='PREDICT'
cd ./scibert
python -m allennlp.run predict --output-file=/mnt/disk/jenny/pubmed-dataset/pico_preprocess/test/output.txt --include-package=scibert --predictor=sentence-tagger --use-dataset-reader --cuda-device=0 --batch-size=32 --silent /home/qianqian/scratch/scibert/wotune_model/model.tar.gz /mnt/disk/jenny/pubmed-dataset/pico_preprocess/test/cord.txt
- Format the predicted pico to Json Files
python src/pico_predict_read.py -raw_path ./data/pico/ebmnlp/cord.txt -save_path .=/token_data/ -predict_path out.txt
python src/preprocess.py -mode format_to_lines -raw_path ./token_data/ -save_path ./json_data -log ./tokenize.log
RAW_PATH
is the directory containing tokenized files,JSON_PATH
is the target directory to save the generated json files
python src/preprocess.py -mode format_to_bert -raw_path ./json_data/ -save_path ./bert_data/ -lower -n_cpus 1 -log_file ./logs/preprocess.log
JSON_PATH
is the directory containing json files,BERT_DATA_PATH
is the target directory to save the generated binary files- Note depending on model type you want to use, you can change
format_to_bert
toformat_to_pubmed_bert
orformat_to_robert
Step 7. Pico Adapter - train PICO adapter model which will be included as an adapter in model training in the next step
python src/preprocess.py -mode format_to_pico_adapter -raw_path ./json_data/ -save_path ./pico_adapter_data/ -log_file ./pico_adapter_robert.log
Note depending on model type you want to use, you can change format_to_bert
to format_to_pico_adapter_pubmed_bert
or format_to_pico_adapter_robert
CUDA_VISIBLE_DEVICES=0 python src/pico_adapter.py -model robert -path /data/xieqianqian/covid-bert/data/pico_roberta_data -output ./pico_adapter_output
- -model can be [bert, robert, pubmed]
CUDA_VISIBLE_DEVICES=0 python src/pico_adapter_ml.py -model robert -path /data/xieqianqian/covid-bert/data/pico_roberta_data -output ./pico_adapter_output
- -model can be [bert, robert, pubmed]
First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1
, after downloading, you could kill the process and rerun the code with multi-GPUs.
python src/train.py -task ext -mode train -bert_data_path /data/xieqianqian/covid-bert/data/pubmed_data/ -ext_dropout 0.4 -model_path /data/xieqianqian/covid-bert/models_2/ -lr 2e-3 -visible_gpus 2 -report_every 50 -save_checkpoint_steps 1000 -batch_size 12000 -train_steps 20000 -accum_count 2 -log_file /data/xieqianqian/covid-bert/logs/ext_bert_covid -use_interval true -warmup_steps 5000 -model pubmed -adapter_training_strategy discriminative -adapter_path_pubmed_discriminative /home/jenny/data/covid/pico_adapter_model_outputs_pubmed/adapter/final_pubmed_adapter
- -training strategy can be [generative, discriminative, both]
- depending on training strategy and model type you can set adapter_path_pubmed_discriminative/adapter_path_bert_discriminative/adapter_path_robert_discriminative and adapter_path_pubmed_generative/adapter_path_bert_generative/adapter_path_robert_generative variables to trained adapter models
python src/train.py -task ext -mode validate -batch_size 12000 -test_batch_size 12000 -bert_data_path ./bert_data/ -log_file ./logs/val_ext_bert_covid -model_path ./models/ -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -result_path ./results/ext_bert_covid -test_all True -model bert
python src/train.py -task ext -mode test -batch_size 3000 -test_batch_size 500 -bert_data_path ./bert_data/ -log_file ./logs/test_ext_bert_covid -test_from ./models/model_step_9000.pt -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -result_path ./results/ext_bert_covid -model bert
-mode
can be {validate, test
}, wherevalidate
will inspect the model directory and evaluate the model for each saved checkpoint,test
need to be used with-test_from
, indicating the checkpoint you want to use (choose the top checkpoint on the validation dataset)MODEL_PATH
is the directory of saved checkpoints- use
-mode valiadte
with-test_all
, the system will load all saved checkpoints and select the top ones to generate summaries