This document has instructions for how to run BERT for the following modes/platforms:
For all fine-tuning, the datasets (SQuAD, MultiNLI, MRPC etc..) and checkpoints should be downloaded as mentioned in the Google bert repo.
Refer to google reference page for checkpoints.
When running BERT for different tasks : SQuAD, Classifiers and Pretraining setting env variables can simplify the commands.
An example setting is below.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
# For Classifiers
export GLUE_DIR=/home/<user>/glue
#For SQuAD
export SQUAD_DIR=/home/<user>/SQUAD
(Experimental)
- Clone the intelai/models repo. This repo has the launch script for running the model, which we will use in the next step.
git clone https://github.com/IntelAI/models.git
- Download checkpoints and data for BERT from google bert repo. Keep all data under one directory (SQuAD, GLUE). For training from scratch Wikipedia and BookCorpus need to be downloaded and pre-processed.
- To run SQuAD
Navigate to
models/benchmarks
directory and run the following command:
Note : Add space after --
, for BERT-specific options.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
export SQUAD_DIR=/home/<user>/SQuAD
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $SQUAD_DIR:$SQUAD_DIR \
-- train_option=SQuAD \
vocab_file=$BERT_LARGE_DIR/vocab.txt \
config_file=$BERT_LARGE_DIR/bert_config.json \
init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
do_train=True \
train_file=$SQUAD_DIR/train-v1.1.json \
do_predict=True \
predict_file=$SQUAD_DIR/dev-v1.1.json \
learning_rate=3e-5 \
num_train_epochs=2 \
max_seq_length=384 \
doc_stride=128 \
optimized_softmax=True \
experimental_gelu=False \
do_lower_case=True
The results file will be written to the
models/benchmarks/common/tensorflow/logs
directory, unless another
output directory is specified by the --output-dir
arg.
The dev set predictions will be saved to a file called predictions.json
in the output directory.
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /home/<user>/models/benchmarks/common/tensorflow/logs/predictions.json
An execution with these parameters produces results in line with below scores:
Bf16: {"exact_match": 86.77388836329234, "f1": 92.98642358746287}
FP32: {"exact_match": 86.72658467360453, "f1": 92.98046893150796}
To run distributed training of SQuAD: for better throughput, simply specify --mpi_num_processes=<num_of_sockets> [--mpi_num_processes_per_socket=1]
.
To set --mpi_num_processes=<num_of_sockets>
, please run lscpu
on your machine to check the available number of sockets.
Note:
- the
global batch size
ismpi_num_processes * train_batch_size
and sometimeslearning rate
needs to be adjusted for convergence.square root learning rate scaling
is used by default.- for BERT fine-tuning, state-of-the-art accuracy can be achieved via parallel training without synchronizing gradients between MPI workers.
- The
--mpi_workers_sync_gradients=[True/False]
controls whether the MPI workers sync gradients.By default it is set toFalse
meaning the workers are training independently and the best performing training results will be picked in the end.- To enable gradients synchronization, set the
--mpi_workers_sync_gradients
toTrue
in BERT-specific options.- The options
optimized_softmax=True
can be set for better performance.
Navigate to models/benchmarks
directory and run the following command:
Note : Add space after --
, for BERT-specific options.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
export SQUAD_DIR=/home/<user>/SQuAD
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--mpi_num_processes=<num_of_sockets> \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $SQUAD_DIR:$SQUAD_DIR \
-- train_option=SQuAD \
vocab_file=$BERT_LARGE_DIR/vocab.txt \
config_file=$BERT_LARGE_DIR/bert_config.json \
init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
do_train=True \
train_file=$SQUAD_DIR/train-v1.1.json \
do_predict=True \
predict_file=$SQUAD_DIR/dev-v1.1.json \
learning_rate=3e-5 \
num_train_epochs=2 \
max_seq_length=384 \
doc_stride=128 \
optimized_softmax=True \
experimental_gelu=False \
do_lower_case=True
The results file will be written to the
models/benchmarks/common/tensorflow/logs
directory, unless another
output directory is specified by the --output-dir
arg.
- To run Classifier
train-option=Classifier
Download GLUE data by running the script This example code fine-tunes BERT-Base on the Microsoft Research Paraphrase Corpus (MRPC) corpus, which only contains 3,600 examples.
export BERT_BASE_DIR=/home/<user>/wwm_uncased_L-12_H-768_A-16
export GLUE_DIR=/home/<user>/glue
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=32 \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_BASE_DIR:$BERT_BASE_DIR \
--volume $GLUE_DIR:$GLUE_DIR \
-- train-option=Classifier \
task-name=MRPC \
do-train=true \
do-eval=true \
data-dir=$GLUE_DIR/MRPC \
vocab-file=$BERT_BASE_DIR/vocab.txt \
config-file=$BERT_BASE_DIR/bert_config.json \
init-checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
max-seq-length=128 \
learning-rate=2e-5 \
num-train-epochs=30 \
optimized_softmax=True \
experimental_gelu=False \
do-lower-case=True
The results file will be written to the
models/benchmarks/common/tensorflow/logs
directory, unless another
output directory is specified by the --output-dir
arg.
To run distributed training of Classifier: for better throughput, specify --mpi_num_processes=<num_of_sockets> [--mpi_num_processes_per_socket=1]
, please run lscpu
on your machine to check the available number of sockets.
Note that the global batch size
is mpi_num_processes * train_batch_size
and sometimes learning rate needs to be adjusted for convergence. By default, the script uses square root learning rate scaling.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
export GLUE_DIR=/home/<user>/glue
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=32 \
--mpi_num_processes=<num_of_sockets> \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $GLUE_DIR:$GLUE_DIR \
-- train-option=Classifier \
task-name=MRPC \
do-train=true \
do-eval=true \
data-dir=$GLUE_DIR/MRPC \
vocab-file=$BERT_LARGE_DIR/vocab.txt \
config-file=$BERT_LARGE_DIR/bert_config.json \
init-checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
max-seq-length=128 \
learning-rate=2e-5 \
num-train-epochs=30 \
optimized_softmax=True \
experimental_gelu=True \
do-lower-case=True
The results file will be written to the
models/benchmarks/common/tensorflow/logs
directory, unless another
output directory is specified by the --output-dir
arg.
5. To run
Pre-training from scratch. Pre-training has two phases:
In the first phase, the data is generated for sequential length 128. And in the second phase, sequential length 512 is used.
Please follow instructions in google bert pre-training for data pre-processing.
Replace SQuAD with Pretraining
in --train_option
and set train-option=Pretraining
.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
export PRETRAINING_DATA_DIR=/home/<user>/pretraining/tf-record-diretory
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--socket-id=0 \
--num-intra-threads=24 \
--num-inter-threads=1 \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $PRETRAINING_DATA_DIR:$PRETRAINING_DATA_DIR \
-- train-option=Pretraining \
input-file=$PRETRAINING_DATA_DIR/tf_wiki_formatted_slen512.tfrecord \
output-dir=/tmp/pretraining_output \
do-train=True \
do-eval=True \
config-file=$BERT_LARGE_DIR/bert_config.json \
init-checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
max-seq-length=512 \
max-predictions=76 \
num-train-steps=20 \
warmup-steps=10 \
learning-rate=2e-5 \
optimized_softmax=True \
experimental_gelu=False \
profile=False
To run distributed training of pretraining: for better throughput, simply specify --mpi_num_processes=<num_of_sockets> [--mpi_num_processes_per_socket=1]
. Please run lscpu
on your machine to check the available number of sockets.
Note that the
global batch size
ismpi_num_processes * train_batch_size
and sometimeslearning rate
needs to be adjusted for convergence. By default, the script usessquare root learning rate scaling
.
export BERT_LARGE_DIR=/home/<user>/wwm_uncased_L-24_H-1024_A-16
export PRETRAINING_DATA_DIR=/home/<user>/pretraining/tf-record-diretory
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--num-intra-threads=22 \
--num-inter-threads=1 \
--mpi_num_processes=4 \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $PRETRAINING_DATA_DIR:$PRETRAINING_DATA_DIR \
-- train-option=Pretraining \
input-file=$PRETRAINING_DATA_DIR/tf_wiki_formatted_slen512.tfrecord \
output-dir=/tmp/pretraining_output \
do-train=True \
do-eval=True \
config-file=$BERT_LARGE_DIR/bert_config.json \
init-checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
max-seq-length=512 \
max-predictions=76 \
num-train-steps=20 \
warmup-steps=10 \
learning-rate=2e-5 \
optimized_softmax=True \
experimental_gelu=False \
profile=False
Note: for best performance, we will set
num-intra-thread
as follows:
- For single instance (mpi_num_processes=1) run: the value is equal to number of logical cores per socket.
- For multi-instance (mpi_num_processes>1) run: the value is equal to (num_of_logical_cores_per_socket - 2)
FP32 training instructions are the same as Bfloat16 training instructions above, except one needs to change the --precision=bfloat16
to --precision=fp32
in the above commands.
(Experimental)
-
Clone the Intel model zoo:
git clone https://github.com/IntelAI/models.git
-
Download and unzip the BERT large uncased (whole word masking) model from the google bert repo. Then, download the
dev-v1.1.json
file from the google bert repo into thewwm_uncased_L-24_H-1024_A-16
directory that was just unzipped.wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip unzip wwm_uncased_L-24_H-1024_A-16.zip wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P wwm_uncased_L-24_H-1024_A-16
The
wwm_uncased_L-24_H-1024_A-16
directory is what will be passed as the--data-location
when running inference in step 4. -
Download and unzip the pretrained model:
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/bert_large_checkpoints.zip unzip bert_large_checkpoints.zip
This directory will be passed as the
--checkpoint
location when running inference in step 4. -
Navigate to the
benchmarks
directory in the model zoo that you cloned in step 1. This directory contains the launch_benchmarks.py script that will be used to run inference.cd benchmarks
Bert large inference can be run in 3 different modes:
-
Benchmark
python launch_benchmark.py \ --model-name=bert_large \ --precision=bfloat16 \ --mode=inference \ --framework=tensorflow \ --batch-size=32 \ --data-location /home/<user>/wwm_uncased_L-24_H-1024_A-16 \ --checkpoint /home/<user>/bert_large_checkpoints \ --output-dir /home/<user>/bert-squad-output \ --benchmark-only \ --docker-image intel/intel-optimized-tensorflow:2.3.0 \ -- infer_option=SQuAD
-
Profile
python launch_benchmark.py \ --model-name=bert_large \ --precision=bfloat16 \ --mode=inference \ --framework=tensorflow \ --batch-size=32 \ --data-location /home/<user>/wwm_uncased_L-24_H-1024_A-16 \ --checkpoint /home/<user>/bert_large_checkpoints \ --output-dir /home/<user>/bert-squad-output \ --docker-image intel/intel-optimized-tensorflow:2.3.0 \ -- profile=True infer_option=SQuAD
-
Accuracy
python launch_benchmark.py \ --model-name=bert_large \ --precision=bfloat16 \ --mode=inference \ --framework=tensorflow \ --batch-size=32 \ --data-location /home/<user>/wwm_uncased_L-24_H-1024_A-16 \ --checkpoint /home/<user>/bert_large_checkpoints \ --output-dir /home/<user>/bert-squad-output \ --docker-image intel/intel-optimized-tensorflow:2.3.0 \ --accuracy-only \ -- infer_option=SQuAD
Output files and logs are saved to the
models/benchmarks/common/tensorflow/logs
directory. To change the location, use the--output-dir
flag in the launch_benchmarks.py script.Note that args specific to this model are specified after
--
at the end of the command (like theprofile=True
arg in the Profile command above. Below is a list of all of the model specific args and their default values:Model arg Default value doc_stride 128
max_seq_length 384
profile False
config_file bert_config.json
vocab_file vocab.txt
predict_file dev-v1.1.json
init_checkpoint model.ckpt-3649
-
FP32 inference instructions are the same as Bfloat16 inference instructions above, except one needs to change the --precision=bfloat16
to --precision=fp32
in the above commands.