-
Notifications
You must be signed in to change notification settings - Fork 188
词法分析模型实现
Sean edited this page Feb 20, 2022
·
1 revision
模型基于 LangML 构建,训练脚本如下:
export PYTHONUNBUFFERED=1; CUDA_VISIBLE_DEVICES=0 nohup langml-cli baseline ner bert-crf \
--epoch 20 \
--batch_size 64 \
--learning_rate 0.00002 \
--dropout_rate 0.2 \
--max_len 300 \
--lowercase \
--config_path roberta-tiny/bert_config.json \
--ckpt_path roberta-tiny/bert_model.ckpt \
--vocab_path roberta-tiny/vocab.txt \
--train_path data/train.txt \
--dev_path data/dev.txt \
--save_dir bert-crf-ner \
--verbose 2 > bert-crf.log &
效果如下:
precision recall f1-score support
JOB 0.9630 0.9765 0.9697 3280
LOCATION 0.9349 0.9100 0.9223 3344
ORGANIZATION 0.8299 0.8935 0.8605 1502
PERSON 0.9263 0.9365 0.9314 3876
TIME 0.9695 0.9820 0.9757 4177
a 0.9481 0.9315 0.9397 5726
ad 0.9854 0.9827 0.9840 1097
ag 0.8657 0.7571 0.8078 247
an 0.9636 0.9529 0.9582 361
b 0.9152 0.9254 0.9203 1890
c 0.9866 0.9894 0.9880 5586
d 0.9802 0.9800 0.9801 11321
e 1.0000 0.8824 0.9375 17
f 0.9740 0.9761 0.9750 4099
g 0.9071 0.6831 0.7793 243
j 1.0000 0.9697 0.9846 33
k 0.9500 0.9141 0.9317 291
l 0.7500 0.7059 0.7273 17
m 0.9714 0.9769 0.9741 7882
n 0.9341 0.9324 0.9332 32456
ng 0.8694 0.7985 0.8324 1459
nz 0.7771 0.7905 0.7837 11760
o 0.7647 0.8125 0.7879 16
p 0.9817 0.9899 0.9858 8612
q 0.9122 0.8972 0.9046 1401
r 0.9848 0.9884 0.9866 7076
s 0.9738 0.9788 0.9763 989
tg 0.9130 0.8802 0.8963 167
u 0.9951 0.9956 0.9954 13356
v 0.9487 0.9530 0.9508 37486
vd 0.9591 0.9505 0.9548 222
vn 0.9443 0.9366 0.9404 8285
w 0.9973 0.9829 0.9900 38957
x 0.9941 0.9956 0.9948 677
y 0.9649 0.9865 0.9756 223
z 0.9170 0.9111 0.9140 1709
micro avg 0.9529 0.9515 0.9522 219840
macro avg 0.9348 0.9196 0.9264 219840
weighted avg 0.9531 0.9515 0.9523 219840
模型训练结束后,需要使用 tf2onnx 将 frozen_model 转成 onnx, 转化脚本如下:
python -m tf2onnx.convert --saved-model frozen_model --output lexical.onnx --opset 13
In [11]: doc = '7月1日,世预赛亚洲区12强赛抽签举行,中国队分在B组。同组对手是日本、澳大利亚、沙特、阿曼、越南。体育博主潘伟力在个人微博上表示,国足应把目标定在小组第二,第三意义不大'
In [12]: %timeit xmnlp.fast_seg(doc)
385 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: %timeit xmnlp.seg(doc)
19.5 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: %timeit xmnlp.deep_seg(doc)
20.2 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
小明NLP — 轻量级中文NLP工具