Skip to content

词法分析模型实现

Sean edited this page Feb 20, 2022 · 1 revision

模型训练

模型基于 LangML 构建,训练脚本如下:

export PYTHONUNBUFFERED=1; CUDA_VISIBLE_DEVICES=0 nohup langml-cli baseline ner bert-crf \
--epoch 20 \
--batch_size 64 \
--learning_rate 0.00002 \
--dropout_rate 0.2 \
--max_len 300 \
--lowercase \
--config_path roberta-tiny/bert_config.json \
--ckpt_path roberta-tiny/bert_model.ckpt \
--vocab_path roberta-tiny/vocab.txt \
--train_path data/train.txt \
--dev_path data/dev.txt \
--save_dir bert-crf-ner \
--verbose 2 > bert-crf.log &

效果如下:

              precision    recall  f1-score   support

         JOB     0.9630    0.9765    0.9697      3280
    LOCATION     0.9349    0.9100    0.9223      3344
ORGANIZATION     0.8299    0.8935    0.8605      1502
      PERSON     0.9263    0.9365    0.9314      3876
        TIME     0.9695    0.9820    0.9757      4177
           a     0.9481    0.9315    0.9397      5726
          ad     0.9854    0.9827    0.9840      1097
          ag     0.8657    0.7571    0.8078       247
          an     0.9636    0.9529    0.9582       361
           b     0.9152    0.9254    0.9203      1890
           c     0.9866    0.9894    0.9880      5586
           d     0.9802    0.9800    0.9801     11321
           e     1.0000    0.8824    0.9375        17
           f     0.9740    0.9761    0.9750      4099
           g     0.9071    0.6831    0.7793       243
           j     1.0000    0.9697    0.9846        33
           k     0.9500    0.9141    0.9317       291
           l     0.7500    0.7059    0.7273        17
           m     0.9714    0.9769    0.9741      7882
           n     0.9341    0.9324    0.9332     32456
          ng     0.8694    0.7985    0.8324      1459
          nz     0.7771    0.7905    0.7837     11760
           o     0.7647    0.8125    0.7879        16
           p     0.9817    0.9899    0.9858      8612
           q     0.9122    0.8972    0.9046      1401
           r     0.9848    0.9884    0.9866      7076
           s     0.9738    0.9788    0.9763       989
          tg     0.9130    0.8802    0.8963       167
           u     0.9951    0.9956    0.9954     13356
           v     0.9487    0.9530    0.9508     37486
          vd     0.9591    0.9505    0.9548       222
          vn     0.9443    0.9366    0.9404      8285
           w     0.9973    0.9829    0.9900     38957
           x     0.9941    0.9956    0.9948       677
           y     0.9649    0.9865    0.9756       223
           z     0.9170    0.9111    0.9140      1709

   micro avg     0.9529    0.9515    0.9522    219840
   macro avg     0.9348    0.9196    0.9264    219840
weighted avg     0.9531    0.9515    0.9523    219840

模型训练结束后,需要使用 tf2onnx 将 frozen_model 转成 onnx, 转化脚本如下:

python -m tf2onnx.convert --saved-model frozen_model --output lexical.onnx --opset 13

Benchmark

In [11]: doc = '7月1日,世预赛亚洲区12强赛抽签举行,中国队分在B组。同组对手是日本、澳大利亚、沙特、阿曼、越南。体育博主潘伟力在个人微博上表示,国足应把目标定在小组第二,第三意义不大'

In [12]: %timeit xmnlp.fast_seg(doc)
385 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [13]: %timeit xmnlp.seg(doc)
19.5 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit xmnlp.deep_seg(doc)
20.2 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Clone this wiki locally