Skip to content

Commit

Permalink
update ipynb
Browse files Browse the repository at this point in the history
  • Loading branch information
zhanzecheng committed Jul 13, 2018
1 parent bdc092e commit bfcb587
Show file tree
Hide file tree
Showing 3 changed files with 97 additions and 15 deletions.
Binary file added img/pesudo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 49 additions & 14 deletions src/stacking.ipynb
Original file line number Diff line number Diff line change
@@ -1,11 +1,28 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 模型融合&&trick\n",
"我们比赛中使用的stacking模型结构如下图所示\n",
"\n",
" ![img](../img/stacking.png)\n",
" \n",
"### Snapshot Emsemble\n",
" 在stacking第二层模型中我们还加入了深度融合的方法,[论文地址](https://arxiv.org/abs/1704.00109)\n",
" \n",
"### Pesudo Labeling\n",
" 我们使用的另外一个trick就是pesudo-labeling 方法,它适用于所有给定测试集的比赛 [教程](https://shaoanlu.wordpress.com/2017/04/10/a-simple-pseudo-labeling-function-implementation-in-keras/)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# 导入相应的包\n",
"import pickle\n",
"import glob\n",
"from config import Config\n",
Expand All @@ -28,6 +45,13 @@
"config = Config()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 准备基本特征和OOF文件"
]
},
{
"cell_type": "code",
"execution_count": 23,
Expand Down Expand Up @@ -129,6 +153,13 @@
"train, train_y, test = data_prepare()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 这里只是使用了简单的DNN来做模型stacking"
]
},
{
"cell_type": "code",
"execution_count": 25,
Expand Down Expand Up @@ -168,6 +199,13 @@
"BATCH_SIZE = 64\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 准备stacking模型"
]
},
{
"cell_type": "code",
"execution_count": 31,
Expand Down Expand Up @@ -366,13 +404,7 @@
"40400/40400 [==============================] - 2s 56us/step - loss: 0.6402 - acc: 0.7384 - val_loss: 0.6365 - val_acc: 0.7356\n",
"Epoch 4/16\n",
"40400/40400 [==============================] - 2s 59us/step - loss: 0.6301 - acc: 0.7417 - val_loss: 0.6359 - val_acc: 0.7354\n",
"Epoch 5/16\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 5/16\n",
"40400/40400 [==============================] - 2s 56us/step - loss: 0.6367 - acc: 0.7398 - val_loss: 0.6441 - val_acc: 0.7354\n",
"Epoch 6/16\n",
"40400/40400 [==============================] - 2s 54us/step - loss: 0.6276 - acc: 0.7420 - val_loss: 0.6395 - val_acc: 0.7353\n",
Expand Down Expand Up @@ -485,6 +517,15 @@
"predicts = stacking_first(train, train_y, test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 这里使用pesudo-labeling方法\n",
"具体思路如下图所示\n",
"![img](../img/pesudo.png)"
]
},
{
"cell_type": "code",
"execution_count": 34,
Expand Down Expand Up @@ -693,13 +734,7 @@
"40429/40429 [==============================] - 2s 57us/step - loss: 0.6492 - acc: 0.7361 - val_loss: 0.6458 - val_acc: 0.7334\n",
"Epoch 3/16\n",
"40429/40429 [==============================] - 2s 59us/step - loss: 0.6371 - acc: 0.7377 - val_loss: 0.6440 - val_acc: 0.7352\n",
"Epoch 4/16\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 4/16\n",
"40429/40429 [==============================] - 3s 68us/step - loss: 0.6284 - acc: 0.7412 - val_loss: 0.6424 - val_acc: 0.7361\n",
"Epoch 5/16\n",
"40429/40429 [==============================] - 2s 55us/step - loss: 0.6356 - acc: 0.7386 - val_loss: 0.6447 - val_acc: 0.7358\n",
Expand Down
49 changes: 48 additions & 1 deletion src/train&predict.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,38 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 模型预测&训练\n",
"这一部分是模型的训练和预测,我们使用到的模型可以分为两类\n",
"\n",
"## 单模型\n",
"### 深度学习模型\n",
"我们联合了新闻文本和图片上的文本信息来构建模型,基本框架如下图所示\n",
"![img](../img/example.png)\n",
"\n",
"### 机器学习模型\n",
"\n",
"* 输入特征为TFIDF+SVD、Basic Features等\n",
"* 这里OCR出来的结果和新闻文本是做简单拼接的方式\n",
"* 模型有:xgboost、catboost、lightGBM、DNN\n",
"\n",
"\n",
"| 模型或方法 | 得分F1-measure |\n",
"| ----------- | ---------------------------------------- \n",
"| catboost | 0.611 |\n",
"| xgboost | 0.621 |\n",
"| lightgbm | 0.625 |\n",
"| dnn | 0.621|\n",
"| textCNN |0.617|\n",
"| capsule |0.625|\n",
"| covlstm |0.630| \n",
"| dpcnn |0.626| \n",
"| lstm+gru |0.635| \n",
"| lstm+gru+attention |0.640| \n"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down Expand Up @@ -45,7 +78,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"让我们来导入一下传统模型【xgboost、catboost、lightgbm等】的训练数据"
"#### 让我们来导入一下传统模型【xgboost、catboost、lightgbm等】的训练数据"
]
},
{
Expand Down Expand Up @@ -5289,6 +5322,13 @@
"model.train_predict(data_x[:10], data_y[:10], test[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 让我们来导入一下深度学习模型的训练数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
Expand Down Expand Up @@ -5509,6 +5549,13 @@
"init_emb = init_embedding()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 这里是导入深度学习模型,估计使用的模型可以参考config.py文件"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down

0 comments on commit bfcb587

Please sign in to comment.