SOTA model for modeling fine-grained sentiment expressions in financial news articles. Detached CNN-BiLSTM regression head trained on fine-tuned DeBERTa entity embeddings1. Refer to the PDF for more detail.
- Base model comparison notebook
- BERT, RoBERTa, FinBERT, DeBERTa
- Main training/experiments notebook with DeBERTa:
The experiments showed that sentiment regression performance was improved by:
- Incorporating into the classification model the final hidden states of both the [CLS] token as well as the masked target entity token
- Detaching the classification model from the token-level fine-tuning process
- In other words, placing complex architectures inside the fine-tuning process performed worse than placing the same complex architecture after the standard (boilerplate
transformers.BertForSequenceClassification
) pooling + dense layer - Intuitively, the error propogation backwards through DeBERTa during training seemed to benefit from a closer/simpler signal, resulting in better inputs for the detached CNN-BiLSTM
- In other words, placing complex architectures inside the fine-tuning process performed worse than placing the same complex architecture after the standard (boilerplate
The tradeoffs between inference time in production systems and model performance is an interesting area for further research.
Footnotes
-
For BERT-based models, the final token-level embeddings that are output by the fine-tuned model are referred to as the "final hidden states". ↩
-
"Attached" classification/regression head -- a single network is used to simultaneously fine-tune DeBERTa and perform classification/regression. The loss from the "classification" phase directly affects "representation" (i.e. the production of fine-tuned final hidden states). ↩
-
"Detached" classification/regression head -- the production of fine-tuned final hidden states is performed using a simple primary network (pooling + dense), then a (completely separate) secondary network is utilized for classification/regression using the output of the primary network as input. ↩