Add support for sequence labeling #2718

jogonba2 · 2025-02-20T19:20:10Z

This PR adds support for sequence labeling tasks: chunking through IOB scheme and tagging. This is something mentioned in #1675, and could be useful for the community.

The main issue is that there is no widespread agreement about how to prompt language models to perform these tasks. In an attempt of standardizing all those, it seems that just wrapping chunks/words with <>-delimited tags is a common choice in the literature (some references at the end). That is how the code of this PR handles sequence labeling. Basically, a dataset should be prepared accordingly, out of lm-evaluation-harness, to contain input texts and in-text annotated outputs, like this:

Input text: Moncada is a city near Valencia in Spain
Output text: <response> <location> Moncada </location> is a city near <location> Valencia </location> in <location> Spain </location> </response>

Then the language model is prompted to write an output text given the input text and few-shot examples to elicit the expected format. From the outputs, the IOB/tagging labels are extracted to run seqeval and get metrics for sequence labeling evaluation (currently just overall_f1).

I created a guide in docs/sequence_labeling.md to illustrate how to prepare the datasets (moving from IOB/tagging annotation format to in-text annotated outputs) and how to create new sequence labeling tasks, all the details are there.

References about prompting for sequence labeling:

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, & Guoyin Wang. (2023). GPT-NER: Named Entity Recognition via Large Language Models.
Naguib, M., Tannier, X., & Nevéol, A. (2024). Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 6829–6852). Association for Computational Linguistics.
Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, Kirk Roberts, & Hua Xu. (2024). Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering.
Mingchen Li, & Rui Zhang. (2024). How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain.
Yan, F., Yu, P., & Chen, X. (2024). LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity Marking. In Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XIX (pp. 399–411). Springer-Verlag.
Laskar, M., Bari, M., Rahman, M., Bhuiyan, M., Joty, S., & Huang, J. (2023). A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 431–469). Association for Computational Linguistics.
Machado, M., & Ruiz, E. (2024). Evaluating large language models for the tasks of PoS tagging within the Universal Dependency framework. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1 (pp. 454–460). Association for Computational Lingustics.
Stussi, E., & Ströbel, P. (2024). Part-of-Speech Tagging of 16th-Century Latin with GPT. In Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024) (pp. 196–206). Association for Computational Linguistics.

Add support for chunking and tagging

8956995

jogonba2 requested review from baberabb and lintangsutawika as code owners February 20, 2025 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for sequence labeling #2718

Add support for sequence labeling #2718

jogonba2 commented Feb 20, 2025 •

edited

Loading

Add support for sequence labeling #2718

Are you sure you want to change the base?

Add support for sequence labeling #2718

Conversation

jogonba2 commented Feb 20, 2025 • edited Loading

jogonba2 commented Feb 20, 2025 •

edited

Loading