Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Task (Financial mmlu ko) #2699

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

choics2623
Copy link

English:

Title: Add financial-mmlu-ko Task for Korean Financial Multiple-Choice Benchmark

Description:
This pull request adds a new task, financial-mmlu-ko, to the lm-evaluation-harness repository. The task evaluates models on a Korean financial multiple-choice dataset where the number of answer candidates varies per question, so it is implemented using the generate_until output type.

Key Details:

  • Dataset Sources:
    • Korean Wikipedia Finance Category
    • Bank of Korea Economic Research Reports
    • 경제배움e - 퀴즈로 배우는 시사.경제
  • Dataset Composition:
    • 104 manually curated questions
    • 315 GPT-4 generated questions (subsequently verified by experts)
  • Implementation:
    • A custom process_results function extracts numerical answers from model outputs and compares them to the gold answers.
    • Few-shot examples can be provided as system prompts or context.
  • Integration:
    • Task integration was contributed by choics2623
    • The dataset was created by allganize on Hugging Face.

This task is not derived from an academic paper but serves as a practical benchmark for evaluating financial domain knowledge in Korean language models. The README has been updated with all necessary details according to the standard template.

Please review and merge this PR.

Thank you!

useage:
lm_eval --model hf --model_args pretrained=LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct,trust_remote_code=True --tasks financial_mmlu_ko --num_fewshot=3 --device cuda:7
hf (pretrained=LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
financial_mmlu_ko 1 none 3 acc 0.7429 ± 0.0205

lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks financial_mmlu_ko --num_fewshot=3 --device cuda:7
hf (pretrained=meta-llama/Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 1 [152/1783]

Tasks Version Filter n-shot Metric Value Stderr
financial_mmlu_ko 1 none 3 acc 0.6813 ± 0.0219

@CLAassistant
Copy link

CLAassistant commented Feb 14, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants