Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Open
Doraemonzzz opened this issue Feb 12, 2025 · 1 comment
Open
Labels
validation For validation of task implementations.

Comments

@Doraemonzzz
Copy link

Thank you for your excellent work. I trained a 410M LLaMA model on FineWeb-EDU-10B and conducted evaluations, with the results as follows.

|     Tasks      |Version|Filter|n-shot|    Metric     |   | Value  |   |Stderr |
|----------------|------:|------|-----:|---------------|---|-------:|---|-------|
|arc_challenge   |      1|none  |     0|acc            |↑  |  0.2022|±  | 0.0117|
|                |       |none  |     0|acc_norm       |↑  |  0.2483|±  | 0.0126|
|arc_easy        |      1|none  |     0|acc            |↑  |  0.5038|±  | 0.0103|
|                |       |none  |     0|acc_norm       |↑  |  0.4533|±  | 0.0102|
|boolq           |      2|none  |     0|acc            |↑  |  0.5740|±  | 0.0086|
|fda             |      0|none  |     0|contains       |↑  |  0.0281|±  |   N/A |
|hellaswag       |      1|none  |     0|acc            |↑  |  0.2837|±  | 0.0045|
|                |       |none  |     0|acc_norm       |↑  |  0.3032|±  | 0.0046|
|lambada_openai  |      1|none  |     0|acc            |↑  |  0.2148|±  | 0.0057|
|                |       |none  |     0|perplexity     |↓  |210.8356|±  |10.6725|
|openbookqa      |      1|none  |     0|acc            |↑  |  0.1700|±  | 0.0168|
|                |       |none  |     0|acc_norm       |↑  |  0.3020|±  | 0.0206|
|piqa            |      1|none  |     0|acc            |↑  |  0.6251|±  | 0.0113|
|                |       |none  |     0|acc_norm       |↑  |  0.6202|±  | 0.0113|
|social_iqa      |      0|none  |     0|acc            |↑  |  0.3460|±  | 0.0108|
|squad_completion|      0|none  |     0|contains       |↑  |  0.0000|±  |   N/A |
|swde            |      0|none  |     0|contains       |↑  |  0.1548|±  |   N/A |
|wikitext        |      2|none  |     0|bits_per_byte  |↓  |  1.0339|±  |   N/A |
|                |       |none  |     0|byte_perplexity|↓  |  2.0476|±  |   N/A |
|                |       |none  |     0|word_perplexity|↓  | 46.1707|±  |   N/A |
|winogrande      |      1|none  |     0|acc            |↑  |  0.5178|±  | 0.0140|

I noticed that the metrics for the FDA, SWDE, and SQuAD_completion tasks were abnormal, while the performance on other evaluations was normal. Upon analysis, I found that a large number of spaces were being prepended to the inputs of certain tasks. To address this, I made the following modifications:

    def doc_to_text(self, doc):
        return doc["text"].strip()

    def doc_to_target(self, doc):
        return doc["value"].strip()

and re-evaluated the model. The updated results are as follows:

|     Tasks      |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------|------:|------|-----:|--------|---|-----:|---|------|
|fda             |      0|none  |     0|contains|↑  |0.1670|±  |   N/A|
|squad_completion|      0|none  |     0|contains|↑  |0.3040|±  |   N/A|
|swde            |      0|none  |     0|contains|↑  |0.4482|±  |   N/A|

The results now look much more normal, so I’d like to know whether we should apply .strip() to the inputs of all tasks.

@baberabb
Copy link
Contributor

Hi! This seems reasonable. Would you be interested in submitting a pull request to fix it?

@baberabb baberabb added the validation For validation of task implementations. label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validation For validation of task implementations.
Projects
None yet
Development

No branches or pull requests

2 participants