Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Doraemonzzz · 2025-02-12T15:04:48Z

Thank you for your excellent work. I trained a 410M LLaMA model on FineWeb-EDU-10B and conducted evaluations, with the results as follows.

|     Tasks      |Version|Filter|n-shot|    Metric     |   | Value  |   |Stderr |
|----------------|------:|------|-----:|---------------|---|-------:|---|-------|
|arc_challenge   |      1|none  |     0|acc            |↑  |  0.2022|±  | 0.0117|
|                |       |none  |     0|acc_norm       |↑  |  0.2483|±  | 0.0126|
|arc_easy        |      1|none  |     0|acc            |↑  |  0.5038|±  | 0.0103|
|                |       |none  |     0|acc_norm       |↑  |  0.4533|±  | 0.0102|
|boolq           |      2|none  |     0|acc            |↑  |  0.5740|±  | 0.0086|
|fda             |      0|none  |     0|contains       |↑  |  0.0281|±  |   N/A |
|hellaswag       |      1|none  |     0|acc            |↑  |  0.2837|±  | 0.0045|
|                |       |none  |     0|acc_norm       |↑  |  0.3032|±  | 0.0046|
|lambada_openai  |      1|none  |     0|acc            |↑  |  0.2148|±  | 0.0057|
|                |       |none  |     0|perplexity     |↓  |210.8356|±  |10.6725|
|openbookqa      |      1|none  |     0|acc            |↑  |  0.1700|±  | 0.0168|
|                |       |none  |     0|acc_norm       |↑  |  0.3020|±  | 0.0206|
|piqa            |      1|none  |     0|acc            |↑  |  0.6251|±  | 0.0113|
|                |       |none  |     0|acc_norm       |↑  |  0.6202|±  | 0.0113|
|social_iqa      |      0|none  |     0|acc            |↑  |  0.3460|±  | 0.0108|
|squad_completion|      0|none  |     0|contains       |↑  |  0.0000|±  |   N/A |
|swde            |      0|none  |     0|contains       |↑  |  0.1548|±  |   N/A |
|wikitext        |      2|none  |     0|bits_per_byte  |↓  |  1.0339|±  |   N/A |
|                |       |none  |     0|byte_perplexity|↓  |  2.0476|±  |   N/A |
|                |       |none  |     0|word_perplexity|↓  | 46.1707|±  |   N/A |
|winogrande      |      1|none  |     0|acc            |↑  |  0.5178|±  | 0.0140|

I noticed that the metrics for the FDA, SWDE, and SQuAD_completion tasks were abnormal, while the performance on other evaluations was normal. Upon analysis, I found that a large number of spaces were being prepended to the inputs of certain tasks. To address this, I made the following modifications:

    def doc_to_text(self, doc):
        return doc["text"].strip()

    def doc_to_target(self, doc):
        return doc["value"].strip()

and re-evaluated the model. The updated results are as follows:

|     Tasks      |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------|------:|------|-----:|--------|---|-----:|---|------|
|fda             |      0|none  |     0|contains|↑  |0.1670|±  |   N/A|
|squad_completion|      0|none  |     0|contains|↑  |0.3040|±  |   N/A|
|swde            |      0|none  |     0|contains|↑  |0.4482|±  |   N/A|

The results now look much more normal, so I’d like to know whether we should apply .strip() to the inputs of all tasks.

The text was updated successfully, but these errors were encountered:

baberabb · 2025-02-12T17:38:32Z

Hi! This seems reasonable. Would you be interested in submitting a pull request to fix it?

baberabb added the validation For validation of task implementations. label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Doraemonzzz commented Feb 12, 2025

baberabb commented Feb 12, 2025

Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Strip the input for the three tasks: FDA, SWDE, and SQuAD_completion. #2690

Comments

Doraemonzzz commented Feb 12, 2025

baberabb commented Feb 12, 2025