How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

p1nksnow · 2025-02-20T08:19:31Z

For example, I encountered difficulties when integrating needle_in_a_haystack benchmark into lm-evaluation-harness. During the preprocessing of documents, I needed to first tokenize the haystack document and then insert the needle at different positions, which required the involvement of a tokenizer. However, both the process_docs and doc_to_text unction only have a single doc parameter, making it impossible to pass in the tokenizer.

The text was updated successfully, but these errors were encountered:

baberabb · 2025-02-20T19:22:07Z

Hi! I've been working on the ruler benchmark in #2629. You can now use download_dataset : !function utils... in the config and that function will have access to the tokenizer or pretrained from model_args, as well as custom arguments passed to --metadata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

p1nksnow commented Feb 20, 2025

baberabb commented Feb 20, 2025

How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

Comments

p1nksnow commented Feb 20, 2025

baberabb commented Feb 20, 2025