Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preprocess a document with the assistance of a tokenizer from a specific Model #2717

Open
p1nksnow opened this issue Feb 20, 2025 · 1 comment

Comments

@p1nksnow
Copy link

For example, I encountered difficulties when integrating needle_in_a_haystack benchmark into lm-evaluation-harness. During the preprocessing of documents, I needed to first tokenize the haystack document and then insert the needle at different positions, which required the involvement of a tokenizer. However, both the process_docs and doc_to_text unction only have a single doc parameter, making it impossible to pass in the tokenizer.

@baberabb
Copy link
Contributor

Hi! I've been working on the ruler benchmark in #2629. You can now use download_dataset : !function utils... in the config and that function will have access to the tokenizer or pretrained from model_args, as well as custom arguments passed to --metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants