-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The EOS and BOS token setting when contine pretraining Llama3.1 #648
Comments
@init27
I understand that the EOS token is used during pretraining the base model. However, I'm unclear about the BOS token's usage, particularly in the pretraining phase. Since it's defined as "the start of the prompt," I'm wondering is the BOS token used during pretraining, or is it primarily for fine-tuning and inference? So which one should I prepare my pretraining data:
Thank you for your time again |
__
@ShomyLiu Hi, I am wondering if you figure out whether it's the former one or the latter one. |
Hello,
Thank you for providing these valuable recipes. I appreciate your work.
I'm interested in further pre-training the Llama3.1-8B-base model rather than using the instruct version. To ensure I prepare my data correctly, I'd like some clarification on the tokenization process:
Could you please provide information about how the data should be tokenized?
Specifically, I'm wondering whether the tokenized sequences should include the EOS and BOS tokens:
Thank you in advance for your assistance.
The text was updated successfully, but these errors were encountered: