The EOS and BOS token setting when contine pretraining Llama3.1 #648

ShomyLiu · 2024-08-28T06:39:46Z

Hello,
Thank you for providing these valuable recipes. I appreciate your work.
I'm interested in further pre-training the Llama3.1-8B-base model rather than using the instruct version. To ensure I prepare my data correctly, I'd like some clarification on the tokenization process:

Could you please provide information about how the data should be tokenized?
Specifically, I'm wondering whether the tokenized sequences should include the EOS and BOS tokens:

Both the (BOS) and (EOS) tokens
Only one of these special tokens (if so, which one?)

Thank you in advance for your assistance.

init27 · 2024-08-28T13:07:44Z

Hey @ShomyLiu!

Thanks for the question, you can find details on the right tokens here in the model card.

I would recommend upgrading your transformers version to the latest since that will also take care of the right token settings for you.

Please let me know if you run into any issues!

ShomyLiu · 2024-08-28T13:17:31Z

@init27
Thank you for your response. I've reviewed the information provided about the special tokens:

<|begin_of_text|>: Specifies the start of the prompt
<|end_of_text|>: Indicates the model should cease generating more tokens (generated only by base models)

I understand that the EOS token is used during pretraining the base model. However, I'm unclear about the BOS token's usage, particularly in the pretraining phase. Since it's defined as "the start of the prompt," I'm wondering is the BOS token used during pretraining, or is it primarily for fine-tuning and inference?

So which one should I prepare my pretraining data:

<|begin_of_text|> TEXT <|end_of_text|>
TEXT <|end_of_text|>

Thank you for your time again

kkkevinkkkkk · 2024-10-23T21:06:10Z

__

@init27 Thank you for your response. I've reviewed the information provided about the special tokens:

<|begin_of_text|>: Specifies the start of the prompt

<|end_of_text|>: Indicates the model should cease generating more tokens (generated only by base models)

I understand that the EOS token is used during pretraining the base model. However, I'm unclear about the BOS token's usage, particularly in the pretraining phase. Since it's defined as "the start of the prompt," I'm wondering is the BOS token used during pretraining, or is it primarily for fine-tuning and inference?

So which one should I prepare my pretraining data:

<|begin_of_text|> TEXT <|end_of_text|>

TEXT <|end_of_text|>

Thank you for your time again

@ShomyLiu Hi, I am wondering if you figure out whether it's the former one or the latter one.

init27 self-assigned this Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The EOS and BOS token setting when contine pretraining Llama3.1 #648

The EOS and BOS token setting when contine pretraining Llama3.1 #648

ShomyLiu commented Aug 28, 2024

init27 commented Aug 28, 2024

ShomyLiu commented Aug 28, 2024

kkkevinkkkkk commented Oct 23, 2024

The EOS and BOS token setting when contine pretraining Llama3.1 #648

The EOS and BOS token setting when contine pretraining Llama3.1 #648

Comments

ShomyLiu commented Aug 28, 2024

init27 commented Aug 28, 2024

ShomyLiu commented Aug 28, 2024

kkkevinkkkkk commented Oct 23, 2024