This project generates a dataset of persona-based conversations with hidden information (needles) that need to be discovered through dialogue. The dataset is based on the google/Synthetic-Persona-Chat
dataset.
The project consists of two main components:
needle_1hop.py
: Generates one-hop questionsneedle_2hop.py
: Generates two-hop questions
Both scripts use the google/Synthetic-Persona-Chat
dataset as a foundation and employ GPT-4 to generate and modify conversations.
This script generates one-hop questions by inserting a "needle" of information into one persona and creating a conversation where the other participant must discover this information.
Process:
- Loads personas and conversations from the Synthetic-Persona-Chat dataset.
- Selects a random piece of information (needle) from a persona.
- Inserts this needle into Person A's persona.
- Generates a conversation between Person A and Person B, where B tries to discover the needle information from A.
- Creates a task prompt (question) related to the needle information.
- Saves the generated sample, including the modified conversation, task prompt, and answer.
This script generates two-hop questions by inserting a common "needle" of information into two personas and creating separate conversations that reveal different aspects of this shared information.
Process:
- Loads personas and conversations from the Synthetic-Persona-Chat dataset.
- Selects random personas and conversations for four participants: Alice, Bob, Charlie, and Dave.
- Generates a common persona (needle) that is added to both Alice and Dave's personas, either in the same way or differently.
- Modifies the conversation between Alice and Bob to include Alice's part of the new persona.
- Modifies the conversation between Charlie and Dave to include Dave's part of the new persona.
- Generates a new conversation between Bob and Charlie.
- Creates a task prompt (question) related to the common persona.
- Saves the generated sample, including all modified conversations, task prompt, and answer.
To generate the dataset, run the following commands:
python needle_1hop.py
python needle_2hop.py
The dataset is saved in JSONL format and can be used for evaluating iAgents.