iAgents Needle in the Persona Dataset Pipeline

This project generates a dataset of persona-based conversations with hidden information (needles) that need to be discovered through dialogue. The dataset is based on the google/Synthetic-Persona-Chat dataset.

Overview

The project consists of two main components:

needle_1hop.py: Generates one-hop questions
needle_2hop.py: Generates two-hop questions

Both scripts use the google/Synthetic-Persona-Chat dataset as a foundation and employ GPT-4 to generate and modify conversations.

needle_1hop.py

This script generates one-hop questions by inserting a "needle" of information into one persona and creating a conversation where the other participant must discover this information.

Process:

Loads personas and conversations from the Synthetic-Persona-Chat dataset.
Selects a random piece of information (needle) from a persona.
Inserts this needle into Person A's persona.
Generates a conversation between Person A and Person B, where B tries to discover the needle information from A.
Creates a task prompt (question) related to the needle information.
Saves the generated sample, including the modified conversation, task prompt, and answer.

needle_2hop.py

This script generates two-hop questions by inserting a common "needle" of information into two personas and creating separate conversations that reveal different aspects of this shared information.

Process:

Loads personas and conversations from the Synthetic-Persona-Chat dataset.
Selects random personas and conversations for four participants: Alice, Bob, Charlie, and Dave.
Generates a common persona (needle) that is added to both Alice and Dave's personas, either in the same way or differently.
Modifies the conversation between Alice and Bob to include Alice's part of the new persona.
Modifies the conversation between Charlie and Dave to include Dave's part of the new persona.
Generates a new conversation between Bob and Charlie.
Creates a task prompt (question) related to the common persona.
Saves the generated sample, including all modified conversations, task prompt, and answer.

Usage

To generate the dataset, run the following commands:

python needle_1hop.py
python needle_2hop.py

Dataset

The dataset is saved in JSONL format and can be used for evaluating iAgents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

iAgents Needle in the Persona Dataset Pipeline

Overview

needle_1hop.py

needle_2hop.py

Usage

Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

iAgents Needle in the Persona Dataset Pipeline

Overview

needle_1hop.py

needle_2hop.py

Usage

Dataset