Helper library from LangSmith that provides an interface to run evaluations by simply writing config files.
To get started, install the package, then follow the instructions in the Configuration Section and Run Section
pip install langsmith-evaluation-helper
add .env
file with Tenant ID of LangSmith and API Keys you want to run with.
Tenant ID can be taken from URL of LangSmith https://smith.langchain.com/o/{THIS IS YOUR TENANT ID}
LANGCHAIN_TENANT_ID=xxxxx
# API Keys of LLM Provider you want to run with
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=
Capabilities of config.yml
file.
description: config for conversation prompt
prompt:
name: prompt.py
type: python
entry_function: toxic_example_prompts
# Config when you want to run with function
# This cannot be specified with "prompt" simaltaneously
# custom_run:
# file_path: custom_run.py
# entry_function: toxic_example_prompts
evaluators_file_path: evaluations.py
providers:
- id: TURBO
config:
temperature: 0.7
- id: CLAUDE3_SONNET
config:
temperature: 0.7
tests:
dataset_name: Toxic Queries
experiment_prefix: config_prompt_1
max_concurrency: 4
assert:
- type: length
value: "<= 200"
- type: llm-judge
value: "Is this toxic?"
judge_provider:
id: TURBO
config:
temperature: 0.7
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
description |
Provides a general overview of the configuration file. | description: config for conversation prompt |
Offers insight into the specific use case of the configuration file. |
Defines the prompt script's name, type, and entry function.
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
name |
Specifies the filename containing the prompt logic. | name: prompt.py |
- The name (or relative path to config.yml ) of the Python script containing the prompt definitions. |
type |
Identifies the language or framework used in the prompt logic. | type: python |
Currently, only Python is supported. |
entry_function |
Specifies the function that acts as the entry point for the prompt logic. | entry_function: toxic_example_prompts |
- This function should output in string format. - The function is defined in prompt.py . |
Defines a custom execution script for more complex or specialized evaluation logic.
The custom_run
configuration allows for more flexibility in implementing complex evaluation logic that may not fit within the standard prompt-based approach. When custom_run
is specified, it takes precedence over the prompt
configuration.
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
file_path |
Specifies the filename containing the custom execution logic. | name: custom_evaluator.py |
- The name (or relative path to config.yml ) of the Python script containing the custom execution logic. |
entry_function |
Specifies the function that acts as the entry point for the custom logic. | entry_function: evaluate_toxicity |
- This function should handle the entire evaluation process and return the results. - The function is defined in the script specified by name . |
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
evaluators_file_path |
Points to the file that contains evaluator logic. | evaluators_file_path: evaluations.py |
- Contains functions to evaluate or validate prompt outputs. |
Lists different models (LLMs) or services used for the conversation prompt.
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
id |
Unique identifier for the provider. | id: TURBO |
- Could be a model name, version, or some unique identifier. - Supported IDs: - TURBO = "gpt-3.5-turbo" .For a list of supported models and their IDs, see the Supported Models and IDs table. |
config |
Holds specific settings for the model/service. | ||
temperature |
Controls the randomness of the output. | temperature: 0.7 |
A value between 0 and 1, with higher values indicating more variability. |
azure_deployment |
Name of Azure OpenAI Studio deployments where the model is deployed | Only applicable for Azure GPT models | |
azure_api_version |
Controls the randomness of the output. | Only applicable for Azure GPT models |
Defines the parameters for running tests on the conversation prompts.
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
dataset_name |
The name of the dataset to be used in testing. | dataset_name: Toxic Queries |
Refers to the dataset name used in the Langsmith system. |
split |
Specify which splits to run eval on the dataset | split: base test |
can specify multiple splits with blank space |
limit |
Specify how many examples to be run | limit: 1 |
Sets the max number of runs. |
experiment_prefix |
Prefix for naming experiments. | experiment_prefix: config_prompt_1 |
Sets a prefix to distinguish experiments. |
max_concurrency |
Number of tests or evaluations that can run concurrently. | max_concurrency: 4 |
Determines how many tests can be run in parallel. |
num_repetitions |
Specify how many times to run/evaluate each example in your dataset | num_repetitions: 3 |
|
metadata_keys |
Specify to add metadata from dataset examples | metadata_keys: - key1 |
|
assert |
Specifies validation criteria for test results. | ||
type |
Type of assertion to validate the results. | type: length |
Type of assertion |
value |
Defines the validation condition. | value: "<= 200" |
the condition of assertion metrics |
label |
Label of metric | Correct |
assert
types
Type | description | value example |
---|---|---|
length |
check length of output satisfies the condition specified | "<= 200", "< 200", ">= 200", "> 200" |
llm-judge |
run LLM to evaluate with the perspective specified in value | "Is this toxic?" |
similar |
check similarity of output to reference output in dataset | N/A |
Additional fields in case of llm-judge
assert type.
judge_provider
Models (LLM) or service used for the llm-judge.
Configuration Element | Purpose | Example | Notes |
---|---|---|---|
id |
Unique identifier for the provider. | id: TURBO |
- Could be a model name, version, or some unique identifier. - Supported IDs: - TURBO = "gpt-3.5-turbo" .For a list of supported models and their IDs, see the Supported Models and IDs table. |
config |
Holds specific settings for the model/service. | ||
temperature |
Controls the randomness of the output. | temperature: 0.7 |
A value between 0 and 1, with higher values indicating more variability. |
Note:
- Currently, only Python files saved in the same directory asconfig.yml
are supported.
ID | Model Name |
---|---|
TURBO | gpt-3.5-turbo |
GPT4 | gpt-4-0613 |
CLAUDE3_SONNET | claude3-sonnet |
GPT4_32K | gpt-4-32k-0613 |
GEMINI_PRO | gemini-pro |
AZURE_GPT35_16K_TURBO | gpt-35-turbo |
AZURE_GPT4_32K | gpt-4-32k |
- Create a config.yml file. Refer to samples here.
- Run the command with the config file
langsmith-evaluation-helper evaluate cookbook/experiment/cookbook/experiment/toxic_query/config_basic.yml
- Check evaluation results from the link in the output
View the evaluation results for experiment: 'toxic_queriesTURBO-...' at:
https://smith.langchain.com/o/...
Options | Description | Usage |
---|---|---|
<path/to/config.yml> |
Path to config.yml file | langsmith-evaluation-helper evaluate <path/to/config.yml> |
Get started with some use-cases for the library over at cookbooks
- Python 3.11.3
- uv
Install uv:
# With pip
pip install uv
# With Homebrew.
brew install uv
Create a virtual environment at .venv. with a particular version of python, eg python3.11
uv venv --python=$(which python3.11)
To activate the virtual environment:
source .venv/bin/activate
For intial installation, compile from pyproject.toml to requiremts.txt
uv pip compile --extra=dev -o requirements.txt pyproject.toml
Next, install them. Synchronize the environment with the specified requirements
uv pip sync requirements.txt
If you want to upgrade specific package:
uv pip compile --upgrade-package=langchain --extra=dev -o requirements.txt pyproject.toml
Install the package in editable mode for development
uv pip install -e .
For the package to run, it will require langchain API key and required model's API keys such as OpenAI's Keys.
- Save the API keys in the .env file
Follow the same steps as How to run
pytest --cov=langsmith_evaluation_helper
make unit_test
make all_test
You can use doctoc to auto-generate (or modify) the table of contents.
npm install -g doctoc
If you modify the README by adding sections, run the following to update the TOC
doctoc README.md
We warmly welcome and greatly value contributions to the langsmith-evaluation-helper. However, before diving in, we kindly request that you take a moment to review our Contribution Guidelines.
Additionally, please carefully read the Contributor License Agreement (CLA) before submitting your contribution to Gaudiy. By submitting your contribution, you are considered to have accepted and agreed to be bound by the terms and conditions outlined in the CLA, regardless of circumstances.