This is the official repo for the paper ClinDiag: Grounding Large Language Model in Clinical Diagnostics.
- Demo website: https://clindiag.streamlit.app/
When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state. We use conda
as an example here:
Create and activate:
$ conda create -n clindiag python==3.11.1
$ conda activate clindiag
To deactivate later, run:
(clindiag) conda deactivate
(clindiag) pip install -r requirements.txt
Before running a script, go to configs/OAI_Config_List.json
to fill in your model and API key.
{
"model": "gpt-4o-mini",
"api_key": "[YOUR_API_KEY]",
"base_url": "[YOUR_BASE_URL]",
"tags": [
"x_gpt4omini"
]
}
The tags will be used to filter selected model(s) for each stage, see parse_args()
for details.
This script implements a human-LLM collaboration framework where LLMs serve as an assistant to answer physician's questions.
(clindiag) python code/test_human_llm.py --data_dir benchmark_dataset
This is to simulate the human-alone scenario where a physician performs the clinical diagnostic procedure all by itself in the ClinDiag framework.
(clindiag) python code/test_human_alone.py --data_dir benchmark_dataset
The following scripts were used for ablation study. We examined the effects of (1) multi-doctor collaboration, (2) introducing a critic agent, and (3) prompt engineering on diagnostic performance.
We tested the effect of having 2–3 doctor agents collaborate in the clinical decision making process.
(clindiag) python code/trial_stepwise_multiagent_converse.py --data_dir benchmark_dataset --num_specialists 2
--num_specialists
: number of doctor agents, defaults to 3
This framework incorporates a critic agent to suggest further revisions on doctor agent's questions.
(clindiag) python code/trial_stepwise_nochain_critic.py --data_dir benchmark_dataset --model_name_critic x_gpt4omini
--model_name_critic
: model used for the critic agent, defaults to gpt-4o-mini
This script adopts expert-generated prompts.
(clindiag) python code/trial_stepwise_nochain_expert_prompt.py --data_dir benchmark_dataset
./benchmark_dataset.zip
(To uncompress, run unzip benchmark_dataset.zip
in the root directory)
A comprehensive clinical dataset comprising 4,421 real-world cases, encompassing both rare and common diseases across 32 specialties.
./human_examiner_scripts/
A set of 35 patient scripts sourced from the hospital’s Objective Structured Clinical Examination (OSCE) test dataset for standardized patient training.
./finetune_data.zip
(To uncompress, run unzip finetune_data.zip
in the root directory)
The multi-turn chat dataset used for fine-tuning a chat model. Each conversation example was constructed from a quality-checked real-world case and structured to adhere to standard clinical diagnostic practice. The data is available in both jsonl
and json
formats.
finetune_data_messages.jsonl
:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
finetune_data_conversations.json
:
{
"conversations": [
[
{"from": "system", "value": "..."},
{"from": "user", "value": "..."},
{"from": "assistant", "value": "..."},
],
[
{"from": "system", "value": "..."},
{"from": "user", "value": "..."},
{"from": "assistant", "value": "..."},
]
]
}