Skip to content

cns-iu/onto-llm-mapping

Repository files navigation

Mapping ontologies using LLMs and RAG

⚠️ RESEARCH IN PROGRESS ⚠️

TODO

  • Write validation code to evaluate each mapping generated against a gold standard
  • Tweak models used and prompts to improve results
  • Write code to choose and finalize a mapping for publication, which may be finalized by hand via subject matter expert
  • Generalize the code and workflow further for mapping between any two sets of concepts

Using onto-llm-mapping

Requirements

To run the workflows, you will need the following installed:

  1. A unix-like environment (Linux, WSL2 / Ubuntu For Windows, or Mac (untested))
  2. Python 3.10+ with python-venv library installed

A virtual environment will be installed with the following core applications:

Setup

With these pre-requisites are installed into a Python virtual environment (in the .venv directory by default). To setup the environment you will run

./scripts/00-setup-environment.sh

To install the models, run

./scripts/05-setup-models.sh

Running Workflows

You can run individual workflows from the workflows after setup by simply executing the shell script. For example:

./workflows/desc-vec.sh

Publishing Results

To publish final results, you can run this scripts:

./scripts/80-publish-results.sh

The data is then compiled to output-data/$DATASET/$VERSION.

End-to-End Workflow

To run the workflow from end to end, starting with nothing (not even a virtual environment potentially) and going to all data built and published, you can run this command:

./logged-run.sh

By default no extra workflows are run. To set which workflows are run during the end-to-end workflow. You can set the WORKFLOWS environment variable or set it right at run time like this.

WORKFLOWS="desc-vec llama3.2-1b" ./logged-run.sh

As you see, the workflows are named after the shell script in the workflows directory without the .sh extension.

Previous results (Sept 26, 2024)

There are 6 different SSSOM mappings to evaluate in mappings folder:

Note: The LLM expanded description is in the data folder: uberon and mesh

Method

  1. For each uberon and mesh term that we care about, use an LLM to expand the name, synonyms, plus descriptions to a common length and quality to create an expanded description
  2. Store the expanded descriptions in a vector database
  3. For each term in the uberon ontology, compare it's expanded description to the expanded descriptions in the mesh ontology's vector database and retrieve the top most similar terms based on the expanded description.
  4. Ask the LLM to then take that same term's expanded description and rank the retrieved similar terms from the mesh ontology. (currently ranks the top 3)
  5. Output the results to a .csv file to evaluate the results with an SME (Ellen).
  6. Generate SSSOM file

Prerequisites

Commands used at the time

# Setup templates
cp templates/*.yaml `llm templates path`

# Extract terms and descriptive metadata
sparql-select.sh http://localhost:8080/blazegraph/namespace/kb/sparql queries/mesh-terms.rq > data/mesh-terms.csv
sparql-select.sh http://localhost:8080/blazegraph/namespace/kb/sparql queries/uberon-terms.rq > data/uberon-terms.csv

# WITHOUT EXPANDED CONTENT

## Convert metadata to a content format for LLM vector database
node ./src/create-descriptions.js data/mesh-terms.csv data/mesh-terms.description.csv
node ./src/create-descriptions.js data/uberon-terms.csv data/uberon-terms.description.csv

## Find similar terms using vectorized versions of original content
node ./src/find-similar.js data/uberon-terms.description.csv data/mesh-terms.description.csv data/uberon-terms.description.mesh-scores.csv

## Use an LLM to rank the similar terms from expanded content comparison
node ./src/rank-similar.js data/uberon-terms.csv data/mesh-terms.csv data/uberon-terms.description.csv data/mesh-terms.description.csv data/uberon-terms.description.mesh-scores.csv data/uberon-terms.description.mesh-ranked-scores.csv


# WITH EXPANDED CONTENT

## Use an LLM to expand the description to a consistent length and format
node src/expand-descriptions.js data/mesh-terms.csv data/mesh-terms.content.csv
node src/expand-descriptions.js data/uberon-terms.csv data/uberon-terms.content.csv

## Find similar terms using vectorized versions of expanded content
node src/find-similar.js data/uberon-terms.content.csv data/mesh-terms.content.csv data/uberon-terms.mesh-scores.csv

## Use an LLM to rank the similar terms from expanded content comparison
node ./src/rank-similar.js data/uberon-terms.csv data/mesh-terms.csv data/uberon-terms.content.csv data/mesh-terms.content.csv data/uberon-terms.mesh-scores.csv data/uberon-terms.mesh-ranked-scores.csv

### Convert to SSSOM format
duckdb :memory: -no-stdin -init queries/uberon-mesh-mapping.desc-vec.sql
duckdb :memory: -no-stdin -init queries/uberon-mesh-mapping.llm-rank.sql
duckdb :memory: -no-stdin -init queries/uberon-mesh-mapping.llm-vec.sql
duckdb :memory: -no-stdin -init queries/uberon-mesh-mapping.ubkg.sql

### Validate SSOM csvs
sssom validate mappings/uberon-mesh-mapping.desc-vec.sssom.csv
sssom validate mappings/uberon-mesh-mapping.llm-rank.sssom.csv
sssom validate mappings/uberon-mesh-mapping.llm-vec.sssom.csv
sssom validate mappings/uberon-mesh-mapping.ubkg.sssom.csv

About

[WIP] ontology mapping using LLMs and RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published