`sct` - Modern, fast SNOMED-CT tooling for the agentic age

Hi Open Health Hubbers

I had an interesting day today thinking about SNOMED-CT Terminology Servers and how inefficient it is to have to call a web server for SCT queries. I have always found it difficult to get started with SNOMED because of the requirement to set up a server, or to use one of the clunky web UIs.

I wanted a local-first SNOMED tool that I could understand and which I could play with on the command line. Maybe such things exist somewhere, but a decent search didn’t turn up anything that was obviously easier to get to grips with than the Ontoservers and Snowstorms and suchlike.

Using SNOMED and LLMs via a Web Terminology Server is likely to be incredibly slow, because of the latency of the web connection, then multiple queries required, and the relatively heavy context window overhead for LLMs of crafting REST queries compared to using jq or ripgrep.

So. I wrote my own! It’s a single Rust binary which can ingest the entirety of the UK Edition in 27 seconds on my laptop. From there there’s a ton of next steps you can take to handle the data in different ways.

sct is a local-first SNOMED CT toolchain - a single Rust binary that takes an RF2 Snapshot release (the raw tab-separated files that SNOMED CT is distributed in) and converts it into a canonical NDJSON artefact, joining 800k+ concepts with their preferred terms, synonyms, hierarchy paths, and relationships in one pass.

From that artefact you can load into SQLite with FTS5 full-text search, export to Parquet for DuckDB analytics, render per-concept Markdown files for RAG/LLM ingestion, or generate Ollama vector embeddings for semantic similarity search.

There’s also a built-in MCP server that connects directly to LLMs, giving an AI assistant live access to five SNOMED tools (free-text search, concept detail, children, ancestors, hierarchy browse) with no cloud dependency and sub-5ms startup.

The whole thing runs offline, produces standard files queryable with sqlite3 , duckdb , jq , or ripgrep , and is designed around the principle that the expensive RF2 join should happen once - deterministically - and everything else should be derived from the resulting stable, versionable NDJSON file.

I’d very much appreciate feedback on the work so far. It’s likely to have some bugs, but it does work. As an example, I have been able to ask Claude for the dumbest and silliest SNOMED-CT terms, in its opinion, which it dug out after 58s of querying for silly things:

I finally have SNOMED-CT in may laptop in a way I can query in any way I like.

What would you like to see in sct next?

You are a bit further down the road than I am.

I’m at a stage where I can produce output like you have using (jupyter + medspaCy + scispacy) Testing/PDFTextAnalytics.ipynb at main · nw-gmsa/Testing · GitHub

But I’m going to use this to justify not producing only PDF Genomic Reports , our source data like many secondary care is already coded.
So I’m going to attempt improvements around engineering the process, rather than fixing the output (as PDF)

The main problem I have is:

GP’s probably want the diagnostics implication in Rare Disease Genomic Reports SNOMED coded. That I think is pretty obvious.

The hard part is talking the ‘NHS’ system into making this a requirement, most of the tech around this is already done. E.g. in Manchester the PDF is shared with GPs, in Yorkshire we can in theory just plug into the main Yorkshire architecture ….. so Yorkshire GP’s can see reports we’ve done for their patients. ← Again this is mostly technically done, the hard part is the NHS system. Somehow I need to engineer an obvious user requirement.

Sorry started to waffle, I think many elements of SNOMED coding could be solved with northern engineering, rather than AI

Hi Marcus,

This looks really nice! Thank you.

I am looking at how to best add SNOMED CT support to my server backend. Am I right i thinking if i wanted to keep a trail of the full snomed change history, I would:

  1. Download the full current release and generate the canonical NDJSON.
  2. When new releases come, I download the snapshot and run sct ndjson --rf2 <snapshot-dir>
  3. Run sct sqlite --input new-release.ndjson and then hot-swap this new database file into my database.
  4. Run: sct diff --old 2025-01.ndjson --new 2026-01.ndjson --format ndjson > diff_25_to_26.ndjson

So that:
a) My web app only hosts the single SQLite database generated from the latest Snapshot.
b) I store the tiny diff.ndjson files generated by sct diff to maintain a trail of the historical change.
c) I can compress (via gzip or zstd) and archive the point-in-time .ndjson files, to recreate the database for any exact date in the past if ever needed.

I think that would work yes. The entire process should be totally deterministic, so you can store any part of the ‘chain’ (apart from perhaps the embeddings, which may have ‘temperature’ ie small amounts of random variation) from .zip → RF2 → NDJSON → SQLite → diffs and they should all essentially hold the same information.

Just to flag with you, in the spirit of transparency, that I literally wrote this project yesterday in a single massive spec-driven agentic engineering session, and while I take care to make sure this is not throwaway ‘vibe code’ (it has tests, I have clear standards, I use CI and will make this 100% production ready…) - you should still exercise caution if you’re incorporating this work into your platform!

I wasn’t quite clear on what you’re planning to do - you’re right that engineering and standards are the solutions, not AI.

Receiving GP clinical systems probably don’t have good ways for the Sender to include ‘suggested’ SNOMED-CT codes, apart from bunging the code in a PDF and then this becomes an admin task from one of the practice staff. GP clinical systems should probably be better at this.

Noted thank you!
This system is not in full production yet. And if it gets rid of the need to run a snomed server that would be great.
I wonder if there is some interesting logic for working across national / international editions with your system, maybe a flag to prefer an edition if you have multiple loaded?

That could be really useful for a multi-national platform.

For multiple editions, create multiple appropriately-labelled snomed-<edition-name>.db files - then you can simply use the required edition when you issue SQLite commands.

At the moment it’s been designed mainly as a CLI tool, but I do plan to make it into a Rust library which can be integrated into a Rust codebase, so that my other projects can integrate SNOMED more easily.

However once you have a local NDJSON or SQLite DB, or any of the derivative products, you can query them using any tooling you want, in any framework or language.

Looking forward to the library!

They do, a few years ago → 5 of them (as Transfer of Care) but ……. it was complex (for small NHS Trust IT teams) and GP Suppliers only display them to GPs as html pages (and a 3rd party can’t access them). The next version to come out is likely to be this Home - HL7 Europe Hospital Discharge Report v1.0.0-ci-build , ideally shared via NRL, hopefully without any over engineering added on top.

But in the main most others have just been PDF, mostly Kettering XML as this was easiest for small NHS Trust teams to support ….. but in most cases these reports start off with coding present (probably not SNOMED or LOINC) with NHS trusts.

Interesting - as one who has gone through the pain of setting up own SNOMED CT SQL server ingesting Monolith edition and has to go through the tedium of updating said installation at intervals…

I think some of us, unfamiliar with Rust may need a bit of help to get on the first rung of the ladder. Have you got some kind of a Dummies guide that will help me from the point where I am logged in to my Windows 11 notebook, can access Github - read the files but can’t easily work out where to start / the prerequisites / what I need to install etc, to have some chance of checking out what you have done. It looks very interesting and it might be worthwhile amongst other things to compare what your SCT generates with the output from SQL queries on full server - accessible from my notebook. Also intrigued to know how you have pulled together RF2 into a single canonical artefact. The monolith release gets rid of some of the complexity - but not all. Anything to demystify and get to grips with the SNOMED CT beast looks good to me… Hope that with a little bit of help I can get SCT working?

Important waffle. As ex GP I would be interested to explore what GPs need / want from Rare Disease Genomics Reports and how that can best be conveyed in a structured manner. Would also be interesting to hear more about difficulties with the ‘NHS system’. Am involved with some work on replacing the aged PMIP / EDIFACT pathology links message with FHIR / SNOMED CT. A number of us keep asking why there is not more collaboration with the genomics folk about getting reports into GP systems. The ‘system’ seems to be siloed. For the future we need a very much wider spectrum of reports to reach GP systems than the very limited selection that we currently get via PMIP. We also need to close the loop so that meaningful requests go the other way - and all designed to accommodate decision support / AI in its various forms. We too have our problems with the ‘NHS system’. Sounds as if this may at some point need a thread separate from SCT

reply Genomic Reports for GP and future AI use

There is a Quick Start here GitHub - pacharanero/sct: SNOMED-CT tooling, brought into the 21stC. RF2 -> ND-JSON, SQLite, Vector, Parquet, LLM-friendly · GitHub

And the Walkthrough is also useful sct Walkthrough — Feature Tour - sct

Installing Rust on Windows will be documented somewhere on the internet. I use mise to install language toolchains now, but that is an optional route. rustup is the most common approach.

Here’s the Walkthrough section of the docs, in video form. If anyone’s struggling to get sct installed, feed back here with what you’ve tried and what’s happening when you do, and I’ll help.

sct — what’s new since the end of March

Three weeks of fast-moving work. Here’s the highlight reel.

Easier to install

Three ways to get sct now exist where there was only cargo install:

  • curl -sSL https://... | sh installer for macOS/Linux, PowerShell installer for Windows - auto-detects OS/arch, verifies SHA-256 against SHA256SUMS, drops the
    binary in the right place.
  • Homebrew tap and Scoop bucket auto-bumped on every release.
  • cargo binstall sct-rs for the Rust crowd who want prebuilt without
    compiling.
  • Prebuilt binaries shipped for Linux x86_64 + aarch64, macOS Intel + Apple
    Silicon, and Windows x86_64.

New things you can actually do

  • sct lookup — direct SCTID and CTV3 code lookup with the full concept
    page.
  • sct lexical — FTS5 keyword search, with phrase / prefix / boolean
    operators.
  • sct semantic — Ollama-backed vector similarity search over Arrow IPC
    embeddings.
  • sct refset — list, inspect, and enumerate members of any Simple refset,
    end-to-end RF2 ingest support.
  • sct codelist — build, validate, diff, stat, and export clinical code
    lists in a YAML-front-matter Markdown format.
  • sct trud — download SNOMED CT releases straight from NHS TRUD, with
    SHA-256 verification.
  • sct info + sct diff — inspect any artefact, compare two NDJSONs to see
    what changed between releases.
  • sct tui — full-screen interactive terminal explorer.
  • sct gui — browser-based UI with a D3.js neighbourhood graph
    visualisation.
  • sct completions — bash/zsh/fish/PowerShell/elvish.

More data linked together

  • CTV3 and Read v2 cross-maps loaded from UK Monolith RF2 — reverse-lookup
    legacy codes to SNOMED.
  • Transitive closure tables for fast subsumption queries.
  • ZIP auto-extraction — point sct ndjson at the release zip, skip the unzip
    step.

MCP server matured

  • Codelist tools now exposed (build/edit code lists from your LLM client).
  • Newline-delimited JSON transport for the 2025-03-26 spec alongside the
    older Content-Length framing — works with both Claude Desktop and Claude
    Code 2.x.
  • Schema-version validation at startup so a stale binary against a new DB
    fails loud, not silent.

Provenance (this week!)

Every artefact now carries the SNOMED edition, release date, full release identifier, and the sct version that built it — captured in NDJSON headers, SQLite metadata tables, and Arrow schema metadata. sct info displays it, query commands show a footer (TTY-aware so pipes stay clean), --provenance flag overrides the default, MCP advertises it on every handshake and embeds it in snomed_concept responses, and sct codelist add auto-fills the snomed_release frontmatter from the DB.

Maintenance, bugfixes & security

  • Crate split into library + binary so integration tests can live properly
    under tests/.
  • Pre-commit hook running cargo fmt --check + cargo clippy --all-targets -- -D warnings.
  • Replaced unmaintained serde_yml with serde_yaml_ng; bumped ratatui to
    0.30 — clears three open RUSTSEC advisories.
  • Clean SIGPIPE exit so sct … | head no longer panics.
  • Configurable one-line concept format shared across
    lookup/lexical/refset/semantic.

Docs

  • Walkthrough split into focused per-topic pages, hosted on the GitHub
    Pages site (Zensical).

  • Per-command reference in docs/commands/.

  • Devcontainer with sqlite3, duckdb, jq, ripgrep, Python, and Ollama
    pre-installed.


    From v0.3.1 to v0.3.10 — eleven releases in three weeks.