UK SNOMED RF - Idiots Guide

I’m looking at terminology services and have loaded in SNOMED Concepts, Descriptions and Relationships.
The description and entity diagram on this site was very useful

Does anyone know of anything similar for reference sets? (ideally an entity diagram on relationships between files). The official SNOMED site is useful but I’m getting a little lost in the amount of information.

Have worked it out :slight_smile:

Not strictly related to your query Kev, but I’ve been reading about graph databases, which are all about relationships between nodes, and it struck me how terminologies like SNOMED-CT, (and complex related relationships such as mappings from CTV3/Read2 etc) would work really well in a graph database such a Neo4j.


Did an in-memory server of SNOMED CT while I was at NHS Digital (alas, powered by RF1 so now out of date), held the whole graph in around 500MB of RAM ; you can run refset queries pretty fast when you have no I/O overheads to worry about - could do queries like “descendants of finding” with tens of thousands of concept results in under half a second (including writing the output to disk). No mean feat when Access would take over 5 minutes to do the same thing :slight_smile:

SNOW Owl did the same queries faster, but it was “cheating” and using precomputed indexes of the relationships.

I remain fairly convinced that for best performance on SNOMED CT, even a graph database isn’t as fast as you’d want it to be, but would be humbly surprised to be proved wrong.

I’ve not looked at graph databases. I’ve went straight to SQL database (using hibernate) because of the data structures, ideally I’d move the text searching to elastic. Performance is around 300ms->1sec for concept query (I’ve not optimised yet).

FHIR is pretty clunky around Terminology at the moment, main problem is resource is centered around CodeSystem rather than Concept. The resources don’t follow the style/pattern of the main FHIR resources but it works and the overall structure was really helpful it getting started on the Terminology Server.

@adrian.wilkins which graph db did you use for that?

Ah, that was the point. No graph DB. Wrote a high-performance-but-low-overhead sorted collection class and a routine that loaded SNOMED CT objects into RAM from slightly modified RF1 files. Also wrote a library to do the traversal / query part (very much in the same mould as the official recommendations, only the query language was XML based and thus easy for poor ol’ me to write a parser for). That leaned heavily on the excellent Guava library for Set operations.

Just sticking everything in TreeMap gives you about 400MB of overhead for that set of objects (making the whole shebang over 900MB), and it’s not very fast.

900MB is pushing the limits of available heap space on 32-bit Windows boxes (for virtual memory address layout reasons, you get about 1.2-1.4GB heap space, max, depending on the size of drivers you have loaded). But of course, we were limited to 32-bit OS installs, because various bits of legacy baggage didn’t want to run on 64-bit Windows and no-one wanted to work to fix this.

So for my poor crippled Windows users, I worked to trim that overhead ; ended up with a collection class that was much faster and had about 20MB total overhead for all 3M or so objects in the core SNOMED CT module (which contains about 150MB just as strings). Being able to fit everything in around 500MB of RAM leaves you enough room to actually run useful apps on top of that data.

Collection class was what I think is called a hybrid bucket trie ; low overhead and quite cache-optimized so lookups and inserts are fast. Can load the core SNOMED CT module in under 20s on reasonable hardware, can run most of the full set of UK refset queries and write the output to disk in under 80s ; from comparisons we made with the official implementation (at the time, Chris Morris converting specs in Word docs to lots of PL/SQL), it’s pretty accurate.

Still have the code (and it’s under a permissive license like most efforts of NHS Digital in recent years), but it ain’t pretty or fun to read. All very, very special purpose. But I think SNOMED CT is complex enough that it needs special-purpose - working at it from the level of abstraction that e.g. the Common Terminology Services spec lays out is just nuts, like making a sandwich using chopsticks while wearing a welding mask.

Edit : And SNOMED CT is in itself quite the abstraction, proposals like the representation of numbers for e.g. pharmaceutical products compounding that considerably.

Aha, I see.

I guess that’s pushing the limits of what can be done just in RAM, and if you run out of RAM… splat?

I was just thinking about how the entirety of SNOMED-CT, and in particular, the various forward mappings from READ2->CTV3->SNOMED-CT, and any UK specific refsets are all just managed in text files, word docs, and Excel spreadsheets, which while ‘interoperable’ in a loose sense, because they cater to a lowest common denominator, they are very low-tech and require a lot of work on the part of any implementer.

If there was (for example) a ready-to-go Docker deploy of a graph db server that contained the whole lot - SNOMED-CT, and all relevant forward mappings, then this would lower the bar to understanding and manipulating terminologies? (I ask that as a question because I’m pretty sure I don’t really understand terminologies myself, having never really had to use them other than as a clinician in an EHR.)

Sounds like a great idea, could also include an API layer and sample front end?

Not sure about graph, most staff using a terminology/codes work in SQL (Server).

Neo4j includes a REST API out of the box, it’s primarily intended as a REST-based server DB for horizontal scalability. There are various front-ends which allow browsing of the graph, these can be themed/skinned using D3.js/HTML/CSS