Simulating plausible dummy data for an average Type 2 Diabetic patient

One of the perennially-lacking tools to support development of EHRs and other health technology is the lack of good quality dummy data for testing. I’m involved in the INTEROPen Hackathon next week, and I thought it would be nice prep to make a very quick sketch API to randomly generate plausible blood glucose measurements and HbA1c values for the famous INTEROPen subject ‘Michael’, who has Type II Diabetes.

I’ve been Googling around this a bit and not found much of use already out there. Academic papers tend not to publish their actual algorithms, which is a shame.

I already have a Clinical Calculation API which I’ve been reviving as a project after a few years’ hiatus, so I’ve simply added a dummy-data endpoint to this existing API, which is a fairly rudimentary work in progress here:

What I’m asking from the community is:

  • does anyone have access to or could suggest a ‘plausible’ mean and standard deviation for blood glucose and HbA1c in Type II DM? Looking at you diabetologists or laboratory peeps @jonathan_kay :slight_smile: In my testing I’m getting reasonable numbers from a mean of 10 mmol/l and a SD of 2.5, but this is just a clinical ballpark figure I came up with. There will be data out there, I just can’t find it.

  • does anyone know of any better way to model this than a Normal distribution?

  • Anyone want to help out or join in? (Minimal Ruby programming required, 1 hour tutorial should be enough to get you going) The idea is Plausible Dummy Data for various conditions served via API according to SNOMED-CT code.


1 Like

In principle, I could get real anonymised data for you, and/or derive an actual mean and standard deviation from a real patient or set of patients.

What would you prefer?

Let me know and I will see what I can do. However, I can make no promises because I work in Clinical Haematology and I have not yet had to identify someone with this comorbidity. That said, IIRC I do know someone who is working with diabetes data from the local Diabetes Unit and I will have a chat with them to see what can be done in the desired timeframe …

I am very sorry to miss the Hackathon. Just too busy …


Have you looked at this?

We’ve been talking about converting this for UK use

1 Like

@gdvallance thanks for the offer - I would like to steer clear of anonymised real data, so a derived mean and SD from a real cohort of type 2 diabetics would be perfect. In terms of identifying the cohort without access to any other clinical information about them, I would reckon best way would be from abnormally raised HbA1c.

Doesn’t have to be perfect, anything will be better than my guesswork, and the idea is that open source participants can refine and optimise this over time.

@mayfield.g.kev that Synthea tool looks amazing. Certainly worth having a look in their source code, and I would agree very much worth developing a UK localized version. I’ve looked through their source and it doesn’t look as though it would do glucose or HbA1c dummy data, it seems more aimed at population-level rather than individual-level dummy data.

I’ll keep plodding away in Ruby/Rails on my API, I do think that (taking the example of the rest of the tech world here) wrapping complex/hard stuff like this in web based APIs is a useful thing to do, and in fact if charged for large volume use, could even sustainably fund itself (maybe)

OK. Derived mean and SD it is. I should be seeing my contact tomorrow. If he has the data I believe he has then I should be able to get you those numbers tomorrow afternoon. If not, I doubt I will have the capacity to get the numbers in the timeframe you desire. But will try …

1 Like

A great ask and so glad you are getting responses. Only sorry that we cannot help directly.

OK, I have some data.

I have pulled data for patients with a diagnosis of Type II diabetes from one of our Warehouses. This was ascertained by them have an ICD-10 code of E11.* In the database they were represented E11x and I searched for any that met these as per:

I chose 50 patients … first 50 in the list and then pulled their GLUCOSE (mmol/L) and HbA1c results. There were TWO methods for HBA1c: IFCC mmol/mol & DCCT (%).

For the 50 patients for GLUCOSE I got n = 192 records. Some are from the same patient and some from different patients. All mixed up.
GLUCOSE: MEAN = 12.3 SD = 7.70 (calculated a/c to EXCEL STDEV.S function) 2 dp.

For the 50 patients for Hb1ac I got n = 398 records. Some are from the same patient and some from different patients. All mixed up.

Hb1ac (DCCT): MEAN = 8, SD = 1.38 (calculated a/c to EXCEL STDEV.S function) 2 dp
Hb1ac (IFCC): MEAN = 59, SD = 15.14 (calculated a/c to EXCEL STDEV.S function) 2 dp.

Hope this helps. If you want me to do anything else let me know. Unfortunately, it won’t be until Monday because I am out of the office on other projects tomorrow.

1 Like

This is awesome Grant, thanks for doing this so quickly! This data is ideal for what I need, I can use it right away.

Are you happy for me to add you in the acknowledgements/collaborators in the project documentation on GitHub?

Dear Marcus,

Certainly. It would be my honour. Very happy to help.

Grant D. Vallance