OSHCA 2001 NHS Stream: Jeremy Rogers

Jeremy Rogers is one of the benevolent dictators of the OpenGalen project. His talk gave insight into the way in which the major components of intelligent medical records will come together. And in his view, such record systems will be impossible to create without widespread adoption of open source development methods for the termsets, ontologies, and knowledge bases.

Chair: OK, so who's coming next?

Jeremy Rogers: This is becoming a fixed feature of Powerpoint presentations when we all do the swopping laptops shuffle...

[beeps and keystrokes]

Chair: I think we've actually pretty much introduced you...

Jeremy Rogers: I'm one of the benevolent dictators in OpenGalen. I trained as a GP, and I still see patients occasionally to remind me why I don't do it all the time. One of the benevolent dictators who looks after the software is in the audience as well, so if there's anything nasty, we'll get you afterwards...

OpenGalen is middleware. It's making the impossible very difficult as we say, and we recognise that OpenGalen and that kind of area of healthcare middleware is abstract and pretty difficult for a lot of people. I'm very conscious that a lot of projects in healthcare that are open source are much nearer the coal face and have taken a lot of shortcuts, perhaps without knowing they're doing it, and get results very quickly. And OpenGalen, and HL7, and all those weird and wonderful things are really trying to look beyond that level of development and saying, "There's only so far you can scale with 1970s technology, so what do you have to put in its place?" I don't know how familar you are with this kind of area, but we'll see how far you go. One of the phenomena that is apparent from the MedInfo conference that I've just been to is a recognition that what everybody wants is intelligent systems. We don't want dumb filing cabinets: there's just too much data there has to be some intelligence in the system.

Historically, intelligent applications have been achieved by encoding your knowledge and your heuristics in the C code or the Java code or whatever language you use. And that is very successful for a while. There's no reuse of your knowledge, and it's very difficult to maintain as it gets bigger, and the response in the academic side certainly, has been to say that we need to get this knowledge out of the code and into some sort of queriable knowledge base where we can maintain it better, where it's not written in C code, it's written in something else.

So for example, we have these things called ontologies: concept models. OpenGalen would claim to be in that ballpark. SnoMed RT is really a completely different beast from Read version 2 which is actually what's used on the ground, and pretty different from Read Version 3 and CT version 3: it's an entirely different animal. They are written in hierarchical frame based systems description logic with wierd and wild and whacky things that come out of the mathematics that are used to represent these things.

There are only a few instances of them in the world because they are extraordinarily large. SNOMED CT will have 350,000 terms and upwards concepts when you get it out of the box, but the ability to make in your own back room millions and millions more. Galen is the same sort of size. There was a knowledge base talked about in MedInfo that has a million concepts in it already, and seven million facts that is used to do natural language processing. these things are enormous, so it's not really building an awful lot of them.

Another sort of knowledge base or sort of knowledge structure that people are building is some sort of information model, so typically the kind of thing that would be written in a kind of UML object diagram that would be used to structure a medical database: things like HL7 reference information model or the CEN electronic healthcare record architecture: that kind of ball park. In terms of size their a lot smaller than your concept model--they may have a few hundred objects or classes in terms of the total number of them in the world there are relatively few of them as well. Mostly because people recognise these, as above all else as worth standardising rather than because they are hard to build.

So there's quite a bit of interest, even in the commercial world, on settling on one and arguably the HL7 RIM is going to end up being the kind of default model that everyone uses whether people want to or not . The other model that you see is sort of inference models, which are a little bit more difficult to describe, but are the kind of thing that underpin things like PRODIGY, PROTOGE from Stanford Medical Logic models in Arden syntax. They are used for interface control terminologies and the things you would write if/then rules in. One of the characteristics of those are they are things that people would like there to be many many instances of and everybody and their dog in the hospital would be able to write their own decision support thing and customise it.

To write a protocol, for instance, to manage angina, turns out to be very, very hard, as PRODIGY are finding out. They're bigger that you think they are going to be. When you see them written in the BMJ they run to a couple of sides of A4. If you try to formalise them you find all the bits that are missing and you end up with reams and reams of code: a typical individual guideline will have a thousand classes or concepts mentioned in it. So for example, you're going to have something up there that mentions the idea of metastatic liver cancer, some term, or set of terms; your information will have some slots that you can put that information in, and they may choose not to have just one slot for the diagnosis. You may choose, for a very good reason, to split it in the site and pathology. If you are in a cancer hospital this is a much more efficient SQL query for saying how many patients do I have with metastatic cancer than trying to read the text of the other thing. And down the bottom you want to write a rule that says "If metastatic cancer then.... do something"

If you can get all this lot working together, you may be able to get an intelligent system. But only if you get it working together. And there is clearly a relationship between these three sorts of components which says: Top right: fantastically large, not too many of them, these things very very numerous, and actually quite complex themselves.

So you have three types of model millions of objects between them. One final system. For this all to work there is an overlap and an interface between them but at the moment, the way they are developed is just to say, "Well, I've worked out what my reference and information model will be; we'll just say the slots in my information model will be filled by terminology and we'll chuck that problem over the fence to the terminologists and not worry about it" That doesn't really constitute managing the overlap, that constitutes moving the complexity into somebody else's field and hoping you can forget about it. Historically what you see in this sort of area of enterprise is the complexity that is medicine is just passed from one to other area around those three. Structurally, because they've developed from a managerial point of view by different communities, that will persist. People don't really recognise that they're building part of one thing.

It's never really been done before in any domain, certainly not in something as big and complicated as medicine, so if you pretend you know how to do it you are probably lying. My own view, being part of this, is that claiming a single central effort, where you co-ordinate the whole the process as a managed enterprise--you know how to develop the three pieces and you how to make them work together--is an extremely complex undertaking and too risky for commerce to do. It's necessarily a distributed effort because particularly the development of decision support rules and local customisation and the fiddling that clinicians are going to want to do are going to happen in lots of places.

So even if you could pin down your UML model and your terminology to one building or even two buildings, the other part is going to be all over the place, yet it has to be able to feed in and affect the other two components. How are you going to manage the interplay of all those pieces with this inevitably distributed effort?

Well one way to do it is closed source. You have everything closed source. You have control, it's attributable, which means you can get sued, and very clear demarcation so you can shunt the complexity to someone, or just say, we're not going to deal with it. It is commercially exploitable, but it's inflexible. It's very easy to get into a blinkered state of mind where you say, you either don't recognise other possibilities in other applications for the thing you built, that you're not actually primarily interested in, but someone else might be, so you limit the ability of the thing to become all that it could be. Particularly in the NHS you become isolated from prototyping because if you develop it closed source within the NHS you can't develop a preferred supplier test bed, because that seems supplier favouritism and I think personally, one of the problems of the Read codes was the inability to form a test bed as it was built.

Managing the process doesn't scale. It's too complex, particularly at the boundaries, so that actually saying you can manage as a single closed source enterprise is probably like... as I say, because it's attributable you become a target.

The particular demarcation boundaries where you say this is a problem that we're going to deal with may seem appropriate when you start, but they may in the light of experience need to be changed, but by that time you've kind of enshrined the whole thing

[minidisk change]

Just a little point. The knowledge base that is produced are not particular datasets. The particular structure of the knowledge base will be arcane and to some extent arbitrary. There are arbitrary choices: you could have done it one way but we chose to do it another and that inevitably has knock on effects on the software that users actually use. You can't actually divorce the software that chews on the knowledge base from the knowledge base itself. It needs prototyping and iterative development. We don't know how to do this and we are lying if we say that we do. It may be that something we originally decide for one purpose, is actually, with a bit of adaptation could be made to do something entirely different that we're not interested in, but somebody else is, so even for our own chosen path it's very important but if you can allow prototyping for some different use then so much the better.

In the context of this sort of knowledge of the triad of knowledge base that I talked about, distributed working is not just a nice idea but something that is inevitably going to happen, particularly in decision support software. And I believe personally it is too complex to centralise. It's too complex to claim that you can actually co-ordinate it with milestones along the way. You can't claim to manage all of the interfaces between all of these very large complicated objects. you have the risk of developing blind spots that actually stifle development or worse of actually blocking resources--of blocking people from doing something that you don't want to do.

And users can't contribute to something unless the thing is open, and they won't generally contribute to something unless it's free as in "no cost" and they may pay for training, but they're not going to pay to get their own improvements back again. Douglas mentioned bioinformatics. I went to a bioinformatics ontology conference a couple of weeks ago: they are rabidly open source: they would not contemplate going near anything that is not open source.

So putting it into practice, trying to make it all real, there is a website which you can go to with an open source terminology. You can get the very obscure and arcane source code that builds this knowledge structure, the development software that is used to author that open source software is now open source as of Friday last week, after a long battle.

I think in Douglas' slide he said OpenGalen was implemented in a proprietary database. The specification of how to build the database is actually open, but the only known instance of it is currently proprietary. but should you choose to build your own you are more than welcome to do so. We would dearly love to have an open source one. In terms of the community of developers--it's a sort of BSD-style license-- you can do what you like with OpenGalen stuff as long as you give a nod to the two universities that claim the origins of it. The main developers are TopFin which is an alias for the University of Manchester and Kermegen which is part of the University of Neimegen.

We have a user base in France in Sweden, and in Korea: they've done Korean natural language generation in a research context on a terminology of dental procedures in Korea. And there's us in the UK working with PRODIGY.

One of the interesting points, one of the main thrusts, is that I think this is too hard to do with a single managed enterprise, and that open source encourages the free for all that is probably the only way this is going to happen.

The second issue is I think the problem of linking all these things together. At the moment there is one open source ontology. There are several other closed source or Crown Copyright ones. You might at least be able to link to historical versions, and see if there are any different areas or interesting differences. Just in that top right corner, that's actually very hard because the intellectual property around that is something of a nest of vipers and so it's very hard to reuse that, even stuff from the NHS. If I want to take OpenCS 4 for example, and in some way extract that by comparing one hierarchy with theirs, what does that mean? what can I do with the end result. Is my quality assured?

So that's a real issue I think. And the licensing issue is important when you consider the number of different decision support things, and consider that every time you touch one to see if your product works with it you have to get a license for it, and the number of licenses you would have to deal with to operate the whole thing... it just doesn't make sense from the licensing point of view. And certainly we've had not a very good experience with licensing and intellectual property around this area, even as publicly funded researchers. And even in the NHS there are other communities within the NHS who are developing knowledge bases, small and large that are not necessarily under Crown Copyright yet, but are still extremely difficult to get hold of, so there's a lesson for the information authority there. As a public funding body it is has historically commissioned the development of very large components, such as the basic specification of Read version 3. Although they are Crown Copyright are impossible to get hold of and people tend to steer clear of them for fear of being sued.

So in summary, intelligent systems are desirable that's sort of been why the medical informatics community carries on and certainly every time you go to MedInfo there'll be some clinician jumping up and down and saying "My local system is appalling, please can I have something better?" And here we have the people to try to build one. But to get it all to happen on a complete scale, we do need these kinds of knowledge bases but they must be able to interoperate. The relationship between them, the interfaces between them is currently not really managed and they will be very numerous and very complicated, and therefore not centrally manageable and the free for all will probably be better. And I would argue that the property of open source that makes it most suitable for this is not that it's free of charge--which appeals to sort of public good sensibilities, it's not that you can check to make sure that it really does what it says, it is that you really require people to be stretching and bending this object and evolve it to where it can be. And the alternative, which is closed source historically, it hasn't worked.

Chair: Am I right in thinking we should break for tea at 3.30?

Colin Smith: My suggestion is that we have the last talk, and then break for tea.

Chair: the choices are to break or to continue? What's the feeling. Break? Hands up! Carried. Could we hold it down to about 15 minutes?


Copyright Jeremy Rogers 2001. You may reproduce this table in any medium provided this copyright notice is also maintained. Transcription by Douglas Carnall: dougie@carnall.org
Index | Carnall | OS projects | Richards | Rogers | Todd | Henry