A YOUTUBE SERIES: TRANSFORMING HEALTHCARE DATA

Episode 3: Ingesting Data into FHIR with Don Rucker and Ethan Siegel

Don Rucker:

Welcome. Today, we’re talking about getting data into FHIR, data ingestion of FHIR. FHIR has captivated the healthcare IT industry, but obviously data doesn’t magically jump into anything, lots of work involved. It’s my great pleasure today to have a discussion with Ethan Siegel, who is going to talk to us about his experiences in getting data into FHIR. Ethan is a software engineer who has experienced both on the provider and the payer side as a software engineer at Harvard’s Dana-Farber Cancer Institute. He built a software program to do patient matching called MatchMiner that actually matches on genomic information, I believe. And here at 1up, Ethan has had a leadership role in designing our entire architecture on data ingestion. And maybe the same precision that he uses in the rest of his life has come into this data ingestion because before all of this, he was a classic, or is still, a classic violinist who’s performed in many locations throughout the world. So Ethan, great to have you here.

Ethan Siegel:

Happy to be here.

Don Rucker:

Yeah. So why don’t you talk to us a little bit about, just give everybody somewhere on the same page, a little bit of what FHIR is and how you think about the type of data that might be put into FHIR?

Ethan Siegel:

Sure. So FHIR, in just a few words, is just a modern data standard that enables the exchange of healthcare information. So in FHIR terminology, they refer to these as resources, and there are about maybe 150 different resources and you can think of them as units of information. Some examples of resources might be patients, encounter, explanation of benefits, conditions, procedures, really anything that might encapsulate some kind of clinical or claims information. Sometimes they’re even more specialized by use case for, say, oncology or financial transactions. So that’s kind of the output, the FHIR output that we’re talking about, creating these resources from different inputs. So on the other side, on the input side, what are the kind of inputs that we’re trying to transform into FHIR? So often, I think of these in two different categories. There’s claims and financial information that’s transformed into FHIR and clinical information.

           So sometimes the clinical information comes already transformed from EHRs or health record systems like Epic or Cerner. They have FHIR endpoints, so methods of retrieving data directly from the clinical site in the FHIR format. But other times they’re not, and this is where things can get really tricky and complicated and what we’ll talk more about today. So on the claim side, there’s actually quite a long journey for the claims data before it arrives at a company like 1up. The life cycle of a claim, often it will start out in the same place as the clinical data in a health record system input by a nurse or a physician or physician’s assistant, or maybe someone working in a lab system.

           But from there, it will get put into a claim and the claim will then make its way to a payer’s clearing house, an insurance company’s clearing house, where it will then make its way into a claim engine of some kind, Q-NEXT is one piece of software used for this where it will be processed and adjudicated and if there’s any appeals on that claim and lots of other things that happen to it and sort of process there. And then once it’s fully adjudicated, then it sort of comes, it makes its way back to the clearing house, and sometimes then gets reported to CMS, the Center for Medicaid Services, if it’s a Medicare patient. And after all of that, it will end up back in a payer data warehouse stored in some database somewhere, after which at that point usually then is when a company like us at 1up will eventually get the data. So that’s the whole context that we’re talking about here, getting data from either those warehouses or the clinical systems into the FHIR format, and we can talk more about how we do that.

Don Rucker:

Yeah. So 1up, where we’re both employees, would get that data on behalf of a payer to be clear who’s using that for the CMS payer to payer requirements or some other allowed purpose under treatment payment operations for the privacy protection, so that’s the auspices of that data. Obviously, healthcare is incredibly complicated. There’s a lot of variability. FHIR is, I think, designed to materially reduce that variability and to be a little bit of a lingua franca for data. But how do you think of the variability?

Ethan Siegel:

Yeah, variability is the core problem that we deal with. It’s really hard. There’s a ton of different file formats. There’s X12, there’s EDI, there’s DICOM for images. There’s a format called T-MSIS, which is a Medicaid specific file format. There’s CDA and CCDA, HL7v2. And before we even talk about genomic data, there’s VCF files for variant calls, and band file, and it’s just endless. So there’s an enormous variety of file types that have a very long history. Sometimes there are multiple versions of files and all kinds of problems that you can deal with trying to transform from one format to another. So really, the way we’ve tried to deal with it is to be a little bit more prescriptive and standardized about how we transform data. So especially when talking about payer data, coming from claim systems and claims warehouses, we’ve tried to make recommendations for extract formats.

           So for example, if there is a lot of claims data living in some kind of SQL database or a relational database, we’ll ask for that data to be formatted and extracted in a specific format, which we can then transform very quickly. We’ve done these transformations and implementations a number of times. And if we receive data in this specified format, we can transform it in a matter of hours. So it’s a very, very quick turnaround at huge volume. So hundreds of megabytes and sometimes gigabytes of text data. Sometimes if that’s not possible, and there are many reasons where it just hasn’t been feasible to manipulate or extract the data in this way, we can actually work together with different pairs to do more customized transformations from data where maybe it’s not easily possible for in-house IT teams to meet the specifications that we have. So we can meet where the client is to do these transformations based on many different kinds of data schemas and structures.

Don Rucker:

So how do you actually go about doing that? What are the moving pieces to actually do that? What pieces of software, what approaches?

Ethan Siegel:

Yeah, so there’s a number of different approaches depending on the file and types of files that we’re transforming. So two examples that come to mind, so CCDA files, and HL7v2 files are two kinds of files that we encounter quite often, and we have standalone converters that can receive as input the CCDA or HL7v2 file and output a FHIR resource or multiple FHIR resources. So just to be specific, maybe one CCDA comes in and then 20 FHIR resources might come out. So we’d say, “Here is your patient resource. Here is the condition that was included.”

Don Rucker:

Yeah, because the CCDA is a continued care document architecture. So for folks who are not deep in this specifics, you can think of that as… Or maybe before US Core Data for Interoperability, that is a broader thing for which CCDA is part of as the electronic medical record. With HL7, classically version two, referring to lab results. If you had to just do a super, super simplistic description.

Ethan Siegel:

Right, yeah. And I guess the other salient point-

Don Rucker:

It’s not that simple.

Ethan Siegel:

Yeah, the other salient point is just because the input data for these file types are already structured in a very specific way, we can transform them in a standalone fashion. So we can have converters that just receive as input these file types and output FHIR, and that’s somewhat simpler from an engineering perspective than some of these more custom transformations. So I guess like you were asking earlier, how do we actually do this transformation? Say we receive 20 files from a client and these files directly come from a data warehouse and each of these files likely has… Their CSV files, so comma separated value files, and they’ll have headers. And each of the headers will have a column.

           So in the simplest case, say we’ve received a file called patients.CSV and that patients file will have a first name and a last name and maybe address line one and zip code and so on. And we will take those columns from that file and map it to individual elements in a FHIR resource. So a FHIR resource may also have a corresponding concept of a first name and a last name and a zip code and so on. And we’ll do that mapping one to one from each column from the source systems or the source files to their corresponding location in the FHIR resource. So that tends to be a time consuming process and is quite difficult to do correctly at scale. And so we actually built an internal tool, we call it DIMA internally, it stands for data ingestion and mapping administration.

Don Rucker:

A lot of imagination in the naming there, right? Yeah, we call it precisely what it is.

Ethan Siegel:

Yeah, a Very creative name, I guess.

Don Rucker:

Yeah.

Ethan Siegel:

But it’s a very useful tool that we found has cut down the implementation time by at least two or three times. So the tool allows you-

Don Rucker:

Two or three times or two or three orders of magnitude?

Ethan Siegel:

Orders of magnitude faster conversion.

Don Rucker:

Yes, okay.

Ethan Siegel:

And it allows you as an analyst to receive a file as an input and to just visually map these columns to their locations and their resource, and then validate immediately. So you can generate these FHIR previews instantaneously, which makes it very simple to do custom mapping with the visual and validation component built in. So really, it takes minutes to do this mapping instead of hours and a lot of engineering time.

Don Rucker:

Cool. So obviously, this type of work, the devil’s in the details. So we’ve mapped, I don’t know, many, many millions of patients’ records at this stage. What have you seen as the stuff that’s hard? What are the challenges as you operationalize this that you have to sort through?

Ethan Siegel:

So there are a lot of challenges. It’s very hard. I would categorize basically five big challenges when it comes to mapping. So in sequence, I’d say volume of data is a huge problem, the structure of the received data can be a big problem, the actual content in the data can present quite a bit of problems as well, making the data conform to the rules of the FHIR specification is difficult, and data quality, ensuring that data quality is maintained or even increased. So I’ll just go through these in a little more detail, but those would be the big categories of issues.

           So right off the bat, volume. So for certain payers, especially ones that deal with Medicare or Medicaid data, it’s an enormous amount of just data to sift through and to process. And whenever dealing with data at that volume, there are many, many edge cases that only appear after having processed a certain number of records, which presents lots of issues in terms of reprocessing and scaling infrastructure and managing that reporting in order to surface these. And so that’s one whole category of issues that we’ve dealt with and built different systems to accommodate.

Don Rucker:

All of this is running in the cloud, right?

Ethan Siegel:

Yes. So all of our systems run in the cloud on scalable systems. So most of them are serverless, so they will scale as needed and be able to handle large through puts. So I believe we’re able to handle about 18 million resources per hour. We can scale up to that amount, if not further, we just haven’t had to. But it’s quite a large capacity that we do have for data size. The other issue that I mentioned earlier, structure. So for better or for worse, much of the data that we receive to transform comes from relational databases, which are highly normalized, which really just means that they’ve been structured to reduce duplication. So the FHIR standard also is structured in such a way to reduce duplication, but it’s also often very, very different from the way that the source systems have been structured.

           One example of this is the way, in a claim, how the information that’s contained in a claim header and the information contained in claim line items, say, for example, the actual details of each procedure or medication or whatever was received, are stored. So in FHIR, this is expected to be stored in one resource. The resource is called explanation of benefit, and it’s expected that all the line items are included in every claim. But in most source systems, this is not the way the data’s stored. They’re stored in separate tables and there’s a link between them. And so reconciling these types of issues is something that can get quite tricky, especially in bigger FHIR resources like claims. So yeah, the FHIR spec can be quite strict. On the content side, and this is where people more traditionally talk about data quality or data cleanliness, we find lots of issues all the time. So there’s often inconsistent fields where sometimes there’re filled out and sometimes there’re not, and sometimes there are multiple values put in one field, sometimes they’re just missing.

           And oftentimes, the data is referentially incomplete, meaning that if the data refers to say a patient was taken care of by this doctor and the doctor is referenced by a NPI number, a national provider identifier, sometimes we don’t have information from that same payer about who the doctor is. Where do they practice? What facility do they practice at? And some of the missing links there are things that we try and fill in with other methods. So that’s another big problem. Conformance too. The FHIR standard is very strict. So sometimes it will require that, say, for a patient, every patient resource must have a birth date, for example, or a gender. And if it doesn’t have a birth date or gender, it’s invalid and we cannot ingest it into the FHIR server and it cannot be surfaced in any downstream application or analytic. And this usually is okay until we run across instances where our source data does not have this value or the source data is malformed. So that creates more problems, and so another category of issue that we face.

Don Rucker:

Yeah, there are a lot of pieces on the quality. Because ultimately, the quality of your analysis, whatever you do with it, whether you’re powering an app or doing analysis or doing risk stratification or underwriting or quality on the provider side, identifying people at risk, population management, all of that is ultimately dependent on quality, quality and quality out, or I guess in computer science itself, GIGO, which is garbage in, garbage out. But how do we deal with that?

Ethan Siegel:

Oh yeah, we spend a lot of time thinking about data quality because it is so critical for just about any application that you’d want to use the data for, whether it’s a clinical decision making support system, whether it’s risk adjustment, whether it’s for actuarial modeling, whether for anything. If your underlying data quality is not sound, then you have big problems. So we have a number of automated and manual processes set up to make sure that the data that we’re transforming is of a uniform and high quality. So there’s a manual review process that happens at every stage in the process before it’s transformed into FHIR, during the actual mapping process, and after ingestion. So beforehand, we’ll do a data profile just to eyeball the data as we receive it. Does it look right? Are the fields generally indicative of what’s in the content of the file, and so on?

           During the mapping process, we also have multiple reviews. And during ingestion, we actually validate automatically every FHIR resource that’s created for structure and content conformance to the FHIR specification. So does a claim have a created date, for example? Does it have a diagnosis? Is it missing a procedure code? Does every practitioner have an NPI number, and so on? So there are hundreds and hundreds of these rules which we run on every single FHIR resource that we create. So no FHIR resource that’s created ever makes it to our final server that’s not conformant to the R4 standard. So that’s a hard guarantee that we will not ingest any malformed resources. There’s other initiatives that we’re currently working on to, at a more metal level, try and flag issues earlier in the pipe, so to speak. So one of them is around benchmarking. So to set up different rules and processes that are scanning the data that we’re receiving and see if there’s any anomaly.

           So one example might be maybe we know for one payer, for example, that the average yearly spend for a Medicare patient should be about $12,000. And if we notice that a patient has $75,000 in spend, that might trigger a notification for a manual review. Similarly, we have another system where we’re actively building now to catch changes like trends, changes in trends. So if we expect to receive every day maybe X thousand, 20,000 resources, and one day we received 200,000 or we receive zero, that would be another event to trigger notification. And then finally, standards around clinical data. So if we receive a piece of clinical data where a patient is shown to have a heart rate of 500 beats per minute, that’s probably something where we made a mapping mistake or there’s an issue with the underlying source data and we really need someone to review that. So these are just some examples of different ways we can set up rules and triggers to allow for more targeted investigation into quality issues.

Don Rucker:

Cool. So one of the interesting things is why FHIR? Why has FHIR, I think, captured the attention of the healthcare IT community? And maybe I’m being defensive, but I don’t think it just purely because ONC put out rules on this, really it was actually the CURES Act had Congress to have data available through APIs with public data standards as opposed to proprietary data standards. But in doing that rulemaking, it wasn’t like there were 10 choices of standards. FHIR was clear and away the only standard that really made sense for modern representation of clinical data. Why is that? I think a lot of people know this, but we’re guessing some folks may not have that, and that’s important. So why FHIR?

Ethan Siegel:

I think you said it earlier. The standardization of the data formats is huge. And we talked a little bit earlier about the wide variety of formats that have come before. And trying to have a single format allows for just a lot of things to work better. Just take analytics as an example. Even classical analytics or even machine learning and more advanced analytics just work a lot better when your underlying data is structured in the same way. Take, for example, if you wanted to run the same analytic from a dataset on one clinical dataset versus another clinical dataset from another EHR system, you can’t really do that if they’re using two different schema, two different data formats, unless you do a lot of work to clean it up and standardize it. If you use FHIR, you dramatically reduce the amount of work you have to do to basically get the same insight out of the data that you couldn’t otherwise.

           So some examples, maybe identifying high cost claimants or patients with high utilization or doing risk analysis or risk adjustment. Any of these kinds of analyses are dramatically simplified if you use a common standard, and reusable. So what you develop for one client or one payer or one dataset is much more closer to being reused on others, if not universally. So that’s a huge one, standardization. I think the other big benefit is that there’s a very robust ecosystem of open source tools. So because the data standard is public and anyone can develop against it and it uses modern technologies and modern approaches like JSON as a data storage specification, any framework or any programming language can be built to use FHIR and it just makes it very easy to reuse different pieces, whether they’re large or whether they’re small, whether it’s a FHIR server, whether extensions. Because it’s all open source, it’s much simpler and much easier to adopt for any organization, commercial or government.

Don Rucker:

Yeah. FHIR is the healthcare version of JSON. So JSON, JavaScript object notation, is the data format that pretty much all apps on your smartphone use to communicate back with their servers and restful APIs. So it is everywhere. If you have 100 apps on your phone, I don’t know, probably most people have more these days, but almost all of them have to communicate with a server in order to have some business model for writing the app. And all of those things, maybe not all, but most of those are done with the JSON data standard. So finally in healthcare.

Ethan Siegel:

Yeah. It’s really transformative to have a common data standard across the entire healthcare ecosystem. Of course, it’s still very much a work in progress, but as more data gets into the standard, what the opportunities that are unlocked by that standardization are really enormous. So it’s very exciting time to work in this industry.

Don Rucker:

Let’s just close with a little bit… Can give us a little sense of the scale as folks in their setting, whether they’re at a provider or at a payer or in some other part of the healthcare ecosystem? What kinds of scales are we talking about here in generating these FHIR resources? What have we seen?

Ethan Siegel:

So I think in terms of total lives on the 1up platform, as of today, which is in August 2022, I think there’s about 33 or 34 million lives on the platform. And on the ingestion side, when we’re looking at resources, occasionally we’ve had to hit that number that I mentioned earlier, 19 million resources an hour of ingestion. And for certain clients, we have hit a billion resources ingested across multiple environments. So it’s not all in production, some of its in our developments and staging environments, but it can vary quite a bit. So for populations which have lots of health issues which tend to be a little older, there can be quite a long history of data to transform. So one that comes to mind, we were doing some analysis of a dataset recently where the average number of events per patient was about 500. So there were 500 discreet events that we could track as points in time over a number of years for this large patient population. And I don’t think that’s unusual, especially for patients as they age, and especially in Medicare and Medicare Advantage populations.

Don Rucker:

Cool. Any closing thoughts as folks think about how folks might get their data into FHIR? Any closing thoughts on this or just encouragement go and do it because that is the way of the modern world and pretty much any app that you’re going to want to do to power your corner of the healthcare business is, at least I think, going to run more efficiently in FHIR, and in many cases would only be possible in FHIR, there’s not a plausible way of doing it? Maybe we close with your thoughts on combining claims and clinical data?

Ethan Siegel:

Sure. I think I would say it is difficult to transform data into FHIR, and I can’t really make any hedges around that statement. It’s hard. But the benefits that you can get are really enormous. And part of it is combining the claims and clinical dataset, like you were saying, but other parts of it are around reusing analytics that may have been developed by other people or other teams that other companies or other institutions. And previously, if someone took a dataset from someplace, some open source dataset, some closed system dataset and developed some great analytic on it, you would say, “Great, that’s nice,” and move on.

           But now, if someone does that using the FHIR standard, it is very, very easy to adopt that to use internally. And so while there is an investment that you have to make in order to transform data into FHIR at whatever institution or organization you’re at, the dividends that come from doing that and setting up processes to do that in real time or near real time or continuously so that analytics systems can be run on top of FHIR and clinical decision making systems and actuarial systems, I think the real benefit is probably two to five years out. But that is a huge, huge benefit to be had for the people who can get in early now.

Don Rucker:

Cool. Just for probably many folks know, end of this calendar year 2022, certified electronic medical records will have a FHIR API for patients to get their own data and download it, the US Core Data for Interoperability, that component of it, converted into FHIR, and those same electronic medical records will have the Bulk FHIR API so that if they want to look at their entire population, they’ll, natively, without conversion, without what we’ve all talked about here, be able to get, look at their data in FHIR and maybe have much more interesting business models as the whole world of providers and payers merges and all these interesting payvider combinations that are really a business version of integrating claims and clinical. So Ethan, thank you very much for giving us your solid experience with FHIR, and good luck.

Ethan Siegel:

Thank you.

In this video, Dr. Don Rucker, 1upHealth’s Chief Strategy Officer and Former National Coordinator for Health IT, sits down with Ethan Siegel, to discuss the ingestion of data into FHIR. This is Episode 3 of Dr. Rucker’s Youtube Series, Transforming Healthcare Data.

At 1upHealth, we’re the healthcare industry’s most complete FHIR data platform. Along with our managed platform, best-in-class tools, and FHIR APIs, we provide scalable, serverless, and secure solutions to support your business programs, workflows, and analytics.