Getting to Know The Ingenuity® Knowledge Base: Manual Curation (Part 1 of 4)

Ingenuity Knowledge Base

In this blog series over the next few weeks, we’ll take a closer look at story behind the Ingenuity Knowledge Base. As many readers already know, The Ingenuity® Knowledge Base is the data engine that powers Ingenuity Variant Analysis, Ingenuity Pathway Analysis, and our new Ingenuity Clinical decision support tool.  It is the brain that powers advanced algorithms to help researchers understand their genomics data, make connections, find resources, and rapidly advance their science.

Today, we look at the curation process through which we feed information into the Knowledge Base. Many people assume that a database like ours can be built by automatically scraping content from other data and literature repositories, but that is not the case.

When Ingenuity was first founded, the team tried every curation approach they could imagine: automated natural language processing approaches, semi-automated approaches, custom-built technology, third-party technology, etc. “We tried every method under the sun to get to scale,” remembers co-founder Ramon Felciano. But the quality of content just wasn’t there. “The only way that we were able to do it reliably was by hiring this army of curators and ontologists and biological modelers to build, structure and integrate these content resources by hand,” Ramon adds.

At the time, nobody at then-startup Ingenuity Systems had planned on pursuing manual curation — the slowest, most costly method of data acquisition out there. The realization was a turning point; the founders threw out earlier versions of Ingenuity Knowledge Base and started over. “That’s when we doubled down and said if we’re going to do this we’ve got to do it right,” Ramon recalls. “It might take us five years to build this manually but we don’t see any other path — and we think everyone else who’s chasing these alternate methods is going to fail.”

What led to the manual curation choice was data quality. The founders’ goal was to build a foundation that could help scientists analyze their data and quickly make a decision about what to do next based on those results. That required high accuracy, breadth of coverage, and detailed context and annotation of information. Automated approaches for extracting data from published literature had too many errors: data that was mis-captured or mis-represented, as well as a general lack of context — a scan of title and abstract didn’t cut it. Some of the best information came from figures and tables, which automated approaches could not process at all.

Expert curation, on the other hand, could process all of the content of a paper and fully understand the importance of it. “You need an expert to understand what it means,” says Sara Tanenbaum, our director of content. “An expert can read the captions and the text to know what are the numbers in column A and what are the numbers in column B.” That human element provides a level of detail that isn’t possible with other methods.  On the other hand, we are always careful to make sure our experts are not inserting their own bias or interpretation into the captured computable knowledge.  We accomplish this by providing our curators with software extraction tools which directly extract the exact text and findings from the paper to avoid bias.  It’s their expertise in understanding, identifying and extracting the important scientific facts which powers the process in a way that pure automated algorithms cannot.

Another benefit of manual curation is the ability to target certain kinds of content. To build the capabilities of the Ingenuity tools in certain areas, we can task experts to read the literature for specific information to help keep the Knowledge Base up to date and rock-solid for users.  Naturally, adding semi-automated curation approaches where appropriate can add considerably more content than our expert curators are able to process.

Be sure to check our next Ingenuity Knowledge Base blog post where we’ll describe the ExpertAssist Findings program.