Big Data Getting Bigger: Analysis and Sharing Critical as Genomics Grows Up

Recently PLoS Biology published a paper about big data in genomics from lead author Zachary Stephens, senior author Gene Robinson, and their collaborators at the University of Illinois at Urbana-Champaign and Cold Spring Harbor Laboratory.

The perspective offers a bold vision about the projected growth of data generation and management in the field. The authors compare genomic data to other areas known for their leading data production (astronomy, Twitter, and YouTube) and offer solid documentation for their theory that 10 years from now, genomics could outpace all other big data fields. (Check it out: “Big Data: Astronomical or Genomical?”)

One of the areas they focused on was data analysis, a category that’s near and dear to the QIAGEN Bioinformatics team. Stephens et al. call out variant interpretation as one of the most computationally intensive processes for genomic data. Projecting out to the number of genomes that could be available by 2025, they write, “Variant calling on 2 billion genomes per year, with 100,000 CPUs in parallel, would require methods that process 2 genomes per CPU-hour, three-to-four orders of magnitude faster than current capabilities.”

We think solutions that take advantage of cloud-based computing will be critical for overcoming this hurdle. The less scientists have to move data around and rely on limited in-house clusters to perform complicated queries, the more they can focus on what’s really important: results. This is one of the reasons our applications are web-based. Users of Ingenuity® Variant Analysis™, for instance, simply upload their variant list and let us worry about the computational resources. Clearly, further improvements are needed to analyze genomes at enormous scale, and our R&D team will work closely with others in the field to make sure we’re ready to meet the challenge.

Another great point in the scientists’ perspective was their insistence on data sharing across labs and institutions. “For precision medicine and similar efforts to be most effective, genomes and related ’omics data need to be shared and compared in huge numbers,” the authors write. “If we do not commit as a scientific community to sharing now, we run the risk of establishing thousands of isolated, private data collections, each too underpowered to allow subtle signals to be extracted.”

We heartily support this statement, and are proud to be co-founders of a leading initiative aimed at facilitating this kind of sharing — the Allele Frequency Community. When we and our fellow founders first conceived the community, data sharing was one of our most important goals. That’s why we adopted a share-and-share-alike approach for AFC, letting all scientists use the data as long as they share their own allele frequency data in exchange. This proviso has led to remarkable growth for the community in its first six months, constantly making the resource more valuable to everyone using it. We think there are opportunities to use a similar approach for other types of genomic data and hope others are inspired to try it.

Kudos to Stephens et al. for a thought-provoking commentary that has gotten the whole field talking about what the future of genomics might look like!