top of page
  • Writer's pictureTony Zeljkovic

Cloud-Driven Genomics: How a Diagnostics Firm Scaled Human Genome Sequencing with Zelytics

Executive summary

As diagnostics companies expand into human clinical genome sequencing, legacy infrastructure often becomes a bottleneck.


A leading diagnostics firm faced similar challenges with their on-premise high-performance computing (HPC) cluster, struggling with performance issues, limited cloud expertise, and slow query performance on large genomic datasets.


The company needed a scalable, cost-effective solution to meet strict clinical and regulatory demands.


Zelytics partnered with the client to deliver a cloud-based genome processing pipeline, leveraging cutting-edge hardware acclerated technologies like NVIDIA Parabricks and Illumina’s DRAGEN and deploying this to AWS.


This pipeline reduced genome processing times to under two hours, handling thousands of genomes at 30-40X coverage.


In addition to processing improvements, Zelytics optimized secure data transfer between cloud environments, significantly accelerating transfer times.


Next, by integrating Snowflake with Nextflow, Zelytics enabled high-performance querying of semi-structured data, making it easier to analyze and derive insights.


A custom Nirvana-annotated data parsing solution further streamlined data analysis, reducing query times by 50% and cutting infrastructure costs by 60%*.


Key outcomes included:


  • Faster Genome Processing: A sub-two-hour pipeline* drastically reduced time-to-insight for large-scale genome sequencing.

  • Cost Savings: Infrastructure costs were reduced by 60% with optimized cloud architecture.

  • Improved Query Performance: Variant call analytical query performance was accelerated by 50% through efficient handling of semi-structured data.

  • Scalability: The cloud solution enabled the client to scale their genome sequencing operations by several orders of magnitude*.


*Interested in how we measured these results? Reach out to learn more about the technologies and processes we used.


This solution empowered the diagnostics company to deliver clinical-grade genome sequencing at scale, setting the stage for future growth in the rapidly evolving genomics field.


 

Context & Background

Clinical human genome sequencing for diagnostics is a cornerstone of genetics-based health care. It’s also one of the trickiest for diagnostics companies to implement well due to stringent clinical, technological and regulatory requirements.


Legacy systems rule supreme with many organizations in this space which leads to critical bottlenecks in turnaround time, scalability and insights generated from the data.


In this article, we dive deeper into a case study of a diagnostics company wanting to expand into human clinical sequencing.


Pain Points & Challenges


  • The client was running into significant performance issues with their on-premise high performance compute cluster which was expected to be a deal breaker when expanding into human genome sequencing.

  • The client was interested in exploring cloud solutions but had limited experience and knowledge on the pros and cons of cloud computing versus HPC and was unclear on which platforms would be best suitable for their use case.

  • Past experiences with analytical query performance of human genome big datasets has been disappointing with previous data warehouses solutions.

  • The client had previous experience with integration of semi-structured annotations with clinical diagnostics pipelines but needed support converting these to a structured database format.

  • Reproducibility and data provenance were identified as key bottlenecks in previous R&D efforts.


Scope


  • Amplify the time-to-insights and volume of insights by at least an order of magnitude with a 50% lower cost profile compared to current implementations.

  • Determine the best value-for-money cloud platforms for the use case.

  • Migrate on-prem infrastructure to the cloud.

  • Develop a clinical-grade ACMG-standard end-to-end bioinformatics pipeline for processing, annotation and analysis of thousands of human genomes.

  • Solutions must be HIPAA/HITECH compliant.


Solutions


Objective: Develop a highly reproducible, highly scalable, sub 2 hour human genome sequencing pipeline in the cloud


The first objective was to set up a bioinformatics pipeline and accompanying infrastructure that could handle processing arbitrary amounts of human genome sequencing data at 30-40X coverage at sub 2 hour processing time and right sized computational resources.


After some back and forth, a couple of limiting factors were identified in the process:


  • Computationally heavy steps such as demultiplexing and read alignment can eat away at a lot of computational resources and processing time.

  • Transferring large quantities of raw data from an illumina sequencer on premise and from cloud to cloud from base space to client cloud takes a significant amount of time and needs to be properly secured due to sensitivity of data.

  • CPU, memory and disk requirements vary heavily with each step and require a powerful scheduler.


Zelytics identified using hardware acceleration at the right bottlenecks could significantly speed up overall processing. Zelytics has accompanied the client in implementing hardware accelerated solutions through NVIDIA parabricks and Illumina’s DRAGEN platform.


Next, we implemented a custom bioinformatics pipeline leveraging the aforementioned platforms accompanied by flexible compute resources for subsequent annotation steps such as the Nirvana annotator.


We implemented a containerized nextflow pipeline integrated with AWS batch and AWS ECS to seamlessly deploy and schedule compute jobs.


To accelerate data transfer from cloud to cloud and on-prem to cloud, we leveraged the fact that illumina basespace is based on S3 and implemented a custom network accelerated solution to quickly transfer data.


For on-prem data, we set up an AWS private link connection accompanied with on-prem IAM roles and cron jobs for a highly secure automated data transfer to a dedicated Virtual Private Cloud (VPC) network on AWS.


Objective: Simplify analytical query performance and semi-structured annotation processing


Once the raw processing capacity was in place the next step was implementing a simple performant structure for querying large quantities of semistructured data.


Diving more in depth into this topic, it was evident that certain tools and formats are much easier to process on regular compute nodes whereas some were quite difficult to work with in those environments. Specifically, semi-structured data as produced by the nirvana annotator was rather difficult to analyze and use at scale within simple python or R environments.


As a highly proficient snowflake integration partner, Zelytics suggested a hybrid approach to processing annotated data where data would be processed where it makes sense. 


For this, Zelytics built a custom integration between snowflake and nextflow which allowed to process annotated files with bioinformatics tools on regular EC2 instances, while leveraging the powerful snowflake data warehouse to process more difficult to parse semi-structure data and get highly performant online analytical processing (OLAP) queries out of the box that just scale.


Objective: Custom parsing solution for Nirvana semi-structured data


Last, Zelytics moved to build a custom parsing solution which would parse out the Nirvana annotated JSON data into a relational database format which could be readily used to join with snowflake ingested Variant Call Format (VCF) files containing the processed human genome data. 


This was critical to allow an astonishingly simple but scalable interface for the research and development team at the client to perform queries against these data sources and easily integrate these within their nextflow pipelines or local development environments.


Closing remarks

Are you facing similar compliance challenges in your organization? Healthcare, finance, you name it. At Zelytics, we have dedicated consultants to set up comprehensive data governance solutions for your company.


Zelytics offers a complimentary consultation to help you gain clarity around your main challenges and develop a data-driven strategy to overcome them.


Let’s talk and get to know each other and see what we can do for your business.




3 views0 comments

Comments


bottom of page