Genomics is a major contributor to Bioinformatics market, expected to grow 20% per year in the next decade. Giga bytes of data are involved at each computations (DNA samples, reference genome, banks, etc. ) and large scale deployment requires economical efficiency, and so computing efficiency.
The first time we met Mr Lavenier, leading GenScale team at INRIA public research lab, he identified the potential of Processing-In-Memory: Genomics algorithm execution time is dictated by DRAM bandwidth, and the deport of data-intensive treatments behind the Memory Wall should solve this problem.
To validate initial thoughts with real code and data, Mr. Lavenier conducted 2 evaluations, with 2 of the most used applications in Genomics : Mapping and BLAST. In both cases, we see an acceleration of 25x. In other words, the application runs 25 times faster with UPMEM hardware added to the server. Awesome !
Mapping evaluation
The “Mapping” operation consists of finding the best location of DNA fragments in the full genome. For instance, typical Human genome analysis often requires to “map” more than 100 millions of DNA fragments( 10+ GBytes of data) over the Human reference genome, requiring 150+ GBytes of memory to be stored as a fast index. With UPMEM solution, DNA sequences are dispatched to the various co-processors depending on k-mers features, allowing to massively distribute the computation (see chart below).
Benchmarking with best of art implementations, we see an average speed up of 25 when UPMEM solution is used. The full report is available here, and includes detailed explanations on the implementation, the source code used as well as the timing measured and overheads estimated.
Blast evaluation
BLAST is a molecular biology software application that scans/compares DNA sequences and/or protein banks. It is daily used by thousands of biologists. To find similarities, BLAST proceeds in 3 steps: (1) Search of common words between the query sequence and the bank sequences; (2) Evaluation of local similarity on the neighbourhood of these words; and (3) Computation of the final alignment.
Steps 1 & 2 are limited by memory bandwidth, and represent the majority of computing time. They are offloaded to UPMEM co-processors into the DRAM, so that only the final alignment is done by the main CPU of the server.
BLAST workload runs 25 times faster when using UPMEM Processing-In-Memory solution, or would require 1 server instead of 25. The full report is available here, and details the datasets used, the source code implemented as well as execution times measured and integration overhead.
For the anecdote, the instruction cmpb4 was suggested by Mr Lavenier to accelerate the critical loop, and is now a first class citizen in DPUv1 instruction set architecture. This is a small thing compared to the validation of the PIM programming model and efficiency, but it prouves how much our partners can influence our products.