# up mem

# ABUMPIMP 2023

The 1st Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform

DU

UPME

August 29, 2023



Copyright UPMEM® 2023



## Today's agenda

| TIME          | TITLE                                                                                                                         | SPEAKER(S)                                                |
|---------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|
| 09:00 – 09:05 | Session welcome and aims                                                                                                      | UPMEM                                                     |
| 09:05 – 10:00 | Keynote: UPMEM PIM platform for Data-Intensive Applications                                                                   | Yann FALEVOZ (UPMEM) /<br>Julien LEGRIEL (UPMEM)          |
| 10:00 - 10:30 | Invited talk: Understanding the potential of real processing-in-memory for modern workloads                                   | Juan GOMEZ LUNA<br>(ETHZ)                                 |
| 10:30 – 11:00 | Coffee break                                                                                                                  | —                                                         |
| 11:00 – 11:30 | Research paper: pimDB: From Main-Memory DBMS to Processing-In-Memory DBMS-Engines on Intelligent Memories                     | Arthur BERNHARDT<br>(Reutlingen University)               |
| 11:30 – 12:00 | Invited talk: A Fast Processing-in-DIMM Join Algorithm Exploiting UPMEM DIMMs                                                 | Chaemin LIM<br>(Yonsei University)                        |
| 12:00 - 12:30 | Invited talk: PIM Performance and Economics for In-Memory Databases                                                           | Hanna KRUPPE<br>(SAP)                                     |
| 12:30 - 14:00 | Lunch Break                                                                                                                   |                                                           |
| 14:00 - 14:30 | Research paper: Banded Dynamic Programming Algorithms on UPMEM PIM Architecture                                               | Meven MOGNOL<br>(Univ. Rennes, CNRS-IRISA, Inria & UPMEM) |
| 14:30 - 15:00 | Research paper: Protein Alignment on UPMEM PIM Architecture                                                                   | Dominique LAVENIER<br>(Univ. Rennes, CNRS-IRISA & Inria)  |
| 15:00 - 15:30 | Research paper: An Experimental Evaluation of Machine Learning Training on a Real<br>Processing-in-Memory System              | Juan GOMEZ LUNA (ETHZ) /<br>Sylvan BROCARD (UPMEM)        |
| 15:30 - 16:00 | Coffee break                                                                                                                  | —                                                         |
| 16:00 - 16:30 | Research paper: Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware | Purab SUTRADHAR<br>(RIT)                                  |
| 16:30 – 17:00 | Research paper: Privacy-Preserving Computing on UPMEM                                                                         | Elaheh SADREDINI<br>(UCR)                                 |
| 17:00 - 17:15 | Closing                                                                                                                       | UPMEM                                                     |





ABUMPIMP 2023

# Keynote: UPMEM PIM platform for Data-Intensive Applications

UPMI

Copyright UPMEM® 2023

### Keynote's agenda

- 1. Technology Overview
- 2. High-Level Hardware Architecture
- 3. Programming the UPMEM PIM
- 4. PIM Applications
- 5. Hardware Roadmap



### Overcome data and energy bottleneck thanks to PIM



**Founded:** 2015



Headquarters: Grenoble, France



Gilles Hamou CEO / Co-Founder

Track Record:

Co-owner @ Oscaro.com Scaled Oscaro.com to \$100M revenue from inception Founded & scaled Plantes-et-Jardins.com Senior Manager @ RSM Case Leader @ BCG

Education:

MBA INSEAD

Eng. Centrale Paris



**Employee Count:** ~20



Total Patents: 11



Fabrice Devaux CTO / Co-Founder

Track Record: Senior Staff SWE @ VMWare Co-owner, CTO @ Trango Virtual Processors, sold to VMware CPU Architect @ STMicroeletronics

#### **Education:**

DEA, Microelectronics, Pierre and Marie Curie University



#### UPMEM

### Limitations of Traditional Compute-Centric Architectures for Data-Intensive Workloads



Total system (server with compute node) energy consumption

~63%





Source: SK hynix CEO Seok-Hee Lee's keynote at the GSA Memory+ 2021 conference, confirming Lawrence Berkeley Lab results.



## Overcome data and energy bottleneck thanks to processing in memory







#### UPMEM

## Taxonomy of processing in memory (PIM)



Source: Das, Reetuparna., Wang, Xiaowei., Fujiki, Daichi., Subramaniyan, Arun. In-/near-Memory Computing. United States: Morgan & Claypool Publishers, 2021.



UPMEM

## Proven capacity to benefit a wide range of applications







# Technology Overview

Copyright UPMEM<sup>®</sup> 2023

## A standard application server populated with PIM DIMMs







## A DPU is a simple modern general-purpose processor



- Shared access with the host CPU to a DRAM bank
- Instruction and data caches replaced by instruction RAM and a Working RAM.
- Independent and asynchronous
- 24 independent threads per DPU
- No direct communication channel among DPUs



# A set of tools for smooth application porting

x86 program written in C, C++ or python with C functions to call routines on the DPUs

**UPMEM SDK contains:** 

- A Full-featured runtime library for the DPU
- Management and communication libraries to encapsulate easily all the Host to DPU operations
- An LLVM based C-compiler using LLVM 12.0
- A LLDB based debugger
- Programming tools: profilers, simulator...
- Server BIOS binaries

| 1,224,000 µs    | 1,224,500 µs                        | 1,225,000 µs 1,225,500 µs           |
|-----------------|-------------------------------------|-------------------------------------|
| dpu_sync        |                                     | dpu_sync                            |
|                 | dpu_copy dpu dpu dpu_sync_rat       | nk dpu_cop                          |
|                 | dpu dpu dpu dpu_sync_rank           | dpu                                 |
| dpu dpu_copy_fr | dpu_copy_to_mrams dpu_copy          | . dpu dpu dpu_sync_rank dpu_copy_fr |
| dpu_cop         | dpu_copy dpu_copy_to_mrams          | dpu dpu_sync_rank dpu_cop           |
| dpu             | dpu_copy_to_mrams dpu               | _co dpu dpu dpu_sync_rank dpu_cop   |
| d dpu_copy_fr   | dpu_copy_to_mrams dpu_copy_to_mrams | dpu dpu_sync_rank dpu_cop           |
|                 | dpu_copy_to_mrams d dpu dpu dpu_s   | sync_rank dpu_copy_fr               |
|                 | dpu dpu_copy d dpu dpu_sy           | nc_rank dpu_cop                     |
|                 | dpu_copy_to_mrams d dpu dp          | u_sync_rank dpu_cop                 |
|                 | dpu_co dpu_copy dpu dpu dpu         | _sync_rank dpu_cop                  |

• Linux driver for x86 servers Validated on Redhat, Ubuntu and Debian.





# High Level Hardware Architecture

Copyright UPMEM® 2023

### Integrating a processor into a DRAM





### Integrating a processor into a DRAM





### The DPU architecture





## **Pipeline properties**

- 11-stage pipeline
- 24 hardware threads interleaved
- 32 registers per thread
  - 24 multi-purpose registers
  - 8 constant registers (zero, one, identifier of thread, etc)





### System overview







## **CPU/DPUs communication**





## **CPU/DPUs communication**







# Programming the UPMEM PIM

Copyright UPMEM® 2023

# **UPMEM SDK & Application Design Flow**

#### • <u>CPU program</u>

- compiled using the x86 toolchain
- uses UPMEM's SDK host library
- Allocate DPUs, load DPU program, boot DPUs, copy, asynchronous orchestration etc.

#### • DPU program

- Written in C, subset of C library available
- usually common to all DPUs (not mandatory)
- compiled using dpu-clang (based on LLVM 12)
- uses the DPU libraries for primitives like
   WRAM/MRAM management, thread
   synchronization (mutex etc.), perf counters etc.

#### Profiling & debugging tools

- DPU: dpu-lldb, dpu-trace, etc.
- Host: dpu-profiling tool based on Linux perf

#### • <u>System</u>

- Linux driver for x86 servers
- Server BIOS binaries





## The DPU ISA

- Proprietary RISC triadic instruction set
- No FPU : FP using a software library (IEEE-754)
  - ~200 cycles for a 32x32 float multiplication
  - consider quantization for FP applications
- No vectorized instructions (except cmpb4)
- 8x8 multiplication instruction, up to 32 ops for 32x32
- Rich set of conditions for jump (examples next slide)
- Assembly code analysis & optimization can be useful

Operators (non-exhaustive): Arithmetic: ADD, SUB, AND, OR, XOR 0 Loads/Stores: SD, SW, SH, SB, LD, LW, LH, LB Ο Shift/Rotate: LSL, LSR, ROL, ROR Ο Count bits: CLZ, CLO, CLS, CAO 0 Multiplication/Division: MUL\_STEP, DIV\_STEP, MUL Ο DMA loads/stores: SDMA, LDMA, LDMAI 0



## **The DPU ISA**



|                                                   | sum:                                                    |                     |
|---------------------------------------------------|---------------------------------------------------------|---------------------|
| <pre>int sum(const int* data, int size) {</pre>   | move r2, r0<br>move r0, 0                               | <u>Reverse Loop</u> |
| <pre>int val = 0; for(int i=size; i!= 0; i)</pre> | jeq r1, 0, .LBB0_2<br>.LBB0_1:                          |                     |
| <pre>val += data[i]; return val;</pre>            | lsl_add r3, r2, r1, 2<br>lw r3, r3, 0<br>add r0, r3, r0 |                     |
| }                                                 | → add r1, r1, -1, nz, .LBB0_1                           |                     |



Programming the UPMEM PIM







mem



Copyright UPMEM® 2023

# **PIM Application Development**

#### • Which applications on UPMEM PIM ?

- Largely data-parallel (DPU + HW threads)
- Low synchronizations, good load balancing
- Memory-bound workloads (massive bandwidth of 2TB/sec on PIM)
- Low operational intensity on CPU / cache inefficient
- UPMEM PIM is throughput-oriented (batching)
- PIM workload vs data transfer (large read-only database in PIM)
- Consider algorithm redesign due to change of paradigm
- Maturing PIM applications is key to UPMEM's PIM product adoption
  - Customers do not want to develop the applications for us
- From algorithm benchmarking towards porting of standard libraries on PIM
- Application tightly coupled with SDK enhancements



## **Analytics : Index Search**

- An index search engine identifies items in a database from keywords specified by the user (web pages, text documents, e-commerce product...)
- UPIS: Engine for exact phrase match
   <u>https://github.com/upmem/usecase\_UPIS</u>
  - **> 600 queries/sec** with full PIM server on wikipedia
  - 30 queries/sec in Apache Lucene nightly benchmark
- **PIM Lucene**: extension of Apache Lucene <u>https://github.com/upmem/pim-lucene</u>





# **Analytics : Hash Join**

- Parallel hash-based join on DPU
- 4G rows per table (32GB of random data)
- PIM-based join is ~9X faster than CPU-based join (data fusion) using a server configuration of 2048 DPUs and 8 standard DIMMs



| eft Table |                              |                  | Ri          | ght Table      |         |
|-----------|------------------------------|------------------|-------------|----------------|---------|
| Date      | CountryID                    | Units            |             | ID             | Country |
| 1/1/2020  | 1                            | 40               |             | 1              | USA     |
| 1/2/2020  | 1                            | 25               |             | 2              | Canada  |
| 1/3/2020  | 3                            | 30               |             | 3              | Panama  |
| 1/4/2020  | 2                            | 35               |             | 4              | Spain   |
|           | Merged Ta                    | ble              | )           |                |         |
|           | Merged Ta                    |                  | Units       | Country        | 1       |
|           |                              | ble<br>CountryID | Units<br>40 | Country<br>USA | ]       |
|           | Date                         | CountryID        |             |                | ]       |
|           | Date<br>1/1/2020             | CountryID<br>1   | 40          | USA            |         |
|           | Date<br>1/1/2020<br>1/2/2020 | CountryID<br>1   | 40<br>25    | USA<br>USA     |         |

- Scatter-gather transfers (SDK 2023.1.0)
- Considering porting an OLAP DBMS to PIM (DuckDB)
- <u>https://github.com/upmem/dpu\_olap</u>

up mem

# ML: Decision Trees / K-means

- **CART** training implemented on DPU : builds a binary-search tree which represents a partitioning of the feature space
- 36X faster than CPU (scikit intel ext., criteo dataset)
- Next step: XGBoost on PIM, throughput implementation

- **K-means** : partition the dataset into K distinct non-overlapping subgroups (clusters)
- 1.37× faster than CPU (scikit intel ext., criteo dataset)
- **Paper:** Evaluating Machine Learning Workloads on Memory-Centric Computing Systems (ISPASS 2023)







## Genomics

• Sequence alignment



FASTQ file (read of 120 nucleotides):

- UPVC: Short reads alignment + variant calling
   <a href="https://github.com/upmem/usecase\_UPVC">https://github.com/upmem/usecase\_UPVC</a>
- Long read alignment : adaptive N&W algorithm (~9X speedup) <u>https://github.com/upmem/usecase\_dpu\_alignment</u>
- BWA (FM-index)
  - memory-bound algorithm but difficult to parallelize (too many synchronizations between DPUs)
- Pair-HMM (GATK variant calling)





## **Collaborative projects**









**BioPIM** 





Co-designing algorithms and data structures commonly used in bioinformatics together with several types of PIM architectures to obtain the highest benefit in cost, energy, and time savings.

3M€ project



**SustainML** 





Sustainable, interactive ML framework development for Green AI that will comprehensively prioritize and advocate energy efficiency across the entire life cycle of an application and avoid AI-waste.

4.3M€ project



# **STRATUM**





3D decision support tool for brain surgery guidance and diagnostics based on multimodal data processing through AI algorithms that will be integrated as an energy-efficient Point-of-Care computing tool.







# Hardware Roadmap

Copyright UPMEM® 2023

## **PIM DRAM Modules**

#### <u>Gen 1B (being released)</u>

- Silicon bring up completed. System bring up in progress
- Frequency increased to 466 MHz or up to 40% lower power consumption at same frequency
- Host access to WRAM while the DPU owns the bank
- New HW monitoring features
- DPU switch off capability  $\rightarrow$  Idle consumption  $\searrow$  by 90%

#### <u>Gen 2</u>

- Under development
- Enhanced control interface
- Host access to MRAM while the DPU owns the bank
- New operator integration
- Improved system integration

#### <u>Gen Al</u>

• Starting to work on a roadmap for a dedicated chip for AI applications such as LLM





## **CXL or SoC attached integration**

#### **CXL Proof of Concept**

- Based on a PCIe FPGA prototyping Card
- PIM friendly API embedded in the FPGA
- Simplifies system integration
- At the cost of higher BOM price and power consumption

#### PIM aware system

• Starting initiative with a major player in network devices to build a SOC with the same functionalities



BittWare XUP-P3R



### **Servers**

#### From Skylake SP to Ice Lake SP

- Partnership with a major server manufacturer
- Moving from R&D servers to production servers
- $24 \rightarrow 32$  DIMM slots
- Increased number of memory channels
- Increased PIM and legacy memory density
- Released in coming months
- Early exploration of Sapphire Rapids SP

#### **Open firmware implementation**

- Partnership with another major server manufacturer
- Early exploration through a proof of concept





## **Cloud infrastructure**

#### In Numbers

- 10 servers
- 56 teams
- 212 active users
- Almost 31 000 hours booked



#### **Evolutions**

- 1st spot where the aforementioned novelties will be released
- Service storage capacity (local disk, sftp for dataset pre-loading...)





# **Useful links**

- <u>Website</u>
- <u>Resource page</u>
- <u>Github</u>
- <u>SDK</u>

# Thank you

#### Yann FALEVOZ, In charge of lab relationship management

<u>yfalevoz@upmem.com</u>

Julien LEGRIEL, Technical leader on the SDK and applications on PIM.

<u>jlegriel@upmem.com</u>

Copyright UPMEM® 2023