#### Offloading Embedding Lookups to Processing-In-Memory for Deep Learning Recommender Models

#### EuroPar 2024- ABUMPIMP Niloofar Zarif, Justin Wong, Alexandra Fedorova Aug 2024



a place of mind THE UNIVERSITY OF BRITISH COLUMBIA



UBC Systopia Research Group



## **Recommender Models**

- Recommender Systems in our everyday life: Facebook Marketplace, Google Ads, Netflix
- Deep Learning for Recommender Models
- Different from DNN or RNN
- Features:
  - Numerical
  - Categorical
- Embedding Layers



#### **Embedding Layer**



## **DLRM Inference Workload**

- DLRM: Meta's recommender system
  - MLP
  - Embedding Layer
- Low Inference Latency important -> CPU prefered
- There are models with more than 80% of execution time of each inference cycle spend on embedding lookup<sup>[1]</sup>
- Embedding lookups:
  - Very Irregular memory accesses -> higher MPKI and lower IPC
  - Low computational intensity -> lower FLOPS
- PIM-Rec:
  - Use Processing-In-Memory for Embedding Lookups



## **UPMEM PIM Solution**

- Perform computations right where the data lives and avoid memory wall (limited memory bandwidth)
- This approach has been used before but with specialized hardware:
  - Do this with the first commercially available PIM solution, that is a drop-in replacement for existing DRAM
- UPMEM DRAM: Delivered as standard DDR4 DIMM modules



#### **UPMEM PIM Architecture**

- Constraints:
  - No cross-dpu memory sharing
  - Cannot process floating point
- Huge bandwidth potential
- Each DIMM, 2 ranks and each rank 64 DPUs





#### UPMEM PIM DRAM Chip

## **Design Challenges**

- Minimal implementation overhead
  - Python vs. C memory management
- No inter-DPU communication
- No floating point operation



## **PIM-Rec Design**

#### Loading embedding tables to UPMEM memory

- Break tables into columns (16,32 or 64)
- Each column copied to 1 DPU
- Turn 32-FP values into 32-int
- Pre-processing done just once

- Receiving lookup query
  - Break down for each table
  - Copy to corresponding DPUs
  - Aggregate on host-side
  - Turn 32-int back into 32-FP

#### case 1: embedding table with 16 columns



case 2: embedding table with 32 columns



case 3: embedding table with 64 columns



case 4: embedding table with 128 columns



## PIM-Rec Design(cont.)

- Loading embedding tables to UPMEM memory
  - Break tables into columns (16,32 or 64)
  - Each column copied to 1 DPU
  - Each table copied to at least 1 rank
  - Turn 32-FP values into 32-int

- Receiving lookup query
- 1. Break query and copy to DPUs
  - a. Parallel transfers
- 2. Process in DPU and store in mram
- 3. Copy from MRAM (DPU) to host
- 4. Turn 32-int back into 32-FP



#### Lookup Processing in DPUs



#### Parallelism in DPUs: Tasklets



# **Experimental Results**

## Speedup

- 2048 DPUs
  - 32 embedding tables
  - 64 columns per table
- 0.5 to 56 MB data per DPU
  - 125K to 13.9M 32bit integers
- 30 KB queries
  - Batch size of 64
  - ~120 lookup operation per batch
- 1 to 114 GB total embedding data
  - 32 tables
  - $\circ$  0.5 to 56 MB per table



Embedding Data per DPU (MB)

#### Cache Hit Rate

- 2048 DPUs
  - 32 embedding tables
  - 64 columns per table
- 2 MB data per DPU
  - 500K 32bit integers
- 3.8 to 48 KB queries
  - Batch size of 8 to 100
  - ~120 lookup operation per batch
- 4 GB total embedding data
  - o 32 tables
  - 2 MB per table

L1D, LLC Hit Rate with Varying Batch Size



## **Processor Performance**

- 128 to 2048 DPUs
  - 2 to 32 embedding tables
  - 64 columns per table
- 2 MB data per DPU
  - 500K 32bit integers
- 30 KB queries
  - Batch size of 64
  - ~120 lookup operation per batch
- 256 MB to 4 GB total embedding data
  - $\circ$  2 to 32 tables
  - 2 MB per table



## **Favourable Workload**

- 4480 DPUs
  - 70 embedding tables
  - 64 columns per table
- 400 KB data per DPU
  - 100K 32bit integers
- 4 KB queries(32bit int)
  - Batch size of 16
  - ~ 64 lookup operation per batch
- 448 MB total embedding data
  - 64 tables
  - 6.6 MB per table



#### Latency Breakdown

- 128 to 2048 DPUs
  - 2 to 32 embedding tables
  - 64 columns per table
- 2 MB data per DPU
  - 500K 32bit integers
- 30 KB queries
  - Batch size of 64
  - ~120 lookup operation per batch
- 256 MB to 4 GB total embedding data
  - $\circ$  2 to 32 tables
  - 2 MB per table



## Conclusion

- PIM-Rec offers up to 10.5X speedup
- CPU used more efficiently, higher IPC
- Cache used more efficiently, higher LLC and L1D hit rate
- UPMEM PIM lookups exhibit promising scalability

Further experimental results:

MSc Thesis on UBC library

# Thank you!

Special Thanks to Prof. Alexandra(Sasha) Fedorova and Justin Wong!

**Questions?**