The PIM (Processing-In-Memory) solution developed by UPMEM is based on a RISC processor design code-named DPU (stands for DRAM Processing Unit), optimized for intensive data processing and compatible with DRAM process constraints. The DPU SDK has been built to accelerate software developments and to facilitate integration with large applications potentially using hundreds of DPUs in parallel.
- At micro level : The SDK provides basic bricks to create programs, load them on the co-processors and pilot them from host applications
- At macro level : The SDK defines a collection of APIs and frameworks to command and control groups of processing units, simplifying activity scheduling, memory distribution and communication
On the DPU side, the SDK allows to write programs, sharing memory with the host and using specific communication channels, called mailboxes, to post and receive small amounts of information (such as processing parameters). Programs can be designed as simple functional co-processing units or as a network of actors co-operating within a broader massively parallel environment.
In the « functional model », each DPU implements one or several operations seen as individual functions from the host application standpoint. The input parameters are fetched from memory or mailboxes and the returned value is written back into memory or posted into mailboxes.
The actor model conforms to a standard definition : every DPU holds a number of computational units, receiving messages from the host application, itself composed of actors. DPUs see their host as an actor to which they can respond by sending messages too.
These different execution models are enabled thanks to the DPU run-time environment (RTE), which offers a framework to design the proper scheduling models. This environment is the combination of a run-time library (RTL) and a run-time configuration (RTC). The RTL is a collection of C functions, to manage memory, synchronization primitives (such as mutual exclusions), plus miscellaneous utilities. To obtain the best performances and the smallest memory footprint, the run-time environment operates on the basis of a static configuration, defined before compile time (rather than dynamically created at run-time). The RTC defines which threads or actors, synchronization primitives, memory mappings etc. come into play.
In other words, the configuration is « injected » into the program while constructing it and referred to at run-time by invoking the library primitives.
Due to this « static run-time definition » paradigm, the process of building a DPU program is slightly different from the traditional « compile/assemble/link » sequence:
Developers first need to define a « static kernel configuration« , using a command line interface tool called dpukconfig. This is a pretty basic interactive shell from which programmers describe the RTC with a directive-driven syntax and save the result into a JSON file.
For example, the syntax to declare a thread (called « tasklet » in the RTE) looks like:
Once the RTC is saved, users can compile and link their source code, along with this configuration, using dpucc, which is our « kind of gcc ». Behind the scene, dpucc uses a number of individual tools to create a program:
- A C compiler, based on LLVM (http://llvm.org) with a CLANG front-end and a DPU specific back-end
- The JCPP pre-processor (http://www.anarres.org/projects/jcpp)
- RT Studio, translating run-time configurations to a set of assembly files linked with the rest of the program
- A home-brewed assembler and linker
Testing and debugging
Next step in the development process is to exercise, debug and validate the programs to ensure that they will smoothly integrate into the final application. To achieve this, the DPU toolchain offers the dpuservice, behaving as a kind of « hub » between high-end development applications and different back-ends, such as simulators or hardware.
The service is a network server, offering a way for external utilities to drive a given type of DPU. Such utilities can load and execute programs, check the DPU memories and registers, etc. The service also embeds a debug unit, offering an extensive set of operations to developers, such as breaking, back-tracing, executing step by step…
Such an architectural approach allows us to plug almost any tool we want during our development stage. Typically, we use the service to run test-suites, regularly scheduled by a Jenkins server (https://jenkins.io/). We hope in a near future to use the service infrastructure to provide the support for commons IDEs, such as Jetbrain’s Clion (https://www.jetbrains.com/clion/) or Eclipse (https://eclipse.org/). Today, the toolchain provides a command line utility, called dpushell, offering functions that do come in handy for developments.
Integrating programs into applications
For the final integration, the SDK comes with the foundations to a rich framework adapting the DPU programming model with the applications’ needs and constraints. The basic brick is a lightweight C API and a library linking DPUs and applications all together. The programming interface is quite simple and intuitive, so that the implementation of simple use-cases is pretty straightforward.
This concludes our quick tour of the DPU SDK. In the next posts, we will open the boxes and understand the internals of the toolchain components. We will see how the runtime environment is constructed in practice, how dpucc builds the programs, etc…. Stay tuned 🙂