Graphics processors (GPUs) suck bits out of SDRAMs the way vampires do what comes naturally to them in the immensely popular Twilight book series by Stephenie Meyer. In other words, GPUs need all the memory bandwidth they can get and current SDRAMs can provide graphics subsystems with large amounts of memory at the lowest cost per bit and excellent bandwidth—but only if the memory controller has some understanding of the GPU’s specific memory-usage needs. That’s a design challenge Vivante faced when developing the ScalarMorphic Architecture for its latest line of multi-threaded, multi-core GPUs, which are designed into products ranging from smartphones to home-entertainment products.
The important thing to know about GPUs with respect to memory accesses is that the real-time nature of graphics applications requires that GPUs have a far more intense and intimate relationship with memory. There tend to be more memory clients at work within the GPU (perhaps five times more clients than for a general-purpose CPU) and these clients generate shorter bursts and more random accesses than general-purpose processors tend to generate. There are also more read/write conflicts and—specific to SDRAM accesses—more bank conflicts.
Here’s a graphic comparison of the number of memory clients in a CPU versus in a GPU, which clearly illustrates the difference:
The Vivante ScalarMorphic Architecture is a multi-core GPU design with multiple shaders, texture engines, rendering engines, and pixel engines in its 3D pipeline. There’s also a 2D pipeline and all of these elements can generate a variety of memory accesses. Here’s a block diagram of the ScalarMorphic Architecture:
Vivante’s approach to optimizing the GPU-SDRAM interface includes an “ultra-threaded” design (as many as 1000 threads per core) to increase tolerance for memory latency. Other optimizations include coalescing multiple memory requests to reduce the number of discrete memory accesses, maximizing data locality, and the use compression and caching to reduce memory-bandwidth needs. But it became clear to the designers at Vivante that more performance could be gotten from an optimized SDRAM controller.
Now if you’re not very familiar with SDRAM controllers, you might think there’s not much difference in controller designs. After all, they just take memory requests and feed them to the SDRAM. Right? Not so. SDRAMs actually have pretty complex access protocols these days and there are three ways to execute these protocols: efficiently, inefficiently, and wrong.
Exercise an SDRAM incorrectly—outside of the specified protocols—and you can lose data. (Note: That’s a bad thing.)
Exercise an SDRAM inefficiently by ignoring a few of the SDRAM’s access timing requirements, and you’ll lose a lot of the raw memory bandwidth you’ve paid for. (Note: That’s not so good either.)
So you want to be protocol-efficient to get your money’s worth from the SDRAM, but efficiency actually depends on the workload characteristics. Some accesses can overlap others. Some can not. So it makes sense to optimize the memory controller for the task.
From Vivante’s perspective, a good DDR memory controller
- Allows multiple clients (GPU cores, CPU cores, DSP cores, and other logic) to share one SDRAM array
- Delivers low latency for critical transactions
- Delivers high bandwidth for large transfers
- Reorders transactions to maximize SDRAM bandwidth (a gain of 30% is possible here)
- Uses the SDRAM’s low-power modes and access methods where possible to reduce SDRAM power consumption
For these reasons, Vivante collaborated with Cadence to optimize the design of the configurable Cadence DDR memory controller core for Vivante’s graphics application. Specifically, the controller was customized to allow out-of-order reads and writes to hide SDRAM latency and to make efficient use of available memory bandwidth, to break memory objects up and distribute them to multiple banks, and to perform GPU-aware memory allocation. These modifications resulted in a 34% performance improvement for a read-oriented test, an 83% performance improvement for a write-oriented test, and a 22% performance improvement for an application-oriented test.
Note that it’s the joint collaboration between Vivante and Cadence that produced this significant improvement in overall performance. Optimizing the GPU and the memory-controller cores independently might never have achieved such a large improvement because interlocking improvements to both core designs work with the SDRAM’s innate characteristics to achieve the resulting performance boost.