Multicore SoC and processor designs were our solution to the death of Dennard Scaling when IC process geometries dropped below 90nm, when processor speeds hit 3GHz, and when processor power consumption went off the charts. Since 2004, we’ve transformed Moore’s Law into a processor-core replicator, spending transistors on more processor cores rather than bigger, smarter, faster processor cores. But there’s a storm brewing once more, heralded by the dismal utilization of supercomputers that run hundreds to hundreds of thousands of processors in parallel. Currently, per-core processor utilization in supercomputers is less than 10% and falling due to memory and I/O limitations. If we don’t want the same thing to happen to our multicore SoC designs, we need to find a new path that allows processor utilization to scale along with processor core count.
That’s how Samplify CTO and founder Al Wegener opened last week’s presentation to the Santa Clara Chapter of the IEEE Computer Society. Wegener’s company specializes in data compression, which happens to be a way to get more bandwidth out of your existing memory and memory interfaces. But before he cranked up his memory-compression sales pitch, Wegener covered a lot of useful ground that every SoC designer needs to know.
The bedrock foundation of any processor-memory discussion is the memory hierarchy and Wegener provided this handy chart:
At the bottom, left of the chart is the fastest memory available to processors: registers. Now registers may not seem to be much like memory but they are. Registers are simply very fast memory locations with unique instruction-access modes. Processors need and have multiport access to registers so that different parts of the processor’s execution pipeline have unimpeded access to register contents. The number or processor registers has grown over the years from seven 8-bit registers in the Intel 8008…the
first second commercial microprocessor…to dozens of 32-bit general-purpose and wide specialized registers in today’s 32- and 64-bit processor cores. However fast, multiport access to all of these registers comes at a steep price in terms of silicon area and routing congestion, so the number of processor registers is inherently limited.
That’s why there’s cache. Cache-memory access is almost as fast as register access, but you can have a lot of cache memory—comparatively. L1 caches typically have 32 or 64Kbyte capacities and often there are independent caches for instructions and data. Caches take a huge load off main memory, relieving the pressure to speed up main-memory access.
To a point.
Large data sets can easily outgrow L1 caches, which are size-limited because they must be as fast as the processor core. That’s why there are L2 caches. That’s also why there are L3 caches. Each step up the cache hierarchy is bigger—and slower. Finally, you get to main memory, which seems to be missing from Wegener’s chart but I can assure you it’s there. Main memory is usually implemented with SDRAM these days and today’s DDR SDRAM memories need hundreds of processor clock cycles to respond to memory transaction requests, so you really need caches. That’s just the way processor-based system design has evolved. Everyone would really love DRAMs that ran as fast as processor cores. Not yet.
Once you get beyond caches, you generally need to go off chip. That’s where the fun begins. The Intel 8008 microprocessor came in an 18-pin package and ran off a 740KHz clock. All memory was off chip (no cache) and the processor’s instructions executed in 10 to 22 clock cycles. At that time, the memory wall was very far away. The wall got closer each time processors got faster and wider.
Today, multicore processor designs have a voracious appetite for memory bandwidth and we’ve taken several steps to feed that appetitie.
The first step we’ve taken is to add pins so that we can connect wider memory arrays to the processor. Here’s a chart showing the historic rise in the number of pins for Intel x86 processors.
The chart shows the handful of pins needed by the Pentium III in 2000 to more than 2000 pins for the 6-processor Intel Core i7-3960x in 2011. (I think Wegener’s graph has swapped the red and blue curve colors. The red curve seems to match the bandwidth scale and the blue curve seems to match the pin count scale.)
You can’t miss the memory-bandwidth jump from 2007 to 2008 as the effect of having four or more processor cores took effect. Multicore architectures have a big problem with memory bandwidth.
Lest you get the impression that this problem only afflicts Intel CPUs, or even CPUs in general, take a look at this chart, which shows the drop in per-core PCIe and GDDR SDRAM bandwidth as the number of cores in Nvidia GPUs has increased:
GPUs are walled in as well.
The next steps taken to overcome the memory wall included faster memory-I/O pins (the rising clock rate of SDRAMs from SDR to DDR 2/3/4, for example), more cache to relieve the pressure on main memory, and 3D IC assembly permitting many hundreds of memory-dedicated I/O pins (as with Wide-I/O DRAM). The ultimate attempt may be optical interconnect, abandoning electron-based interconnects completely.
Wegener’s approach, through Samplify, is to compress the data so you don’t need to move as many bits between the processors and memory. Hence the Samplify slogan: “…simply the bits that matter.” Wegener’s argument is pretty simple. Assume, for example, you could compress the data stream by 2x, removing half of the bits from the memory stream. That’s the same thing as doubling the existing memory bandwidth, which is like doubling the DDR data rate once again or doubling the width of the memory interface. What’s it cost to do that? About 100K gates says Wegener. At current lithographies, that fits under a bond pad, he adds.
The 100K gates are used to construct a hardware data compressor with three performance levels: lossless, fixed-ratio or fixed-rate (settable in steps of 0.05), and fixed-quality (settable in steps of 0.05 dB). You crank the setting up until it starts to hurt, then back off one click.
Now before you go ballistic about lossy compression, consider some really interesting facts. A lot of data comes from real-world transducers and A/D converters that capture anywhere from 8 to 20 bits of real data. (OK, 24 bits if you’ve got a really expensive converter.) These data samples tend to get stuck in 32-bit locations with anywhere from 12 to 24 irrelevant bits per sample (likely zeroes). You don’t really want to spend precious memory bandwidth or consume power to ship around a bunch of zeroes that will vaporize upon arrival. It’s easy to see that lossy compression applied to these data samples could drop a lot of bits but lose not a single significant bit (…simply the bits that matter).
Need more proof? What do you think you’ve been listening to for the last 10 years? Compressed music and VOIP phone calls. What have you been watching for the past 5 years? Compressed video.
Does compression really work in more than just these situations? Go ask Samplify. Wegener made me a believer.