AI has an inflated problem. Some researchers think it can be solved by rearranging the way computer processors fit together.
At “Hot Chips” last week (the annual Silicon Valley technology symposium), researchers from the University of Wisconsin-Madison and startup SimpleMachines described what they see as a necessary change in the way computer hardware is assembled. Like many things in high-end computing, the apparent driver for this change is well known: artificial intelligence.
Saru Sankaralingam, a computer science professor at the University of Wisconsin-Madison, argued in his lecture on Mozart architecture that AI models get blown up. If you look at things like language models, models now exceed Moore’s law by a factor of ten. In that environment, it’s no surprise to see waferscale processors like Cerebras’s show up at the same conference. There’s one reason what the Stanford University HAI calls foundation models are so large: They work better than smaller models. At the moment there is no strong incentive to try to optimize them for efficiency, although it is quite clear that there is a lot of redundancy in the neural connections. Some experiments on the now rather small BERT architecture showed that you can happily and almost randomly see huge chunks from a trained model and it will still yield acceptable results.
However, models are growing for a different reason, Sankaralingam says: The architectures running today just happen to run efficiently on the graphics processing units (GPUs) that are still primarily used to train and run them. Other architectures could yield good results, but use far fewer teraflops, he says. The disadvantage? They are not suitable for GPUs or the other types of accelerators used for these applications. There are optimizations that significantly reduce the number of computations, even within the convolutional layers that sparked the deep learning revolution. A technique that breaks the matrix multiplications into smaller chunks, known as depth-wise convolution, is quite widely used for inferences in embedded systems because it can produce the same results for up to ten times fewer operations. Unfortunately, according to the UW-Madison researchers, it doesn’t run faster on GPUs, because the overhead of getting data on and off the chip dominates the performance equation compared to running normal convolutions.
The need to shuffle data is the major obstacle to efficient machine learning. A series of articles has underlined how much energy it takes to search, retrieve, and replace data in memory. Actually performing a multiplication, even at high speed, is much less energy-consuming. And it takes time. Researchers have been talking about the memory wall for decades, and the wall has certainly gotten bigger since it first appeared. The problem is that computing largely boils down to this: decide where the data comes from; decide what to do with it and decide where to post the results once that’s done.
Traditionally, all these functions were combined in the dominant processor architecture: the von Neumann machine. But it was designed for an earlier era when transistors were expensive and memory was at least as fast as the logic gates in the instruction pipeline. As a result, the idea of executing several instructions to pull data from memory and dump those elements into local registers, do something with them, and then place the results with another instruction made a lot of sense.
Now the calculators are often hungry for new data. The roadblock is getting the data in and out. This is where Mozart from UW-Madison comes in. It breaks down the computer to match these three phases, with two others – synchronization to prevent threads from damaging each other’s data storage and control such as branches – used to complete the sequence of operations required for a full computer processor. In fact, the control is handled by an ordinary microprocessor, as von Neumann machines are perfectly good at it, as billions of embedded microcontrollers have demonstrated over the years.
Mozart looks sensible and at this point, you wonder, “Why aren’t computers designed that way already?” In fact, some already are. This is an attempt to formalize something that has already evolved in machine learning and signal processing circles. The core area of the computer is a coarse-grained, reconfigurable architecture: basically, a number of execution units that pass data to each other using programmable interconnections. In applications such as radar processing, field programming gate arrays (FPGAs) made by Intel PSG and Xilinx have been doing that kind of work for decades. Google’s Tensorflow ASIC is based on a systolic array with a fairly fixed forwarding network.
One difference between the FPGA and the hardwired systolic array architecture is that they work brilliantly on dense arrays, but usage plummets when faced with sparse structures, which GPUs also find problematic. This is where the Mozart data collection engine comes in handy. It looks ahead in the order of operations, reads the data, and rearranges it in a pattern that will flow nicely through the CGRA. This is also not a radical deviation. Signal processing specialists such as Ceva have used scatter-gather memory controllers to feed their own highly parallel execution pipelines, although they do not yet use CGRA structure. A major drawback of the CGRA is that it is not as hardware efficient: programmable interconnection is often expensive. There is another obstacle to the FPGA-like architectures: they are difficult to program because while Xilinx has worked hard to make the tools more programmable, the environment is rather hardware-oriented.
This is where the Mozart team thinks they could have an advantage, developing a software stack that works directly on the model’s source code and generates not so much a file full of instructions as a list of streams that neater on the data. displayed. collect engines. “Program synthesis and automatic generation of the software stack are essential for future chips,” says Sankaralingam. “The compiler looks at the semantics of the program and divides it into the four broad classes of activities.”
The theory is that the approach the team has taken will make it easier to use new models developed with easily accessible languages and libraries and let them generate efficient programs for something like Mozart without the need for manual tuning that developers in AI forces to use tried and tested out-of-the-box kernels. In principle, the decomposition process used by the compiler could be applied to more conventional architectures, but the team believes its hardware is more suited to this approach than most.
While it underperforms an Nvidia A100 on any of the usual DNN structures, the 16nm Mozart implementation seems to hold up on a variety of model types compared to the commercial device, which is made on a process node more advanced by a few generations. An upcoming implementation targeting the 7nm process node should deliver higher performance.
In practice, many of the bits that make up Mozart are already in use today, but a reformulation of the computer’s core attributes, similar to that of the UW-Madison team, coupled with a lingering drive for better AI may eventually von Neumann break up. machine.