IBM takes the bits out of deep learning
IBM Research has unveiled plans to build a prototype chip specialised for artificial intelligence (AI). The chip, which IBM has named Artificial Intelligence Unit (AIU), marks the first complete system on a chip from the IBM Research AI Hardware Centre.
In a blog discussing the new chip, IBM researchers wrote: “We’re running out of computing power. AI models are growing exponentially, but the hardware to train these behemoths and run them on servers in the cloud or on edge devices like smartphones and sensors hasn’t advanced as quickly.”
IBM’s plan, based on research from 2019, is to reduce the complexity of the chips used for AI processing. The researchers said the flexibility and high precision of general-purpose computing processors (CPUs) have made these chips well suited for general-purpose software applications, but this flexibility also puts them at a disadvantage when it comes to training and running deep learning models, which require massively parallel AI operations.
IBM is taking two approaches with its alternative to traditional CPUs. First, it said it is developing an application specific integrated circuit (ASIC), which uses significantly fewer binary bits (less precision) that the 32-bit arithmetic used in general-purpose computing. The ASIC’s main task involves matrix and vector multiplication, which IBM said are the primary calculations required in AI.
In a paper published in 2019, IBM researchers presented an approach to simplifying the processing required to perform so-called “dot” calculations used in deep learning algorithms. Such computations involve multiplying two floating point numbers and accumulating the results into partial sums.
The researchers said much of the work involved in “reduced-precision deep learning” is achieved by approximating the data in the multiplication part of the computation. But the accumulation part is left at 32 bits.
According to IBM, it is not possible to reduce the precision of the accumulation part of the computation, because doing so can result in severe training instability and degradation in model accuracy. In the paper, the researchers suggested a theoretical approach to achieving ultra-low-precision hardware for deep neural network (DNN) training. This is among the areas of research IBM has drawn upon to develop the AIU hardware.
In the blog post discussing the AIU, IBM said: “An AI chip doesn’t have to be as ultra-precise as a CPU. We’re not calculating trajectories for landing a spacecraft on the moon or estimating the number of hairs on a cat. We’re making predictions and decisions that don’t require anything close to that granular resolution.”
With the technique it uses, called “approximate computing”, IBM said it can drop down from 32-bit floating point arithmetic to bit formats, which hold a quarter as much information. “This simplified format dramatically cuts the amount of number-crunching needed to train and run an AI model, without sacrificing accuracy,” IBM claimed.
The second approach taken by the IBM researchers is that the AIU chip is being designed in such a way that the circuits streamline AI workflows by sending data directly from one compute engine to the next.
Specialised processing units designed for AI workloads are nothing new. Companies such as Nvidia and AMD have reaped the benefits of the specialised cores that their graphics processing units (GPUs) offer to streamline machine learning. But, ultimately, the GPU was fundamentally designed around the maths involved in manipulating graphics using a highly parallel computing architecture. However, they use hundreds, if not thousands, of cores. For instance, the 21 billion transistor Nvidia Titan V supercomputing GPU, launched in 2017, had 5,120 single precision Cuda cores.
Theoretically, an ASIC can be designed to be wholly focused on optimising one type of workload. In IBM’s case, this is training deep learning networks for AI applications.
When McKinsey looked into AI acceleration for training AI models at the end of 2017, it estimated that within datacentre computing, ASICs will account for 50% of workloads by 2025, with GPUs taking 40% by 2025. At the edge, it forecast that by 2025, ASICs will be used for 70% of workloads.
But the line between ASICs and GPUs is blurring. For instance, Nvidia’s DXG A100 AI training engine offers Tensor cores within its GPU architecture.
Describing the AIU, IBM said: “Our complete system-on-chip features 32 processing cores and contains 23 billion transistors – roughly the same number packed into our z16 chip. The IBM AIU is also designed to be as easy to use as a graphics card. It can be plugged into any computer or server with a PCIe slot.”
IBM has positioned the AIU as “easy to plug-in as a GPU card”, suggesting it hopes to offer a viable alternative to GPU-based AI accelerators. “By 2029, our goal is to train and run AI models 1,000 times faster than we could three years ago,” said the company.