Jump to content

AI accelerator

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Fmadd (talk | contribs) at 01:46, 30 November 2016. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

An AI accelerator is (as of 2016) an emerging class of microprocessor designed to accelerate artificial neural networks, machine vision and other machine learning algorithms for robotics, internet of things and other data-intensive or sensor-driven tasks.[1] They are frequently manycore designs (mirroring the massively-parallel nature of biological neural networks). They are targeted at practical narrow AI applications, rather than artificial general intelligence research. Many vendor specific terms exist for devices in this space.

They are distinct from GPUs (which are commonly used for the same role) in that they lack any fixed function units for graphics, and generally focus on low-precision arithmetic.

History

Computer systems have frequently complemented the CPU with special purpose accelerators for intensive tasks, most notably graphics, but also sound, video, etc. Over time various accelerators have appeared that have been applicable to AI workloads.

Early attempts

In the early days, DSPs (such as the AT&T DSP32C) have been used as neural network accelerators e.g. to accelerate OCR software,[2] and there have been attempts to create parallel high throughput systems for workstations (e.g. TetraSpert in the 1990s, which was a parallel fixed point vector processor[3]), aimed at various applications including neural network simulations.[4] ANNA was a neural net CMOS accelerator developed by Yann LeCun.[5] There was another attempt to build a neural net workstation called Synapse-1[6] (not to be confused with the current IBM SyNAPSE project).

Heterogeneous computing

Architectures such as the Cell microprocessor (itself inspired by the PS2 vector units, one of which was tied more closely to the CPU for general purpose work) have exhibited features significantly overlap with AI accelerators - in its support for packed low precision arithmetic, dataflow architecture, and prioritising 'throughput' over latency and "branchy-int" code. This was a move toward heterogeneous computing, with a number of throughput-oriented accelerators intended to assist the CPU with a range of intensive tasks: physics-simulation, AI, video encoding/decoding, and certain graphics tasks beyond its contemporary GPUs.

The Physics processing unit was yet another example of an attempt to fill the gap between CPU and GPU in PC hardware, however physics tends to require 32bit precision and up, whilst much lower precision can be a better tradeoff for AI.[7]

CPUs themselves have gained increasingly wide SIMD units (driven by video and gaming workloads) and increased the number of cores in a bid to eliminate the need for another accelerator, as well as for accelerating application code. These tend to support packed low precision data types.[8]

Use of GPGPU

Innovative software appeared using vertex and pixel shaders for general purpose computation through rendering APIs, by storing non graphical data in vertex-buffers and texture maps (including implementations of convolutional neural networks for OCR[9]),[10] Vendors of graphics processing units subsequently saw the opportunity and generalised their shader pipelines with specific support for GPGPU, mostly motivated by the demands of video game-physics but also targeting scientific computing.[11]

This killed off the market for a dedicated physics accelerator, and superseded Cell in video game consoles,[12] and eventually led to their use in running convolutional neural networks such as AlexNet (which exhibited leading performance the ImageNet Large Scale Visual Recognition Challenge).[13]

As such, as of 2016 GPUs are popular for AI work, and they continue to evolve in a direction to facilitate deep learning, both for training[14] and inference in devices such as self-driving cars.[15] - and gaining additional connective capability for the kind of dataflow workloads AI benefits from (e.g. NVidia NVLink).[16]

Use of FPGA

Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices like Field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other.[17]

Microsoft has used FPGA chips to accelerate inference.[18][19] This has motivated Intel to purchase Altera with the aim of integrating FPGAs in server CPUs, which would be capable of accelerating AI as well as other tasks.[citation needed]

Use of ASIC

Whilst GPUs perform far better than CPUs for these tasks, a factor of 10 in efficiency[20][21] can still be gained with a more specific design, via an application-specific integrated circuit (ASIC).

Memory access pattern

The memory access pattern of AI calculations differs from graphics: a more predictable but deeper dataflow, benefiting more from the ability to keep more temporary variables on-chip (e.g. in scratchpad memory rather than caches); GPUs by contrast devote silicon to efficiently dealing with highly non-linear gather-scatter addressing between texture maps and frame-buffers, and texture filtering, as is needed for their primary role in 3D rendering.

Precision

AI researchers are often finding minimal accuracy losses whilst dropping to 16 or even 8 bits,[7] suggesting that a larger volume of low precision arithmetic is a better use of the same bandwidth. Some researchers have even tried using 1bit precision (i.e. putting the emphasis entirely on spatial information in vision tasks).[22] IBM's design is more radical, dispensing with scalar values altogether and accumulating timed pulses to represent activations stochastically, requiring conversion of traditional representations.[23]

Nomenclature

As of 2016, the field is still in flux and vendors are pushing their own marketing term for what amounts to an "AI accelerator", in the hope that their designs and APIs will dominate. There is no consensus on the boundary between these devices, nor the exact form they will take, however several examples clearly aim to fill this new space, with a fair amount of overlap in capabilities.

In the past when consumer graphics accelerators emerged, the industry eventually adopted NVidias self assigned term, "the GPU",[24] as the collective noun for "graphics accelerators", which had taken many forms before settling on an overall pipeline implementing a model presented by Direct3D.

Slowing of Moore's law

As of 2016, the slowing (and possible imminent end of) Moore's law[25] drives some to suggest refocussing industry efforts on application led silicon design,[26] whereas in the past, increasingly powerful general purpose chips have been applied to varying applications via software. In this scenario, a diversification of dedicated AI accelerators makes more sense than continuing to stretch GPUs and CPUs.

Future

It remains to be seen however if the eventual shape on AI accelerator is a radically new device like TrueNorth, or a more general purpose processor that just happens to be optimised for the right mix of precision and dataflow.[4] There are also some even more exotic approaches on the horizon, e.g. using memristors, attempting to use individual memristors as synapses.

Potential Applications

Examples

References

  1. ^ "google developing AI processors".google using its own AI accelerators.
  2. ^ "convolutional neural network demo from 1993 featuring DSP32 accelerator".
  3. ^ "design of a connectionist network supercomputer".
  4. ^ a b "The end of general purpose computers (not)".This presentation covers a past attempt at neural net accelerators, notes the similarity to the modern SLI GPGPU processor setup, and argues that general purpose vector accelerators are the way forward (in relation to RISC-V hwacha project. Argues that NN's are just dense and sparse matrices, one of several recurring algorithms)
  5. ^ Application of the ANNA Neural Network Chip to High-Speed Character Recognition
  6. ^ "SYNAPSE-1: a high-speed general purpose parallel neurocomputer system".
  7. ^ a b "Deep Learning with Limited Numerical Precision" (PDF).
  8. ^ "Improving the performance of video with AVX".
  9. ^ "microsoft research/pixel shaders/MNIST".
  10. ^ "how the gpu came to be used for general computation".
  11. ^ "nvidia tesla microarchitecture" (PDF).
  12. ^ "End of the line for IBM's Cell".
  13. ^ "imagenet classification with deep convolutional neural networks" (PDF).
  14. ^ "nvidia driving the development of deep learning".
  15. ^ "nvidia introduces supercomputer for self driving cars".
  16. ^ "how nvlink will enable faster easier multi GPU computing".
  17. ^ "FPGA Based Deep Learning Accelerators Take on ASICs". The Next Platform. 2016-08-23. Retrieved 2016-09-07.
  18. ^ "microsoft extends fpga reach from bing to deep learning".
  19. ^ "Accelerating Deep Convolutional Neural Networks Using Specialized Hardware" (PDF).
  20. ^ "Google boosts machine learning with its Tensor Processing Unit". 2016-05-19. Retrieved 2016-09-13.
  21. ^ "Chip could bring deep learning to mobile devices". www.sciencedaily.com. 2016-02-03. Retrieved 2016-09-13.
  22. ^ Rastegari, Mohammad; Ordonez, Vicente; Redmon, Joseph; Farhadi, Ali (2016). "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks". arXiv:1603.05279 [cs.CV].
  23. ^ Diehl, Peter U.; Zarrella, Guido; Cassidy, Andrew; Pedroni, Bruno U.; Neftci, Emre (2016). "Conversion of Artificial Recurrent Neural Networks to Spiking Neural Networks for Low-power Neuromorphic Hardware". arXiv:1601.04187 [cs.NE].
  24. ^ "NVIDIA launches he Worlds First Graphics Processing Unit, the GeForce 256,".
  25. ^ "intels former chief architect - moore's law will be dead within a decade".
  26. ^ "more than moore" (PDF).
  27. ^ "drive px".
  28. ^ "design of a machine vision system for weed control" (PDF).
  29. ^ "qualcomm research brings server class machine learning to every data devices".
  30. ^ "movidius powers worlds most intelligent drone".
  31. ^ "yann lecun on IBM truenorth".argues that spiking neurons have never produce leading quality results, and that 8-16 bit precision is optimal, pushes the competing 'neuflow' design
  32. ^ "IBM cracks open new era of neuromorphic computing". TrueNorth is incredibly efficient: The chip consumes just 72 milliwatts at max load, which equates to around 400 billion synaptic operations per second per watt — or about 176,000 times more efficient than a modern CPU running the same brain-like workload, or 769 times more efficient than other state-of-the-art neuromorphic approaches
  33. ^ "kalray MPPA" (PDF).
  34. ^ "India preps RISC-V Processors - Shakti targets servers, IoT, analytics". The Shakti project now includes plans for at least six microprocessor designs as well as associated fabrics and an accelerator chip