= AI engine =

AI engine is a computing architecture created by AMD (formerly by Xilinx, which AMD acquired in 2022). It is commonly used for accelerating linear algebra operations (such as matrix multiplication), for artificial intelligence algorithms, digital signal processing, and more generally, for high-performance computing. The first products containing AI engines were the Versal adaptive compute acceleration platforms, which combine scalar, adaptable, and intelligent engines connected through a Network on Chip (NoC).

AI engines have evolved significantly as modern computing workloads have changed, including changes directed toward accelerating AI applications. The basic architecture of a single AI engine integrates vector processors and scalar processors to implement Single Instruction Multiple Data (SIMD) capabilities. AI engines are integrated with many other architectures like FPGAs, CPUs, and GPUs to provide a plethora of architectures for high performance, heterogeneous computation with wide application in different domains.

==Etymology==

According to AMD, while the architecture can be used for artificial intelligence, the "AI" in AI Engine is not an acronym for artificial intelligence or any other term.

== History ==
The AMD AI engines were originally released by Xilinx, Inc., an American company active in the creation of field-programmable gate arrays (FPGAs). Their initial goal was to accelerate signal processing and, more generally, applications where data parallelism could offer significant improvements. Initially, AI engines were released combined with an FPGA layer in the novel Versal platforms. The initial systems, the VCK190 and VCK5000, contained 400 AI engines in their AI engine layer, connected through a VC1902. For connectivity, this architecture class relied on an innovative Network on Chip, a high-performance connectivity devised to become the core connectivity of modern FPGA fabric.

In 2022, the AI engine project changed when Xilinx was officially acquired by AMD, an American company active in the computing architecture market. The AI engines were integrated with other computing systems to target a wider range of applications, finding benefits when considering AI workloads. Indeed, even though the Versal architecture proved powerful, it was complicated and unfamiliar to a vast academic and industrial community segment. For this reason, AMD, along with third-party developers, began releasing improved toolsets and software stacks aimed at simplifying the programming challenges posed by the platform, targeting productivity and programmability.

Aware of the AI workload needs, in 2023, AMD announced the AI engine ML (AIE-ML), the second generation of such architecture. It added support for AI-specific data types like bfloat16, a common data type for deep learning applications. The version retained the same vector processing capabilities of the previous instance, but enlarged memory to support more intermediate computations. From this generation, AMD integrates AI engines with other processing units like CPUs and GPUs, which are incorporated into modern Ryzen AI processors. In such systems, AI engines are usually referred to as Compute Tiles that are a self-contained processing block designed to efficiently execute AI and signal processing workloads. These blocks are integrated with different other types of tiles, namely Memory tile and Shim tile. The apparatus containing the three interconnected kinds of tiles is named XDNA, and its first generation, namely XDMA 1, is released on Ryzen AI Phoenix PCs. Along with this release, AMD continues the research about programmability, releasing, as open source tool, Riallto.

On a similar path, at the end of 2023, early 2024, AMD announced the XDNA 2, along with the Strix series of Ryzen AI architectures. Different from the first generation of XDNA architectures, the second one offers more units to target the massive workload of ML systems. Again, to keep the efforts on the programmability side, AMD released the open source Ryzen AI SW toolchain, which includes the tools and runtime libraries for optimizing and deploying AI inference on Ryzen AI PC.

Lastly, as neural processing and deep learning applications are spreading across different domains, researchers and industry are referring to XDNA architectures as Neural Processing Units (NPUs). However, the term includes all those architectures specifically meant for deep learning workloads and several companies, such as Huawei and Tesla, are proposing their own alternative.

== Hardware architecture ==

=== AI engine tile ===
A single AI engine is a 7-way VLIW processor that offers vector and scalar capabilities, enabling parallel execution of multiple operations per clock cycle. The architecture includes a 128-bit wide vector unit capable of SIMD (Single Instruction, Multiple Data) execution, a scalar unit for control and sequential logic, and a set of load/store units for memory access. The maximum vector register size is 1024 bit, leading to different vector sizes depending on the vector data type.
In the first generation, each AI engine tile has a 32KB memory to load partial computations and 16KB of program memory.

AI engines are statically scheduled architectures. As widely studied in literature, static scheduling suffers from code explosion, requiring manual code optimizations when writing the AI engine kernel to handle this side effect.

The main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile. However, different toolchains can offer support for other programming languages, targeting specific applications or offering automation.

=== First generation - the AI engine layer ===

In the first generation of Versal systems, each AI engine is connected to multiple other engines through three main interfaces, namely cascade, memory and stream interfaces. Each one represents a possible communication mechanism of each AI engine with the others.

The AI engine layer of the first versal systems combined 400 AI engines together. Each AI engine has a 32KB memory that extended up to 128KB by using the memory of neighbouring engines. This leads to a reduced number of actual compute cores but ensures enlarged data memory.

Each AI engine can execute an independent function, or multiple functions by leveraging time multiplexing. The programming structure used to describe the AI engine instantiation, placement and connection is named AIE graph. The official programming model suggested by AMD requires writing such a file in C++. However, different programming toolchains, from both companies and research, can support different alternatives to improve programmability and/or performance.

To compile the application, the original toolchain relies on a closed-source AI engine compiler that automatically performs placement and routing, despite custom indications that can be given when writing the AIE graph.

As the AI engine were initially integrated in Versal systems only, thus combining AI engines with FPGAs capabilities and Network on Chip connectivity, this architectural layer also offers a limited number of direct communications with both of them. Such communications needs to be specified in both the AIE graph, to ensure a correct placement of the AI engines, and during the system-level design.

=== Second generation - the AI engine ML ===
The second generation of AMD's AI engines, or AI engine ML (AIE-ML), provides some architectural modifications with respect to the first generation, focusing on performance and efficiency for machine learning workloads.

AIE-ML possesses almost twice the density of computing per tile, improved memory bandwidth, and natively supports data types with more AI inference workload-optimized formats such as INT8 and bfloat formats. These optimizations allow the second-generation engine to deliver up to three times more TOPS per watt than the underlying AI engine, which was primarily built for DSP-heavy workloads and required explicit SIMD programming and hand-coded data partitioning.

Recent publications from researchers and institutions confirm that AIE-ML offers more scalable, more on-chip memory, and more computational power, making it better suited for edge-based modern ML inference workloads. These advances collectively counter the limitations of the first generation.

According to the company official documentation, there are some key similarities and differences between the two architectures.
  - Key similarities and differences between AI engine of first (AIE) and second (AIE-ML) generation**

| similarities between AIE-ML and AIE | differences between AIE-ML and AIE |
| Same process, voltage, frequency, clock and power distribution | AIE-ML features doubled compute/memory. AIE-ML features a processor bus for direct read/write accesses to local tile memory-mapped registers. |
| One VLIW SIMD processor per tile | AIE-ML features an increased memory capacity (64 KB). |
| Same debug functionality | AIE-ML features an improved power efficiency (TOPS/W). |
| Same connectivity with PL and NoC | AIE-ML features an improved stream switch functionality, performing source to destination parity check and deterministic merge. |
| Same bandwidth for stream interconnect | AIE-ML features a grid-array architecture supporting both vertical (top to bottom) and horizontal (left to right) 512-bit cascade, versus the 384-bit horizontal cascade only of AIE. |

=== XDNA 1 ===

The XDNA is the hardware layer combining three types of tiles:

- The Compute Tile (AI engine ML) is responsible for executing vector and scalar operations.
- The Memory Tile is responsible for 512 KB of local memory and computes pattern-specific data movements to upstream Compute Tile fetch requests.
- The ShimTile, which handles the host memory interaction, controls the data exchanges between Memory and Compute Tiles.

The XDNA architecture is combined with other architectural layers such as CPUs and GPUs, for Ryzen AI Phoenix architectures, composing the AMD product line for energy-efficient inference and AI workloads.

=== XDNA 2 ===
Second generation of XDNA layers is integrated within Ryzen AI Strix architecture and official documents from the producer claim it as specifically tailored for LLM inference workloads.

== Tools and programming model ==
The main programming environment for AI engine, officially supported by AMD, is the Vitis flow, which uses the Vitis toolchain to program the hardware accelerator.

Vitis offers support for both hardware and software developers in a unified development environment, including high-level synthesis, RTL-based flows, and domain-specific libraries. Vitis enables applications to be deployed onto heterogeneous platforms, including AI engines, FPGAs, and scalar processors.

Newer architectures are moving towards a design approach utilizing Vitis for hardware and IP design, while relying on Vivado for system integration and hardware setup. Vivado, also a part of the AMD toolchain ecosystem, is primarily utilized for RTL design and IP integration and offers a GUI-based design environment to construct block designs and manage synthesis, implementation, and bitstream generation.

For the AI engine layer, the main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile.

== Research toolchains ==
Parallelling the company efforts in proposing programming models, design flows, and tools, researchers have also proposed their own toolchains targeting programmability, performance, or simplifying development for a subset of applications.

Some of the main research toolchains are brefly described below:

- IRON is an open-source toolchain developed by AMD in collaboration with several researchers. The IRON toolchain uses MLIR as its middle representation. At the user level, IRON permits a Python API for placing and orchestrating multiple AI engines. Such Python code is then translated into MLIR using one of the two possible backends: a Vitis-based backend or an open-source backend using the Peano compiler. IRON still relies on C++ for kernel development, supporting all the APIs of the standard AI engine kernel development flow.
- ARIES (An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI engines) presents a high-level, tile-based programming model and shared MLIR intermediate representation encompassing both AI engines and FPGA fabric. It represents task-level, tile-level, and instruction-level parallelism in MLIR and accommodates global and local optimization passes. ARIES generates compact C++ code for AI engine kernels and data-movement logic, allowing kernel specification through Python.
- EA4RCA is aimed at a specialized subclass of algorithms: regular Communication-Avoiding algorithms. EA4RCA introduces a design environment optimized for the Versal heterogeneity, emphasizing AI engine performance and high-speed data streaming abstractions. EA4RCA is aimed at algorithms exhibiting regular communication patterns to make the most out of parallelism and hierarchies of memory in the Versal platform.
- CHARM is a framework to compose multiple diverse matrix multiplication accelerators working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling.

== See also ==

- Central processing unit
- Field programmable gate arrays
- Flynn's taxonomy
- Hardware acceleration
- Neural processing unit
- NVIDIA deep learning accelerator
- Vivado
