= Lion Cove =

Lion Cove is a 64-bit x86 CPU core architecture designed by Intel. The Lion Cove core is featured in Core Ultra Series 2 Arrow Lake and Lunar Lake processors.

== Architecture ==
Lion Cove is a performance core architecture aimed at providing high computing performance with wider integer and vector execution units, wider fetch and increased core frequencies compared to the Intel's density-optimized E-core architectures.
Intel claims a 14% increase in instructions per cycle (IPC) with the Lion Cove P-core over Redwood Cove. Intel approached the Lion Cove design process with the intention to "remove any transistor from the design that doesn't directly contribute to productivity", stripping down the core design in order to focus on single-threading and core area efficiency. Ori Lempel served as Senior Principal Engineer for the Lion Cove- P-core design.

=== Front end ===
The front-end of the Lion Cove core for fetching, decoding and issuing instructions has been made wider and deeper. There is eight-way decoding of instructions from the Instruction Queue, up from six-way decode in Redwood Cove. Likewise, Lion Cove's out-of-order engine uses an eight-way allocation/rename queue, increased from Redwood Cove's six-way queue. The out-of-order engine has split the renamers and scheduling into dedicated integer and vector domains which allows Intel to modify each of these domains independently in future designs without requiring a complete redesign of the out-of-order engine. Both of these domains have their own individual access to the micro-op queue. The larger Ops cache size and longer queue benefit efficiency as more micro-ops being stored in the larger cache does not require the decode logic to be powered up again.

| | Redwood Cove | Lion Cove |
| Decode | 6-way | 8-way |
| Allocation/Rename | 6-way | 8-way |
| Retirement | 8-wide | 12-wide |
| Deep instruction window | 512 | 576 |
| Execution Ports | 12 | 18 |
| Op Cache | 4096 entry 8-way | 5250 entry 12-way |
| Op Queue | 144 entry | 192 entry |

==== Branch Predictor ====
Branch prediction has been strengthened in Lion Cove with the core's prediction block being 8 times wider than Redwood Cove. The branch predictor in a core tries to predict the outcome when there are diverging code paths or branch. Lion Cove's L0 Branch Target Buffer (BTB) cache has been doubled to 256 entries to store a higher number of target addresses for a taken branch which can be used to help predict the next branch and reduce the number of misses.
  - Buffer caches entries**

| | Redwood Cove | Lion Cove |
| L0 BTB | 128 | 256 |
| L1 BTB | 5K | 6K |
| L2 BTB | 12K | 12K |

=== Execution Engine ===
==== Integer Unit ====
Lion Cove increases the number of integer Arithmetic Logic Units (ALUs) to six. Redwood Cove contained five ALUs that used a 256-bit wide pipe. The number of integer multiply units has risen from one to three which means that the core can enact more than one integer multiply operation per cycle.

==== Vector engine ====
Intel's vector engine design in Lion Cove now more closely resembles that used by AMD since Zen with four pipes for floating point and vector execution. Two of those pipes deal with floating-point multiplications and multiply-adds, while the two other pipes handle floating-point additions. The number of floating-point dividers has increased from one to two with improved throughput. For handling sort-vector instructions, the vector engine contains four SIMD ALUs, up from three in Redwood Cove.

Lion Cove supports AVX-512 instructions but it is disabled in heterogeneous processor generations like Arrow Lake and Lunar Lake. This is no different to Golden Cove, Raptor Cove or Redwood Cove that had their AVX-512 support disabled in all heterogeneous non-server products.

=== Cache ===
Lion Cove introduces an expanded cache hierarchy with four caching tiers rather than three. With select Broadwell SKUs in 2015, Intel added a 128 MB eDRAM that acted like fourth level cache. However, this eDRAM was not a traditional cache as it was placed on a separate die as a form of slower shared memory between the CPU cores and graphics with its intended purpose being to reduce memory access requests. Broadwell's L3 cache had three times lower per-cycle latency and over triple the bandwidth compared to its eDRAM. In terms of adding a new level of traditional cache, the last time Intel did so was in 2003 with L3 cache on the Pentium 4 Extreme Edition.

| Cache | Redwood Cove | Lion Cove |
| | Size | 48 KB |
| Associativity | 12-way | 12-way |
| Latency | 5-cycles | 4-cycles |
| Bandwidth | 128 B/clk | 128 B/clk |
| | Size | 64 KB |
| Associativity | 6-way | 16-way |
| Latency | -cycles | -cycles |
| Bandwidth | 32 B/clk | 128 B/clk |
| L1 | Size | |
| Associativity | 12-way | |
| Latency | 9-cycles | |
| Bandwidth | 2×64 B/clk | |
| L2 | Size | 2 MB |
| Associativity | 16-way | 10-way |
| Latency | 16-cycles | 17-cycles |
| Bandwidth | __ B/clk | 2×64 B/clk |
| L3 | Size | 4 MB |
| Associativity | 12-way | 12-way |
| Latency | 75-cycles | 51-cycles |
| Bandwidth | 32 B/clk Read 32 B/clk Write | 32 B/clk Read 32 B/clk Write |

==== L0 ====
Lion Cove's L0 caches are what were formerly known as L1 data and instruction caches in any other CPU core architecture. Even though Intel maintains the larger L0 cache sizes in recent core architectures, they have managed to reduce the load-to-use latency down to four cycles, not seen since Skylake, rather than five cycles in Redwood Cove.

==== L1 ====
The new 192 KB L1 cache in the Lion Cove core acts as a mid-level buffer cache between the L0 data and instruction caches inside the core and the L2 cache outside the core. It is focused on reducing latency in the event of L0 data cache misses rather than needing to access the L2 cache. Accessing data in the L1 cache comes with a nine-cycle latency which is nearly half the latency that comes with accessing the L2 cache.

==== L2 ====
L2 cache is important for the Lion Cove core architecture as Intel's reliance on L2 cache is to insulate the cores from the L3 cache's slow performance. Lion Cove was designed to accommodate L2 caches configurable from 2.5 MB up to 3 MB depending on the product. Lunar Lake's Lion Cove implementation contains a 2.5 MB L2 cache while the Lion Cove variant in Arrow Lake contains a 3 MB L2 cache. Lion Cove's larger L2 cache continues the trend of Intel increasing the size of the L2 cache for the last few generations of their P-cores such as Golden Cove, Raptor Cove and Redwood Cove. The previous generation Redwood Cove P-core architecture featured 2 MB of L2 cache. However, increasing the cache size often brings higher latency. Lion Cove's L2 cache has a 17-cycle latency, up from Redwood Cove's 16-cycle latency. Theoretically, the L2 cache can deliver a bandwidth of 110 bytes per cycle but this was limited to 64 bytes per cycle in Lunar Lake for power savings.

==== L3 ====
The read bandwidth when a single Lion Cove core accesses the L3 cache has regressed from 16 bytes per cycle with Redwood Cove to 10 bytes per cycle for Lion Cove. Despite this lower bandwidth in reading and writing data, the latency of Lion Cove accessing L3 data has been reduced from 75 cycles to 51 cycles in Lunar Lake. However, Lion Cove in Arrow Lake suffers from much higher latency at 84 cycles due to a longer ring bus design as its L3 cache is being shared by both its P-cores and E-cores.
Lunar Lake's L3 cache is exclusive to its four Lion Cove P-cores while its four E-cores sit on a separate "island" without an L3 cache.
