Qualcomm self-developed chip architecture revealed in depth, Apple M series welcomes the strongest competitor

Wallstreetcn
2024.06.14 04:13
portai
I'm PortAI, I can summarize articles.

Qualcomm has unveiled the Snapdragon X SoC architecture, which includes their new custom Arm v8 "Oryon" CPU core, as well as technical disclosures on Adreno GPU and Hexagon NPU. This new architecture will position Qualcomm as a competitor in the PC and mobile fields, breaking away from Arm's reference design. Oryon is Qualcomm's first high-performance CPU design created from scratch in the past few years, crucial for the development of the new generation of Windows-on-Arm SoCs. Qualcomm aims to establish a presence in the Windows PC market with this architecture, while also serving as the foundation for their traditional mobile device SoCs

In the past 8 months, Qualcomm has made many interesting statements about its high-performance Windows-on-Arm SoC, many of which will undergo testing in the coming weeks. However, in the increasingly competitive environment of PC CPU competition, beyond all performance claims and marketing, there is a more fundamental question about the Snapdragon X that we have been eager to know: how does it work?

Ahead of its release next week, we finally have the answer as Qualcomm today unveiled their long-awaited Snapdragon X SoC architecture. This includes not only their new custom Arm v8 "Oryon" CPU cores, but also the technical disclosure of their Adreno GPU and the Hexagon NPU that supports their heavily promoted AI capabilities.

The company has previously stated that the Snapdragon X is a serious, top priority plan for them - they will not just cobble together existing IP modules to create a Windows SoC - hence there are many new technologies in the SoC.

While we are excited to see all of this, we must first acknowledge that what excites us the most is finally being able to delve into Oryon, the custom Arm CPU core by Qualcomm. As the first new high-performance CPU design created from scratch in the past few years, the importance of Oryon cannot be overstated. In addition to providing the foundation for the new generation of Windows-on-Arm SoCs (which Qualcomm hopes will establish a presence in the Windows PC market), Oryon will also serve as the foundation for Qualcomm's traditional Snapdragon mobile phone and tablet SoCs.

Therefore, in the coming years, a large amount of the company's hardware will be based on this CPU architecture - if all goes according to plan, Oryon will launch more generations of products. In any case, it will set Qualcomm apart from competitors in both the PC and mobile fields, as it means Qualcomm is moving away from Arm's reference design, which is essentially also a competitor to Qualcomm.

Without further ado, let's delve into Qualcomm's Snapdragon X SoC architecture.

Elite, Plus, and Currently Released SKUs

In a brief overview, Qualcomm has announced a total of 4 Snapdragon X SKUs to date, all of which have been provided to device manufacturers and will be launched next week.

Three of them are "Elite" SKUs, which contain 12 CPU cores. Meanwhile, Qualcomm currently only offers one "Plus" SKU, with the CPU core count reduced to 10.

Officially, Qualcomm has not specified any TDP level for these chip SKUs because, in principle, any given SKU can be used across the entire power level range. Need to install a top-tier chip in a fanless laptop? Just lower the TDP to match your power/cooling capabilities. In other words, to achieve the highest clock speeds and performance targets of Qualcomm chips, a significant amount of cooling and power delivery is required. Therefore, we are unlikely to see the X1E-84-100 in fanless devices, for example, because its higher clock speeds would be wasted due to lack of heat dissipation space. This does not prevent lower-performance chips from being used as budget options in larger devices, but the SKU table can also be roughly sorted by TDP.

Although this was not disclosed today, don't be surprised to see more Snapdragon X chip SKUs coming soon. Qualcomm is at least developing another Snapdragon X chip, which has become a little-known secret—a chip that may be smaller in size with fewer CPU and GPU cores—potentially launching a more budget-focused SKU series in the future. However, Qualcomm is currently starting with its larger chips, which are also its highest-performing choices.

While the first batch of Snapdragon X devices won't meet consumers until next week, it is clear from the OEM adoption that this will be Qualcomm's most successful Windows-on-Arm SoC to date. The difference in adoption compared to the Snapdragon 8cx Gen 3 is almost night and day; Qualcomm's PC partners have developed over a dozen laptop models with the new chip, while the latest 8cx has two designs. Therefore, with Microsoft, Dell, HP, Lenovo, and other companies all producing Snapdragon X laptops, the start of the Snapdragon X ecosystem is much stronger than any previous Windows-on-Arm product.

Undoubtedly, this is largely due to the powerful Qualcomm architecture. Snapdragon X features a CPU that Qualcomm claims is much more powerful than the Cortex-X1 core on the latest (around 2022) 8cx chip and is manufactured using a process that competes highly with TSMC's N4 node. Therefore, if all conditions are right, the Snapdragon X chip should be a huge step forward for Qualcomm.

At the same time, there are two other pillars supporting the release of this product. The first is, of course, artificial intelligence, as Snapdragon X is the first Windows SoC to support Copilot+ The Hexagon NPU of Snapdragon X requires a 40+ TOPS NPU, and the 45 TOPS Hexagon NPU makes this SoC the first chip to provide such high performance for neural networks and other model inferences. The second pillar is performance. Qualcomm promises that, based on its years of experience in producing mobile SoCs, the battery life of its SoCs will be very long. If they can achieve this goal while meeting the performance targets - achieving a balance between performance and battery life for users - then it will provide a good foundation for the Snapdragon X chip and the laptops produced from it.

Ultimately, Qualcomm hopes to achieve their Apple Silicon moment - repeating the performance and battery life improvements Apple gained when transitioning from Intel x86 chips to their own custom Arm chips, Apple Silicon. Meanwhile, partner Microsoft is very eager to have a competitor to the MacBook Air in the PC ecosystem. This is a daunting task, with the most important reason being that Intel and AMD have not been standing still in the past few years, but it is not an impossible feat.

That being said, Qualcomm and the Windows-on-Arm ecosystem do face some obstacles, meaning that the release trajectory of Snapdragon X can never be compared to Apple's. Apart from the obvious lack of a unified development hardware and software ecosystem (and driving developers to develop software for it), Windows also carries the expectation of backward compatibility and the legacy baggage that comes with it. On the Microsoft side, they continue to focus on their x86/x64 emulation layer, now called Prism, and the release of Snapdragon X will be its first real test. But even though Windows has been supporting Arm for many years, the software ecosystem is slowly taking shape, so Snapdragon X will rely more on x86 emulation than Apple. Windows and macOS are very different operating systems, both in terms of their history and the development philosophies of their owners, which is especially evident in the first few years of Snapdragon X.

Oryon CPU Architecture: A Core Designed for All Applications

To delve into the architecture, let's start with the most core part: the Oryon CPU core.

In a brief review, Oryon is essentially Qualcomm's third-party acquisition. The CPU core was originally named "Phoenix" and developed by the chip startup NUVIA. NUVIA itself was formed by several former Apple employees, and their initial plan was to develop a new server CPU core that would compete with cores in modern Xeon, EPYC, and Arm Neoverse V CPUs However, Qualcomm seized the opportunity to acquire an excellent CPU development team in 2021, purchasing NUVIA. Phoenix was repurposed for consumer-grade hardware, reborn as the Oryon CPU core.

Although Qualcomm did not pay much attention to the origins of Oryon, it is clear that the first-generation architecture (using Arm's v8.7-A ISA) still deeply roots in the original Phoenix design. Phoenix itself was designed to be scalable and energy-efficient, so this is definitely not a bad thing for Qualcomm. However, it does mean that many customer-centric core design changes did not appear in the initial Oryon design, and we should expect to see these changes in future generations of CPU architectures.

After in-depth research, as previously disclosed by Qualcomm, the Snapdragon X uses three sets of Oryon CPU cores. From a high-level perspective, Oryon is designed as full-size CPU cores, capable of providing both efficiency and performance simultaneously. For this reason, it is the only CPU core Qualcomm needs; it does not have separate performance-optimized and efficiency-optimized cores like Qualcomm's previous Snapdragon 8cx chip or the latest mobile chips from Intel/AMD.

According to Qualcomm's disclosure, all clusters are equal. Therefore, there is no "efficiency" cluster adjusted for power efficiency rather than clock speed. However, only 2 CPU cores (in different clusters) can reach the highest turbo boost speed for any given SKU; the rest of the cores reach the chip's full-core turbo boost speed.

Each cluster also has its own PLL, so each cluster can be timed and powered separately. In practice, this means that when the workload is light, two clusters can go into sleep mode and then wake up from sleep when more performance is needed.

Unlike most CPU designs, Qualcomm uses a flatter cache hierarchy for the Snapdragon X and Oryon CPU core clusters. The L2 cache is not per core but shared by every 4 cores (similar to how Intel shares L2 cache on its E core clusters). Moreover, this is a fairly large L2 cache, sized at 12MB. The L2 cache is 12-way associative, and even with its large size, the latency of accessing the L2 cache after an L1 miss is only 17 cycles This is an inclusive cache design, so it also includes a mirror of the contents in the L1 cache. According to Qualcomm, they use inclusive cache for energy-saving reasons; inclusive cache means that eviction is much simpler because L1 data does not need to be moved to L2 to be evicted (or deleted from L2 when promoted to L1). In turn, cache coherence is maintained using the MOESI protocol.

The L2 cache itself operates at the full core frequency. L1/L2 cache operations are 64-byte operations, equivalent to hundreds of gigabytes per second of bandwidth between the cache and CPU cores. While the L2 cache is primarily used to serve its own directly connected CPU cores, Qualcomm has also implemented optimized cluster-to-cluster snooping operations to handle a situation where one cluster needs to read from another cluster.

Interestingly, the 4-core cluster configuration of Snapdragon X is not even as large as the Oryon CPU cluster. According to Qualcomm engineers, the cluster design actually has all the features and bandwidth of an 8-core design, undoubtedly reminiscent of its roots as a server processor. For consumer-grade processors, multiple smaller clusters provide higher granularity for power management and can serve as better building blocks for manufacturing lower-end chips (such as Snapdragon mobile SoCs). However, when these cores are in different clusters (thus needing to be connected to another core via a bus interface unit), there are some trade-offs, namely slower core-to-core communication speed. This is a small but noteworthy difference, as Intel and AMD's current designs place 6 to 8 CPU cores within the same cluster/CCX/die.

A closer look at a single Oryon CPU core quickly reveals why Qualcomm adopts shared L2 cache: the L1 instruction cache within a single core is already very large. Oryon is equipped with a 192KB L1 I-Cache, three times that of Redwood Cove (Meteor Lake) L1 I-Cache, and even larger than Zen 4's. Overall, the 6-way associative cache allows Oryon to store a large number of instructions locally to the CPU execution units. Unfortunately, we do not have L1I latency on hand to compare it with other chips.

In summary, Oryon's fetch/L1 unit can retrieve up to 16 instructions per cycle at most.

This, in turn, supports a very wide decode frontend. Oryon can decode up to 8 instructions in a single clock cycle, wider than Redwood Cove (6) and Zen 4 (4). Moreover, all decoders are the same (symmetrical), so full throughput can be achieved without special cases/scenarios Like other contemporary processors, these decoded instructions are issued in the form of micro-operations (uOps) for further processing by the CPU core. Technically, each Arm instruction can decode up to 7 uOps, but according to Qualcomm, the ratio of Arm v8 instructions to decoded micro-operations is usually closer to 1:1.

Branch prediction is another major driver of CPU core performance, and this is another area where Oryon is not stingy. Oryon has all common predictors: direct, conditional, and indirect. The direct predictor is single-cycle; however, a branch prediction error incurs a delay loss of 13 cycles. Unfortunately, Qualcomm has not disclosed the size of the branch target buffer itself, so we do not know how large they actually are.

However, we do know the size of the L1 Translation Lookaside Buffer (TLB), which is used for virtual-to-physical memory address mapping. The buffer can hold 256 entries and supports 4K and 64KB pages.

Turning to Oryon's execution backend, there are many aspects worth discussing. Part of the reason is that there is a lot of hardware and many buffers here. Oryon has a fairly large 650+ Reorder Buffer (ROB) used to extract instruction parallelism and overall performance through out-of-order execution. This makes Qualcomm the latest CPU designer to abandon traditional wisdom and introduce a large ROB, avoiding the diminishing returns claimed by larger ROBs.

In turn, the instruction retirement matches the maximum capacity of the decoder block: 8 instruction inputs, 8 uOps outputs. As mentioned earlier, the decoder can technically issue multiple uOps for one instruction, but in most cases, it will align perfectly with the instruction retirement rate.

The register renaming pool on Oryon is also quite large (do you sense a common theme here?). There are over 400 registers available for integers, with an additional 400 registers available for feeding the vector unit.

As for the actual execution pipeline itself, Oryon provides 6 integer pipelines, 4 FP/vector pipelines, and an additional 4 load/store pipelines. Qualcomm has not provided a complete mapping of each pipeline, so we cannot cover all possibilities and special cases. But at a higher level, all integer pipelines can perform basic ALU operations, with 2 capable of handling branches, and 2 capable of executing complex Multiply-Accumulate (MLA) instructions. Additionally, we are informed that the majority of integer operations have a single-cycle latency - meaning they execute within a single cycle.

On the floating-point/vector side, each vector pipeline has its own NEON unit. A reminder, this is an Arm v8.7 architecture, so there are no vector SVE or Matrix SME pipelines here; The only SIMD feature of the CPU core uses the classic 128-bit NEON instructions. This does limit the CPU to narrower vectors than contemporary PC CPUs (AVX2 is 256 bits wide), but it does compensate for this by using NEON units on all four FP pipelines. Moreover, as we are now in the AI era, the FP/vector unit supports all common data types, all the way down to INT8. The only notable omission here is BF16, which is a common data type for AI workloads; but for serious AI workloads, that's where the NPU comes in.

We also see the data loading/storing units on Oryon. The core's loading/storing units are very flexible, meaning that the 4 execution pipelines can execute any combination of loads and stores as needed in each cycle. The load queue itself can have up to 192 entries, while the store queue can have up to 26 entries. All fills are the full size of a cache line: 64 bytes.

The L1 data cache supporting the load/store units is also quite large. The fully inclusive 6-way associative cache size is 96KB, which is twice the size of Intel's Redwood Cove cache (although the upcoming Lion Cove will change this significantly). It is carefully designed to effectively support various access sizes.

In addition, Qualcomm's memory prefetcher is a bit like a "secret sauce," as the company states that this relatively complex unit contributes significantly to performance. Therefore, Qualcomm has not revealed too much about how its prefetcher works, but undoubtedly, its ability to accurately predict and prefetch data will have a huge impact on the overall performance of the CPU core, especially given the long latency of DRAM at modern processor clock speeds. Overall, Qualcomm's prefetch algorithm aims to cover a variety of scenarios, from simple adjacency and strides to more complex patterns, using past access history to predict future data demands.

In contrast, Oryon's memory management unit is relatively simple. It is a fully-featured modern MMU that supports more sophisticated features such as nested virtualization - allowing guest virtual machines to host their own guest VM managers to serve further virtual machines.

Another notable feature here is the hardware table walker. If a cache line is not in the L1 or L2 cache, this unit is responsible for moving the cache line to DRAM, supporting up to 16 concurrent table walks Please remember, this is the core of everything, so the complete Snapdragon X chip can perform up to 192 table traversals at once.

Finally, in addition to CPU cores and CPU clusters, we also have the highest level of the SoC: the shared memory subsystem.

The last level of cache is located here, sharing L3 cache with the chip. Considering the sizes of the chip's L1 and L2 caches, you might think the L3 cache is also quite large. But you would be wrong. In fact, Qualcomm has only equipped the chip with 6MB of L3 cache, which is only a small part of the 36MB L2 cache it supports.

Since the chip already has a large amount of cache at the L1/L2 levels and these caches are tightly integrated, Qualcomm uses a relatively small sacrifice cache here as the last stop before entering system memory. This is a significant change compared to traditional x86 CPUs, although it is very much in line with Qualcomm's brand positioning, as its Arm mobile SoCs typically also have relatively small L3 caches. At least the benefit is that L3 cache access speed is quite fast, with a latency of only 26-29 nanoseconds. And it has the same bandwidth as DRAM (135GB/s), allowing data to be passed between the L2 cache below it and the DRAM above it.

As for memory support, as disclosed earlier, the Snapdragon X has a 128-bit memory bus, supporting LPDDR5X-8448, with a maximum memory bandwidth of 135GB/s. With the current capacity of LPDDR5X, the Snapdragon X can handle up to 64GB of RAM at most. However, I wouldn't be too surprised if Qualcomm confirms support for 128GB once higher-density LPDDR5X chips start shipping.

It is worth noting that unlike some other chips focused on the mobile field, the Snapdragon X does not use any type of package memory. Therefore, LPDDR5X chips will be installed on the device's motherboard, and device vendors can choose their own memory configurations.

Using LPDDR5X-8448 memory, Qualcomm tells us that DRAM latency should be slightly higher than 100ns, at 102-104ns.

Finally, let's briefly mention CPU security. Qualcomm supports all the security features required for modern chips, including Arm TrustZone, random number generators for each cluster, and security enhancements such as pointer authentication It is worth noting that Qualcomm claims that Oryon can mitigate all known side-channel attacks, including Spectre, which is referred to as a "gift that keeps on giving." This statement is interesting because Spectre itself is not a hardware vulnerability, but an inherent consequence of speculative execution. This, in turn, is why it is difficult to completely defend against (the best defense is to isolate sensitive operations). Nevertheless, Qualcomm believes that by implementing various obfuscation techniques in hardware, they can prevent such side-channel attacks. So it will be interesting to see how this will develop.

About x86 Emulation

Finally, I would like to take some time to briefly introduce some information about x86 emulation on Oryon.

Qualcomm's x86 emulation solution is much more complex than what we are used to on Apple devices, as in the Windows world, no single vendor can control both the hardware and software stack. Therefore, while Qualcomm can talk about their hardware, they cannot control software issues - and they will not risk collective misrepresentation by speaking for Microsoft. Thus, x86 emulation on Snapdragon X devices is essentially a joint project between two companies, with Qualcomm providing the hardware and Microsoft providing the Prism translation layer.

However, although x86 emulation is largely a software task (with Prism handling most of the heavy lifting), Arm CPU vendors can still make some hardware adjustments to improve x86 performance. Qualcomm has made these adjustments. The Oryon CPU core has hardware-assisted features that enhance x86 floating-point performance. To address what can be described as the elephant in the room, Oryon also provides hardware adjustments for x86's unique memory storage architecture - widely seen as a key advancement in achieving high x86 emulation performance on their own chips.

Nevertheless, no one would expect Qualcomm's chips to run x86 code as fast as native chips. There will still be some translation overhead (how much depends on the workload), and performance-critical applications will still benefit from being natively compiled for AArch64. But Qualcomm is not entirely at the mercy of Microsoft in this regard, as they have made hardware adjustments to enhance their x86 emulation performance.

In terms of compatibility, the biggest obstacle expected here is AVX2 support. Compared to the NEON unit on Oryon, the x86 vector instruction set is wider (256b vs. 128b), and the instructions themselves do not completely overlap. As Qualcomm has stated, the AVX to NEON conversion is a challenging task. However, we know it can be done - Apple quietly added AVX2 support to their Game Porting Kit 2 this week - so it will be interesting to see what will happen with future generations of Oryon CPU cores. Unlike Apple's ecosystem, x86 will not disappear in the Windows ecosystem, so transitioning to AVX2 (eventually AVX-512 and AVX10) will be interesting to watch!

Adreno X1 GPU Architecture: A More Familiar Face

Next, let's talk about the GPU architecture of the Snapdragon X SoC: Adreno X.

Unlike the Oryon CPU core, the Adreno X1 is not a completely new hardware architecture. In fact, it has been preceded by three generations of the 8cx SoC, which is not even new for Windows. However, Qualcomm has kept its GPU architecture under wraps for many years, so the GPU architecture may be new to AnandTech readers as well. It can be said that for over a decade, I have been trying to obtain detailed disclosure information from Qualcomm, and with the launch of Snapdragon X, they have finally achieved this goal.

From a high-level perspective, the Adreno X1 GPU architecture is the latest version of the Adreno architecture series that Qualcomm is currently developing, with X1 representing the 7th generation. Adreno itself is based on the acquisition from ATI 15 years ago (Adreno is an anagram of Radeon), and over the years, Qualcomm's Adreno architecture has always been the strongest GPU in the Android field.

Of course, the situation in the Windows domain is slightly different, as discrete GPUs have pushed integrated GPUs aside, unable to handle workloads that require high GPU performance. And because game development has never completely detached from GPU architecture/drivers, Qualcomm's insignificant presence in the Windows market over the years has often led game developers to overlook them. However, Qualcomm is not a newcomer to Windows games, which gives them an advantage in trying to capture a larger share of the Windows market.

From a functional perspective, the Adreno X1 GPU architecture is unfortunately a bit outdated compared to contemporary x86 SoCs. Although this architecture does support ray tracing, the chip cannot support the complete DirectX 12 Ultimate (feature level 12_2) feature set. This means that it must report itself to DirectX applications as a feature level 12_1 GPU, which limits the use of these features in most games.

Nevertheless, the Adreno X1 does support some advanced features that have been actively used on Android, as DirectX feature levels do not exist there. As mentioned earlier, it supports ray tracing, which is exposed in Windows applications through the Vulkan API and its ray query calls. Considering the limited use of Vulkan on Windows, it is understandable that Qualcomm does not delve deeply into this topic, but it sounds like Qualcomm's implementation is a tier 2 design with hardware ray testing, but without hardware BVH processing similar to AMD's RDNA2

In addition, it also supports Variable Rate Shading (VRS) Tier 2, which is crucial for optimizing shader workloads on mobile GPUs. Therefore, it seems that the obstacles to X1 obtaining DirectX 12 Ultimate support are mesh shaders and sampler feedback, indeed, these are some significant hardware changes.

In terms of API support, as mentioned earlier, the Adreno X1 GPU supports DirectX and Vulkan. Qualcomm provides native drivers/paths for DirectX 12 and DirectX 11, Vulkan 1.3, and OpenCL 3.0. The only notable exception here is DirectX 9 support, which, like SoC vendor Intel, is implemented using D3D9on12, which translates DX9 commands into DX12. Nowadays, DX9 games are rare (the API was replaced by DX10/11 15 years ago), but because this is Windows, backward compatibility is an ongoing expectation.

On the contrary, it even supports the new DirectML API that Microsoft uses for low-level GPU access for machine learning. Qualcomm has even optimized the meta-commands written for the GPU so that software utilizing DirectML can run more efficiently without needing to know any other information about the architecture.

Adreno X1 GPU Architecture in Detail

Apart from advanced features, let's take a look at the low-level architecture.

The Adreno X1 GPU is divided into 6 shader processing blocks, each block providing 256 FP32 ALUs, totaling 1536 ALUs. The peak clock speed is 1.5GHz, allowing the integrated GPU on Snapdragon X to achieve a maximum throughput of 4.6 TFLOPS (lower-end SKUs have lower throughput).

Like other GPUs, the GPU front end is divided into traditional front end/SP/back end settings, with the front end responsible for handling triangle setup and rasterization, as well as merging GPU-based tile rendering modes. It is worth noting that the GPU front end can set and rasterize 2 triangles per clock, which is not noteworthy in the PC field in 2024, but is quite good for an integrated GPU. To improve performance, the front end can also perform early depth testing to reject polygons that will never be visible before rasterization At the same time, the backend consists of 6 Render Output Units (ROPs), each unit capable of processing 8 pixels per cycle, rendering a total of 48 pixels per clock. The rendered backend inserts into local cache, as well as an important temporary memory called GMEM by Qualcomm.

The individual shader processor block itself is quite common, especially if you have seen the NVIDIA GPU architecture diagram. Each SP is further subdivided into two micro-pipelines (microshader pipeline texture pipeline, or uSPTP), controlled by their own dedicated scheduler and other resources (such as local memory, load/store units, and texture units).

Each uSPTP provides 128 FP32 ALUs. Surprisingly, there is also a separate set of 256 FP16 ALUs, which means that the Adreno X1 does not need to share resources when processing FP16 and FP32 data, unlike architectures that perform FP16 operations on FP32 ALUs. However, if the GPU scheduler determines the need, FP32 units can also be used for FP16 operations.

Finally, there are 16 Elementary Function Units (EFUs) for handling transcendental functions such as LOG, SQRT, and other rare but important mathematical functions.

Surprisingly, the wavefront size used by Adreno X1 is quite large. Depending on the mode, Qualcomm uses a wavefront with a channel width of 64 or 128. Qualcomm tells us that they typically use a 128-wide wavefront for 16-bit operations (such as fragment shaders), while a 64-wide wavefront is used for 32-bit operations (such as pixel shaders).

In comparison, AMD's RDNA architecture uses 32/64-wide wavefronts, while NVIDIA's wavefront/warp is always 32-wide. Wide designs have fallen out of favor in the PC field because of difficulties in power delivery (too much divergence), so this is interesting. Despite concerns about wavefront size, considering the high GPU performance of Qualcomm's smartphone SoCs, it seems to work well for Qualcomm - not an easy task given the high resolution of smartphone screens.

In addition to ALUs, each uSPTP contains its own texture unit, capable of outputting 8 texture pixels per clock per uSPTP. There are limited image processing functions here, including texture filtering, and even SAD/SAS instructions used for generating motion vectors.

Finally, there is a considerable amount of register space in each uSPTP. In addition to the L1 texture cache, there is a total of 192KB of general-purpose registers to provide information for each block and attempt to hide wavefront latency bubbles

As mentioned earlier, Adreno X1 supports multiple rendering modes for optimal performance, which the company refers to as FlexRender technology. This is a less common topic in PC GPU design but more important in the mobile field due to historical and efficiency reasons.

In addition to the traditional direct/immediate mode rendering method (typical mode for most PC GPUs), Qualcomm also supports tile-based rendering, which they call merge mode. Like other tile-based renderers, merge mode divides the screen into multiple tiles and renders each tile separately. This allows the GPU to process only a portion of the data at a time, keeping most of the data in its local cache and minimizing the flow to DRAM, which is both power-consuming and performance-limiting.

Finally, Adreno X1 has a third mode that combines the advantages of tiled rendering and direct rendering, which they call tiled direct mode. In this mode, the tiled visibility pass runs before switching to direct rendering as a means to further eliminate back-facing (invisible) triangles before rasterization. Only after discarding this data does the GPU switch to direct rendering mode, reducing the workload.

The key to making the hierarchical rendering mode work properly is the GPU's GMEM, a 3MB SRAM block that serves as a high-bandwidth cache for the GPU. Architecturally, GMEM is not just a cache because it is separate from the system memory hierarchy, and the GPU can perform almost any operation on memory (including using it as a cache when necessary).

The size of the GMEM block is 3MB, which is not very large overall. However, it is enough to store a tile, preventing a large amount of traffic from hitting the system memory. Moreover, it is fast with a bandwidth of 2.3TB/s, enough to keep the ROP running at full speed without being limited by memory bandwidth.

With the GMEM block, ideally, the GPU only needs to write data to DRAM once for each piece of work when rendering that tile. Of course, in practice, DRAM traffic may be more than this, but this is one of Qualcomm's key features to avoid the GPU writing data to DRAM, consuming memory bandwidth and power.

When the Adreno X1 really needs to enter the system memory, it will pass through its remaining cache and finally reach the shared memory controller of Snapdragon X.

Above GMEM, each pair of SP has a 128KB cluster cache (for the complete Snapdragon X, a total of 384KB). Above this, there is a 1MB GPU unified L2 cache.

Finally, there is the system-level cache (L3/SLC), which serves all processing blocks on the GPU. When all other methods fail, there is still DRAM.

Lastly, it is worth noting that the Adreno X1 GPU also includes a dedicated RISC controller inside the GPU, acting as the GPU Management Unit (GMU). The GMU provides various functions, with the most important being power management within the GPU. The GMU works in coordination with power management requests from other parts of the SoC, allowing the chip to redistribute power among different blocks based on the optimal performance allocation determined by the SoC.

Performance and Initial Thoughts

Finally, before concluding this in-depth architecture analysis, let's take a look at a few performance slides from Qualcomm. While the performance of Snapdragon X can be seen firsthand when retail devices are released worldwide next week, it gives us a better understanding of the expected results beforehand. But caution is advised.

In terms of CPU, Qualcomm claims that Snapdragon X Elite can outperform all contemporary PC competitors in the GeekBench 6.2 single-threaded test. Moreover, its leading advantage in terms of energy efficiency is also quite significant.

In short, Qualcomm claims that even if the TDP of x86 cores is unrestricted, the Oryon CPU cores in Snapdragon X Elite can also outperform Redwood Cove (Meteor Lake) and Zen 4 (Phoenix) in absolute performance. Given that the acceleration of mobile x86 chips reaches up to 5GHz, this is a bold statement but not impossible.

At the same time, in terms of GPU, Qualcomm has also achieved similar efficiency improvements. However, the discussed workload 3DMark WildLife Extreme is unlikely to translate into most games, as it is a benchmark test focused on mobile devices that has been repeatedly optimized in the drivers of each mobile SoC supplier over the years

Performance benchmark testing using actual games may be more useful here. Although Qualcomm may be picking some good products, the top Snapdragon X SKU often competes with Intel's Core Ultra 7 155H. It is undeniable that its overall performance is not impressive, but it is good to see Qualcomm's current performance in real games. In this case, even if it is just a tie/defeat against one of Intel's better mobile chips, it is not bad.

Preliminary Thoughts

The above is our first in-depth study of the Qualcomm Snapdragon X SoC architecture. Qualcomm will invest long-term in the Windows-on-Arm ecosystem, hoping to be the first among many ecosystems, as the company seeks to become the third largest Windows CPU/SoC supplier.

However, the ultimate significance of the Snapdragon X SoC and its Oryon CPU core is not just as an SoC for laptops. Even if Qualcomm has achieved great success in this area, the number of PC chips they ship is just a drop in the bucket compared to their true strength base: the Android SoC field. Oryon will illuminate the path for significant changes in Qualcomm's mobile SoC here.

As Qualcomm has pointed out since the beginning of the Oryon journey, this will ultimately become the core CPU core for all of Qualcomm's products. Starting from this month's PC SoC, it will eventually expand to include mobile SoCs such as the Snapdragon 8 series, and further down the line, high-end branches such as Qualcomm's automotive products and XR headset SoCs. Although I doubt if we will really see Oryon and its successors from top to bottom in Qualcomm's products (the company needs small and cheap CPU cores to support its budget product lines like Snapdragon 6 and Snapdragon 4), undoubtedly, in the long run, it will become the cornerstone of most of their products. This is the differentiated value of making your own CPU core - by using it in as many places as possible to extract the maximum value from the CPU core.

Ultimately, Qualcomm has been heavily promoting its next-generation PC SoC and its custom CPU core over the past 8 months, and now it's time to get all the pieces in place. The prospect of having a third competitor in the PC CPU field (and an Arm-based competitor) is exciting, but slides and advertisements are not hardware and benchmarks. Therefore, we eagerly await what next week will bring and see if Qualcomm's engineering prowess can fulfill the company's grand ambitions.

Article Author: Semiconductor Industry Watch, Source: [Semiconductor Industry Watch (ID: icbank)](https://mp.weixin.qq.com/s?__biz=Mzg2NDgzNTQ4MA==&mid=2247741738&idx=3&sn=400a8c624587246fe57aefb102c8ba51&chksm=ce6e33ddf919bacbaa1c3b3029ca107ab813e0964d2fc35d35a13a1452e97a449f2c 1cf047ca&mpshare=1&scene=23&srcid=06145FUWEMqWVPc0W3s07Yff&sharer_shareinfo=d0ce1f81ae2fff00f728d18c3cccc339&sharer_shareinfo_first=d0ce1f81ae2fff00f728d18c3cccc339#rd), the original title: "Qualcomm self-developed chip architecture deeply revealed, Apple M series welcomes the strongest competitor"