Intel CPUs instruction set architecture Evolution
Before you examine the technical details of the core x86-32 platform, a brief history lesson might be helpful in understanding how the architecture has evolved over the years. In the review that follows, I focus on the noteworthy processors and architectural enhancements that have affected how software developers use x86 assembly language. Readers who are interested in a more comprehensive chronicle of the x86’s lineage should consult the resources listed in future posts.
The original embodiment of the x86-32 platform was the Intel 80386 microprocessor, which was introduced in 1985. The 80386 extended the architecture of its 16-bit predecessors to include 32-bit wide registers and data types, flat memory model options, a 4 GB logical address space, and paged virtual memory. The 80486 processor improved the performance of the 80386 with the inclusion of on-chip memory caches and optimized instructions. Unlike the 80386 with its separate 80387 floating-point unit (FPU), most versions of the 80486 CPU also included an integrated x87 FPU. Expansion of the x86-32 microarchitectures continued with the introduction of the first Pentium brand processor in 1993. Known as the P5 microarchitecture, performance enhancements included a dual-instruction execution pipeline, 64-bit external data bus, and separate on-chip code and data caches. (A microarchitecture defines the organization of a processor’s internal components, including its register files, execution units, instruction pipelines, data buses, and memory caches. Microarchitectures are often used by multiple processor product lines as described in this section.) Later versions of the P5 microarchitecture incorporated a new computational resource called MMX technology, which supports single-instruction multiple-data (SIMD) operations on packed integers using 64-bit wide registers (1997). The P6 microarchitecture, first used on the Pentium Pro (1995) and later on the Pentium II (1997), extended the x86-32 platform using a three-way superscalar design. This means that the processor is able (on average) to decode, dispatch, and execute three distinct instructions during each clock cycle. Other P6 augmentations included support for out-of-order instruction executions, improved branch-prediction algorithms, and speculative instruction executions. The Pentium III, also based on the P6 microarchitecture, was launched in 1999 and included a new SIMD technology called streaming SIMD extensions (SSE). SSE added eight 128-bit wide registers to the x86-32 platform and instructions that support packed single-precision (32-bit) floating-point arithmetic.
In 2000 Intel introduced a new microarchitecture called Netburst that included SSE2, which extended the floating-point capabilities of SSE to cover packed double- precision (64-bit) values. SSE2 also incorporated additional instructions that enabled the 128-bit SSE registers to be used for packed integer calculations and scalar floating-point operations. Processors based on the Netburst architecture included several variations of the Pentium 4. In 2004 the Netburst microarchitecture was upgraded to include SSE3 and hyper-threading technology. SSE3 adds packed integer and packed floating-point instructions to the x86 platform while hyper-threading technology parallelizes the processor’s front-end instruction pipelines in order to improve performance. SSE3-capable processors include 90 nm (and smaller) versions of the Pentium 4 and the server-oriented Xeon product lines.
In 2006 Intel launched a new microarchitecture called Core. The Core microarchitecture included redesigns of many Netburst front-end pipelines and execution units in order to improve performance and reduce power consumption. It also incorporated a number of x86-SSE enhancements, including SSSE3 and SSE4.1. These extensions added new packed integer and packed floating-point instructions to the platform but no new registers or data types. Processors based on the Core microarchitecture include CPUs from the Core 2 Duo and Core 2 Quad series and the Xeon 3000/5000 series. A microarchitecture called Nehalem followed Core in late 2008. The Nehalem microarchitecture re-introduced hyper-threading to the x86 platform, which had been excluded from the Core microarchitecture. It also incorporates SSE4.2. This final x86-SSE enhancement adds several application-specific accelerator instructions to the x86-SSE instruction set. SSE4.2 also includes four new instructions that facilitate text-string processing using the 128-bit wide x86-SSE registers. Processors based on the Nehalem microarchitecture include first generation Core i3, i5, and i7 CPUs. It also includes CPUs from the Xeon 3000, 5000, and 7000 series.
In 2011 Intel launched a new microarchitecture called Sandy Bridge. The Sandy Bridge microarchitecture introduced a new x86 SIMD technology called Advanced Vector Extensions (AVX). AVX adds packed floating-point operations (both single-precision and double-precision) using 256-bit wide registers. AVX also supports a new three-operand instruction syntax, which helps reduce the number of register-to-register data transfers that a function must perform. Processors based on the Sandy Bridge microarchitecture include second- and third-generation Core i3, i5, and i7 CPUs along with Xeon series E3, E5, and E7 CPUs. In 2013 Intel unveiled its Haswell microarchitecture. Haswell includes AVX2, which extends AVX to support packed-integer operations using its 256-bit wide registers. AVX2 also supports enhanced data transfer capabilities with its new set of broadcast, gather, and permute instructions. Another feature of the Haswell microarchitecture is its inclusion of fused-multiply-add (FMA) operations. FMA enables software to perform successive product-sum calculations using a single floating-point rounding operation. The Haswell microarchitecture also encompasses several new general-purpose register instructions. Processors based on the Haswell microarchitecture include fourth- generation Core i3, i5, and i7 CPUs and Xeon E3 (v3) series CPUs. X86 platform extensions over the past several years have not been limited to SIMD enhancements. In 2003 AMD introduced its Opteron processor, which extended the x86’s core architecture from 32 bits to 64 bits. Intel followed suit in 2004 by adding essentially the same 64-bit extensions to its processors, starting with certain versions of the Pentium 4. All Intel processors based on the Core, Nehalem, Sandy Bridge, and Haswell microarchitectures support the x86-64 execution environment. Intel has also introduced several specialized microarchitectures that have been optimized for specific applications. The first of these is called Bonnell and was the basis for the original Atom processor in 2008. Atom processors built on this microarchitecture included support for SSSE3. In 2013 Intel introduced its Silvermont System on a Chip (SoC) microarchitecture, which is optimized for portable devices such as smartphones and tablet PCs. The Silvermont microarchitecture is also used in processors that are tailored for small servers, storage devices, network communications equipment, and embedded systems. Processors based on the Silvermont microarchitecture include SSE4.2 but lack x86-AVX. In 2013 Intel also introduced an ultra-low power SoC microarchitecture called Quark, which targets Internet-of-Things (IoT) and wearable computing devices. Processors based on the Quark microarchitecture only support the core x86-32 and x87 FPU instruction sets; they do not include x86-64 processing capabilities or any of the SIMD resources provided by MMX, x86-SSE, and x86-AVX. Processors from AMD have also evolved over the past few years. In 2003 AMD introduced a series of processors based on its K8 microarchitecture. Original versions of the K8 included support for MMX, SSE, and SSE2, while later versions added SSE3. In 2007 the K10 microarchitecture was launched and included a SIMD enhancement called SSE4a. SSE4a contains several mask shift and streaming store instructions that are not available on processors from Intel. Following the K10, AMD introduced a new microarchitecture called Bulldozer in 2011. The Bulldozer microarchitecture includes SSSE3, SSE4.1, SSE4.2, SSE4a, and AVX. It also adds FMA4, which is a four-operand version of fused-multiply-add. Like SSE4a, FMA4 is not available on Intel from processors. A 2012 update to the Bulldozer microarchitecture called Piledriver includes support for both FMA4 and the three-operand version of FMA, which is called FMA3 by some CPU feature-detection utilities and third-party documentation sources.