The GPU acronym stands for Graphics Processing Unit (or graphics processor), a term which was coined by NVIDIA in 1999 when it released its GeForce 256 “GPU”. AMD uses the term VPU (pdf link), which means Visual Computing Unit. Such a processing unit often describes a chip or a sub-unit of a larger chip called “system on chip”, aka SoC. GPUs were originally designed to accelerate graphics operations, thanks to specialized hardware which operates at a much faster rate than general purpose processors called Central Processing Unit, or CPU.
Why GPUs are so much faster
On the outside, GPUs look just like another chip, but inside, there is an array of dozens, hundreds or thousands of small computing units (or “GPU Cores”) that work in parallel, and this is essentially why GPUs are so much faster than CPUs to compute images: hundreds of small computing cores get the job done faster than a handful of big cores. Note that the definition of GPU Cores vary greatly from one vendor to the next, so this is not a number that should generally be used to compare GPUs from different brands.
Computer graphics is fundamentally an “embarrassingly” parallel-problem, which means that the workload can easily be spread across multiple compute units. Since each pixel on the screen can (mostly) be worked-on independently from one another. It is therefore possible to scale performance by adding more compute units. This is achieved by adding more compute units to a single chip, or by adding many chips into a computer with a multi-GPU setup.
A consumer-level GPU like the 3DFX Voodoo had one compute unit and one texture unit, while today’s fastest GPUs feature thousands of units (the Titan Z card has 5760 GPU cores), and achieve performance levels that are thousands of times higher.
Consumer-level Multi-GPU became popular thanks to the 3DFX Voodoo 2, and sources inside 3DFX once said that 30% of their customers were buying a second card to double the graphics performance. 3DFX’s multi-GPU brand was called SLI (Scan Line Interleave) which described how the rendering was split across two cards.
The same SLI name was later used by NVIDIA which owns the 3DFX intellectual property. This time, SLI was rebranded Scalable Link Interface. At this point, it is possible to add many cards with one or more GPUs on a PC. It is important to understand that there are inefficiencies related to managing a multi-GPU setup, which is why the performance is never linear, or directly proportional, to the number of GPUs.
Besides computing pixel colors, GPUs also move geometry around, or define the curvature of surfaces to be rendered, thanks to tessellation capabilities. And since they can also be used to calculate physics (movie) as well, the GPU can be involved in nearly every aspect of computer graphics. Since graphics are fundamentally a parallel problem, there is not yet a practical limit as to how many cores can be utilized at once. As a fantasy, if we could have one or more compute unit per pixel on the screen, things would be crazy fast.
Modern GPUs accelerate both 2D and 3D operations. 2D operations usually mean what users are seeing in their windowed operating system environment, while 3D operation mean games or CAD applications that draw images using polygons and textures.
The state of the art in terms of GPUs use a combination of triangles, GPU compute, and even ray-tracing, to calculate a final synthetic image. Modern GPU units can be programmed almost like any other processors (check gpgpu.org, which is a great resource on this topic). They can branch and address many types of memory, either virtualized or not. They can also be designed to share the same memory as the CPU in an architecture called UMA or Unified Memory Architecture.
This avoids having to copy data back and forth, thus saving precious time, and making the data much more straightforward to manipulate. Despite being programmable, GPU are not CPU and should in fact coexist in a heterogeneous computing environment.
GPUs are systems designed to “hide latency” because they use a lot of data residing off-chip (geometry, textures…), which is slow to access. CPU programs can be optimized to keep data in local cache memory and are much better at branching.
GPU deployment in computer systems
GPUs can be found in a number of forms and shapes. Graphics cards are probably the first thing that comes to mind, but today, most GPUs are embedded into a larger chip as part of a system on chip or SoC. This is true for smartphones and tablets, but also for the Xbox One and PS4 generation of consoles, which use an SoC instead of having separate chips, both for performance (it’s faster to have everything in one chip) and cost-effectiveness, since the one chip will be optimized and shrunk in size many times over the lifespan of the console.
Intel, the largest GPU vendor by volume, also embeds GPUs into its CPUs to ensure that every Intel customer gets a minimal and predictable level of graphics functionality and performance. In the PC industry, this is called “integrated graphics”.
While Intel has been steadily improving its GPU performance, the company mostly uses CPUs as a way to ensure that CPU customers get a good user experience. Thus far, the only high-performance GPU project called Larrabee (look at the Larrabee demo) ended up being delayed, and ultimately, shelved.
Before they were programmable, GPU had fixed functions and could only do a limited number of predefined tasks. To provide more control to developers, hardware vendors have added multiple “stages” where data could flow and be processed.
Each stage could be set up independently and by using several stages, one could provide the appearance of limited programmability. The more stages, and the more “programmable” things looked, but in the end, this was not a model that could scale towards full programmability. In general, fixed-function hardware is faster and more power-efficient than general purpose programmable hardware.
Even today, a modern GPU still has a lot of fixed functions such as texture filtering or frame-buffer blending. That is why one cannot build a great GPU, just by “throwing more cores” at it. Even if developers have often expressed the desire to fully control such functions, the performance loss of making them programmable has outweighed the benefits of higher flexibility.
Bilinear filtering and texture-mapping were among the first fixed-function that CPU could not even approach in terms of performance. Up until that point, drawing flat-shaded triangles was still within a reasonable reach.
Video Decoding and Encoding
Among the many graphics-related functions that GPU can do, video-encode and decode are on the list. Interestingly, most consumer-decoding like playing a movie is done through a dedicated unit that is separated from the main graphics compute units.
As we said earlier, the main reason for this is efficiency. While it may be possible to decode common H.264 video using 3D graphics hardware, it is not very power-efficient to do so, especially on a laptop or a phone. A small sub-unit in the chip can perform the same work much more efficiently.
GPU Application Programming Interface
To access the hardware of a GPU, developers generally use an Application Programming Interface or API. The general idea of an API is to abstract the hardware details so that developers don’t start from scratch each time a new GPU architecture comes out. There have been many APIs over the years, including proprietary ones like 3DFX Glide, but thus far OpenGL (and OpenGL ES on mobile) have remained popular on mobile, Mac and Linux, while DirectX is dominating on Windows.
Interestingly, the rise of huge pockets of smartphone hardware platforms has re-ignited the appetite for proprietary APIs. Apple has released Metal on June 2 2014, an API that promises vast CPU-optimizations, not unlike AMD’s Mantle on PC. Unfortunately for Mantle, DirectX 12 offers pretty all the same benefits, but works across several hardware vendors, so the viability of Mantle on PC remains an open question.
On game consoles, developers used to have “to the metal” access, but in practical terms, they use a very thin abstraction of the hardware to protect themselves from small implementation details changes, without affecting the performance much. If the hardware platform will remains stable for a decade, it’s worth spending time to build from this kind of foundation. The same idea is probably not true in fast-moving platforms like PCs.
The hardware graphics industry has started small but got eventually really crowded, with more than 50 companies competing at some point in time, including Matrox, S3 Graphics or Rendition. Eventually, there was a lot of consolidation, and the main market movers today are: NVIDIA, AMD, Qualcomm, ARM, Imagination Technologies and Intel.