Gpu alu: Background: How GPUs Work — Imagination’s PowerVR Rogue Architecture Explored

Background: How GPUs Work — Imagination’s PowerVR Rogue Architecture Explored

by Ryan Smithon February 24, 2014 3:00 AM EST

Posted in
GPUs
Imagination Technologies
PowerVR
PowerVR Series6
SoCs

95 Comments
|

95 Comments

Imagination’s PowerVR Rogue Architecture ExploredBackground: How GPUs WorkImagination’s PowerVR Rogue Series 6/6XT USCs DissectedHow Rogues Get Executed: Wavefronts & Superscalar ILPTechnical Comparisons & Final Words

Seeing as how this is our first in-depth architecture article on a SoC GPU design (specifically as opposed to PC-derived designs like Intel and NVIDIA), we felt it best to start at the beginning. For our regular GPU readers the following should be redundant, but if you’ve ever wanted to learn a bit more about how a GPU works, this is your place to start.

GPUs, like most complex processors, are composed of a number of different functional units responsible for the different aspects of computation and rendering. We have functional units that setup geometry data, frequently called geometry engine, geometry processors, or polymorph engines. We have memory subsystems that provide caching and access to external memory. We have rendering backends (ROPs or pixel co-processors) that take computed geometry and pixels to blend them and finalize them. We have texture mapping units (TMUs) that fetch textures and texels to place them within a scene. And of course we have shaders, the compute cores that do much of the heavy lifting in today’s games.

Perhaps the most basic question even from a simple summary of the functional units in a GPU is why there are so many different functional units in the first place. While conceptually virtually all of these steps (except memory) can be done in software – and hence done in something like a shader – GPU designers don’t do that for performance and power reasons. So-called fixed function hardware (such as ROPs) exists because it’s far more efficient to do certain tasks with hardware that is tightly optimized for the job, rather than doing it with flexible hardware such as shaders. For a given task flexible hardware is bigger and consumes more power than fixed function hardware, hence the need to do as much work in power/space efficient fixed function hardware as is possible. As such the portions of the rendering process that need flexibility will take place in shaders, while other aspects that are by their nature consistent and fixed take place in fixed function units.

The bulk of the information Imagination is sharing with us today is with respect to shaders, so that’s what we’ll focus on today. On a die area basis and power basis the shader blocks are the biggest contributors to rendering. Though every functional unit is important for its job, it’s in the shaders that most of the work takes place for rendering, and the proportion of that work that is bottlenecked by shaders increases with every year and with every generation, as increasingly complex shader programs are created.

So with that in mind, let’s start with a simple question: just what is a shader?

At its most fundamental level, a shader core is a flexible mathematics pipeline; it is a single computational resource that accepts instructions (a shader program) and executes it in order to manipulate the pixels and polygon vertices within a scene. An individual shader core goes by many names depending on who makes it: AMD has Stream Processors, NVIDIA has CUDA cores, and Imagination has Pipelines. At the same time how a shader core is built and configured depends on the architecture and its design goals, so while there are always similarities it is rare that shader cores are identical.

On a lower technical level, a shader core itself contains several elements. It contains decoders, dispatchers, operand collectors, results collectors, and more. But the single most important element, and the element we’re typically fixated on, is the Arithmetic Logic Unit (ALU). ALUs are the most fundamental building blocks in a GPU, and are the base unit that actually performs the mathematical operations called for as part of a shader program.

An NVIDIA CUDA Core

And an Imgination PVR Rogue Series 6XT Pipeline

The number of ALUs within a shader core in turn depends on the design of the shader core. To use NVIDIA as an example again, they have 2 ALUs – an FP32 floating point ALU and an integer ALU – either of which is in operation as a shader program requires. In other designs such as Imagination’s Rogue Series 6XT, a single shader core can have up to 7 ALUs, in which multiple ALUs can be used simultaneously. From a practical perspective we typically count shader cores when discussing architectures, but it is at times important to remember that the number of ALUs within a shader core can vary.

When it comes to shader cores, GPU designs will implement hundreds and up to thousands of these shader cores. Graphics rendering is what we call an embarrassingly parallel process, as there are potentially millions of pixels in a scene, most of which can be operated in in a semi-independent or fully-independent manner. As a result a GPU will implement a large number of shader cores to work on multiple pixels in parallel. The use of a “wide” design is well suited for graphics rendering as it allows each shader core to be clocked relatively low, saving power while achieving work in bulk. A shader core may only operate at a few hundred megahertz, but because there are so many of them the aggregate throughput of a GPU can be enormous, which is just what we need for graphics rendering (and some classes of compute workloads, as it turns out).

A collection of Kepler CUDA cores, 192 in all

The final piece of the puzzle then is how these shader cores are organized. Like all processors, the shader cores in a GPU are fed by a “thread” of instructions, one instruction following another until all the necessary operations are complete for that program. In terms of shader organization there is a tradeoff between just how independent a shader core is, and how much space/power it takes up. In a perfectly ideal scenario, each and every shader core would be fully independent, potentially working on something entirely different than any of its neighbors. But in the real world we do not do that because it is space and power inefficient, and as it turns out it’s unnecessary.

Neighboring pixels may be independent – that is, their outcome doesn’t depend on the outcome of their neighbors – but in rendering a scene, most of the time we’re going to be applying the same operations to large groups of pixels. So rather than grant the shader cores true independence, they are grouped up together for the purpose of having all of them executing threads out of the same collection of threads. This setup is power and space efficient as the collection of shader cores take up less power and less space since they don’t need the intelligence to operate completely independently of each other.

The flow of threads within a wavefront/warp

Not unlike the construction of a shader core, how shader cores are grouped together will depend on the design. The most common groupings are either 16 or 32 shader cores. Smaller groupings are more performance efficient (you have fewer shader cores sitting idle if you can’t fill all of them with identical threads), while larger groupings are more space/power efficient since you can group more shader cores together under the control of a single instruction scheduler.

Finally, these groupings of threads can go by several different names. NVIDIA uses the term warp, AMD uses the term wavefront, and the official OpenGL terminology is the workgroup. Workgroup is technically the most accurate term, however it’s also the most ambiguous; lots of things in the world are called workgroups. Imagination doesn’t have an official name for their workgroups, so our preference is to stick with the term wavefront, since its more limited use makes it easier to pick up on the context of the discussion.

Summing things up then, we have ALUs, the most basic building block in a GPU shader design. From those ALUs we build up a shader core, and then we group those shader cores into a array of (typically) 16 or 32 shader cores. Finally, those arrays are fed threads of instructions, one thread per shader core, which like the shader cores are grouped together. We call these thread groups wavefronts.

And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfied Shading Cluster.

Imagination’s PowerVR Rogue Architecture Explored
Imagination’s PowerVR Rogue Series 6/6XT USCs Dissected
Imagination’s PowerVR Rogue Architecture ExploredBackground: How GPUs WorkImagination’s PowerVR Rogue Series 6/6XT USCs DissectedHow Rogues Get Executed: Wavefronts & Superscalar ILPTechnical Comparisons & Final Words

PRINT THIS ARTICLE

Exploring the GPU Architecture | VMware

Overview

A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running applications that weigh heavy on graphics, i.e. 3D modeling software or VDI infrastructures. In the consumer market, a GPU is mostly used to accelerate gaming graphics. Today, GPGPU’s (General Purpose GPU) are the choice of hardware to accelerate computational workloads in modern High Performance Computing (HPC) landscapes.

HPC in itself is the platform serving workloads like Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI). Using a GPGPU is not only about ML computations that require image recognition anymore. Calculations on tabular data is also a common exercise in i.e. healthcare, insurance and financial industry verticals. But why do we need a GPU for these types of all these workloads? This blogpost will go into the GPU architecture and why they are a good fit for HPC workloads running on vSphere ESXi.

Latency vs Throughput

Let’s first take a look at the main differences between a Central Processing Unit (CPU) and a GPU. A common CPU is optimized to be as quick as possible to finish a task at a as low as possible latency, while keeping the ability to quickly switch between operations. It’s nature is all about processing tasks in a serialized way. A GPU is all about throughput optimization, allowing to push as many as possible tasks through is internals at once. It does so by being able to parallel process a task. The following exemplary diagram shows the ‘core’ count of a CPU and GPU. It emphasizes that the main contrast between both is that a GPU has a lot more cores to process a task.

Differences and Similarities

However, it is not only about the number of cores. And when we speak of cores in a NVIDIA GPU, we refer to CUDA cores that consists of ALU’s (Arithmetic Logic Unit). Terminology may vary between vendors.

Looking at the overall architecture of a CPU and GPU, we can see a lot of similarities between the two. Both use the memory constructs of cache layers, memory controller and global memory. A high-level overview of modern CPU architectures indicates it is all about low latency memory access by using significant cache memory layers. Let’s first take a look at a diagram that shows an generic, memory focussed, modern CPU package (note: the precise lay-out strongly depends on vendor/model).

A single CPU package consists of cores that contains separate data and instruction layer-1 caches, supported by the layer-2 cache. The layer-3 cache, or last level cache, is shared across multiple cores. If data is not residing in the cache layers, it will fetch the data from the global DDR-4 memory. The numbers of cores per CPU can go up to 28 or 32 that run up to 2.5 GHz or 3.8 GHz with Turbo mode, depending on make and model. Caches sizes range up to 2MB L2 cache per core.

Exploring the GPU Architecture

If we inspect the high-level architecture overview of a GPU (again, strongly depended on make/model), it looks like the nature of a GPU is all about putting available cores to work and it’s less focussed on low latency cache memory access.

A single GPU device consists of multiple Processor Clusters (PC) that contain multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1 instruction cache layer with its associated cores. Typically, one SM uses a dedicated layer-1 cache and a shared layer-2 cache before pulling data from global GDDR-5 (or GDDR-6 in newer GPU models) memory. Its architecture is tolerant of memory latency.

Compared to a CPU, a GPU works with fewer, and relatively small, memory cache layers. Reason being is that a GPU has more transistors dedicated to computation meaning it cares less how long it takes the retrieve data from memory. The potential memory access ‘latency’ is masked as long as the GPU has enough computations at hand, keeping it busy.

A GPU is optimized for data parallel throughput computations.

Looking at the numbers of cores it quickly shows you the possibilities on parallelism that is it is capable of. When examining the 2019 NVIDIA flagship offering, the Tesla V100, one device contains 80 SM’s, each containing 64 cores making a total of 5120 cores! Tasks aren’t scheduled to individual cores, but to processor clusters and SM’s. That’s how it’s able to process in parallel. Now combine this powerful hardware device with a programming framework so applications can fully utilize the computing power of a GPU.

To Conclude

High Performance Computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.

This is exactly why GPU’s are a perfect fit for HPC workloads. Workloads can greatly benefit from using GPU’s as it enables them to have massive increases in throughput. A HPC platform using GPU’s will become much more versatile, flexible and efficient when running it on top of the VMware vSphere ESXi hypervisor. It allows for GPU-based workloads to allocate GPU resources in a very flexible and dynamic way.

Filter Tags

AI/ML
ESXi
vSphere
Assignable Hardware
Hardware Acceleration
Document
Deep Dive
Advanced
Design

Titan Twin Turbo VGA Cooler — More Fans, More Heatpipes / Chassis, PSU & Cooling

Powerful efficient coolers are equipped mainly with top-class video cards. «Middle peasants» are most often content with compact single-slot cooling, or a pile of passive radiators with heat pipes. The advantage of passive cooling is the quiet operation of the video card. But even with the most advanced and sophisticated passive heatsinks, temperatures often look intimidating, even though the graphics card remains stable. We observed a similar situation in the review of video cards of the NVIDIA GeForce 8600 series from Gigabyte.

Although small, but active coolers can handle heat dissipation more efficiently, although it is not often possible to see a good performance-to-noise ratio. That is why fans of silence and (or) overclocking also continue to look for alternative coolers that should show higher performance and not annoy with annoying noise.

Another newcomer in this area will be thoroughly reviewed and tested today. We are talking about a cooler for graphics cards Titan Twin Turbo. In some ways, it turned out to be similar to the Evercool Turbo 2 cooler considered earlier, but it is equipped with not one, but two fans at once. So let’s take a closer look at it and compare it in practice with other popular coolers for video cards.

Click to enlarge

The sealed plastic packaging allows you to see only an openwork grille and two fans, behind which you can see a radiator with heat pipes. On the sides and on the back of the package, it tells about the main advantages, features and a list of compatible video cards. We have compiled this information into one specification table:

Specification

Model

Titan Twin Turbo

Compatibility with video cards
(distance between mounting holes for fixing a cooler on the board) mm

54. 8
79.7
75.4
61.1

Overall dimensions, mm

158 x 115 x 36

Fan dimensions, mm

70 x 70 x 10

Number of fans, pcs

2

Radiator material

Base and heat pipes — copper
Fins — aluminum

Fan speed, rpm

2000 ± 10%

Created air flow, m ³ /h

52. 6
(31 CFM)

Static pressure, mm N ₂ O

1.016

Noise level, dB

26.2

Fan supply voltage, V

12

Fan operating current, A

0.18

Power consumption, W

2. 16

Cooler weight, g

n/a

Estimated cost, $

not available for sale at the time of preparation of the material

The compatibility list contains almost all video cards from several generations ago, so we did not clutter up the table with this data, and gave typical distances between mounting holes. This method is also more convenient because the new generation of GeForce 8×00 and Radeon HD2x00 graphics adapters are not in this list, while standard typical distances are used for mounting.

The weight of the cooler, for unknown reasons, is not indicated either on the box or on the official website, but according to personal feelings, the cooler is very light and hardly weighs more than 300 grams.

Click to enlarge

The package bundle is minimal, but everything you need is in abundance: the cooler itself, a set of fasteners, memory heatsinks and a syringe with thermal paste. The fan speed is fixed at 2000 rpm, so no speed controller is needed.

Click to enlarge

Four heat pipes look very serious, and coupled with two fans promise good performance. The fans cover almost the entire surface of the heatsink, which ensures good airflow across the entire heat dissipation area.
A little embarrassing is only a small area of dispersion of the radiator itself, but there are serious reasons for this.

Click to enlarge

The thickness of the heatsink is not high, slightly larger than the diameter of the heatpipes, but an important task of this kind of coolers is the need to fit into two slots along with a video card, so it was not possible to accommodate a heatsink with a larger dissipation area.

Click to enlarge

The base consists of two halves: copper, which is in contact with the GPU, and aluminum, which carries the mounting cross. Four heat pipes are clamped in the cut grooves. No traces of soldering could be seen — probably some kind of thermal interface is used.

The plastic film protects the surface of the base from scratches and fingerprints, the inscription in two languages reminds that the film must be removed before installation.

Click to enlarge

The base hidden under the film did not turn out to be mirror-like, but it is quite uniform and only slightly rough. Proprietary thermal paste spreads onto the GPU very easily, and its stock is enough for many installations.

The used mounting principle came to us a long time ago, with the cooler for Zalman VF900 video cards: there are a number of threaded holes that form typical distances between the mounting holes of different video cards. To install the cooler on a video card, you need to screw the threaded racks into the necessary holes and then, on the reverse side, fix them on the board with nuts.

Click to enlarge

We saw an identical cross with typical holes on the Evercool Turbo 2 cooler, probably both companies buy it from the same OEM.

In addition to the GPU cooler, the Titan Twin Turbo comes with eight heatsinks for memory chips.

Click to enlarge

As you can see, two of the eight heatsinks are smaller and are designed to be mounted on those memory chips that the cooler’s heatpipes will run over. Here, unfortunately, Titan’s engineers miscalculated a bit: when installing the cooler on the experimental video card Leadtek GeForce 8600GTS, the heat pipes of the cooler turned out to be slightly lower than the level of reduced radiators, and they had to be abandoned.

Further installation does not cause difficulties, it remains to apply thermal paste to the surface of the GPU and fix the heatsink. By the way, the heatsink fixing screws have no limits and can be tightened to the limit, so we recommend tightening them one by one, two or three turns, and not overtightening too much, otherwise there will be a high risk of chipping the crystal.

Click to enlarge

The final design turned out to be quite cumbersome. In terms of height, it completely occupies two free slots, but for the cooler to work properly, the third one will have to be abandoned, otherwise the fans will have nowhere to draw air from.

Testing

As it has already become quite obvious, we will test the effectiveness of the new cooler on the currently best mid-range video card, GeForce 8600GTS. Unfortunately, it cannot be called particularly hot, but none of the graphic giants have released «trimmings» from top solutions yet, so we will test it on it.

The test configuration is detailed in the following table:

Test bench configuration

Processor

LGA775 Intel Core2Duo E6750 (Conroe, G0) @3800MHz / 1.5V

Motherboard

ASUS P5B Deluxe rev.1.03G (i965)

RAM

2 x 1024 DDR2 PC6400 Kingmax Mars

Video card

256 Mb Leadtek NVIDIA GeForce 8600GTS

Hard disk

250 GB Seagate SATA II 16 MB Cache (ST3250620AS)

Housing

ThermalTake Xaser III (window, 5 x 80mm case fans)

Power supply

FSP Optima 600W (OPS600-80GLN)

Of the features, we highlight the fact that all five case fans operate at low speeds (by supplying from +5 V), while providing good through ventilation. The overclocked Core2Duo E6750 processor generates a lot of heat, but it is effectively dissipated by the Zalman CNPS9 cooler700 LEDs. The flow of hot air is directed to the blowing case fans. All these features create a healthy climate for the graphics card, as there is no overheating of the air inside the case.

The testing methodology has already been tested and is familiar to many of our readers. The temperature indicators of the video card are taken in two modes: under 3D load and idle. The load was carried out by running the test for artifacts of the ATITool v.0.27 program, which creates the maximum uniform load on the graphics processor and allows you to «warm up» it at the level of any modern game.

The stock cooler of the video card Leadtek GeForce 8600GTS fully complies with the reference cooler from NVIDIA.

Its efficiency is enough to provide the proper level of cooling, but a distinct noise is created from a small fan, and it may seem unacceptable to lovers of silence.

To make things more difficult for coolers, the graphics card’s clock speeds have been slightly increased from 675/2000 to 720/2200 MHz.

In addition to the Titan Twin Turbo cooler under consideration, the previously proven popular video card coolers took part in the testing:
Sytrin KuFormula VF1 Plus, Zalman Faral1ty FC-ZV9, IceHammer IH-500V and Evercool Turbo 2. All coolers are good, but let’s try to find out which one is better.

Ambient — board surface temperature
GPU — GPU Temperature

Obviously, in terms of performance, the new product from Titan really showed excellent results, leaving behind only such a “hardened” opponent as Sytrin KuFormula VF1 Plus, which again proved its leadership.

Although, if we approach the issue of leadership in a more voluminous way, we will have to remember that Sytrin KuFormula VF1 Plus, although it is the most efficient air cooler for video cards that has been in our laboratory, is also the most difficult to obtain, since it is not sold in Russia. And Titan’s products are widespread in our country, which means that the Titan Twin Turbo cooler, when it appears on our market, is quite capable of taking the place of the leader in air cooling for video cards.

The second stage of testing in idle time did not change, but only confirmed the results obtained earlier:

Terminals

The test results look eloquent. Two fans and four heat pipes did their job. As already noted, without taking into account the hard-to-find Sytrin KuFormula VF1 Plus, the new Titan Twin Turbo cooler confidently outperformed its competitors and took first place in terms of efficiency. And this in itself is a significant advantage. It is also important that the level of noise emitted was very low. The cooler is not silent, but it only slightly stands out against the background of a quiet system unit.

The universal mount allows you to install the cooler on any modern entry-level and mid-range video card. But the peculiarities of the technique itself require accuracy during installation, since there is a danger of chipping the GPU chip. With the fastening of heatsinks to memory chips, not everything is in order either, despite the reduced height of two heatsinks, they still ran into heat pipes and prevented the cooler from being installed. This little miscalculation of engineers, most likely, can be easily corrected, and it will not be in the next revision of the cooler.

Considering Titan’s moderate pricing policy, we can hope that the cost of the Titan Twin Turbo cooler will not be too high, and then it will undoubtedly take its rightful place in this market. In the meantime, it has not gone on sale, competitors still have a chance to release a new product and take revenge.

The editors would like to thank Sunrise-Rostov-on-Don for providing the Leadtek NVIDIA GeForce 8600GTS 256Mb video card for testing.

— Discuss the material in the conference

If you notice an error, select it with the mouse and press CTRL+ENTER.

Related materials

Permanent URL: https://3dnews.ru/268278

⇣ Comments

Thermaltake Level 20 Tempered Glass E-ATX Vertical Triple Chamber GPU Aluminum Modular Case for Gaming Computer Full Tower 3x140mm Fan Riing Plus+2xLumi P

Delivery options and delivery speed may vary by location.

Sign in to manage addresses

OR

City

BerlinAach B TrierAach, HegauAachenAalenAarbergenAasbuettelAbbenrodeAbenbergAbensbergAbentheuerAbrahamAbsbergAbstattAbtsbessingenAbtsgmuendAbtsteinachAbtswindAbtweilerAchbergAchernAchimAchslachAchstettenAchtAchtelsbachAchterwehrAchtrupAckendorfAddebuellAdelbergAdelebsenAdelheidsdorfAdelmannsfeldenAdelschlagAdelsdorfAdelsheimAdelshofen, Kr FuerstenfeldbruckAdelshofen, MittelfrAdelsriedAdelzhausenAdenauAdenbachAdenbuettelAdendorfAderstedtAdlersteigeAdligstadtAdlkofenAdmannshagen-BargeshagenAdorf/Vogtl.