Texture Mappers (1995-1999)
Fixed T&L, Early Shaders (2000-2002)
Shader Model 2.0/3.0 (2003-2007)
Unified Shaders (2008+)
Featureset determines where a card will go, not its year of introduction
Unified Shader? GPGPU?
Around 2004 time, when SM3.0 was new and cool, we were noticing from framerate profiling that some frames, roughly one or two a second, took far more rendering time than all others. This caused "micro-stutter", an uneven framerate. The shader profile showed that these frames were much more vertex shader intensive than the others and GPUs had much less vertex shader hardware than pixel shader hardware, because most frames needed about a four to one mix.|
As shader units were getting more and more complex, it made sense for each shader to be able to handle both vertex and pixel programs. ATI introduced this with the Radeon HD2xxx series, though due to other "features" on the GPU, they were generally slower than the previous generation X1xxx series. Nvidia jumped on board with the Geforce 8 series, with 500 MADD GFLOPS on the top end Geforce 8800 parts. The teraflop GPU was in sight.
The first GPU with enough shaders running fast enough to be able to manage one trillion operations per second was the AMD Radeon HD4850, though the previous generation 3870 X2 had surpassed this, it was two GPUs on one card. Nvidia struggled during this generation, while the Geforce GTX285 did manage to hit one TFLOPS, it was expensive, noisy, power hungry and rare.
The most complex operation an FPU historically did was the "multiply-add" (MADD) or "fused-multiply-add" (FMA). It's doing a multiply and an add at the same time. The operation is effectively an accumulate on value a where a = a + (b x c) and the distinction is that intermediate value b x c in a MADD is rounded to final precision before it is added to a, while in an FMA, the intermediate is at a higher precision and not rounded before the add, only the final value is rounded. The real-world distinction is minor and usually ignorable but for FMA being performed in a single step, so typically twice as fast. AMD introduced FMA support in TeraScale 2 (Evergreen, 2009) and Nvidia in Fermi (2010). A common driver optimisation is to promote MADD operations to FMA.|
For consistency's sake, where we list "MADD GFLOPS" or "FMA GFLOPS", we may mean either, whichever one is fastest on the given hardware. An FMA, despite being treated as one operation, is actually two and therefore if a GPU can do one billion FMAs a second, its GFLOPS (giga-floating point operations per second, 1 GFLOPS is one billion ops per second) is 2.
MSI Radeon HD3450 (V118 R3450-TD256H Hewlett Packard OEM variant) - 2007
All I had for this was an Underwriters Laboratories certification number, E96016 which identified it as nothing beyond the basic PCB manufactured by Topsearch of Hong Kong and the model number, V118. That's not a lot to be going on.
It came out of a HP Pavillion desktop, pretty typical consumer junk and certainly nothing intended for gaming, so it's not going to be anything spectacular (the size of the PCB says that too). All hardware tells a story, it's just a matter of listening. The silkscreening on the PCB tells us that it's got guidelines for quite a few different coolers, this PCB is meant for a family of video cards perhaps all based around the same GPU or similar GPUs. The VGA connector can be omitted easily and isn't part of the PCB - It's aimed at OEMs who know what they want and need no flexibility.
It's a Radeon HD3450 with 256MB manufactured by MSI using a Topsearch manufactured PCB. It's amazing what the right search terms in Google can let you infer, isn't it?
MSI's standard retail part comes with a large passive heatsink and S-video out, but OEMs can get whatever variant of a design they like if it'll seal a deal for thousands of units. These things, the V118 model, were retailing at about $20.
Under the heatsink is a 600MHz RV620 (A near-direct die shrink to 55nm of the HD2400's 65nm RV610) GPU feeding 1000MHz GDDR2 memory, but the good ends there. The memory is 64 bit, giving a meagre 8GB/s bandwidth and the core has only four ROPs - It's a single quad. Unified shader 4.1 is present, 40 units, but they're pretty slow.
Core: RV620 with 4 ROPs, 1 TMU per ROP, 600MHz (2.4 billion texels per second, 2.4 billion pixels per second)
RAM: 64 bit DDR2, 1000MHz, 8000MB/s
Shader: 1x Unified shader 4.1
MADD GFLOPS: 48
ATI Radeon HD 3450 256 MB (AMD B629 Dell OEM variant as 'W337G') - 2007|
ATI's OEM parts were never easy to identify. This one carried its FCC identification as ADVANCED MICRO DEVICES MODEL: B629, dating it to after AMD's acquisition of ATI. The PCB carries a date code showing it was made in week 19, 2010.
The rear connector carries a DMS-59 output and the standard-for-the-time S-Video output. Looking at the size and layout of the PCB shows us that it is clearly a derivative of the one above with many components sharing identical placement. Notably, this one has a half-height bracket for the Dell Optiplex SFF 360/380/755/760/780.
The performance of the Radeon HD 3450's RV620 GPU was, to a word, lacking. It carried DDR2 memory as standard, running it at 500 MHz for just 8 GB/s memory bandwidth. The memory was provided by SK Hynix as four BGA packages, two on each side. At 600 MHz the GPU itself was never going to set any records, especially with just one R600-architecture shader package (40 individual shaders).
Getting more performance out of this was an exercise in how highly the RAM would clock. Even with the down-clocked GPU, 8 GB/s plain was not enough. The GPU itself would normally hit 700-800 MHz (and in the HD 3470, the same GPU was running at 800 MHz on the same PCB, with the same cooler!) but this didn't help when the RAM was slow 64 bit DDR2.
When it arrived here, in 2019, it had been sat in a corporate stock room for years and was completely unused, not a single grain of dust on the fan.
On power on, as was common for GPUs of the day, the fan spins up to maximum before winding back. This is really quite noisy for such a small fan!
We'll talk about the GPU architecture on this one. R600 used a ring bus to connect the ROPs to the shader cores, a rather unusual architecture, and took a VLIW-5 instruction set later back-named TeraScale 1. This means each shader block of 40 was eight individual cores, which could handle a single VLIW-5 instruction to its five execution units. TeraScale 2 would increase this to sixteen cores (and 80 shaders per block). R600's ring bus also, as noted, decoupled the ROPs from the execution units, but the texture units were "off to one side" of the ring bus, so each group of execution units did not have its own samplers and combining samples with pixels to do MSAA meant first colour data exiting the shaders into the ring bus, then into the texture units, then back to the ring bus to head to the ROPs to be finally MSAA sampled. This was inefficient and, in the original R600, plain didn't work. Even in the revised silicon, which RV620 here is based on, it was best to avoid antialiasing. The RV620 also improved the UVD video block from "1.0" to "2.0", adding better video decoding to the GPU.
Physically, the GPU was 67 mm^2 and contained 181 million features. At 67 mm^2, it barely cost anything to make. It would be AMD's smallest GPU until Cedar/RV810, three years later in 2010, at 59 mm^2. A drop in the price of bulk silicon from TSMC after 28 nm and the general increase in very low end GPU cost meant that no subsequent GPU has been this small, though Nvidia's 77 mm^2 GP108 came close. Powerful IGPs from AMD and Intel embedded in the CPUs have put paid to the very small entry level GPU segment.
Core: RV620 with 4 ROPs, 1 TMU per ROP, 600MHz (2.4 billion texels per second, 2.4 billion pixels per second)
RAM: 64 bit DDR2, 1000MHz, 8000MB/s
Shader: 1x Unified shader 4.1
MADD GFLOPS: 48
PowerColor Radeon HD 3850 AGP 512MB - 2007
If one were to chart all the GPUs made since the early DX7 era, he would notice a near-perfect rule applying to the industry. One architecture is used for two product lines. For example, the NV40 architecture was introduced with Geforce 6 and then refined and extended for Geforce 7. There's an entire list covering the whole history of the industry. Observe:
|This||Was released as||Then changed a bit into||Which was released as
|NV20||Geforce 3||NV25||Geforce 4|
|R300||Radeon 9700||R420||Radeon X800|
|NV40||Geforce 6||G70||Geforce 7|
|G80||Geforce 8||G90||Geforce 9|
|R600||Radeon HD 2xxx||Rv670||Radeon HD 3xxx|
What's happening is that a new GPU architecture is made on an existing silicon process, one well characterised and understood. ATI's R600 was on the known 80nm half-node process, but TSMC quickly made 65 nm and 55 nm processes available. The lower end of ATI's Radeon HD 2000 series were all 65 nm, then the 3000 series was 55 nm. The die shrink gives better clocking, lower cost and better performance, even if all else is equal. It's also an opportunity for bugfixing and fitting in those last few features which weren't quite ready first time around.
In ATI's case, the Radeon HD 2 and 3 series are the same GPUs. Here's a chart of what uses what:
|R600||Radeon HD 2900 series
|RV610||Radeon HD 2400
|RV615||Radeon HD 4230, 4250
|RV620||Radeon HD 3450, 3470
|RV630||Radeon HD 2600
|RV635||Radeon HD 3650, 45x0, 4730
|RV670||Radeon HD 3850, 3870, 3870 X2
Whew! ATI used the R600 architecture across three product lines, making it almost as venerable as the R300, which was used across three lines and in numerous mobile and IGPs. ATI used R300 so much because it was powerful and a very good baseline. So what of R600?
To a word, it sucked. The top end 2900 XTX was beaten by the previous generation X1900 XT, let alone the X1950s and the top of the line X1950 XTX. It suffered a catastrophic performance penalty from enabling antialiasing and Geforce 7 could beat it pretty much across the board: Yet here was a product meant to be taking on Geforce 8!
While this was released in 2008 (announced in January, but was in the channel by December the previous year), the PCIe version was 2007, so that is listed as the year here. ATI did, however, throw the price really low. You could pick them up for £110 when new and even today (August 2009) they're still in stock for around £75.
There'd not been a really decent mid-range card since the Geforce 6600GT (though the Geforce 7900GT was fairly mid-range, it was overpriced) and the ageing Geforce 7 series was still available, it was not at the DirectX 10 level. The ATI Radeon HD 3850 pretty much owned the mid-range of the market. Better still, it was available in this AGP version which wasn't a cut down or in any way crippled model as most AGP versions were. Indeed, Sapphire's AGP version was overclocked a little and both Sapphire and PowerColor's AGP cards had 512 MB of memory, up from the standard 256 MB. The PowerColor card used this very large heatsink but Sapphire used a slimline single-slot cooler.
In early 2008, these were a damned good buy, especially for the ageing Athlon64 X2 (or Opteron) or older Pentium4 system which was still on AGP.
Core: RV670 with 16 ROPs, 1 TMU per ROP, 670MHz (10.7 billion texels per second, 10.7 billion pixels per second)
RAM: 256 bit GDDR3, 1660MHz, 53,100 MB/s
Shader: 4x Unified shader 4.1
MADD GFLOPS: 429
Supplied by Doomlord
AMD Radeon HD 4870 - 2008
After losing their way somewhat with the Radeon HD 2xxx (R600) debacle, AMD were determined to set matters straight with the 3xxx and 4xxx series. The 3xxx were little more than bugfixed and refined 2xxx parts, comprising of RV620, 635 and 670 (though the RV630 was actually the Radeon HD 2600).
ATI had postponed the R400/Loki project when it turned out to be far too ambitious and had simply bolted together two R360s to produce their R420, which powered the Radeon X800XT and most of the rest of that generation, sharing great commonality with the Radeon 9700 in which R300 had debuted. Indeed, a single quad of R300, evolved a little over the years, powers AMD's RS690 chipset onboard graphics as a version of RV370.
By 2006, the R400 project still wasn't ready for release (ATI had been sidetracked in building chipsets, being bought by AMD) and R500, which R400's Loki project was now being labelled as, was still not ready. Instead, R520 was produced. It was quite innovative and the RV570, as the X1950, became one of the fastest things ever to go in an AGP slot. It also provided the XBox 360's graphics.
It took until 2008 for what was the R400 to finally hit the streets as R600. All the hacks, tweaks and changes made to the chipset meant that it was barely working at all. Things as basic as antialiasing (which should have been handled by the ROPs) had to be done in shaders because the ROPs were rumouredly broken. This crippled the R600's shader throughput when antialiasing was in use and led to pathetically low benchmark scores.
R700 corrected everything. Sporting unified shaders at 4.1 level (beyond DirectX 10) and eight hundred of them at that (these are raw ALUs, the RV770 contains ten shader pipelines, each with 16 cores, each core being 5 ALUs - It's more correct to say that RV770 contains 10 shaders, and 160 shader elements, what AMD call 'stream processors'), corrected the ring memory controller's latency issues and fixed the ROPs. After two generations of being uncompetitive, AMD were back in the ring with the RV710 (4450, 4470), RV730 (4650, 4670) and RV770 (4850, 4870).
RV770's ROPs are again arranged in quads, each quad being a 4 pipeline design and having a 64 bit bus to the memory crossbar. Each pipeline has the equivalent of 2.5 texture mappers (can apply five textures in two passes, but only two in one pass)
Worth comparing is a Radeon HD 2900XT: 742MHz core, 16 ROPs, 105.6GB/s of raw bandwidth, 320 shaders, but about a third of the performance of the 4870 - Even when shaders aren't being extensively utilized.
Core: 16 ROPs, 40 TMUs, 750MHz (30.0 billion texels per second, 12.0 billion pixels per second)
RAM: 256 bit GDDR4, 1800MHz, 115200MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 10x Unified shader 4.1 (80 units each)
MADD GFLOPS: 1,260
It is apparently quite difficult to coax full performance out of the shaders, most synthetic benchmarks measure between 200 and 800 GFLOPS
Thanks to Filoni for providing the part
Nvidia Quadro FX 380 256 MB - 2008|
The Quadro FX 380 used Nvidia's G96 GPU, and ran it at 450 MHz. It had 256 MB of GDDR3 RAM, running at 700 MHz. It used the Tesla architecure, had 16 shaders, 8 TMUs and 8 ROPs. It was rated for a very low 34 watt TDP. This meant it compared against the embarrasingly slow GeForce 9400 GT:
The significantly more powerful 9400 GT was also about a quarter of the price. The G96 GPU had four execution cores, each with 8 CUDA cores, but the Quadro FX 380 had half the entire GPU disabled. It, to a word, stank.
|Name||GPU||GPU Clock||CUDA Cores||RAM Bandwidth|
|GeForce 9400 GT||G96||700 MHz||16||25.6 GB/s|
|Quadro FX 380||G96||450 MHz||16||22.4 GB/s|
Core: 4 ROPs, 8 TMUs, 450MHz (3.6 billion texels per second, 1.8 billion pixels per second)
RAM: 128 bit DDR3, 700MHz, 22,400MB/s
Shader: 2x Unified shader 4.0 (16 units)
MADD GFLOPS: 28.8
AMD Radeon HD 5450 512 MB - 2009|
Here's a beauty, the slowest thing that could even be a Terascale2 GPU and still work. If you just wanted to add a DVI port to an older PC or a build without any onboard video, it was worth the £40 or so you'd pay. Like all AMD's Terascale 2, they could take a DVI-HDMI adapter and actually output HDMI signals, so the audio would work.
It was up against Nvidia's entirely unattractive GeForce 210 and GT220, which it resoundly annihilated, but nobody was buying either of these for performance.
Some of them were supplied with a low profile bracket, you could remove the VGA port (on the cable), and the rear backplane, and replace it with the low profile backplane. This XFX model was supplied with such a bracket.
The 'Cedar' GPU was also featured in the FirePro 2250, FirePro 2460 MV, Radeon HD 6350, Radeon HD 7350, R5 210 and R5 220. It was pretty much awful in all of these, but it was meant to be. Nobody bought one of these things expecting a powerful games machine.
I dropped it in a secondary machine with a 2.5 GHz Athlon X2, and ran an old benchmark (Aquamark 3) on it. It scored roughly the same as a Radeon 9700 from 2003, despite the shader array being three times faster, the textured pixel rate is very similar, and the memory bandwidth much less. This particular unit uses RAM far below AMD's specification of 800 MHz, instead using common DDR3-1333 chips. In this case, Nanya elixir N2CB51H80AN-CG parts rated for 667 MHz operation, but it clocked them at 533 MHz (1067 MHz DDR). I got about 700 MHz out of them, 715 MHz was too far and my first attempts at 800 MHz (before I knew exactly which RAM parts were in use) were an instant display corruption, Windows 10 TDR loop and eventual crash. Being passive, the GPU couldn't go very far. I had it up to 670 MHz (11.2 GB/s) without any issues, but I doubt it'd make 700 MHz.
I'm sure XFX would say the RAM was below spec and even underclocked for power reasons, but DDR3 uses practically no power and the RAM chips aren't touching the heatsinks anyway.
AMD's official (if confused) specs gave RAM on this card as "400 MHz DDR2 or 800 MHz DDR3". AMD also specifies the bandwidth for DDR3 as 12.8 GB/s, which is matched by 64 bit DDR3-1600. The 8 GB/s of this card is far below that spec. Even the memory's rated DDR3-1333 is still only 10.6 GB/s. Memory hijinks aside, all 5450s ran at 650 MHz, seemingly without exception.
What rubs is that Cedar spends 292 million transistors on being around as fast as the 107 million in the R300. It does do more with less RAM, and has Shader Model 5.0, and three to four times the raw shader throughput, but pixel throughput is around the same, and so is comparable performance.
Core: 4 ROPs, 4 TMUs, 650MHz (2.6 billion texels per second, 2.6 billion pixels per second)
RAM: 64 bit DDR3, 533MHz, 8000MB/s
Shader: 1x Unified shader 5.0 (80 units)
MADD GFLOPS: 104
AMD Radeon HD 5750 - 2009
AMD's Radeon HD 5700 series rapidly became the mid-range GPUs to rule them. Represented by the Juniper GPU, it sported a teraflops of shader power and over 70GB/s of raw bandwidth. The 5750 promised to be the king of overclockability, being the same Juniper GPU as the 5770, with one shader pipeline disabled and clocked 150 MHz lower, but supplied with an identical heatsink on largely identical PCBs. So it'd produce less power per clock and would overclock further, right?
Wrong. AMD artificially limited overclocking on Juniper-PRO (as the 5750 was known) to 850 MHz and even then most cards just wouldn't reach it. 5770s would hit 900 MHz, sometimes even 1 GHz from a stock clock of 850, yet the very same GPUs on the 5750 would barely pass 800 from stock of 700. This one, for example, runs into trouble at 820 MHz.
Why? AMD reduced the core voltage for 5750s. Less clock means less voltage required, meaning lower power use but also lower overclocking headroom. 5750s ran very cool, very reliable, but paid the price in their headroom.
Juniper was so successful that AMD rather cheekily renamed them from 5750 and 5770 to 6750 and 6770. No, really, just a pure rename. A slight firmware upgrade enabled BluRay 3D support and HDMI1.4, but any moron could flash a 5750 with a 6750 BIOS and enjoy the "upgrade". Unfortunately there was no way of unlocking the disabled 4 TMUs and shader pipeline on the 5750 to turn it into a 5770, it seems they were physically "fused" off.
The Stream Processors were arranged very much like the previous generation, 80 stream processors per pipeline (or "compute engine"), ten pipelines (one disabled in the 5750). Each pipeline has 16 cores, and each "core" is 5 ALUs, so our 5750 has 144 VLIW-5 processor elements. With a slightly downgraded, but more efficient and slightly more highly clocked GPU and slightly more memory bandwidth, the 5750 was that touch faster than a 4850. In places it could trade blows with a 4870 (see above). The 5xxx series really was just a fairly minor update to the earlier GPUs.
This card is the Powercolor version, and the PCB is quite flexible. It is able to be configured as R84FH (Radeon HD 5770), R84FM (This 5750) and R83FM (Radeon HD 5670) - Redwood shared the same pinout as Juniper, so was compatible with the same PCBs. It also could be configured with 512 MB or 1 GB video RAM. The 512 MB versions were a touch cheaper, but much less capable.
Core: 16 ROPs, 36 TMUs, 700MHz (25.2 billion texels per second, 11.2 billion pixels per second)
RAM: 128 bit GDDR5, 1150MHz, 73,600MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 9x Unified shader 5.0 (80 units each)
MADD GFLOPS: 1,008
AMD Radeon HD 5770 - 2009
The Radeon HD 5770 was about the champion of price effectiveness in 2009 on release, and for much of 2010. A cheaper video card was usually much slower, a faster one was much more expensive. The closely related 5750 was about 5-10% cheaper and 5-10% slower.
AMD had planned a "Juniper-LE", to complete the "PRO" (5750) and "XT" (5770) line up, but the smaller, slower and much cheaper Redwood GPU overlapped, so the probable HD 5730 was either never released or very rare. A "Mobility Radeon HD 5730" was released, which was a Redwood equipped version of the Mobility 5770, which used GDDR3 memory instead of GDDR5. Redwood, in its full incarnation, was exactly half a Juniper. Observe:
It's quite clear what AMD was up to, "PRO" and "LE" had parts disabled, while the "XT" was fully enabled. Further more, Redwood was half of Juniper, which was half of Cypress. Cedar was the odd one, it was far below the others and the "PRO" monicker hinted it was at least partly disabled, but no 2-CU Cedar was ever released. From the die size relative to others, it does appear to only have one compute unit.
|Name||GPU||Die Area (mm^2)||Shader ALUs (Pipelines)||TMUs||ROPs||Typical Clock|
|HD 5450||Cedar PRO||59||80 (1)||8||4||650|
|HD 5550||Redwood LE||104||320 (4)||16||8||550|
|HD 5570||Redwood PRO||104||400 (5)||20||8||650|
|HD 5670||Redwood XT||104||400 (5)||20||8||775|
|HD 5750||Juniper PRO||166||720 (9)||36||16||700|
|HD 5770||Juniper XT||166||800 (10)||40||16||850|
|HD 5830||Cypress LE||334||1120 (14)||56||16||800|
|HD 5850||Cypress PRO||334||1440 (18)||72||32||725|
|HD 5870||Cypress XT||334||1600 (20)||80||32||850|
AMD's VLIW-5 architecture clustered its stream processors in groups of five (this allows it to do a dot-product 3 in one cycle), there are 16 such groups in a "SIMD Engine" or shader pipeline. Juniper has ten such engines. Each engine has four texture mappers attached.
Back to the 5770 at hand, when new it was about £130 (January 2011) and by far the apex of the price/performance curve, joined by its 5750 brother which was a tiny bit slower and a tiny bit cheaper.
Core: 16 ROPs, 40 TMUs, 850MHz (34 billion texels per second, 13.6 billion pixels per second)
RAM: 128 bit GDDR5, 1200MHz, 76,800MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 10x Unified shader 5.0 (10 VLIW-5 of 80, 800 total)
MADD GFLOPS: 1,360
AMD Radeon HD 7970 - 2011
AMD's Graphics Core Next was originally a codename for what was coming after VLIW-4 (Cayman, seen in the HD 6970), the instruction set was to change from VLIW to SIMD.
Each GCN "block" consists of four vector ALUs (SIMD-16) and a simple scalar unit. Each SIMD-16 unit can do 16 MADDs or FMAs per clock, so 128 operations per clock for the whole thing. The texture fetch, Z and sample units are unchanged from Terascale2/Evergreen, there are 16 texture fetch, load/store units and four texture filter units per each GCN "compute unit".
AMD's Radeon HD 6000 generation had been disappointing, with rehashes of previous 5000 series GPUs in Juniper (HD 5750 and 5770 were directly renamed to 6750 and 6770) while the replacement for Redwood, Evergreen's 5 CU part, was Turks, a 6 CU part. It seemed a bit pointless. The high-end was Barts, which was actually smaller and slower than Cypress. Only the very high end, Cayman, which was a different architecture (VLIW-4 vs VLIW-5), was any highlight.
On release, the HD 7970 was as much as 40% faster than Cayman. Such a generational improvement was almost unheard of, with 10-20% being more normal. Tahiti, the GPU in the 7970, was phenomenally powerful. Even Pitcairn, the mainstream performance GPU, was faster than everything but the very highest end of the previous generation.
Tahiti was one of those rare GPUs which takes everything else and plain beats it. It was big in every way, fast in every way, and extremely performant in every way. Notably, its double-precision floating point performance was 1/4, meaning it hit almost 1 TFLOPS of DP performance. That was still at the high end of things in 2016.
The Radeon HD 7970 was the first full implementation of the "Tahiti" GPU, which had 32 GCN compute units, organised in four clusters, clocking in at 925 MHz. This put it well ahead of Nvidia's competing Kepler architecture most of the time. An enhanced "GHz Edition" was released briefly with a 1000 MHz GPU clock (not that most 7970s wouldn't hit that), which was then renamed to R9 280X. At that point, only the R9 290 and R9 290X, which used the 44 units of AMD's "Hawaii", a year later, was any faster.
This card eventually died an undignified death, beginning with hangs when under stress, then failing completely. As it was on a flaky motherboard (RAM issues), I assumed the motherboard had died, and replaced it with a spare Dell I got from the junk pile at work (Dell Optiplex 790). This video card couldn't fit that motherboard due to SATA port placement, only when a PCIe SATA controller arrived did the GPU's failure become apparent.
It was likely an issue on the video card's power converters and the Tahiti GPU remains fully working on a PCB unable to properly power it.
Core: 32 ROPs, 128 TMUs, 925 MHz
RAM: 384 bit GDDR5, 1375 MHz, 264,000MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 32x Unified shader 5.1 (32 GCN blocks - 2048 individual shaders)
MADD GFLOPS: 3,790
Zotac GeForce GT620 1GB DDR3 - 2012
This bears a sticker on the rear telling us it was manufactured in 2014, at which point the GF119 GPU was three years old. It has debuted as a low end part of the GeForce 500 series, in GeForce 510 and GeForce 520. The GeForce 620 retail normally used the GF108 GPU (even older, and first appeared in the GeForce 430), but OEM parts were GF119. This was a relabel of the GeForce 520 and used to meet OEM lifetime requirements.
The provenance of this particular card is not well known: It arrived non-functional in an off-the-shelf PC which had onboard video (GeForce 7100) as part of its nForce 630i chipset, so clearly was not part of that PC when it shipped.
It used the old Fermi architecture and only one compute unit of it, giving it just 48 CUDA cores. The GPU clock ran at 810 MHz (the CUDA cores in Tesla and Fermi were double-pumped) and DDR3 ran at 900 MHz over a 64 bit bus, all reference specs, as this card doesn't show up on the bus.
In a system which could keep the IGP running with a video card present, the GT620 actually appeared on the bus and could be queried. It turned out to be a Zotac card with an 810 MHz GPU clock and 700 MHz DDR3 clock. No display output was functional of the HDMI and DVI present. The header for a VGA output was fitted, but the actual port was not.
In later testing, the GT620 was found to be fully functional. Most likely some manner of incompatibility with BIOS or bad BIOS IGP settings caused it. The system it was in had lost its CMOS config due to a failed motherboard battery.
The GeForce GT620 was very cheap, very low end, and very slow. It would handle basic games at low resolutions, such as 1280x720, but details had to be kept in check. In tests, it was about as fast as a Core i5 3570's IGP and around 30% better than the Core i5 3470's lesser IGP. Given they were contemporaries, one wondered exactly who Nvidia was selling the GeForce GT620 to. The GeForce GT520's life did not end with the GT620. It had one more outing as the GeForce GT705, clocks up a little to 873 MHz.
Its contemporary in the bargain basement was AMD's Radeon HD 5450 and its many relabels (6350, 7350, R5 220), which it more or less equal to.
Core: 4 ROPs, 8 TMUs, 810 MHz
RAM: 64 bit DDR3, 700 MHz, 11.2GB/s
Shader: 1x Unified shader 5.0 (1 Fermi block - 48 individual shaders)
MADD GFLOPS: 155.5
Asus GT640 2 GB DDR3 - 2012
This large, imposing thing is actually Asus' GeForce GT640. What could possibly need such a large cooler? Not the GT640, that's for sure. The DDR3 version used Nvidia's GK107 GPU with two Kepler units, for 384 cores, but also 16 ROPs. The DDR3 held it back substantially, with the RAM clock at 854 MHz, the 128 bit bus could only deliver 28.5 GB/s. The GPU itself ran at 901 MHz on this card. Asus ran the RAM a little slower than usual, which was 892 MHz for most GT640s.
GK107 was also used in the GTX 650, which ran the GPU at 1058 MHz, 20% faster, and used GDDR5 memory to give 3.5x the memory performance of the GT640. It was around 30% faster in the real world. It, along with the surprisingly effective (but limited availability) GTX645, was the highlight of Nvidia's mid-range. The GT640, however, was not.
GT640 was among the fastest of Nvidia's entry level "GT" series and did a perfectly passable job. Rear connectors were HDMI, 2x DVI and VGA. It could use all four at once. At the entry level, performance slides off much quicker than retail price does, and while GT640 was near the top of it, it was still much less cost-effective than GT650 was. The very top, GTX680, was also very cost-ineffective.
The low end and high end of any generation typically have similar cost to performance ratios, the low end because performance tanks for little savings, and the high end because performance inches up for a large extra cost.
RAM: 128 bit DDR3, 854 MHz, 27,328 MB/s
Shader: 2x Unified shader 5.1 (2 Kepler blocks - 384 individual shaders)
MADD GFLOPS: 691.2
EVGA GeForce GTX 680 SuperClocked 2 GB - 2012
The GTX 680 was the Kepler architecture's introductory flagship. It used the GK104 GPU, which had eight Kepler SMX units (each unit had 192 ALUs, or "CUDA cores"), each SMX having 16 TMUs attached and the whole thing having 32 ROPs. Memory controllers were tied to ROPs, each cluster of four ROPs having a 32-bit link to a crossbar shared among four ROPs, so each crossbar memory controller, which served 8 ROPs, had a 32 bit memory channel to RAM. With 32 ROPs, GTX 680's GK104 had 256 bit wide memory.
Kepler appeared to have been taken by surprise by AMD's GCN, but just about managed to keep up. As games progressed, however, GK104's performance against the neck-and-neck Radeon HD 7970 began to suffer. In more modern titles, the Tahiti GPU can be between 15 and 30% faster.
Nvidia's Kepler line-up was less rational than AMD's GCN or TeraScale 2, but still covered most of the market:
GK107 had 2 units
GK106 had 5 units
GK104 had 8 units
Nvidia disabled units to make the single unit GK107 in GT630 DDR3 and the three unit GK106 in GTX645. The second generation of Kepler, in (some) Geforce 700s added GK110 with 15 units, Nvidia pulled out all the stops to take on GCN, and more or less succeeded.
We're getting ahead of ourselves. GTX 680 was released into a world where AMD's Tahiti, as Radeon HD 7970, was owning everything, in everything. How did GTX 680 fare? Surprisingly well. Kepler was designed as the ultimate DirectX 11 machine and it lived up to this... These days, however, by showing how badly it has aged. While the 7970 kept up with modern games, the GTX 680 tended not to maintain its place in the lineup. The newer the game, the more the 7970 beats the GTX 680 by.
Core: 32 ROPs, 128 TMUs, 1150 MHz
RAM: 256 bit GDDR5, 1552 MHz, 198,656MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 8x Unified shader 5.1 (8 Kepler blocks - 1536 individual shaders)
MADD GFLOPS: 3,532
Nvidia Quadro K2000 2 GB GDDR5 - 2013
The Quadro K2000 was Nvidia's "mainstream" professional video card at $599 on launch in 2013. It is based around the Nvidia GK107 GPU, which has three Kepler blocks on it, each of which contains 192 CUDA cores (256 functional units including the very limited-use special function unit), giving 384 shader units on this GPU.
Practically, it's a GeForce GTX 650 with double the memory and lower clocks. The memory on this professional card is not ECC protected and provided by commodity SK Hynix, part H5GQ2GH24AFR-R0C.
Most GTX 650s (and the GT740 / GT745 based on it) used double slot coolers, while this uses a single slot design. The GTX 645 used almost the exact same PCB layout as the Quadro K2000, but the more capable GK106 GPU. The GTX 650 used a slightly different PCB, but also needed an additional power cable. Nvidia put Kepler's enhanced power management to good use on the K2000, and, in testing, it was found to throttle back quite rapidly when running tight CUDA code, the kind of thing a Quadro is intended to do. When processing 1.2 GBs worth of data through a CUDA FFT algorithm, the card had clocked back as far as 840 MHz, losing over 10% of its performance. It stayed within its somewhat anaemic 51 watt power budget and reached only 74C temperature.
Professionals wanting more performance than a $150 gaming GPU should have probably bought a GTX 680 a few months earlier with the money, and had enough left over to get some pizzas in for the office. Professionals wanting certified drivers for Bentley or Autodesk products should note that both AMD and Nvidia's mainstream cards and drivers are certified.
This came out of a Dell Precision T5600 workstation, where video was handled by two Quadro K2000s (GTX 650 alike, $599) to give similar performance to a single Quadro K4000 (sub-GTX 660 $1,269). The K4000 was probably the better choice, but that's not what we're here for.
Core: 16 ROPs, 32 TMUs, 954 MHz
RAM: 128 bit GDDR5, 1000 MHz, 64,000 MB/s (GDDR5 is dual ported, so is actually a form of QDR)
Shader: 3x Unified Shader 5.1 (3 Kepler blocks - 384 individual shaders)
MADD GFLOPS: 732.7
What Is A Video Card?
In theoretical and historical terms, computer video is an evolving proof of the Wheel of Reincarnation (see Sound Cards too) wherein:
1. A new type of accelerator is developed, which is faster and more efficient than the CPU alone
2. They eventually take on so much processing power that they are at least as complex as the host CPU
3. Functions done on the GPU are essentially done in software, as the GPU is so flexible
4. The main CPU becomes faster by taking in the functions that dedicated hardware used to do
5. A new type of accelerator is developed, which is faster and more efficient than the CPU alone
Modern video cards are at stage 4, Ivy Bridge and AMD's APUs represent the first generation. They cannot match a GPU's memory bandwidth, so future accelerators will be about reducing this need or supplying it, but the GPU shader core is inexorably bound for the CPU die.
The Current Cycle
In the early days, the video hardware was a framebuffer and a RAMDAC. A video frame was placed in the buffer and, sixty (or so) times a second, the RAMDAC would sequentially read the bitmap in the framebuffer and output it as an analog VGA signal. The RAMDAC itself was an evolution of the three-channel DAC which, in turn, replaced direct CPU control (e.g. the BBC/Acorn Micro's direct CPU driven digital video) and even by the time of the 80286, the dedicated framebuffer was a relatively new concept (introduced with EGA).
This would remain the layout of the video card (standardised by VESA) until quite late in VGA's life (early to middle 486 era) where it incorporated a device known as a blitter, something which could move blocks of memory around very quickly, ideal for scrolling a screen or moving a window with minimal CPU intervention. At this stage the RAMDAC was usually still external (the Tseng ET4000 on the first video page is an example) with the accelerator functions in a discrete video processor.
The next development, after more "2D" GDI functions were added was the addition of more video memory (4MB is enough for even very high resolutions) and a texture mapping unit (TMU). Early generations, such as the S3 ViRGE on this page, were rather simple and didn't really offer much beyond what software rendering was already doing. These eventually culminated in the TNT2, Voodoo3 and Rage 128 processors.
While increasing texture mapping power was important, the newly released Geforce and Radeon parts were concentrating on early versions of what the engineers called 'shaders', small ALUs or register combiners able to programmatically modify specific values during drawing, these were used to offload driver API code which set up triangles and rotated them to fit the viewing angle, and then another part where lighting was applied.
These became known as "Transformation and Lighting" engines (T&L) and contained many simple ALUs for lighting (which is only 8 bit pixel colour values) and several, not as many, more complex ALUs for vertex positions, which can be 16 bits.
As it became obvious that GPUs had extreme levels of performance available to them (a simple Geforce 256 could do five billion floating point operations per second!), it was natural to try to expose this raw power to programmers.
The lighting part of the engine became a pixel shader, the vertex part, a vertex shader. Eventually they became standardised into Shader Models. SM2.0 and above have described a Turing Complete architecture (i.e. able to perform any computation) while SM4.0 arguably describes a parallel array of tiny CPUs.
Current SM4.0/SM5.0 (DX10 or better) shaders are very, very fast at performing small, simple operations repetitively on small amounts of data, perfect for processing vertex or pixel data. However, SIMD and multi-core CPUs are also becoming fast at performing these operations and much more flexible. This has led many to believe that the lifespan of the GPU is nearing an end. By saving the expense of a video card and using the extra budget to tie much faster memory to a CPU and add more cores to the CPU, a more powerful machine even when not playing games could be likely realised in future PCs.
A modern GPU's array of shader power is the extreme of one end of a scale which goes both ways. The far other end is a CPU, which has few but complex cores. A CPU is much faster on mixed, general instructions than a GPU is, a GPU is much faster on small repetitive workloads. The two extremes are converging, CPUs are evolving simpler cores while GPUs are evolving more complex ones. Intel's future Larrabee GPU is actually an array of 16-48 modified Pentium-MMX cores which run x86 instructions, truly emulating a hardware GPU, Larrabee has no specific video hardware and could, with some modifications, be used as a CPU, though it would be incredibly slow for the poorly threaded workloads most CPUs handle.