Best GPU for Deep Learning & AI
CPUs are too versatile for the usual AI applications. Therefore, GPUs are often used. However, some manufacturers are already working on more specialized hardware for machine learning.
- Artificial intelligence requires only a fraction of the instruction set of conventional CPUs.
- Using specialized chips for neural networks & Co. is more efficient and cheaper.
- GPUs are particularly popular because of their easy availability and programmability, but more and more manufacturers are developing special AI hardware.
Today, leading server manufacturers use NVIDIA GPUs to optimize their systems for AI and analytical applications. GPUs that are specifically optimized for AI can perform hundreds of parallel calculations, resulting in over 100 tera floating-point operations per second (TeraFLOPS).
GPUs have been instrumental in making deep learning economically viable, but they are not yet optimal for AI requirements. Some manufacturers therefore offer specially optimized AI chips. Fujitsu has taken this path with its Deep Learning Unit (DLU).
State-of-the-art (SOTA) deep learning models need a lot of memory. Regular GPUs don’t have enough VRAM to meet the requirements for procession SOTA deep learning models. In this test we present you GPUs that can be used for Deep Learning & AI models.
(TLDR) Test Results Overview: Best GPUs for Deep Learning
These GPUs can train all SOTA language models & image models without any issues as of May 2020:
- Nvidia Quadro RTX 8000 with 48GB VRAM (~5000$-6000$)
- Nvidia Quadro RTX 6000 with 24GB VRAM (~3500$-4250$)
- Titan RTX with 24GB VRAM (~2200$-2800$)
GPUs that can train most (but not all) SOTA models:
- Nvidia Geforce RTX 2080Ti with 11GB VRAM (~1000$-1500$)
- Nvidia Geforce RTX 2080 with 8GB VRAM (~650$-800$)
- Nvidia Geforce RTX 2070 with 8GB VRAM (~450$-600$)
The Nvidia Geforce RTX 2060 Super can still be used, but a Nvidia Geforce RTX 2060 and lower models are not suitable anymore for training SOTA models.
Benchmarks: Image Models & Language Models
1. Deep learning SOTA Image Models
Max batch size before hitting the memory limit:
|Model / GPU||2060||2070||2080||1080 Ti||2080 Ti||Titan RTX||RTX 6000||RTX 8000|
Performance Benchmarks (Images processed per second):
|Model / GPU||2060||2070||2080||1080 Ti||2080 Ti||Titan RTX||RTX 6000||RTX 8000|
2. Deep learning SOTA Language Models
Max batch size before hitting the memory limit:
|Model / GPU||Units||2060||2070||2080||1080 Ti||2080 Ti||Titan RTX||RTX 6000||RTX 8000|
|Model / GPU||Units||2060||2070||2080||1080 Ti||2080 Ti||Titan RTX||RTX 6000||RTX 8000|
- Image models benefit less from mor VRAM than language models.
- GPUs with a larger GPU memory are performing better because they can take larger batch sizes which in turn helps to fully utilize the CUDA cores.
- The more VRAM the larger the batch sizes that are possible to process. GPUs with 48GB of VRAM can fit ~4x larger batches than GPUs with 11GB VRAM.
In Detail: Winning GPUs for Deep Learning
Ranking First: Nvidia Quadro RTX 8000
Best performing GPU for Deep Learning models
So-called passively cooled GPU accelerators are no exception in the data center segment. The term “passive cooling” is only applicable to the extent that the cards as such do not use fans, but the cooler has a constant air flow through it. The server case or rack thus ensures sufficient cooling and minimizes the possible hardware failure caused by a faulty fan. In addition, the passively cooled cards can be operated more closely together, since no fan has to suck in the air between the cards.
In the workstation segment, passively cooled graphics cards have not been encountered so far – at least not when we talk about high-performance cards. PNY has now introduced two models, the Quadro RTX 8000 Passive and the Quadro RTX 6000 Passive. Like the active variants, these use a GPU with Turing architecture from NVIDIA. This comes to 4,608 shader units – just as many as on the RTX Titan, but more than on a GeForce RTX 2080 Ti: 72 RT cores and 576 tensor cores. These are the flagship models of the Quadro RTX series. In terms of performance, the two models do not differ in the first place.
The difference between the Quadro RTX 8000 and the Quadro RTX 6000, whether in the actively or passively cooled version, is the memory expansion. While the Quadro RTX 6000 uses 24 GB GDDR6 with ECC support, the Quadro RTX 8000 uses a full 48 GB. The memory bandwidth of 624 GB/s is also identical for both.
The passive cooling – we’ll stick with the terminology – does, however, have an effect on the thermal design power. While this is specified for the actively cooled cards with up to 295 W, the passively cooled cards can only afford 250 W. This results in a power difference between the actively and passively cooled versions of about 10% in favor of the actively cooled models. This is therefore the compromise that must be made when using these models.
Systems or the cooling of such systems are designed for a TDP of 250 W for one card. Correspondingly, this is also a value to which the passively cooled GPU accelerators in the data center segment are designed for. As already mentioned, the case must be able to ensure an appropriate airflow and thus cooling. Incidentally, this can be done in both directions through the card and does not have to be designed in the direction of the slot bracket.
Verdict: Best performing GPU for Deep Learning models
The Quadro RTX 8000 Passive and the Quadro RTX 6000 Passive are available and are supplied by PNY to OEMs for such workstations. A Quadro RTX 6000 costs 3,375 Dollar, the Quadro RTX 8000 with 48 GB memory around 5,400 Dollar- in the actively cooled version, mind you. Whether the passive models also make it into the free trade is not known.
Ranking Second: Nvidia Quadro RTX 6000
Great performing GPU for Deep Learning models
Users can now pre-order the Nvidia Quadro RTX 5000 and Quadro 6000 graphics cards, which specialize in ray tracing. The cheaper RTX 5000 model is already sold out for a price of $2,300 plus VAT.
The Quadro RTX 6000, which Nvidia has only announced for the workstation area so far, uses the full configuration. Besides the full configuration, the model also uses twice the amount of GDDR6 chips per memory controller, a total of 24 GiByte. In the meantime, interested parties can pre-order the Turing graphics card, at least in the USA. Nvidia calls for 6,300 US dollars for this. For comparison: The Quadro P6000 with GP102 full configuration and 24 GiByte GDDR5X RAM costs 5,000 USD. Of course, professional support is always included in the Quadro prices. The price difference to the Geforce RTX 2080 Ti shows why Nvidia prefers to sell the full-featured TU104 GPUs as Quadro. With a chip area of 754 mm², the yield of fully functional chips might not be the best even in the mature 12FFN process and workstation users have the (significantly) higher margin.
The Quadro RTX 6000 at least shows what a hypothetical Turing Titan graphics card could look like in the future. Maybe not with the expensive 24 GiByte GDDR6, but fully enabled and possibly with faster 15 or 16 Gbps memory, as soon as it will be readily available. But whether such a “Titan Xt” will actually come is in the stars.
Verdict: Great performing GPU for Deep Learning models
The significantly more expensive Nvidia Quadro RTX 6000, however, can still be pre-ordered for a price of 5,600 US dollars. With this board Nvidia exhausts the possibilities of the TU102 GPU and thus offers 4,608 CUDA, 576 Tensor and 72 RT cores. Added to this are 24 GB of GDDR6 memory and 384-bit memory connectivity. For the time being, this makes it the cheapest graphics card that gets the most out of a Turing chip. The GeForce RTX 2080 Ti, however, does not rely on the full expansion of the TU102. It also only offers 11 GB memory.
Third: Nvidia Titan RTX
Best performing GPU for Deep Learning models at a cheaper price
If you’re a gamer looking for an extremely fast graphics card, you shouldn’t reach for the Titan RTX under any circumstances. Not even as an enthusiast. Although the graphics card is the fastest for gamers on the market, the advantage doesn’t justify the incredibly high price of 2,700 Dollar. The GeForce RTX 2080 Ti is consistently the better choice.
Nevertheless, the Geforce Titan RTX is a technically very interesting graphics card in terms of its GPU and memory. With the full expansion of the TU102 GPU, it is currently the fastest graphics card in the market, which also works really energy efficient. But also because the limit for the power target is reached very early. A real Titan should have considerably more leeway in power consumption. The cooler is also only a unique selling point optically; the cooling system is not good in terms of titanium for the purchase price.
The 24 GB memory is the (irrational) highlight of the Titan RTX
But the Titan can really set itself apart from all other GeForce models when it comes to memory: 24 GB is by far the largest memory available on a standard graphics card. The AMD Radeon VII (test), already richly equipped with 16 GB, is 8 GB behind.
For gamers, however, so much memory is oversized and will probably remain so during the lifetime of the graphics card. For professional applications, however, this can be a big advantage.
In the end, this is also the case with this Titan and thus, unlike with the Titan Xp, the real clientele, because corresponding rendering and other programs can sometimes not have enough VRAM. The alternatives of the “Prosumers” are otherwise even more expensive with Quadro and Radeon Pro.
Two hot irons in the Mifcom system
Verdict: Best performing GPU for Deep Learning models at a cheaper price
Together with the Skylake-X-CPU, Mifcom’s system offers this user group a very fast, very cleanly configured and even optically matched basis, which cannot perfectly reflect the use of two Titan RTX. In the end, neither the case nor the graphics cards with the standard cooler are suitable for dual operation. As a result, the heat isn’t dissipated fast enough despite the high noise level and the upper of the two graphics cards drops by about 400 MHz to the base clock of 1,350 MHz. This corresponds to around 25 percent loss of raw power, which is sometimes more and sometimes less noticeable depending on the application.
Apart from that, the multi-GPU operation usually works without problems in professional applications, especially since the 2nd GPU does not have to follow the 1st GPU in terms of clock speed. In games, however, the problem with dual-GPU only really starts with games, because only very few titles can use SLI sensibly or even do so at all. The editors will address this topic in a separate article in the near future.
Fourth: Nvidia Geforce RTX 2080Ti
Solid GPU for Deep Learning models in the entry-level segment
How high is the performance of the new Geforce cards RTX 2080 Ti and RTX 2080 in the test in comparison to the GTX 1080 and GTX 1080 Ti of the previous generation? Today we’re testing Nvidia’s new Turing architecture in games – at least in the traditional way. Because the strongly advertised features raytracing and Deep Learning Super Sampling (DLSS) can’t yet be tested practically due to a lack of corresponding games.
In our test of the GTX 2080 and GTX 2080 Ti, we measure the performance in current games as usual, compare the newcomers with GTX 1080 Ti and Co and give an outlook on the potential of the RT and Tensor cores.
Unlike the presentation of the first Pascal graphics cards GTX 1080 and GTX 1070 from 2016, partner cards (custom designs) of the GTX 2080 will be available on September 20th in addition to the reference cards (Founders Edition).
We’ve also already received test samples from Asus, MSI and Zotac, and more are on their way to the editorial office. You can expect further tests in the following days and weeks, but today the focus is directed towards Nvidia’s own creations.
The most obvious feature of the new Founders Edition is the completely new cooling system: the old blower-style principle with a small radial fan is now obsolete; both new models rely on two 90 millimetre fans and an aluminium radiator with vapor chamber.
The RTX 2080 Ti and RTX 2080 are so cooler and quieter than their predecessors, even though they have a higher power consumption. However, this has one disadvantage: the waste heat is no longer transported directly out of the case, but rather distributed inside it. A good case ventilation is therefore advisable (although this also applies without RTX card).
The reference cards score with a simple look in silver and black, are well manufactured and have a valuable haptic. The graphic cards owe this not least to the relatively high weight of around 2.7 lbs. Both models differ externally only in the Ti abbreviation of the lettering, which can be found on the front and on the aluminum backplate on the rear. The green “Geforce RTX” logo lights up during operation.
As usual, the new Founders Edition occupies two slots in the case and has identical dimensions to its predecessors with 26.7 x 11.4 x 4.0 cm. Three Displayport 1.4, one HDMI 2.0b and one USB-C port, among others for VR headsets, are located on the slot bracket. Nvidia’s so-called NVLink port is found on the upper side of the graphics card, over which two graphics cards can be connected in SLI via a separately available bridge (85 euro).
Only in direct comparison does it become clear how much optics and cooling have changed: The striking finish gives way to a simpler and less playful radiator cover. If you have a small case, you’ll probably miss the omission of the blower style principle. But some partners have already announced this cooling principle, and most of these custom designs should also offer more cooling surface including a larger radial fan than Nvidia has realized on the Founders Edition.
The new Geforce graphics cards with Turing chips are finally here, but the last secret has not been revealed yet. Both RTX 2080 Ti and RTX 2080 beat every currently available graphics card in the test, dethroning the GTX 1080 Ti and offering smooth gaming in 4K/UHD and high or very high details. Nvidia only fulfills expectations with that, but there’s still some potential slumbering in Turing due to raytracing and DLSS.
Verdict: Solid GPU for Deep Learning models in the entry-level segment
It will probably still take a while until these features are established in games, though. But Nvidia is able to book the support of the developers for itself, so far over 20 games will use at least one of the two new features. Although the benchmark sequences used in the test seem a bit too far away from practice, they at least give us an idea of Turing’s potential to render more realistic graphics or generate more performance with exclusive cores for raytracing and AI calculations.
Today, however, the conclusion is still a bit sobering. RTX 2080 Ti and RTX 2080 are very fast, but at the same time also very expensive gamer graphics cards, which are finally fast enough for 4K/UHD monitors, but also require a very deep grip into the wallet.
Where hardware counts: Deep Learning & AI
He who wants to harvest must sow – artificial intelligence (AI) is no exception. The race for the smartest AI applications demands high-performance hardware of a special kind. A CPU is not everything.
Learning algorithms have to handle unprecedented amounts of data in real time. They must be able to make intelligent decisions in unpredictable situations and at the same time in a highly individualized context. Learning backend technologies for AI applications such as programmatic advertising, autonomous driving or intelligent infrastructures have already reached the required level of maturity. What is missing is the appropriate hardware.
Partly due to various shortcomings of today’s hardware architectures, the practical use of AI poses enormous challenges. These challenges will be intensified by the end of Moore’s Law. In addition, there are application-specific limitations of key technical data of cyber-physical systems: space requirements, weight, energy consumption and more.
Just ensuring cyber security and data integrity in personal application scenarios is extremely difficult in the face of new types of attacks such as Adversarial Learning (malicious learning based on fraudulent data). Conventional system architectures simply cannot meet the new challenges.
The end of Moore’s Law and Dennard’s Law
Since the invention of integrated circuits for conventional silicon chips, Moore’s Law and Dennard’s Law have been considered the ultimate guide to technical progress in the industry. According to Moore’s Law, the number of transistors in an integrated circuit (mind you, at the same manufacturing costs) would double approximately every two years. At the latest when the components of conventional circuits sink down to the so-called monolayer – a single atomic layer that does not allow them to fall any further below it – the end of the line has been reached.
The end of Moore’s Law is drawing near. Dennard’s Scaling already collapsed in 2005, Professor Christian Märtin of the Augsburg University of Applied Sciences confirmed three years ago in a technical report (“Post-Dennard Scaling and the final Years of Moore’s Law. Consequences for the Evolution of Multicore Architectures”, see also the eBook “High Performance Computing
Dennard’s law states that the progressive miniaturization of the basic component in a circuit is accompanied by a lower voltage and allows a higher clock frequency to be achieved with the same power consumption. Although the transistors continue to shrink for the time being, the error rate and thus the manufacturing costs of CPUs would for the first time not decrease further due to effects such as leakage current and threshold voltage.
Workload-specific hardware acceleration
The only way to continue to tighten the power-to-power consumption-to-procurement cost variable in the future is to develop workload-specific hardware accelerators, according to UC Berkeley researchers in an October 2017 report, “A Berkeley View of Systems Challenges for AI. The emergence of domain-specific hardware architectures, composable infrastructures and edge architectures (see also the eBook “Edge Computing”) could help. According to the Berkeley researchers, further improvements can now only be achieved through innovations in computer architectures, but not through improvements in the semiconductor process.
Domain-specific processors can only perform a few tasks, but they are extremely good. Future servers would therefore be “much more heterogeneous” than ever before. As a “groundbreaking example,” the Berkeley researchers cite Google’s “Tensor Processing Unit” (TPU), an application-specific AI accelerator in ASIC (Application-Specific Integrated Circuits) architecture. Google’s TPU performs the inference phase of deep neural networks 15 to 30 times faster than CPUs and GPUs, with a performance per watt that is 30 to 80 times better.
The current second generation TPU delivers 45 teraflops, is (for the first time) floating point capable and supports a bandwidth of 600 GBps per ASIC. In a parallel architecture of four TPU chips, the resulting module achieves a performance of 180 TFLOPS; 64 of these modules form so-called chip-pods in 256 groups with a total performance of 11.5 PFLOPS.
ASICs at Google and FPGAs at Microsoft
Google’s TPU, however groundbreaking, is purely proprietary and not commercially available. So Google’s competitors have to make do with alternatives for AI workloads.
Unlike Google, both Microsoft and Intel rely on FPGAs. Microsoft offers FPGA-based compute instances as an Azure service. Intel paid a whopping $16.7 billion to FPGA vendor Altera.
ASICs (Application-Specific Integrated Circuits) and FPGAs (Field Programmable Gate Arrays) are based on two opposing concepts. ASICs are application-specific integrated circuits that are manufactured according to very tightly defined design specifications. They are characterized by very low unit costs, but cannot be modified.
Unlike ASICs, FPGA-type integrated circuits can be remotely adapted to new workloads even after they have been installed in the data center (see the eBook “The Programmable Data Center”). Chip development from idea to prototype takes only six months for FPGAs and up to 18 months for ASICS, reveals Dr. Randy Huang, FPGA Architect of Intel’s Programmable Solutions Group. Both ASICs and FPGAs feature high energy efficiency compared to GPU acceleration. GPUs (Graphic Processing Units), on the other hand, trump with a high maximum performance in floating point calculations.
GPUnktet: seizing the opportunity
Thanks to their massive parallelizability, GPUs beat the performance of conventional CPUs by far in AI applications. According to the manufacturer, the Nvidia GPU “Tesla”, for example, delivers up to 27 times the acceleration in the inference phase of neural networks compared to a system with only one single-socket CPU.
Nvidia dominates the market for AI accelerators in data centers. Even Google uses the “Tesla P100” and “Tesla K80” GPUs within the “Google Cloud Platform”.
Nvidia’s sales with data centers have risen sharply in recent months. The GPU market leader has responded by strategically diversifying its technology portfolio away from games and towards AI – with remarkable seriousness… but has not made any friends.
In the latest EULA license terms of the graphics card drivers, Nvidia prohibits the use of the more affordable “GeForce GTX” and “Titan” GPUs in data centers. In data center environments, only block-chain processing is allowed. The manufacturer cites the exceptional heat resistance requirements for operation in high-density systems under the load of demanding AI workloads as the reason for this.
Drivers are the limiting factor
For the existing data center users of these GPUs, the driver updates are now a thing of the past. Without the proprietary drivers from Nvidia, which are constantly updated via bug fixes, the hardware can only provide a fraction of its theoretical performance. If the user breaks the EULAs, the warranty for the associated hardware is automatically voided.
The new Nvidia EULAs thus have even more implications for data centers. AI system vendors will have to resort to more than 10x more expensive Tesla GPUs such as the Tesla “V100” chips (the lower-performance “Quadro” family is optimized for visual workloads such as industrial design, not neural networks).
Due to physical limitations of Moore’s Law, the performance gap between a GPU and a CPU is widening. If this trend continues, the GPU is likely to exceed the single-threaded performance of a CPU by 1,000 times by the year 2025, Nvidia is pleased to report on its own blog.
But even a market leader would be well advised not to praise the day before the evening.
The IPUs, DPUs, DLUs…
GPUs are by far not the only way to satisfy the performance hunger of AI applications. Both risk-averse chip giants such as Intel and Fujitsu and the VC investors behind a whole range of largely unknown start-ups are betting on this.
The British-Californian Graphcore (https://www.graphcore.ai/) wants the “IPU” – the Intelligence Processing Unit – to be the first chip architecture that is optimized from the ground up for machine learning workloads. The innovative processing unit and the associated graph programming framework “Poplar” have brought the start-up under the support of VC companies such as Sequoia and Robert Bosch Venture Capital GmbH.
The Californian startup Wave Computing is in the race with the “Wave DPU” (Dataflow Processing Unit). The company has developed a technology that attempts to eliminate the bottlenecks of conventional system architectures by dispensing with the CPU/GPU co-processor scheme. With a highly scalable 3U appliance for machine learning in the data center, Wave Computing wants to demonstrate the capabilities of the new architecture.
Californian Cerebras Systems develops chips for neural networks, so far still in stealth mode.
Fujitsu has taken its own path with the “DLU” (Deep Learning Unit). The DLU owes its impressive performance features to a new data type called “Deep Learning Integer” and the DPU’s “INT8”,16 accumulator, among other things. These give the processor the ability to perform integer calculations inside deep neural networks with variable precision of 8 bits, 16 bits and 32 bits without compromising the accuracy of the overall model.
Conclusion and outlook
As long as Moore’s Law could be adhered to, leading chip manufacturers simply put alternative technologies such as quantum computers and neuromorphic systems on ice in order not to cannibalize their already successful processor lines. Now the development labs are bustling with activity.
The next generation of AI systems promises to significantly accelerate the rapid progress of AI development in the fierce competition between the top dogs and the challengers – in hardware.
Intel outside? Multiple protection
Intel, too, has already recognized the signs of the times. After the license agreement with Nvidia expired, the CPU giant has been supplied with GPU technology from AMD, but for the time being only for notebooks. Intel apparently doesn’t want to leave anything to chance in AI applications and has several horses in the race: FPGAs from Altera, ASICs from Nervana Systems, a 49 quabits quantum computer called “Tangle Lake”, neuromorphic chips “Loihi” and the “Movidius VPU” (Vision Processing Unit) for edge applications of deep learning in autonomous IoT terminals.
With Nervana Systems, Intel has acquired an estimated $408 million SaaS platform provider with an AI cloud. This is currently… running on Nividia Titan X GPUs. Nervana Engine, an application-tailored ASIC chip under development, is expected to end this dependency soon. The ASIC is also expected to perform about 10 times better than the Nividia Maxwell GPU.
Intel is also experimenting with neuromorphic chips under the code name Loihi. Manufactured in 14-nanometer technology, the technological marvel has a total of 130,000 neurons and 130 million synapses that have been implemented as circuits.
But Nvidia does not want to put all its eggs in one basket either. The next-generation “DrivePX” platform for autonomous vehicles is to have a hybrid architecture. In addition to an ARM CPU and a Volta-GPU, a DLA (Deep Learning Accelerator) in ASIC architecture is used.
Analysts from Research and Markets confirm that the AI market has an annual growth rate of 57.2 percent CAGR. At this rate, the market volume is expected to grow to 58.97 billion dollars by 2025 – there is certainly enough room for several alternative architectures.