The requirements for programming in Data Science itself are quite low (except for large IDEs). With ML it looks quite different again. For ML you need a lot of RAM. 8 GB is the absolute minimum, but 16 GB or even 32 GB can offer considerable advantages.
The CPU should also be decent. At least you should look at quad-core CPUs, but six-core and 8-core CPUs offer a lot more perfromance nowadays.
Also important is the GPU. The GPU can help compute large ML data sets. With Nvidia there is for example CUDA for this. A GPU can speed up the learning process significantly. What takes several hours purely on the CPU is done in a few minutes with the support of the GPU. This is because the GPU is optimized for such calculations (e.g. matrix operations).
And another important aspect is the memory space. With ML, huge amounts of data are generated. In a few days you can easily get 1 TB full. You should already have a large hard disk, if not you should get one.
ML is pretty much the most demanding thing you can do with a computer. Don’t underestimate that! In the following, we present you the best CPU and GPU options for a Machine Learning and Data Science Pc setup. Although you should probably get newer models, which cost a bit more, you do not have to spend a fortune on good perfroming hardware for Machine Learning.
Test Results: Best GPU for Machine Learning & Data Science
Ranking First: AMD Ryzen 9 3900X
- Superb price-for-performance ratio in multithreaded scenarios
- Relatively low power consumption
- Easy overclocking tools
- Huge L3 cache.
- No integrated graphics
Best performing CPU for Machine Learning & Data Science
With Zen 2 alias “Matisse” AMD says goodbye to the old Zeppelin die structure on the chip and splits the tasks into several parts: Three components are located on the silicon of the R9 3900X. Two of them are so-called chiplets. The nuclei of the Ryzen live here – a maximum of eight per chipplet, divided into clusters of four. The cache near the CPU is also located in the chipplets. Both chipplets communicate with the IO-Die via the “Infinity Fabric” data bus (“IO” stands for “in/out”). This in turn takes care of data transfer to the rest of the PC, memory management and also transfers information between the chipplets.
The chipplets are also the epitome of AMD’s current pride in the number 7 – which is why the processors also appear on 7.7. The cores are manufactured with a structure width of 7 nanometers. The Ryzen 2000 series was still produced with 12 nanometers. The miniaturization allows a CPU manufacturer to either shrink a die to make it more efficient or to put more processing units on the same surface.
For the user, the new structure does not change anything at first. In fact, Matisse brings hardly anything new in terms of pure functionality – except that the AMD CPUs are the first to introduce PCI-Express 4.0 into consumer platforms. However, the performance advantages of the wider data bus (512 instead of 256 bits) are currently manageable for normal users – for example, graphics cards are not expected to exhaust the gigantic bandwidth in the near future. But there’s nothing wrong with air to the top. And PCIe 4.0 SSDs can be faster than their older counterparts.
By the way, the CPU still fits into the AM4 socket and can be overclocked with most motherboard chipsets. So if you are still satisfied with your first-generation Ryzen board and the manufacturer offers updates, you can still stick with your old board.
Hardly any weaknesses
When it comes to benchmark results in a table, there has seldom been a more exciting comparison than the current one between the R9 and Intel’s i9. Not because it’s a neck-and-neck race, but because the differences are simply brutal: AMD is a good 21 percent ahead of Intel’s eight-core top model. The top values are clearly above that: In multi-core rendering with Cinebench, for example, almost 45 percent, in encryption with TrueCrypt 44 percent, x265 encoding around 39 percent. The results are pressed by a few low-flying aircraft, which “only” operate at eye level with the i9: The old PCMark 8, Cinebench with only one processor thread (Ryzen 2 percent worse), and large spreadsheets in Excel (Ryzen 8 percent worse).
The Ryzen 9 3900X will also be interesting for gaming: In combination with an Nvidia GTX 1080, it beats the competitor model in the benchmark suites Fire Strike and Time Spy by two to six percent.
All benchmark results in comparison to the Intel Core i9-9900K can be found in the following table. By the way: A drilled version with 16 cores, the 3950X, is also supposed to follow.
|AMD Ryzen 9 3900X||Intel Core i9-9900K|
|PCMark 8||4,153 points||4,152 points|
|PCMark 10||4,194 points||3,783 points|
|Excel||0.448 seconds||0.41 seconds|
|Cinebench R15||3,130 points||2,033 points|
|Cinebench R20||7,111 points||4,912 points|
|Cinebench R20 (ST)||501 points||511 points|
|Winrar||27.823 KB / s||25.476 KB / s|
|Handbrake||206.9 FPS||157.51 FPS|
|x264||150.26 FPS||120.31 FPS|
|x265||14.107 FPS||10.156 FPS|
|POV-Ray||6,174.2 points||4,272.93 points|
|TrueCrypt||979 MB / s||697 MB / s|
|Fire strike||20,220 points||19,899 points|
|Time Spy||8143 points||7681 points|
Ryzen 9 is not a power guzzler
A typical way to get more power out of the CPU is to increase the power consumption. In our measurements with a 250 watt fan (TDP), however, there is hardly any difference between the top CPUs from AMD and Intel. In PCMark 10 the system power consumption comes to 234 and 350 watts, depending on the test scenario. In return, the Intel system reaches marginally lower 233 and 348 watts. Even if you consider the different mainboards and their possibly different power consumption, the differences between the processors are negligible. So AMD has not saved on efficiency.
The secret lies in the IPC
There is a big difference between AMD and Intel: the clock frequency. While Intel has cracked the 5 GHz, the Ryzen Boost just manages to get to 4.6 GHz. So the higher performance can only come from a monstrously improved IPC (Instructions per Cycle). AMD lists a number of changes here that, taken together, should contribute to the additional 15% IPC that the manufacturer claims compared to the previous generation.
Most obvious is the increased L3 cache. 64 MByte of CPU-like memory are now available. The improved AVX2 support is also exciting – the CPU now processes corresponding data twice as fast. Furthermore the chip improves the jump prediction of instructions, gets a larger micro-op cache and a more associative L1 cache.
The last two improvements are a bit more vivid: first, the so-called thread grouping. In Zen 2, processor threads, i.e. tasks of the executing programs, prefer to end up in the same chiplet and there in the same computing cluster rather than at different ends of the processor. This might be a better solution especially for the spatially separated chipplets.
Memory: AMD has given the Infinity Fabric, i.e. the CPU data bus, more freedom in terms of clocking. This should remove an old bottleneck – nevertheless, according to AMD there is a “sweet spot” with DDR4-3733. If you want to save a little money without significant performance losses, you should go for DDR4-3600 (CL16). Unfortunately, we haven’t been able to test how different data rates affect the performance so far.
Finally, there is one thing you shouldn’t forget when it comes to CPU performance: AMD does without an integrated graphics unit in the higher class desktop processors. If Intel were to omit this, there would be more space available for CPU tasks. However, an integrated graphics unit can sometimes also bring significant advantages in benchmarks.
Verdict: Best performing CPU for Machine Learning & Data Science
AMD’s Ryzen 9 3900X turns out to be a wonder CPU in the test for Machine Learning & Data Science. The twelve-core processor beats the direct competition in many tests with flying colors, is efficient and at the same time only slightly more expensive. This also means that Intel’s last stronghold, the consumer high-end, has fallen. No matter whether you are a gamer or high-end user, as in Machine Learning & Data Science: there is hardly a reason not to go for the 3900X.
Ranking Second: Intel Core i9 i9-9900K
- Two more cores than previous top Coffee Lake CPU
- 5GHz peak one-core clock for single-threaded apps
- Unlocked multiplier
- Great for multi-threaded applications.
Great performance for Machine Learning & Data Science
The Core i9-9900K is an Octacore with 5 GHz boost clock and soldered metal cover for lower temperatures. Thereby the chip computes very fast and beats AMD’s Ryzen 7 2700X easily. But the price and power consumption of the 9900K is enormous.
A year and a half after AMD released two Octacore CPUs for midrange systems with the Ryzen 7 1800X and later the Ryzen 7 2700X, Intel follows suit: The Core i9-9900K is the first processor for a Socket LGA 115x, which starts as Core i9 and with eight cores. It follows on from the Core i7-8700K (test) from last fall and has higher frequencies along with two other cores. The performance of the Core i9-9900K is therefore enormous, beating even 10-core workstation processors. The price for this turns out high in two respects, though.
Besides the Core i9-9900K, Intel brings the Core i7-9700K and the Core i5-9600K into the market. Both are based on the same Octacore design, whereby hyperthreading is untypically disabled on the i7. The Core i5-9600K has six active cores without SMT, so it largely corresponds to a Core i5-8600K with more clock speed. Intel internally calls the three 9th Gen desktop chips as Coffee Lake Refresh, since the previous 8th Gen from 2017 is called Coffee Lake.
The new processors are intended for the socket LGA 1151 v2 introduced last year, so they fit into existing motherboards. The Coffee Lake refresh is compatible with boards with Z370, H370, Q370, B360 and H310 chipset, provided an updated UEFI is available. In the short test, the Core i9-9900K ran perfectly on an Asus Maximus X Hero (Z370), an MSI B360 Gaming Plus and a Gigabyte H310M-H. However, we don’t recommend using the chip on particularly cheap boards, as it can possibly put an excessive load on the voltage converters with around 200 watts with the appropriate settings – more about this later.
New is the Z390 chipset, which Intel’s partners use primarily for expensive boards. Technically, the die manufactured in 14 nm is identical to, for example, the B360 and thus smaller than the Z370 in 22 nm. The Z390 has six USB 3.1 Gen2, ten USB 3.0, six Sata 6 Gbps ports and 24 PCIe Gen3 lanes. The B360 has fewer ports and the older Z370 lacks native USB 3.1 Gen2, instead lanes have to be used for additional controllers. The Z390 and the B360 integrate the Mac part for ac-WLAN with 1,733 GBit/s – the phy is more external, for example via ac-9560 card.
As usual, Intel only names the basic and maximum boost clock for the Core i9-9900K, the Core i7-9700K and the Core i5-9600K, but not the individual levels. However, according to our information, the Core i9-9900K can accelerate up to two cores to 5 GHz, up to four to 4.9 GHz, up to six to 4.8 GHz and all eight to 4.7 GHz. The previous Core i7-8700K manages up to 4.7 GHz single core and up to 4.3 GHz all-core, which is 500 MHz less on six cores.
Verdict: Great performing CPU for Machine Learning & Data Science at a higher price
The Core i9-9900K is Intel’s attempt to exploit the Skylake architecture, which has been in use since 2015, and the 14 nm architecture, which has also been slightly improved again and again since 2015, one last time: With eight cores at 4.7 GHz and a dual-core boost of 5 GHz, the chip beats the competition and performs better on average than Intel’s own Core i9-7900X for 1,000 Dollar.
The Octacore trumps with its high-clocked cores for multi-threaded software, such as Blender or x265 encoding. For rather single-threaded applications and in games, it’s the 5 GHz frequency that makes it unbeatable. However, the gap to the previous Core i7-8700K and the Ryzen 7 2700X only turns out significant when the Core i9-9900K can fully utilize its clock rate.
If it is limited to 95 watts by the mainboard, the processor stays behind its potential. Only with 200 watts does it show what it’s capable of and increases speed by up to 20 percent, depending on the application. The drawback is the exorbitantly increasing power consumption of over 100 watts in addition, which Intel can only dissipate because the Core i9-9900K’s heat spreader is soldered instead of using thermal paste.
All in all, the i9 9900K is great perfroming CPU for Machine Learning & Data Science, but at a higher price tham AMDs top CPU. We would rather recommend the AMD Ryzen 9 3900X which is priced very fairly and better performing than the i9 9900K.
Ranking Third: AMD Ryzen 7 3800X
- Solid blend of single- and multi-threaded performance
- PCIe 4.0 support
- Bundled cooler
- Best price-performance ratio
- Requires X570 motherboard for PCIe 4.0
Best price-performance CPU for Machine Learning & Data Science
In the benchmark duel, AMD’s Ryzen 7 3800X has to compete against its own Ryzen 3000 siblings as well as Intel’s Core i7-9700K and Core i9-9900K. Intel wins in games, but overall the Ryzen 7 3800X is the better all-rounder.
In AMD’s current line up of Ryzen 3000 processors, the AMD Ryzen 7 3800X is positioned between the Ryzen 7 3700X and Ryzen 9 3900X. In our eyes, however, the CPU has received too little attention so far. Therefore, we are trying to make up for this with an extensive test and send the Ryzen 7 3800X with eight cores and 16 threads against the current Intel processors, as well as their own Ryzen 3000 siblings into the benchmark duel.
AMD Ryzen 7 3700X the better option for price-conscious buyers
The bottom line is that the Ryzen 7 3800X is “an impressive product” that can compete with Intel’s appropriately positioned processors. The Ryzen 7 3700X, which is also equipped with 8 cores and 16 threads, has the same performance with overclocking and costs considerably less, which is why the budget foxes should rather use this CPU, the first page concludes.
The Ryzen 7 3800X offers a standard clock rate of 3.9 GHz ex works. It should go up to 4.5 GHz in boost. The TDP is stated at 105 watts and the price is currently a smooth 429 Dollar. In opposition to that is the Ryzen 7 3700X, whose TDP corset with 65 watts is tighter and the base clock rate is 300 MHz lower, but in the boost it comes up with 4.4 GHz.
However, the Ryzen 7 3700X only works more economically with factory settings. As soon as Precision Boost Overdrive (PBO) is enabled, the power consumption increases significantly above the Ryzen 7 3800X with and without PBO in all tested workloads. The Intel representatives in the form of the Core i7-9700K and Core i9-9900K with a clock rate and OC at 5.1 and 5.0 GHz respectively also need noticeably more power. Even with an all-core OC at 4.3 GHz with 1.42 Volt, the Ryzen 7 3800X is usually much more frugal.
The average temperatures of the Ryzen 7 3800X with mentioned overclocking were 80, 81.64 and 84.8 degrees Celsius with a Corsair H115i AiO water cooling system for longer x264 and x265 enconding and Y-cruncher. A maximum of 91 degrees Celsius was measured in the Y-Cruncher test. However, the duration was limited to only about one second.
By the way, an MSI MEG X570 Godlike was used as a base. Regarding the memory, tomshardware.com used two 8 GiByte G.Skill FlareX DDR4-3200 memory bars, which were also overclocked to DDR4-3600 on the tested Ryzen 3000 processors. The second-generation Ryzen CPUs were operated with DDR4-2933 and DDR4-3466, with an Nvidia Geforce RTX 2080 Ti as the graphics card in all test systems. In addition, a 2 TByte Intel DC4510 SSD and an EVGA Supernove 1600 T2 with 1600 watts were used.
Verdict: Best price-performance CPU for Machine Learning & Data Science
In the final benchmark course, the Ryzen 7 3800X is more of a duel with its Ryzen 3000 siblings, while the Intel representatives are usually clearly in front. In return, there’s a real battle in the synthetic benchmarks as well as in workloads in the form of rendering, encoding, compression and encryption. That’s why the recommendation goes in the direction of Ryzen 7 3800X instead of the Core i7-9700K, if you want to do more than just game on your computer, because the AMD CPU is a better all-rounder. The X570 platform, equipped with PCI-E 4.0, also speaks in favor of the Ryzen 7 3800X. All in all, the Ryzen 7 3800X is the best price-performance ratio CPU for Machine Learning & Data Science.
Which hardware is better suited for AI & Machine Learning acceleration?
Modern hardware accelerators have brought the practical application of artificial intelligence in IT and industry within reach. But which technology is better suited for this purpose: GPUs, DSPs, programmable FPGAs or proprietary, dedicated processors?
Artificial intelligence and machine learning – these topics are not new. Universities have been dealing with the topic since the 1950s. But it is only in recent years that demonstrations such as the self-learning AlphaGo computer or extensive tests on autonomous driving have ensured that AI has become a tangible topic that is already finding practical application in everyday life. In particular, the speed and extent to which so-called neural networks can be trained has increased rapidly in recent years.
Modern hardware accelerators dedicated to AI have made the practical application of artificial intelligence in real time possible today. The technological approaches for such dedicated AI acceleration are extremely diverse. For example, processor manufacturer Intel has declared AI to be an important trend topic that is being pursued in various technological ways: On the one hand, machine learning based on FPGAs is being driven forward, while on the other hand the company offers its own dedicated processors designed for neural networks in the form of Nervana. Especially in the latter environment, the chip giant is competing, but also supports numerous start-up companies that offer their own chip solutions for AI acceleration.
The ideal “neural” processor: CPU, GPU or FPGA? Or a dedicated ASIC?
“Basically, it’s wrong to ask: Which is better for artificial intelligence -GPU, ASIC or FPGA?,” says Doug Burger, Distinguished Engineer, MSR NExT and member of Microsoft’s “Project Brainwave” team. “Because these technologies are all just a means to an end to implement a suitable architecture for a neural network. The question that remains unanswered is: What is the most appropriate architecture? “This is still a matter of debate.”
In recent years, NVIDIA graphics cards have been used increasingly in academic circles to train self-learning algorithms. This is because the massively parallel architectures of GPUs and their explicit property for high data throughput are not only suitable for graphics computation, but also for AI acceleration. For this reason, the GPU manufacturer now offers its own dedicated platforms designed for AI applications, such as Jetson, whose core is provided by the Graphics Processing Unit.
But not only GPUs have good properties for high, parallel data streaming. It is precisely this property that makes FPGAs attractive for use in telecommunications or as co-processors in data centers. DSPs, Digital Signal Processors, also come into consideration for additional hardware acceleration for very similar reasons to make AI applicable in real time. And Google, which recently introduced its Tensor Processing Unit (TPU) 3.0, uses a series of application-specific integrated circuits (ASIC) in its chip to provide the necessary acceleration for AI training.
With all these different approaches, however, it is difficult to keep track of the situation and efficiently compare the technological approaches already available. What exactly is important in hardware acceleration for artificial intelligence? What are the individual strengths of the respective technological approaches when it comes to this field of application? In which areas can these advantages best be exploited? To this end, we have asked various developers and solution providers about this topic. Cadence, Intel, Lattice, Microsoft, NVIDIA, and several other companies have responded. Over the course of the week, we plan to hear from some of the respondents who gave us detailed information. (Click here for more detailed statements from Cadence, Intel, Lattice, Microsoft, and NVIDIA).
Hyper-scale data centers make AI possible for everyone
Enormous computing power is anyway one of the main reasons why AI is currently experiencing a boom, making its applicability practical. Cloud computing, fast Internet connections and the resulting easy access to powerful data centers make supercomputers or HPCs (High Performance Computing) now accessible to everyone. It is primarily at this point that much of the modern hardware designed for AI comes into play.
A major breakthrough for the modern perception of artificial intelligence in practical applications came in June 2012 as part of the Google Brain project: AI researchers from Google and Professor Andre Ng from Stanford University had trained an AI cluster that was able to automatically recognize cats in YouTube videos and distinguish them from humans. For the training of the artificial intelligence, a cluster of 2000 CPUs was also necessary, which worked in a Google data center. A short time later NVIDIA teamed up with Ng to repeat the experiment based on GPUs. The result: 12 GPUs were sufficient for the AI training to achieve the same result that had previously required 2000 CPUs.
2Deep Learning is an AI technique that enables machine learning by training neural networks with large amounts of data to solve problems,” said Deep Learning Solution Architect Axel Köhler of NVIDIA. “Like 3D graphics, deep learning is a parallel computational problem, meaning that large amounts of data must be processed simultaneously. The multi-core architecture of the GPU is ideal for this type of task.
GPUs (Graphics Processing Units) were originally designed to well map the way different 3D graphics engines execute their code – things like geometry setup and execution, texture mapping, memory access and shaders. To do this as efficiently as possible – and most importantly to reduce the load on a computer’s central CPU – GPUs are equipped with numerous special processor cores to perform these tasks with high parallelism as quickly as possible. NVIDIA refers to individual processing units within the GPU as “Streaming Multiprocessor” or SM accordingly – the more SMs a GPU has, the more parallel tasks in data throughput the device can handle. This structure, and especially this enormous parallelism, is also beneficial for training AI algorithms.
High parallelism, high data throughput and low latency also promise FPGA devices. The logic devices are often used in data centers to support the CPUs, where they are very well suited for fast data interfaces or for data preprocessing to relieve the CPUs used. “For a wide range of AI applications in the data center (including reasoning systems, machine learning, training and deep learning inference), computing systems with Intel Xeon processors are used,” says Stephan Gillich, Director of Artificial Intelligence and Computing at Intel Germany. “The advantage is that classic data analysis is also carried out on these systems. If necessary, the Xeon-based platforms can be accelerated with Intel’s FPGAs (Field Programmable Gate Arrays), for example for real-time analysis”.
However, FPGAs can do more than that when it comes to supporting machine learning in the data center. First, there is the flexibility that makes FPGAs easily reconfigurable hardware. So the algorithms can change. An FPGA implementation can be programmed to achieve maximum system performance – and makes it available in a highly deterministic way – as opposed to a CPU-based approach that is subject to interrupts. This allows for a flexible deployment that enables many machine learning algorithms – with highly distributed logic resources, extensive interconnect schemes and extensive distributed local memory.
Core aspects of a hyperscale-based high-performance computer
Does this automatically make FPGAs more suitable due to their flexibility? NVIDIA’s Axel Köhler disagrees: “To meet the challenge of implementing Deep Learning on a broad scale, the technology must overcome a total of seven challenges: Programmability, latency, accuracy, size, (data) throughput, power efficiency and learning rate. Meeting these challenges requires more than just adding an ASIC or FPGA to a data center. Hyperscale data centers are the most complex computers ever built”.
In addition, FPGAs – especially high-end FPGAs such as those used in data centers – are considered inaccessible to developers, complicated and difficult to program. Köhler, on the other hand, refers to the numerous positive experiences that universities in particular have made in AI research based on GPUs – and the availability of numerous resulting frameworks that make the development and training of AI algorithms easier with the help of GPUs – in theory, at least.
Another way is to rely on dedicated processors that are specifically tailored to the requirements of neural networks. For example, Intel has the Nervana Neural Network Processor (NNP) in its product range – a technology that was purchased together with the tech start-up of the same name in 2016 and incorporated into the portfolio. “AI solutions must be increasingly scalable and fast, while accommodating ever larger data models,” says Stephan Gillich. “The architecture of the Intel NNP was developed specifically for deep learning training and is characterized by high flexibility and scalability as well as fast and powerful memory. Large amounts of data can be stored directly on the chip and can be retrieved in a very short time”.
A real performance comparison of the existing different platforms does not yet exist. However, in August Google, Baidu and the Universities of Harvard and Stanford plan to publish the MLPerf machine learning benchmark for this purpose.
Key features of mass market AI hardware
Up to now there has always been talk of the cloud or the data center. But AI will also play an important role on edge devices or in the mass consumer market, as all respondents confirmed. “Increasingly, AI applications will emerge in mobile, AR/VR headset, surveillance and automotive for on-device AI applications,” said Pulin Desai, Product Marketing Director, Tensilica Vision DSP Product Line. “But at the same time, these markets require a mix of embedded vision and AI on the devices themselves in order to provide a wide range of advanced features”.
Also, there are currently no real benchmarks for the use of machine learning on the edge or end devices – the so-called inference. In order to provide more clarity in this area, the Industrial Alliance for Embedded Systems (EEMBC) is developing a benchmark suite specifically for this purpose.
Why an own benchmark? Consumer devices or products that work at the edge to the cloud have different requirements than supercomputers or high-performance computers in the data center. On the one hand, an AI solution on the end market must be embedded, as Pulin Desai explains. “For all markets, from mobile phones to cars, a large amount of data has to be processed ‘on the fly’. While the training of a neural network can usually take place offline, the applications that use these neural networks must be embedded in their own system, regardless of the market.” And energy efficiency also plays a much more important role: “Just as we don’t carry data centers around in our cars or on our equipment, we can’t carry power sources dedicated to AI with us wherever we go.”
Peter Torelli, president of the EEMBC, puts this into perspective: “A simple camera for face recognition or a device that converts speech to text cannot afford to rely on a 300W GPU. But in an ADAS, it’s perfectly feasible – for level 5 systems even a must”.
Future-proofing also plays an important role. “As the development of neural network processing increases, the products that use neural networks in development may have to be reprogrammed before delivery. The platform must be able to grow with the industry,” Desai continues. Deepak Boppana, senior director, product and segment marketing at Lattice Semiconductor, shares this view: “Ultimately, it comes down to a combination of flexibility and programmability,” he says in the interview. According to him, an AI acceleration device at the Edge must be able to address four key aspects that are less important in the data center: Energy efficiency, chip size, quantization and, in combination, cost.
AI for Edge Deployment- pre-processing before the cloud
NVIDIA also emphasizes that it can offer its GPU solutions end-to-end – not only for the data center, but also for end devices. Deepak Boppana of Lattice countered that CPU and GPU solutions are sometimes too powerful for use on the end device – and therefore more power-hungry. “In machine learning, there is the issue of quantization, especially the size of the bit rate at which your AI model runs,” says Boppana. “The more bits you have – such as 16 bits – the better the accuracy of your final solution. But you will also draw more power.”
A scalable solution, such as a low-end FPGA, is much more practical here. “In applications where you don’t really need high accuracy, you can use lower quantization – such as 8-bit or down to 1-bit,” says Boppana. “This gives the customer much more flexibility in terms of design specification. GPUs and CPUs typically offer only 16-bit – whether you need that much accuracy or not – which typically consumes much more power”. A smart speaker, a simple smart home application, or an AI wizard on the smartphone, could be considered. It should be noted that we are talking about low-end FPGAs with a relatively small number of programmable logic elements – not high-end devices with more than 4 million programmable logic units as used in the data center.
When it comes to AI in the context of embedded vision in particular, says Cadence’s Pulin Deesai, DSPs are increasingly proving to be a popular solution. DSPs can also be used without clock frequencies for signal processing, have high parallelism thanks to processor pipelines integrated into the architecture, and – because they can be integrated as IP in a SoC or ASIC – require little space and power.
Here, as so often, the advantages and disadvantages of ASICs vs. FPGAs arise: ASICs usually require a lengthy initial development phase, but are subsequently cheaper in mass production, are considered faster and more efficient than FPGAs, and are easier to handle due to their development for a special field of application. FPGAs, on the other hand, are considered to be complicated to use, but due to their reprogrammability they have a great advantage in terms of future-proofing, low recurring costs and time-to-market.
With AI hardware it always depends on the application
Which hardware is best suited for artificial intelligence? “Every application has specific requirements for the technology used,” says Stephan Gillich of Intel. Therefore, in addition to the FPGA- and NNP-based approaches already mentioned, the company also offers other solutions tailored to specific needs, for example for computer vision (Movidius), intelligent speech and audio (GNA), cognitive computing software (Saffron) or autonomous driving. For example, the Mobileye eyeQ-SoC, which Intel has compared in the past with the GPU-supported platform NVIDIA Xavier.
How does it look from the side of companies that do not produce the hardware for AI use, but would like to use it for their solutions? “As far as neural network training is concerned, faster Internet connections and more extensive cloud offerings have created new opportunities in recent years,” says Sandro Cerato, Chief Technology Officer of the Power Management & Multimarket Division at Infineon Technologies AG. Providers such as Amazon Web Services (AWS), Microsoft’s Azure Cloud or the web services of the Chinese operator Alibaba now enable virtually everyone to access high-performance data centers or HPCs (High Performance Computing). Microsoft, for example, relies on a combination of Intel Xeon processors and Stratix 10 FPGAs in its data centers and in the AI undertaking “Project Brainwave”.
Artificial intelligence for everyone requires rethinking
“If you use cloud services for training neural networks, it seems at first glance irrelevant which hardware is used for this. With tools, frameworks and libraries under open source license, such as Tensorflow or Caffe, and corresponding data sets with which the future AI is to be trained, machine learning can be accomplished relatively easily,” Sandro Cerato continues from his own experience. “In addition, only a minimum of proprietary software code is required – whether on GPUs, CPUs, NPPs or FPGAs. However, if you want to train an AI on your own hardware, you have to make some considerations – especially if the question of how fast you want it to go is predominant.
To ensure that the transition from training to inference is as seamless as possible, NVIDIA is particularly committed to its end-to-end approach: “Our hardware and software stack encompasses the entire AI ecosystem, both in the training and inferencing phases,” said Axel Köhler of NVIDIA. “From the cloud to the local AI data center to intelligent IoT devices and individual workstations, NVIDIA’s goal is to democratize AI by making the essential tools widely available with the capabilities, form factors, and scalability needed by developers, scientists, and IT managers,” said Axel Köhler of NVIDIA.
Lattice’s Deepak Boppana is more critical: “To take this approach to AI on their devices, existing designs would have to be completely revamped and redesigned. With an FPGA-based approach, there is not much need for developers to deal with new hardware. “It’s not easy to incorporate a standard chip solution that can be seamlessly integrated into an existing design. FPGAs can better integrate the technology into these existing designs,” says Boppana.
“You can’t avoid this problem,” says Peter Tovelli of EEMBC. “Developers will need to educate themselves about the AI models they want to implement before they make the necessary hardware choices. This is not a feature that can be added at the push of a button, as is the case with additional interfaces. There’s already a learning curve here.”
“All the old paradigms that AI research was looking at back in the 1970s and 1980s are now coming back to the surface,” says Doug Burger. “Those who are not aware of this are now asking themselves: Which is better, FPGA or GPU? But that’s the wrong comparison! FPGAs or ASICs are just two ways to put an NPU to work. The more important question is: What is the right architecture for an NPU? We have our opinion on this. Google and NVIDIA have their own opinion on this. And Intel even has several different views on this. This is the crucial question here. And that’s the big debate that we’re gonna have over the next three or four years.