As the benefits of device expansion become less and less, people are beginning to design systems with built-in AI to process more data locally. Chipmakers are investigating a new architecture that can significantly increase the amount of data that can be processed per energy and per clock cycle, and lay the groundwork for major changes in chip architecture over the next few decades.
All major chip vendors
And system vendors are changing direction
All major chip vendors and system vendors are changing direction, and they have sparked an architectural competition, from the way data is read to memory to how data is processed and managed, and how the various elements are ultimately packaged into A single chip is in every aspect. Although the shrinking of nodes will continue, no one will bet on everything in order to cope with the explosive data growth of communication between various sensors and more and more machines.
Among these changes, there are some that deserve our attention.
The new processor architecture focuses on processing large blocks of data per clock cycle, sometimes with lower accuracy or higher priority for some operations, depending on the application.
A new memory architecture is being developed that will change the way data is stored, read, written, and accessed.
More directed processing elements are scattered across the various parts of the system to be configured closest to the memory. The accelerator will be selected based on the data type and application in the future.
There are also many studies on AI to mix different data types into patterns to effectively increase data density and minimize differences between data.
Packaging is now a core component of the architecture, and there is an increasing emphasis on the ease of design modification.
Rambus's distinguished inventor Steven Woo said: "There are several trends that are driving people to get the most out of their existing solutions. In the data center, people want to squeeze all the performance of hardware and software. This makes people re-recognize the data center economy. Mode. The cost of innovation is very high. But technology switching is the bottleneck, so we will see the dedicated chip, and we will see many ways to improve the efficiency of computing. If we can reduce the data exchange between memory and input and output, we can Have a major impact."
The change of edge devices is particularly obvious
This change is particularly evident on edge devices, and system vendors have suddenly discovered that tens of billions of devices will send all the data they generate to the cloud, which is obviously too large. However, dealing with huge amounts of data on edge devices raises new challenges and must improve processing performance without significantly increasing energy consumption.
Robert Ober Ober, chief platform architect for Nvidia's Tesla product line, said, "The focus now is on reducing accuracy. This is not just asking for more computational cycles, but also requiring more data to be put into memory, and only used. 16-bit instruction format. Therefore, it is not possible to increase efficiency by putting data in the cache. Statistically, the results of the two methods are the same."
Ober predicts that through a series of architectural optimizations, it is possible to double the processing speed every two years for the foreseeable future. “We will see the most cutting-edge changes,” he said.
"To do this, we need to solve three bottlenecks. The first is computing. The second is memory. In some models, memory access, other models are computational problems. The third is host bandwidth and I/O bandwidth. We need to do a lot of work to optimize storage and networking."
Some of them have already been implemented. In a speech at the Hot Chips 2018 conference, Jeff Rupley, the core architect of the Samsung Austin R&D Center, pointed out several major architectural changes in Samsung's M3 processor. One of them can execute more instructions per clock cycle, and M3 can execute six more than four of its predecessor M2. Coupled with the branch prediction function (roughly several neural networks like the pre-reading in the search), and a two-fold deep instruction queue, it is expected to solve these problems.
From another perspective, these changes shift the focus of innovation from manufacturing and processing technology to front-end architecture and design, as well as the back-end manufacturing process. Despite the innovations in processing technology, the 15% to 20% increase in performance per new node is complex and difficult to keep up with today's rapidly growing data.
Victor Peng, president and CEO of Xilinx, said in a speech at Hot Chips, "Changes are occurring at an exponential rate. Every year, 10ZB (1021 bytes) of data is generated, most of which is unstructured data."
New method of memory
Handling so much data requires rethinking every component of the system from the way it is processed to the way it is stored.
Carlos Maciàn, senior innovation director at eSilicon EMEA, said, "There has been a vain attempt to build a new memory architecture. The problem is that you need to read all the rows and pick a bit from each row. An alternative way It is to create a memory that can be read from left to right and top to bottom. It is also possible to further distribute the calculations to the nearest location to each memory."
These changes include changing the way memory is read, the location of memory, the type of processing elements, and methods for using AI to optimize how data is stored, positioned, processed, and moved throughout the system.
"If for the case of sparse data, we can read only one byte at a time from the memory array, or read 8 consecutive bytes at a time from the same byte channel, without spending energy on other things we don't feel What happens on the byte or byte channel of interest?” said Marc Greenberg, product marketing director at Cadence.
“The future may be more interested in this change. Taking the architecture of HBM2 as an example, HBM2's wafer stack is organized into 16 virtual channels, each of which is 16 bits wide, so that no matter which channel is accessed, only 4 are needed. A continuous 64-bit word is sufficient, so it is entirely possible to construct a 1024-bit wide data array and write horizontally while reading four 64-bit words at a time in the vertical direction."
Memory is one of the core components of the Von Neumann architecture, but it has also become the largest experimental area. Dan Bouvier, chief architect of AMD's customer products, said, "One of the biggest retributions is the virtual memory system, which uses a lot of abnormal ways to move data. You need to do address translation continuously. And we are used to this. But if Eliminating bank conflicts in DRAM enables more efficient data transfer. Therefore, discrete GPUs can use DRAM to 90% of the effective range, which is very rare. But if the data transmission is smoother, then APU and The CPU can also reach an effective range of 80% to 85%."
Charlie Janac, Chairman and CEO of Arteris IP, said, "There are four main configurations - many-to-many, memory subsystems, low-power input and output, and grid and ring topologies. These four parts can be placed in the same On-chip, this is the practice when manufacturing IoT chips. Or, you can add a high-throughput HBM subsystem. But the complexity will be greatly improved, because some of the load depends on the specific chip, and each chip will have more Special loads and pins. For example, some IoT chips can handle massive amounts of data, especially radars and LiDAR chips in cars. Without some special advanced connectivity, these chips are not possible."
The challenge is to minimize data movement while maximizing the flow of data when data has to be moved, and to strike a balance between local processing and central processing without consuming too much energy.
Rajesh Ramanujam, product marketing manager at NetSpeed Systems, said, "On the one hand, bandwidth issues. You will do everything possible without moving the data, so you will move the data as close as possible to the processor. But if you have to move the data, you will try to It’s data. But it’s not falling from the sky. It’s all about the height of the system. Every step has to be considered from multiple angles, and it’s up to you to use memory in the traditional way of reading and writing, or to use updates. Memory technology. In some cases, you have to change the way you store your data. If you want faster performance, this usually means higher area overhead, which affects power consumption. Then you have to consider security. Also have to consider the problem of data overload."
That's why many people care about processing on edge devices and throughput issues between multiple processing elements. The AI engine can perform its own analysis on solid-state storage.
Ned varnica, Marvell's main engineer, said, "You can load the model directly into the hardware and perform hardware processing on the SSD controller. Today, the host in the cloud service does this. If each drive is sent to the cloud Data, it will cause a lot of network traffic. So it is best to let the edge device handle the data by itself, so that the host only needs to send the command containing the metadata. Thus, the more storage devices, the more powerful the processing power. The benefits are huge."
It is worth mentioning in this way that it emphasizes the flexibility of data movement for different applications. Therefore, the host can generate tasks and send them to the storage device for characterization, and then only return metadata or calculation results. In another case, the storage device can store data, preprocess the data, and generate metadata, tags, and indexes that are retrieved by the host for future analysis.
This is just one of the options. There are other options. Samsung's Rupley specifically emphasizes out-of-order execution and mixed-information instructions, which can decode two instructions at a time and mix them into one operation.
AI supervision and optimization
Throughout this is artificial intelligence, which is the latest feature in the field of chip architecture. The functions are no longer managed by the operating system and middleware, but are distributed at various points in the chip at the system level and distributed among different chips. In some cases, a neural network can be built inside the chip.
Mike Gianfagna, vice president of marketing at eSilicon, said, “In fact, what we have to do is to pack more things together and change the traditional way. Through AI and machine learning, we can spread this all over the system and get more effective. More predictable processing. In some cases, different chips in the system can be used. In other cases, the same package can be used."
Arm released its first machine learning chip, which is scheduled to be available in multiple markets later this year. Arm's contact engineer Ian Bratt said, "This is a new processor. It has a basic module, a computing engine, a MAC engine and a DMA engine, plus a control and broadcast network. There are a total of 16 Such a computing engine can process 4 trillion instructions at 1 GHz with 7 nanometer technology."
Because Arm works with ecosystem partners, its chips are more versatile and configurable than other AI/ML chips still under development.
Arm doesn't put everything into the macro kernel architecture, it sorts the processing by function, so each computing engine can be responsible for different functions. Bratt says there are four key functions, namely static task scheduling, efficient convolution, bandwidth reduction mechanisms, and programmable mechanisms for future designs.
At the same time, Nvidia uses a different path, and they set up a separate deep learning engine next to the GPU to optimize the flow of images and video.
By implementing some or all of these methods, chip vendors say they can double the performance of the chip every two years, keeping up with the explosive growth of data while keeping the chip's power consumption within a certain range.
This is more than just more computers. It is the starting point for changes in the entire chip design and system engineering, from which the chip begins to follow the growth of data, rather than being limited by hardware and software.
Synopsys chairman and deputy CEO Aart de Geus said: "When computers enter the company, many people feel that the whole world is developing too fast. They were still working on accounting on a piece of paper. Since then, index-level Growth, and now we will see the same thing again.
What you are developing now, you can think of it as the evolution of accounting books from perforated cards. In the field, you must water and fertilize on the right date and when the temperature rises, which is why machine learning has not brought about significant progress. ”
Not only did he give this evaluation alone. The president of Mentor, a subsidiary of Siemens, and CEOLally Rhines said, “People will eventually accept the new architecture. The new architecture will eventually be designed. In most cases these architectures will include machine learning, just as your brain can learn from experience. I've seen more than 20 companies using their own AI processors, each with a specific purpose. But you will see them in more and more applications now, and they will eventually challenge the traditional von Noy. Man architecture. Neuron computing will become mainstream, which is a big step in improving computing efficiency, reducing costs and improving mobility and connectivity."