A view on future HPC architectures : An introduction to Google’s Tensor Processing Units
Announced in May 2016 at the Google I/O conference, Google’s Tensor Processing Units (TPUs) are a promising new AI oriented architecture. They wereoriginally designed to help achieve business and research breakthroughs in machine learning. They are available to the public via Google’s Cloud Platform, and are compatible with Tensorflow and their own XLA (“Accelerated Linear Algebra”) compiler.
It is common knowledge that machine learning models perform matrix-heavy operations and if we put this under another perspective it appears that, based on their components, TPUs could prove useful to other subjects than what they were originally designed for.
Our main focus at Aldwin is HPC-related problems dealing with similar data structures such as physical simulation.
The current state of HPC architectures
To begin with, HPC can be seen as a set of concepts or methods to decrease execution time or improve energy consumption of computational algorithms. Computational improvements can be software and hardware related : HPC targets both of them.
Nowadays, hardware is a key component to HPC. Its progress and capabilities can be a limitation factor to software development. The user has only the option to select the software implementation that will fit its architecture the best. From a simple single-core computer to a complex supercomputer using thousands of Central Processing Units (CPUs), the range of performances can be very wide.
CPUs can be seen as the core electronic circuitry of the computer, performing the basic arithmetic, logical and control operations specified by a computer program. Nowadays, some computers employ a multi-core
processor, which is a single chip containing two or more CPUs. OpenMP is an example of API that supports multiprocessing programming.
Graphics Processing Units (GPUs) are another type of electronic circuits, made by design to be good at manipulating computer graphics and image processing. Unlike modern multi-core CPUs, GPUs are made of thousands of relatively small units that run in parallel, which lends itself well to massively parallel computing. Deep learning contributed to the democratization of general purpose GPUs to work with complex graphics operations.
The best possible indicator of CPUs and GPUs usage for HPC would be the Top500  ranking project.
Produced since June 1993, it ranks the 500 most powerful supercomputer systems in the world, and as of June 2018, Summit  is the most powerful one. It comes as no surprise that it fully relies on a hybrid architecture of CPUs (9,216 IBM POWER9) and GPUs (27,648 Nvidia Tesla). Since it is now easy to use languages such as CUDA (developed by Nvidia) to work directly with both architectures, it is the most common combination that can be found. For instance, GROMACS, used in molecular dynamics, can run on both architectures for better performances.
Google’s answer to hardware limitations
The previous paragraphs raise the question of hardware limitations, having to deal with more complex operations, data sets and architectures. You can add to this by talking about the stagnation of Moore’s law or energy and power as the key limiters.
One can see two solutions to this problem in a near future, going for even bigger supercomputers like the Sunway TaihuLight and its 10,649,600 CPU cores , or going for specified architectures. This last option was the one chosen by Google in 2015, when they launched Tensor Processing Units.
The article “In-Datacenter Performance Analysis of a Tensor Processing Unit” (Norman Jouppi et al.) gives the best possible definition of a TPU. Google faced hardware limitations in 2013, when a projection showed that their Deep Neural Networks usage could double their own datacenters’ computation demand. They decided to design a custom ASIC  to support it, and in 15 months, TPUs were deployed.
A TPU is designed to be a coprocessor on the PCIe I/O bus, and tends to be closer in spirit to Floating-Point Unit coprocessor, rather than a GPU. Its main component is a Matrix Multiply Unit, containing 256×256 Multiplier-Accumulators that can perform 8-bit multiply-and-adds on signed or unsigned integers. This component highlights the possibility to perform matrix-heavy operations.
In another article called “Quantifying the performance of the TPU, our first machine learning chip”, Google gives production and energy consumption numbers related to their own production AI workloads. TPUs are said to be 15 to 30 times faster than GPUs and CPUs (for neural network inference models), and have a 30 to 80 times improvement in tera-operations per Watt of energy consumed.
Exploring new possibilities
These last two paragraphs recalled the basic purpose of HPC. At Aldwin, we believe HPC and AI share a common basis : leveraging advanced computing architectures to make breakthrough discoveries. Given this situation, we are moving forward in bridging the gap between these fields by exploring the usage of TPU for running a non-ML application : SeWaS.
Aldwin developed a Seismic Wave Simulator (SeWaS) following a fully task-based model . Like many other simulation applications, it relies on computational grids, represented as multidimensional arrays.
Moreover, it uses a pretty similar model to Tensorflow , which is the main API used by Google to write code on TPUs. Indeed, SeWaS uses nodes as operations and edges as data flow. Using Tensorflow’s C++ API, we are rewriting the original SeWaS code and trying to port it to TPUs, benefitting from the Matrix Multiply Unit and the specifications of this architecture.
We will give more details (C++ design and implementation, hardware limitations and benchmarks) about this project in an upcoming blog article.
 https://www.top500.org/ – Top500
 https://www.ibm.com/thought-leadership/summit-supercomputer/ – Summit supercomputer
https://www.top500.org/system/178764 – Sunway TaihuLight supercomputer
https://fr.wikipedia.org/wiki/Application-specific_integrated_circuit – ASIC
https://github.com/aneoconsulting/SeWaS – SeWaS
https://github.com/tensorflow/tensorflow – Tensorflow