Introduction to Google’s Tensor Processing Units

Written by Vincent Piumi, on 11 July 2019

They were announced in May 2016 at the Google I/O conference. Google’s Tensor Processing Units (TPUs) are a promising new AI oriented architecture. They were originally designed to help achieve business and research breakthroughs in machine learning. They are available to the public via Google’s Cloud Platform. Also, they are compatible with Tensorflow and their own XLA (“Accelerated Linear Algebra”) compiler.

It is common knowledge that machine learning models perform matrix-heavy operations. However, if we put this under another perspective ; it appears that based on their components, Tensor Processing Units could prove ot be useful to other subjects.

Our main focus at Aldwin is HPC-related problems dealing with similar data structures such as physical simulation.

The current state of HPC architectures

Firstly, HPC can be seen as a set of concepts or methods to decrease execution time ; or to improve energy consumption of computational algorithms. Computational improvements can be software and hardware related : HPC targets both of them.

Nowadays, hardware is a key component to HPC. Its progress and capabilities can be a limitation factor to software development. The user has only the option to select the software implementation that will fit its architecture the best. From a single-core computer to a complex supercomputer using thousands of Central Processing Units (CPUs), the range of performances can be wide.

CPUs versus GPUs

CPUs act as the core electronic circuitry performing the basic arithmetic, logical and control operations specified by a computer program. Nowadays, some computers employ a multi-core processor, which is a single chip containing two or more CPUs. OpenMP is an example of API that supports multiprocessing programming.

Graphics Processing Units (GPUs) are another type of electronic circuits. They are made by design to be good at manipulating computer graphics and image processing. Unlike modern multi-core CPUs, GPUs are made of thousands of relatively small units that run in parallel. It lends itself well to massively parallel computing. Deep learning contributed to the democratization of general purpose GPUs to work with complex graphics operations.

The best possible indicator of CPUs and GPUs usage for HPC would be the Top500 [1] ranking project.

Produced since June 1993, it ranks the 500 most powerful supercomputer systems in the world, and as of June 2018, Summit [2] is the most powerful one. It comes as no surprise that it fully relies on a hybrid architecture of CPUs (9,216 IBM POWER9) and GPUs (27,648 Nvidia Tesla). Since it is now easy to use languages such as CUDA (developed by Nvidia) to work directly with both architectures; it is the most common combination that can be found. For instance, GROMACS, used in molecular dynamics, can run on both architectures for better performances.

Google’s Tensor Processing Units, an answer to hardware limitations

The previous paragraphs raise the question of hardware limitations, having to deal with more complex operations, data sets and architectures. Without mentionning the stagnation of Moore’s law or energy and power as the key limiters.

One can see two solutions to this problem in a near future, going for even bigger supercomputers like the Sunway TaihuLight and its 10,649,600 CPU cores [3], or going for specified architectures. This last option was the one chosen by Google in 2015, when they launched Tensor Processing Units.

The article “In-Datacenter Performance Analysis of a Tensor Processing Unit” (Norman Jouppi et al.), gives the best definition of a Tensor Processing Units. Google faced hardware limitations in 2013. A that time, a projection showed that their Deep Neural Networks usage could double their own datacenters’ computation demand. They decided to design a custom ASIC [4] to support it, and in 15 months, TPUs were deployed.

A TPU is designed to be a coprocessor on the PCIe I/O bus. It tends to be closer in spirit to Floating-Point Unit coprocessor, rather than a GPU. Its main component is a Matrix Multiply Unit, containing 256×256 Multiplier-Accumulators, which can perform 8-bit multiply-and-adds on signed or unsigned integers. This component highlights the possibility to perform matrix-heavy operations.

In the article “Quantifying the performance of the Tensor Processing Units, our first machine learning chip”; Google gives production and energy consumption numbers related to their own production AI workloads. TPUs are said to be 15 to 30 times faster than GPUs and CPUs (for neural network inference models). Moreover, they have a 30 to 80 times improvement in tera-operations per Watt of energy consumed.

Exploring new possibilities

These last two paragraphs recalled the basic purpose of HPC. At Aldwin, we believe HPC and AI share a common basis : leveraging advanced computing architectures to make breakthrough discoveries. Consequently, we are moving forward in bridging the gap between these fields by exploring the usage of TPU for running a non-ML application : SeWaS.

Aldwin developed a Seismic Wave Simulator (SeWaS) following a fully task-based model [5]. Like many other simulation applications, it relies on computational grids, represented as multidimensional arrays.

Moreover, it uses a pretty similar model to Tensorflow [6], which is the main API used by Google to write code on Tensor Processing Units. Indeed, SeWaS uses nodes as operations and edges as data flow. Using Tensorflow’s C++ API, we are rewriting the original SeWaS code and trying to port it to TPUs ; benefitting from the Matrix Multiply Unit and the specifications of this architecture.

[1] https://www.top500.org/ – Top500
[2] https://www.ibm.com/thought-leadership/summit-supercomputer/ – Summit supercomputer
[3]https://www.top500.org/system/178764 – Sunway TaihuLight supercomputer
[4]https://fr.wikipedia.org/wiki/Application-specific_integrated_circuit – ASIC
[5]https://github.com/aneoconsulting/SeWaS – SeWaS
[6]https://github.com/tensorflow/tensorflow – Tensorflow