GPU porting for HPC applications
High performance computing is required for various industrial and research applications. Those applications aim at simulating whether, car crash, plane engine, nuclear reactor, fundamental interaction, building structure and others.
The TOP500 list lists some of the most powerful supercomputers in the world and one can’t help but notice the huge power consumption of those supercomputers, up to tenth of MW. This raises two challenges: getting enough electrical power to power them and being able to dissipate all the generated heat. Consequently, there is a focus on getting more compute power while keeping power consumption low.
As the power consumption is a quadratic function of frequency clock, the last years’ tendency has been to reduce clock speed and add more compute cores. At the extreme of this idea are GPUs (Graphical Processing Unit) which are composed of up to several thousands of cores while being clocked at around 1.5 GHz. For comparison a classical CPU is composed of a few tenth of cores clocked between roughly 2 GHz and 3 GHz. One of those GPU cores is much less powerful than a CPU core, but there is many of them. This allows great compute power with relatively low consumption and heat dissipation.
The Green500 list is similar to the TOP500 list except it ranks supercomputers according to their energy efficiency rather than computing power. In the November 2018 list, eighteen out of the twenty-first ranked supercomputers in the Green500 list uses GPUs.
GPUs have great potential, though there are two challenges to be addressed when using them:
- The algorithm should be able to efficiently span the workload on thousands of cores. This requires a high level of parallelism to be exposed.
- The GPU act as a coprocessor, meaning the CPU offloads work on the GPU and gets the result back at the end of the calculation. This implies data transfer between the CPU and the GPU, which can slow down the overall computation.
ANEO is working with an HPC system integrator to port some applications to the GPU architecture and demonstrate the gain of using GPU based supercomputers. The focus is on two applications:
- Lattice QCD simulation: simulates the fundamental strong interaction whiting neutrons or protons.
- Fluid dynamics simulation: simulates fluid dynamics with chemical reactions, such as in a helicopter engine.
Lattice QCD simulation
The lattice QCD application requires some parts of its calculation to be made with high precision, higher than 64 bit floating-point numbers would allow. The original implementation uses a CPU library which is a bottleneck for two reasons: first of all because the CPU library implementation is rather slow, and then because data are copied from the GPU to the CPU for this high precision part and then copied back to the GPU for the rest of the computation. This part of the computation is actually one algorithm applied millions of times to different data. GPUs are ideal candidates for this type of workload as there are many independent identical tasks to achieve and the huge number of GPU compute cores can all be at work concurrently.
High precision computation is not natively supported by GPUs, neither by most of CPUs. One way of doing high precision computation is to represent a number with two floating-point variables: the first variable coding for the high part of the number, the second variable coding for the low part of the number. A 64 bit floating-point variable allows a precision up to 16 significant decimals. Using two 64 bits floating-point variables to represent a number allows then up to 32 significant decimals precision.
A two floating-point number type algorithm was implemented with a GPU high precision library. This gives this use case a great speedup compared to the reference CPU implementation. It also required less CPU/GPU communications. At the end of the day it makes this part of the calculation negligible relative to the overall computation time.
Fluid dynamics simulation
GPU optimization sometimes requires rewriting whole parts of the application and make it somehow obscure for uninitiated persons. This project requires porting the code to the GPU architecture using a non-intrusive method, keeping the code structure as is and making the changes comprehensive for non GPU experts. Directives based solutions are tools of choice for this situation.
The application involves complex, nested and chained data structures, which make it a challenge to copy and use the data on the GPU without changing the data structures. Data transfers are handled meticulously inside the routines that create, update and delete those structures. This minimizes the CPU/GPU communication and allows to only copy the data needed for the computation.
At each iteration of the algorithm all nodes involved in the calculation exchange some data. Often, even with GPU accelerated codes, the data exchanges are handled by the CPU. To illustrate this, let be two nodes named A and B each composed of a CPU and a GPU. When a data exchange from node A to node B is needed, the data has to be copied first from the GPU A to CPU A, then from CPU A to CPU B and finally from CPU B to GPU B. The added CPU/GPU transfers are very expensive time wise, hence, the implementation should avoid them. This is possible by copying data directly from GPU A to GPU B as they are connected to the network.
Making all the computation happen on the GPU and avoid most of CPU/GPU transfers will hopefully result in a more efficient application.