

#### Green Flash

#### High performance computing for real-time science

#### Contribution from Observatoire de Paris on WP4 Final Design Review, April 6<sup>th</sup> 2018



Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014



### WP 4 : Accelerators for realtime HPC

Assess various HW accelerator options on a realtime application

GPU : lead by OdP with contribution from UoD

Xeon Phi : lead by UoD

- FPGA : lead by UoD with contribution from OdP
- Assess performance of same hardware on complex data pipeline
  - Supervisor module for AO : lead by OdP

Criterion optimization and large matrix inversion





































Prototype using latest generation GPU cluster



University





#### System architecture

#### Master-Slave approach



- 1 Send frame to master devices
- Send slopes to each slave device
- 3 Compute assigned slopes & MVM
- ④ Gather and sum all command vectors
- Send back command vector to interface









# Standard GPU programming implementation





#### Persistent kernels and DMA





| CPU<br>RAM |  |
|------------|--|
| CPU        |  |







Profile is dominated by MVM

- Master GPU receives the data from WFS (simplest datapath), compute the slopes and distribute over slave GPUs
- In the case of several nodes, data from WFS is shared between the two node masters but a single RTC master will collect the data







### **Optimizing GPU-FPGA sync**

#### FPGA writes/reads directly to/from GPU memory Using only writes would be better though





University







#### **Optimizing GPU-FPGA sync**



Little to no improvements, but CPU free for other kind of computations

oratoire d'Études Spatiales et d'Instrumentation en Ast



#### Tested various system configurations

## Tests were performed on a DGX-1 platform (only 1 GPU for SCAO)

| Name | N slopes | N actuators | Goal frame rate |
|------|----------|-------------|-----------------|
| SCAO | 10048    | 5316        | 1000 FPS        |
| LTAO | 60288    | 5316        | 500 FPS         |
| MCAO | 60288    | 15316       | 500 FPS         |









#### Initial results: SCAO

#### Persistent kernels versus standard kernels





#### Initial results: LTAO/MCAO

#### LTAO running on 4 GPUs, MCAO on 8 GPUs





#### SCAO case is not large enough to feed a GPU !





## Initial results: throughput

Reaching "only" 731 GMAC/s today (8xP100).

x2 speedup to be expected with newer GPU generations in 1-2 years timeframe (faster HBM)





University









Supervisory module. Use the output data stream from RT pipeline to re-optimize the control matrix 2 stages : function optimization (gradient descent) and Choleski inversion : up to 100 TFLOP/s









Mix of cost function optimization for parameters identification ("Learn" process) and linear algebra for reconstructor matrix computation ("apply" process)





Parameters identification ("Learn" process) 2

- Fitting measurements covariance matrix, on a model including system and turbulence parameters
- Using a score function

$$F(x) = \sum_{k=1}^{N^2} [Cmm_k - f_k(x)]^2$$

- Levenberg-Marquardt algorithm for function optimization
- Exemple of turbulence profile reconstruction

bservatoire - LESIA

• Dual stage process (5 layers + 40 layer

Durham

University





Performance for parameters identification ("Learn" process) Multi-GPU process, including matrix generation and LM fit Time to solution for a matrix size of 86k : 240s (4 minutes)

- first pass (5 layers) : 25s
- Second pass (40 layers) : 213s





Performance for parameters identification ("Learn" process) Multi-GPU process, including matrix generation and LM fit Time to solution for a matrix size of 86k : 240s (4 minutes)

- first pass (5 layers) : 25s
- Second pass (40 layers) : 213s





Reconstructor matrix computation ("apply" process)

• Compute the tomographic reconstructor matrix using covarince matrix between "truth" sensor and other WFS and invert of measurements covariance matrix

 $R' = Ctm \cdot Cmm_f^{-1}$ 

- Can use various methods. "Brute" force : direct solver
- Standard Lapack routine : "posv" : mostly compute-bound, high level of scalability
- Highly portable code : explore various architectures by using standard vendor provided maths libraries





Performance for reconstructor matrix computation ("apply" process)

 Comparing last generation of GPU (NVIDIA P100) and last generation of Intel Xeon Phi (KNL)



 Record time-to-solution on DGX-1 : MAORY / HARMONI full scale (100k x 100k matrix) : 25sec to compute tomographic reconstructor









Performance evolution over time on different platforms

• Comparing generations of GPU and CPUs (+Xeon Phi)





State of the art performance on NVIDIA DGX-1 with V100

• Versus P100 using BLAS library from KAUST: x1.6





Time to solution to compute x16 and x32 tomo reconstructors in parallel

- 10s/reconstructor with P100 and 7.5s with V100 !
- Brute force computation of optimal M4 control matrix (averaging over the FoV) is feasible within few minutes
- Here again, we demonstrate that typical system scales are not large enough to feed the newest generations of GPUs with workload efficiently











Task 4.1 (OdP):

- D4.1: GPU cluster for RT-box design and test report (OdP M6 submitted)
- D4.2: GPU cluster for RT-box prototype (OdP M24– submitted)

Task 4.2 (OdP):

– D4.3: GPU cluster for supervisor design and test report (OdP – M6 – submitted)

Task 4.3 (UoD):

- D4.4: Intel Xeon Phi cluster for RT-box prototype design and test report (UoD submitted)
- D4.5: Intel Xeon Phi cluster for RT-box prototype (UoD M24 delayed to M30)

Task 4.4 (UoD):

- D4.6 FPGA cluster for RT-box prototype design and test report (UoD M24 submitted)
- D4.7: FPGA cluster for RT-box prototype (UoD M36)





CROGATE

