EulerP Tests¶
The machine: 2xIntel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, 2xNVIDIA A100 80GB PCIe, no p2p GPU access.
AMD CPU server: 2xAMD EPYC 9754
Periodic shock box¶
Test solver. 2nd order + Barth limiter + SSPRK3, CFL=0.5
Number of cells iterations per second, CI/s
Using 1024^2 cells.
Machine |
Performance (CI/s) |
Power estimated ( by software) (W) |
Efficiency (MCI/kJ) |
|---|---|---|---|
1 A100 |
7.5M |
120 (GPU) + 175 (CPU Package) + 25 (RAM) = 320 |
23.4 MCI/kJ |
2 A100 |
14.7M |
120 (GPU) * 2 + 175 (CPU Package) + 25 (RAM) = 440 |
33.4 MCI/kJ |
32 CPU cores |
4.2M |
360 (CPU Package) + 45 (RAM) = 405 |
10.37 MCI/kJ |
Using 16 OMP thread x 2 ranks performs nearly the same as (slightly worse than) 32 ranks.
GPU python profile results

CPU python profile results

NSYS results

HUGE RX (host to device memcpy), 6-8GB/s why?
GPU occupancy (nvtop): ~60-70%!
Problem: unintended to_device() calls in initializing (rechecking) face buffer.
Fixing extra to_device¶
Fixed:
Machine |
Performance (CI/s) |
Power estimated ( by software) (W) |
Efficiency (MCI/kJ) |
|---|---|---|---|
1 A100 |
22.7M |
195 (GPU) + 165 (CPU Package) + 20 (RAM) = 380 |
59.7 MCI/kJ |
2 A100 |
39.8M |
180 (GPU) * 2 + 170 (CPU Package) + 20 (RAM) = 550 |
72.3 MCI/kJ |
32 CPU cores |
4.2M |
360 (CPU Package) + 45 (RAM) = 405 |
10.37 MCI/kJ |
GPU occupancy (nvtop): ~90%, RX/TX several MB/s
2 GPU v.s. 1 GPU: 88% strong scaling efficiency.
Optimized RecGradient and RecFace2nd¶
Primary optimization: local cache.
When only optimize RecGradient, consider the effect of using shared shuffle write or not:
Shared write (to 3x5 gradient) vs. direct write:
total: 36.4MCI/s vs. 35.8MCI/s
RecGradient: 2677 Iter, 8.79s vs/ 9.99s
Around 10% improvement.
Pitfall
If we use a buffer write function using __shared__ + __syncthreads() inside, you might want:
int tid_global = blockDim.x * blockIdx.x + threadIdx.x;
if (tid_global >= max)
{
write_data(dummy_data);
return;
}
do_calculation....
write_data(real_data);
To handle OOB threads. If write_data is templated or inlined, the __shared__ buffer could diverge.
Safe pattern:
int tid_global = blockDim.x * blockIdx.x + threadIdx.x;
t_buffer real_data;
if (tid_global <> max)
{
do_calculation....
}
write_data(real_data);
When both of RecGradient and RecFace2nd are optimized, performance:
Machine |
Performance (CI/s) |
Power estimated ( by software) (W) |
Efficiency (MCI/kJ) |
|---|---|---|---|
1 A100 |
58.1M |
237 (GPU) + 170 (CPU Package) + 21 (RAM) = 428 |
136 MCI/kJ |
2 A100 |
84.5M |
185 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 566 |
149 MCI/kJ |
32 CPU cores |
4.2M |
360 (CPU Package) + 45 (RAM) = 405 |
10.37 MCI/kJ |
Occupancy: 86% 1 GPU / 71% 2 GPU
Write coalescing optimized¶
Machine |
Performance (CI/s) |
Power estimated ( by software) (W) |
Efficiency (MCI/kJ) |
|---|---|---|---|
1 A100 |
72.3M |
245 (GPU) + 170 (CPU Package) + 21 (RAM) = 436 |
166 MCI/kJ |
2 A100 |
98.2M |
195 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 586 |
168 MCI/kJ |
32 CPU cores |
4.2M |
360 (CPU Package) + 45 (RAM) = 405 |
10.37 MCI/kJ |
64 CPU cores (AMD) |
10.9M |
384 (CPU Package) + 45? (RAM) = 429? |
25.41 MCI/kJ |
128 CPU cores (AMD) |
17.2M |
524 (CPU Package) + 45? (RAM) = 569? |
30.23 MCI/kJ |
256 CPU cores (AMD) |
19.6M |
570 (CPU Package) + 45? (RAM) = 615? |
31.87 MCI/kJ |
Occupancy: 83% 1 GPU / 66% 2 GPU
Larger case 256^3 (3D, more work per cell)
Machine |
Performance (CI/s) |
Power estimated ( by software) (W) |
Efficiency (MCI/kJ) |
|---|---|---|---|
1 A100 |
60.9M |
298 (GPU) + 170 (CPU Package) + 21 (RAM) = 489 |
124 MCI/kJ |
2 A100 |
78.1M |
230 (GPU) * 2 + 178 (CPU Package) + 22 (RAM) = 660 |
118 MCI/kJ |
32 CPU cores |
3.29M |
370 (GPU) (CPU Package) + 45 (RAM) = 415 |
7.93 MCI/kJ |
Occupancy: 99% 1 GPU / 90% 2 GPU