EulerP Tests

The machine: 2xIntel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz, 2xNVIDIA A100 80GB PCIe, no p2p GPU access.

AMD CPU server: 2xAMD EPYC 9754

Periodic shock box

Test solver. 2nd order + Barth limiter + SSPRK3, CFL=0.5

Number of cells iterations per second, CI/s

Using 1024^2 cells.

Machine

Performance (CI/s)

Power estimated ( by software) (W)

Efficiency (MCI/kJ)

1 A100

7.5M

120 (GPU) + 175 (CPU Package) + 25 (RAM) = 320

23.4 MCI/kJ

2 A100

14.7M

120 (GPU) * 2 + 175 (CPU Package) + 25 (RAM) = 440

33.4 MCI/kJ

32 CPU cores

4.2M

360 (CPU Package) + 45 (RAM) = 405

10.37 MCI/kJ

Using 16 OMP thread x 2 ranks performs nearly the same as (slightly worse than) 32 ranks.

GPU python profile results cProfile_GPU

CPU python profile results cProfile_CPU

NSYS results cProfile_GPU

HUGE RX (host to device memcpy), 6-8GB/s why?

GPU occupancy (nvtop): ~60-70%!

Problem: unintended to_device() calls in initializing (rechecking) face buffer.

Fixing extra to_device

Fixed:

Machine

Performance (CI/s)

Power estimated ( by software) (W)

Efficiency (MCI/kJ)

1 A100

22.7M

195 (GPU) + 165 (CPU Package) + 20 (RAM) = 380

59.7 MCI/kJ

2 A100

39.8M

180 (GPU) * 2 + 170 (CPU Package) + 20 (RAM) = 550

72.3 MCI/kJ

32 CPU cores

4.2M

360 (CPU Package) + 45 (RAM) = 405

10.37 MCI/kJ

GPU occupancy (nvtop): ~90%, RX/TX several MB/s

2 GPU v.s. 1 GPU: 88% strong scaling efficiency.

Optimized RecGradient and RecFace2nd

Primary optimization: local cache.

When only optimize RecGradient, consider the effect of using shared shuffle write or not:

Shared write (to 3x5 gradient) vs. direct write:

  • total: 36.4MCI/s vs. 35.8MCI/s

  • RecGradient: 2677 Iter, 8.79s vs/ 9.99s

Around 10% improvement.

Pitfall

If we use a buffer write function using __shared__ + __syncthreads() inside, you might want:

int tid_global = blockDim.x * blockIdx.x + threadIdx.x; 
if (tid_global >= max) 
{ 
  write_data(dummy_data); 
  return; 
} 
do_calculation.... 
write_data(real_data);

To handle OOB threads. If write_data is templated or inlined, the __shared__ buffer could diverge.

Safe pattern:

int tid_global = blockDim.x * blockIdx.x + threadIdx.x; 
t_buffer real_data;
if (tid_global <> max) 
{ 
  do_calculation.... 
} 
write_data(real_data);

When both of RecGradient and RecFace2nd are optimized, performance:

Machine

Performance (CI/s)

Power estimated ( by software) (W)

Efficiency (MCI/kJ)

1 A100

58.1M

237 (GPU) + 170 (CPU Package) + 21 (RAM) = 428

136 MCI/kJ

2 A100

84.5M

185 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 566

149 MCI/kJ

32 CPU cores

4.2M

360 (CPU Package) + 45 (RAM) = 405

10.37 MCI/kJ

Occupancy: 86% 1 GPU / 71% 2 GPU

Write coalescing optimized

Machine

Performance (CI/s)

Power estimated ( by software) (W)

Efficiency (MCI/kJ)

1 A100

72.3M

245 (GPU) + 170 (CPU Package) + 21 (RAM) = 436

166 MCI/kJ

2 A100

98.2M

195 (GPU) * 2 + 175 (CPU Package) + 21 (RAM) = 586

168 MCI/kJ

32 CPU cores

4.2M

360 (CPU Package) + 45 (RAM) = 405

10.37 MCI/kJ

64 CPU cores (AMD)

10.9M

384 (CPU Package) + 45? (RAM) = 429?

25.41 MCI/kJ

128 CPU cores (AMD)

17.2M

524 (CPU Package) + 45? (RAM) = 569?

30.23 MCI/kJ

256 CPU cores (AMD)

19.6M

570 (CPU Package) + 45? (RAM) = 615?

31.87 MCI/kJ

Occupancy: 83% 1 GPU / 66% 2 GPU

Larger case 256^3 (3D, more work per cell)

Machine

Performance (CI/s)

Power estimated ( by software) (W)

Efficiency (MCI/kJ)

1 A100

60.9M

298 (GPU) + 170 (CPU Package) + 21 (RAM) = 489

124 MCI/kJ

2 A100

78.1M

230 (GPU) * 2 + 178 (CPU Package) + 22 (RAM) = 660

118 MCI/kJ

32 CPU cores

3.29M

370 (GPU) (CPU Package) + 45 (RAM) = 415

7.93 MCI/kJ

Occupancy: 99% 1 GPU / 90% 2 GPU