|
DNDSR 0.1.0.dev1+gcd065ad
Distributed Numeric Data Structure for CFV
|
This guide explains how to checkpoint and restart solver state using DNDSR's serialization system. The key feature is redistribution: a checkpoint written with 4 MPI ranks can be read back on 8 without any special handling in the solver code.
For the internal design see Serialization.
A CFD solver periodically writes its solution to disk so that the computation can resume after a crash or be continued on a different machine. Two difficulties arise:
DNDSR solves both with a two-layer system:
SerializerH5** writes a single .h5 file using MPI-parallel HDF5. Each array is stored as a contiguous global dataset with per-rank offset metadata (the rank_offsets companion dataset).ArrayPair::WriteSerialize** stores an origIndex vector alongside the data – a partition-independent global cell ID. On read, ReadSerializeRedistributed uses this to pull each rank's cells regardless of how the file was partitioned.The solver builds a serializer, opens a file, and writes its DOF array. The origIndex vector enables redistribution on read.
(Source: Euler/EulerSolver_PrintData.hxx:844 for the real implementation.)
What goes into the H5 file (under path "u"):
| Dataset | Content |
|---|---|
u/father/data | The raw DOF values (real vector) |
u/father/size | Per-rank row count |
u/father/data\:\:rank_offsets | Cumulative offsets for each rank's data slice |
u/origIndex | Partition-independent cell IDs |
u/redistributable | Flag = 1 (enables redistribution on read) |
When the rank count matches the checkpoint, the read is straightforward:
This is the main feature. The solver code is nearly identical; only the read call changes from ReadSerialize to ReadSerializeRedistributed.
(Source: Euler/EulerSolver_PrintData.hxx:942 for the real implementation; DNDS/ArrayPair.hpp:479 for the redistribution logic.)
When the file was written with np_old = 4 and is read with np_new = 8:
~N_global / 8 rows of both the data and the origIndex vector. This is a balanced but incorrect distribution – the rows don't correspond to this rank's mesh cells yet. (See ParArray::ReadSerializer at DNDS/ArrayTransformer.hpp:165.)RedistributeArrayWithTransformer (at DNDS/ArrayRedistributor.hpp:235) creates a temporary ArrayTransformer. Each rank announces which origIndex values it needs (from newOrigIdx). The transformer's ghost-pull mechanism fetches the corresponding rows from whichever rank holds them after the even-split read.newOrigIdx.This works because origIndex is partition-independent – it does not change between runs or between different rank counts.
DNDSR supports reading a checkpoint from a different solver variant. For example, reading an Euler (5-variable) checkpoint into an SA (6-variable) solver. The key is ReadSerializerMeta, which peeks at the stored dimensions without reading data.
(Source: Euler/EulerSolver_PrintData.hxx:1001.)
For writing/reading individual arrays without the ArrayPair wrapper (e.g. a custom output not tied to solver DOFs):
ArrayGlobalOffset_Parts tells the writer to compute per-rank offsets via MPI_Scan. ArrayGlobalOffset_Unknown tells the reader to auto-detect this rank's slice from the stored rank_offsets companion. See DNDS/SerializerBase.hpp:20 for all offset modes.