|
DNDSR 0.1.0.dev1+gcd065ad
Distributed Numeric Data Structure for CFV
|
for simplicity, the single-block mesh supports 1-to-1 definition of periodic bc, which is, the "main" and "donor" are the same bnd meshes given the rigid body transformation.
periodic donor and main must be non-adjacent (not sharing nodes) (main-donor node mapping each pair is not identical) [reasoning: if periodicity needs this, means singularity point, replace this point with a small face with sym condition]
then, record cell2nodePi, and face2nodePi, to augment cell2node ..., and:
nodePbi (periodic bits info): bit1-bit2-bit3, means [if peri1][if peri2][if peri3], stored in uint8_t now; the bits show if the node (for the elem) is transferred from a periodic group. if 1-1-1, then p->transBack_1->transBack_2->transBack_3->p_current, and to obtain the geometry-well coordinate p, use the reverse transformation. see Geom::UnstructuredMesh::Deduplicate1to1Periodic() and Geom::PeriodicInfo::GetCoordByBits().
in the sense of peri-duplicated (original geometry) mesh, node in a cell is different when both cell2node(iCell,ic2n) and cell2nodePI(iCell,ic2n) are different; so use both cell2node and cell2nodePi for inner-cell coord calculation
in the sense of de-duplicated mesh, cell2node(iCell,ic2n) is unique, and points to a coord;
to get interpolate face, the faces are the same when both are true:
face2nodePi are got from the first cell, like face2node faceAtr are got from the first cell, (for de-duplicated faces, could be periodic-donor or main)
to query face-coords in cell, check is is periodic, if donor, face2cell[1] needs the face to trans-back; if main, face2cell[1] needs the face to trans; face2cell[0] is always good
to query other cell-coord through face, if face is periodic main, if cell is face2cell[0] then other cell needs trans back; cell is face2cell[1] then other cell needs trans
Point-to-point comm in MPI could potentially cause comm race when the comm pattern (of a collective-like comm call) is very dense (close to a all_to_all), while the machine topology is sparse (like in a supercomputer).
calling like:
would launch multiple data transferring tasks in MPI simultaneously, which is ok on a dense machine like a NUMA machine, but it causes network racing and saturation in a sparse machine like a common multi-node supercomputer. The best practice, for the performance of communication, would be like a MPI_Alltoall, which refers to the current machine's topology. On the other hand, alltoall is always global and brings extra zero-sized-data communication overhead globally, which is O(np) overhead. So, DNDS assumes that the comm pattern is sparse enough, and when the the sparsity exceeds that of the machine, the current implementation of DNDS uses one-by-one send/recv, like:
which is receiving one-by-one. Sending one-by-one is also ok for solving network race.
Actually, chunk-by-chunk should be used on a sparse machine, with the knowledge of local network capacity. But since sparse decomposed 3-d meshes generally have very small communication overhead, optimization on this is delayed.
For DLR case used for eulerSA3D in d082620525d1d9a07889c9e8c1a9bede70ebe236, on GS machine, times is: 2.1582/it
If using nVarsFixed = -1 (original 6), time is: 3.3944/1t
must use residual instead of uinc for convergence monitoring!!!
237m14.899s for hm3 run on HZAU 121m49.267s for ESDIRK4 90m18.275s for BDF2 (3x)
ompi 3.x seems ungood (module load mpi/openmpi/3.1.6-gcc-9.3.0), for large mesh partition at least; MPI_Barrier even takes 1 sec each!
note that when switching mpi (or other libs), best check entries in cmake, (safest: restart vscode or in standalone terminal), could be unchanged
with the std::pow in faceFlux function, step time for CylinderA1_L's vorstreet case, is **~5s**, reduces to **~0.2s** at ts 80 (internal 40x80) with std::pow replaced with std::sqrt, step time starts as **~1.3s**, reduces to **~0.2s** at ts 80 (internal 40x80) with std::pow not replaced, but using intel icpc, step time starts as **~0.4s** with replaced with std::sqrt and using intel icpc, step time starts as **~0.4s**
using a v110 api to enable the use by cgns 4.3.0 ? not working?
using oneAPI's mpi on windows: use in powershell:
RHS time: (intOrderVR=5, cacheDiffBase=true)
| intOrder=5 | intOrder=3 | |
|---|---|---|
| IV10U_10_bisect3 serial | 0.112 | 0.079 |
| Uniform32_3D_Periodic 4 | 0.781 | 0.353 |
running OMP needs caution (mpi/srun needs to allocate enough cores to each process)
For openMPI running on local machine:
remember using meshReorderCells to allow local partitions