11. Performance

11.1. Release builds

Use CMAKE_BUILD_TYPE=Release for production simulations. Debug builds are valuable for diagnosis but can be dramatically slower in particle and geometry loops.

11.2. OpenMP

Enable OpenMP at configure time and select the thread count at runtime:

cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DPHONOMC_ENABLE_OPENMP=ON
cmake --build build -j
OMP_NUM_THREADS=32 ./build/PhonoMC input.toml

Do not assume that the largest available thread count is fastest. Compare a short representative run at 1, 2, 4, 8, and higher thread counts. Memory bandwidth, dynamic collision scheduling, grid reductions, and the active-mode table size can limit scaling.

11.3. Built-in profiling

Enable:

[simulation]
profile_timers = true

The runtime summary separates major stages such as main particle advance, particle removal, injection construction, collision-cache updates, injected particle advance, temperature update, lifetime scattering, and statistics.

11.4. Scaling parameters

particle_count

Usually the largest direct memory and transport-work multiplier. Increase it only after a low-count case is geometrically correct.

iterations

Controls total simulated time and total transport work.

grid_xyz

Increases temperature/flux reduction buffers and output width. For mesh geometries, initialization also builds a full bounding-box voxel map. The runtime position-to-grid lookup remains constant time.

temperature_lookup_dt

Smaller values create larger temperature-energy tables and increase initialization work, while improving interpolation resolution.

convergence_write_interval

Larger values reduce file I/O and conductivity-statistics frequency.

11.5. Memory considerations

Particle state is stored in parallel arrays for modes, positions, velocities, temperatures, occupations, energy, grid IDs, collision locations, collision facets, boundary conditions, and alive flags. Ten million particles therefore require substantially more memory than the position array alone suggests.

OpenMP grid reductions allocate per-thread buffers proportional to thread count × grid count. Very fine grids combined with many threads can consume significant additional memory.

11.7. Known performance hotspots

  • boundary intersection and collision processing

  • rough-boundary mode selection

  • particle-array compaction after absorption

  • temperature and heat-flux grid reductions

  • initialization of detailed mesh-volume sampling data

Surface winding repair and volume-consistency checks are initialization-only, linear passes over existing faces or tetrahedra. They do not add work to the per-particle time-stepping path.

11.8. Reproducibility

Set simulation.random_seed explicitly for production comparisons. OpenMP threads receive independently derived streams; therefore exact replay also requires the same thread count and scheduling environment.