Performance
===========

Release builds
--------------

Use ``CMAKE_BUILD_TYPE=Release`` for production simulations. Debug builds are
valuable for diagnosis but can be dramatically slower in particle and geometry
loops.

OpenMP
------

Enable OpenMP at configure time and select the thread count at runtime:

.. code-block:: bash

   cmake -S . -B build \
     -DCMAKE_BUILD_TYPE=Release \
     -DPHONOMC_ENABLE_OPENMP=ON
   cmake --build build -j
   OMP_NUM_THREADS=32 ./build/PhonoMC input.toml

Do not assume that the largest available thread count is fastest. Compare a
short representative run at 1, 2, 4, 8, and higher thread counts. Memory
bandwidth, dynamic collision scheduling, grid reductions, and the active-mode
table size can limit scaling.

Built-in profiling
------------------

Enable:

.. code-block:: toml

   [simulation]
   profile_timers = true

The runtime summary separates major stages such as main particle advance,
particle removal, injection construction, collision-cache updates, injected
particle advance, temperature update, lifetime scattering, and statistics.

Scaling parameters
------------------

``particle_count``
   Usually the largest direct memory and transport-work multiplier. Increase
   it only after a low-count case is geometrically correct.

``iterations``
   Controls total simulated time and total transport work.

``grid_xyz``
   Increases temperature/flux reduction buffers and output width. For mesh
   geometries, initialization also builds a full bounding-box voxel map. The
   runtime position-to-grid lookup remains constant time.

``temperature_lookup_dt``
   Smaller values create larger temperature-energy tables and increase
   initialization work, while improving interpolation resolution.

``convergence_write_interval``
   Larger values reduce file I/O and conductivity-statistics frequency.

Memory considerations
---------------------

Particle state is stored in parallel arrays for modes, positions, velocities,
temperatures, occupations, energy, grid IDs, collision locations, collision
facets, boundary conditions, and alive flags. Ten million particles therefore
require substantially more memory than the position array alone suggests.

OpenMP grid reductions allocate per-thread buffers proportional to
``thread count × grid count``. Very fine grids combined with many threads can
consume significant additional memory.

Recommended scaling workflow
----------------------------

1. Validate input with a very small case.
2. Choose a grid using a grid-sensitivity study.
3. Select a stable time step.
4. Increase particles until noise is acceptable.
5. Increase iterations until observables are stationary.
6. Benchmark thread count on the final problem shape.
7. Enable profiling for at least one representative production run.

Known performance hotspots
--------------------------

- boundary intersection and collision processing
- rough-boundary mode selection
- particle-array compaction after absorption
- temperature and heat-flux grid reductions
- initialization of detailed mesh-volume sampling data

Surface winding repair and volume-consistency checks are initialization-only,
linear passes over existing faces or tetrahedra. They do not add work to the
per-particle time-stepping path.

Reproducibility
---------------

Set ``simulation.random_seed`` explicitly for production comparisons. OpenMP
threads receive independently derived streams; therefore exact replay also
requires the same thread count and scheduling environment.