On-node Parallelism ******************* FleCSI tasks can launch *kernels* to exploit fine-grained, on-node parallelism. These kernels operate inside the task body and are typically mapped to hardware threads by Kokkos. FleCSI provides a unified API for launching kernels through an *executor*, using constructs like ``forall`` and ``reduceall``. Simple Parallel Loop ++++++++++++++++++++ FleCSI provides a convenient API to iterate over data using the ``forall`` construct. This construct is based on ``Kokkos::parallel_for`` and enables efficient, parallel execution over a range of elements. Consider the example below, where each value in the accessor are incremented: .. code-block:: c++ void modify(exec::accelerator s, mesh::accessor t, field::accessor p) noexcept { s.executor().forall(c, t.cells()) { p[c] += 1; }; } In this snippet, the loop index ``c`` is implicitly declared by the ``forall`` construct. The iteration is over the range ``t.cells()``, which provides access to the mesh entities visible to the task. The loop body is a lambda expression that must be followed by a semicolon. Advanced Ranges ^^^^^^^^^^^^^^^ FleCSI also supports iterating over an :ref:`mdspan ` using ``mdiota_view``: .. code-block:: c++ s.executor().forall( mi, mdiota_view( md, exec::full_range, exec::prefix_range{2} )) { /* ... */ }; A subset of the array can be selected using parameters such as ``full_range``, ``prefix_range``, or ``sub_range``. Reductions Inside Kernels +++++++++++++++++++++++++ To obtain a reduced value from a parallel iteration a task uses the ``reduceall`` construct (based on ``Kokkos::parallel_reduce``). For instance, the following kernel computes the maximum value over all cells: .. code-block:: c++ void reduce1(exec::accelerator s, mesh::accessor t, field::accessor p) noexcept { auto res = s.executor().reduceall( c, up, t.cells(), exec::fold::prod, double) { up(p[c]); }; } Here, ``up`` is an additional variable declared by ``reduceall`` to collect values using the given type ``double`` from the kernel, while ``exec::fold::max`` defines the :ref:`fold operation `. Further executor features +++++++++++++++++++++++++ Executors support additional member functions for special cases. Some of these specify optional, composable behavior for a kernel launch: .. code-block:: c++ void modify(exec::accelerator s, mesh::accessor t, field::accessor p) noexcept { s.executor() .named("named forall") .threads<64, 1>() .for_each(t.cells(), [p] FLECSI_INLINE_TARGET (auto c) { p[c] += 2; }); } In this example, ``named("named forall")`` attaches a label to the kernel which can be used by Kokkos Tools for profiling and debugging. ``threads<64, 1>()`` specifies the number of threads and blocks for the construct, here one block of 64 threads. ``for_each`` is a function template equivalent of ``forall`` that directly accepts a function object. ``reduce`` is the equivalent for ``reduceall``; function objects used with these must be declared with ``FLECSI_INLINE_TARGET`` for compatibility with typical GPU compilers.