On-node Parallelism

FleCSI tasks can launch kernels to exploit fine-grained, on-node parallelism. These kernels operate inside the task body and are typically mapped to hardware threads by Kokkos.

FleCSI provides a unified API for launching kernels through an executor, using constructs like forall and reduceall.

Simple Parallel Loop

FleCSI provides a convenient API to iterate over data using the forall construct. This construct is based on Kokkos::parallel_for and enables efficient, parallel execution over a range of elements.

Consider the example below, where each value in the accessor are incremented:

void modify(exec::accelerator s,
  mesh::accessor<ro> t,
  field<double>::accessor<rw> p) noexcept {
    s.executor().forall(c, t.cells()) {
      p[c] += 1;
    };
}

In this snippet, the loop index c is implicitly declared by the forall construct. The iteration is over the range t.cells(), which provides access to the mesh entities visible to the task. The loop body is a lambda expression that must be followed by a semicolon.

Advanced Ranges

FleCSI also supports iterating over an mdspan using mdiota_view:

s.executor().forall(
  mi,
  mdiota_view(
    md,
    exec::full_range,
    exec::prefix_range{2}
  )) { /* ... */ };

A subset of the array can be selected using parameters such as full_range, prefix_range, or sub_range.

Reductions Inside Kernels

To obtain a reduced value from a parallel iteration a task uses the reduceall construct (based on Kokkos::parallel_reduce).

For instance, the following kernel computes the maximum value over all cells:

void reduce1(exec::accelerator s,
  mesh::accessor<ro> t,
  field<float>::accessor<ro> p) noexcept {
    auto res = s.executor().reduceall(
      c, up, t.cells(), exec::fold::prod, double) {
        up(p[c]);
    };
}

Here, up is an additional variable declared by reduceall to collect values using the given type double from the kernel, while exec::fold::max defines the fold operation.

Further executor features

Executors support additional member functions for special cases. Some of these specify optional, composable behavior for a kernel launch:

void modify(exec::accelerator s,
            mesh::accessor<ro> t,
            field<double>::accessor<rw> p) noexcept {
  s.executor()
    .named("named forall")
    .threads<64, 1>()
    .for_each(t.cells(),
      [p] FLECSI_INLINE_TARGET (auto c) {
        p[c] += 2;
      });
}

In this example, named("named forall") attaches a label to the kernel which can be used by Kokkos Tools for profiling and debugging. threads<64, 1>() specifies the number of threads and blocks for the construct, here one block of 64 threads. for_each is a function template equivalent of forall that directly accepts a function object. reduce is the equivalent for reduceall; function objects used with these must be declared with FLECSI_INLINE_TARGET for compatibility with typical GPU compilers.