On-node Parallelism
FleCSI tasks can launch kernels to exploit fine-grained, on-node parallelism. These kernels operate inside the task body and are typically mapped to hardware threads by Kokkos.
FleCSI provides a unified API for launching kernels through an executor, using constructs like forall and reduceall.
Simple Parallel Loop
FleCSI provides a convenient API to iterate over data using the forall construct.
This construct is based on Kokkos::parallel_for and enables efficient, parallel execution over a range of elements.
Consider the example below, where each value in the accessor are incremented:
void modify(exec::accelerator s,
mesh::accessor<ro> t,
field<double>::accessor<rw> p) noexcept {
s.executor().forall(c, t.cells()) {
p[c] += 1;
};
}
In this snippet, the loop index c is implicitly declared by the forall construct.
The iteration is over the range t.cells(), which provides access to the mesh entities visible to the task.
The loop body is a lambda expression that must be followed by a semicolon.
Advanced Ranges
FleCSI also supports iterating over an mdspan using mdiota_view:
s.executor().forall(
mi,
mdiota_view(
md,
exec::full_range,
exec::prefix_range{2}
)) { /* ... */ };
A subset of the array can be selected using parameters such as full_range, prefix_range, or sub_range.
Reductions Inside Kernels
To obtain a reduced value from a parallel iteration a task uses the reduceall construct (based on Kokkos::parallel_reduce).
For instance, the following kernel computes the maximum value over all cells:
void reduce1(exec::accelerator s,
mesh::accessor<ro> t,
field<float>::accessor<ro> p) noexcept {
auto res = s.executor().reduceall(
c, up, t.cells(), exec::fold::prod, double) {
up(p[c]);
};
}
Here, up is an additional variable declared by reduceall to collect values using the given type double from the kernel, while exec::fold::max defines the fold operation.
Further executor features
Executors support additional member functions for special cases. Some of these specify optional, composable behavior for a kernel launch:
void modify(exec::accelerator s,
mesh::accessor<ro> t,
field<double>::accessor<rw> p) noexcept {
s.executor()
.named("named forall")
.threads<64, 1>()
.for_each(t.cells(),
[p] FLECSI_INLINE_TARGET (auto c) {
p[c] += 2;
});
}
In this example, named("named forall") attaches a label to the kernel which can be used by Kokkos Tools for profiling and debugging.
threads<64, 1>() specifies the number of threads and blocks for the construct, here one block of 64 threads.
for_each is a function template equivalent of forall that directly accepts a function object.
reduce is the equivalent for reduceall; function objects used with these must be declared with FLECSI_INLINE_TARGET for compatibility with typical GPU compilers.