On-node Parallelism
FleCSI tasks can launch kernels to exploit fine-grained, on-node parallelism. These kernels operate inside the task body and are typically mapped to hardware threads by Kokkos.
FleCSI provides a unified API for launching kernels through an executor, using constructs like forall
and reduceall
.
Simple Parallel Loop
FleCSI provides a convenient API to iterate over data using the forall
construct.
This construct is based on Kokkos::parallel_for
and enables efficient, parallel execution over a range of elements.
Consider the example below, where each value in the accessor are incremented:
void modify(exec::accelerator s,
mesh::accessor<ro> t,
field<double>::accessor<rw> p) noexcept {
s.executor().forall(c, t.cells()) {
p[c] += 1;
};
}
In this snippet, the loop index c
is implicitly declared by the forall
construct.
The iteration is over the range t.cells()
, which provides access to the mesh entities visible to the task.
The loop body is a lambda expression that must be followed by a semicolon.
Advanced Ranges
FleCSI also supports iterating over an mdspan using mdiota_view
:
s.executor().forall(
mi,
mdiota_view(
md,
exec::full_range,
exec::prefix_range{2}
)) { /* ... */ };
A subset of the array can be selected using parameters such as full_range
, prefix_range
, or sub_range
.
Reductions Inside Kernels
To obtain a reduced value from a parallel iteration a task uses the reduceall
construct (based on Kokkos::parallel_reduce
).
For instance, the following kernel computes the maximum value over all cells:
void reduce1(exec::accelerator s,
mesh::accessor<ro> t,
field<float>::accessor<ro> p) noexcept {
auto res = s.executor().reduceall(
c, up, t.cells(), exec::fold::prod, double) {
up(p[c]);
};
}
Here, up
is an additional variable declared by reduceall
to collect values using the given type double
from the kernel, while exec::fold::max
defines the fold operation.
Further executor features
Executors support additional member functions for special cases. Some of these specify optional, composable behavior for a kernel launch:
void modify(exec::accelerator s,
mesh::accessor<ro> t,
field<double>::accessor<rw> p) noexcept {
s.executor()
.named("named forall")
.threads<64, 1>()
.for_each(t.cells(),
[p] FLECSI_INLINE_TARGET (auto c) {
p[c] += 2;
});
}
In this example, named("named forall")
attaches a label to the kernel which can be used by Kokkos Tools for profiling and debugging.
threads<64, 1>()
specifies the number of threads and blocks for the construct, here one block of 64 threads.
for_each
is a function template equivalent of forall
that directly accepts a function object.
reduce
is the equivalent for reduceall
; function objects used with these must be declared with FLECSI_INLINE_TARGET
for compatibility with typical GPU compilers.