Shared-memory parallelism
FleCSI provides two different levels of parallelism: distributed memory parallelism and shared memory parallelism.
Distributed memory parallelism is provided through topology coloring and distribution of the data between different processes (shards). FleCSI provides macros forall and reduceall for shared memory parallelism. Currently, it uses Kokkos programing model.
Example 1: forall
This example is a modification of the data-dense tutorial example that replaces the data copy with a modify
task that supports Kokkos.
The accelerator
is an execution space that uses Kokkos parallelism on a GPU or via OpenMP if available.
Every execution space has an executor
that implements its parallelism (if any) via a forall
macro that can be used as a member function.
Note
With the Legion backend, OpenMP task execution can be improved with the omp
processor type.
Legion knows to assign an entire node to such a task.
Warning
With the MPI backend, running one process per node with toc
tasks or one
process per core with omp
tasks likely leads to poor performance.
#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"
// this tutorial is based on a 4-data/3-dense.cc tutorial example
using namespace flecsi;
const field<double>::definition<canon, canon::cells> pressure;
void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) noexcept {
std::size_t off{0};
for(const auto c : t.cells()) {
p[c] = (off++) * 2.0;
} // for
} // init
void
modify(exec::accelerator s,
canon::accessor<ro> t,
field<double>::accessor<rw> p) noexcept {
s.executor().forall(c, t.cells()) {
p[c] += 1;
};
}
void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
std::size_t off{0};
for(auto c : t.cells()) {
flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
} // for
} // print
void
advance(control_policy & p) {
auto & s = p.scheduler();
canon::topology canonical(s, canon::mpi_coloring(s, "test.txt"));
auto pf = pressure(canonical);
// cpu task, default
s.execute<init>(canonical, pf);
// Automatically select an execution space based on Kokkos configuration.
// The runtime moves data between the host and device.
s.execute<modify>(exec::on, canonical, pf);
// cpu_task
s.execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;
Example 2: reduceall
This example instead uses reduce1
and reduce2
tasks that use the reduceall
macro interface and reduce
function template interface respectively.
The former accepts two names declared for use in the body: the range element, as for forall
, and a function that accepts values for the reduction.
The latter supports further composition, such as client library interfaces that accept kernel functors; its analog for forall
is called for_each
.
Any of these can have a name attached as illustrated with the named
function.
reduce2
also illustrates defining a task as a function template so that it can use an execution space chosen by FleCSI.
For syntactic reasons, the function template is wrapped in a struct
; it is always named task
and has just one template parameter which is the execution space.
Note that, to let FleCSI select which specializations to instantiate, the definition of the function template must be available when the task is launched (rather than being defined in another source file).
The application can influence that choice: here, the gpu
execution space is taken to be undesirable and is disabled by deleting its template specialization.
#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"
// this tutorial is based on a 4-data/3-dense.cc tutorial example
using namespace flecsi;
const field<double>::definition<canon, canon::cells> pressure;
void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) noexcept {
std::size_t off{0};
for(const auto c : t.cells()) {
p[c] = (off++) * 2.0;
} // for
} // init
void
reduce1(exec::accelerator s,
canon::accessor<ro> t,
field<double>::accessor<ro> p) noexcept {
auto res = s.executor().named("reduce1").reduceall(
c, up, t.cells(), exec::fold::max, double) {
up(p[c]);
}; // forall
flog_assert(res == 6.0, res << " != 6.0");
}
struct reduce2 {
template<class S>
static void
task(S s, canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
auto res = s.executor().template reduce<exec::fold::max, double>(
t.cells(), FLECSI_LAMBDA(auto c, auto up) { up(p[c]); });
flog_assert(res == 6.0, res << " != 6.0");
}
};
template<>
void reduce2::task(exec::gpu,
canon::accessor<ro>,
field<double>::accessor<ro>) noexcept = delete;
void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
std::size_t off{0};
for(auto c : t.cells()) {
flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
} // for
} // print
void
advance(control_policy & p) {
auto & s = p.scheduler();
canon::topology canonical(s, canon::mpi_coloring(s, "test.txt"));
auto pf = pressure(canonical);
// cpu task, default
s.execute<init>(canonical, pf);
s.execute<reduce1>(exec::on, canonical, pf);
s.execute<reduce2>(exec::on, canonical, pf);
// cpu_task
s.execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;