Distributed and shared memory parallelism
FleCSI provides two different levels of parallelism: distributed memory parallelism and shared memory parallelism.
Distributed memory parallelism is provided through topology coloring and distribution of the data between different processes (shards). FleCSI provides macros forall and reduceall for shared memory parallelism. Currently, it uses Kokkos programing model.
Shared memory
Example 1: forall macro / parallel_for interface
This example is an extension to the data-dense tutorial example with the only difference of an additional “modify1” and “modify2” tasks that use forall macro / parallel_for interface. Both “modify” tasks are executed on the FleCSI default accelerator.
Second template parameter to the execute function is a processor_type with loc (latency optimized core) as a default value. default_accelerator is a processor type that corresponds to Kokkos default execution space. For example, if Kokkos is built with Cuda and Serial, Cuda will be a default execution space or toc (throughput optimized core) processor type in FleCSI.
Note
With the Legion backend, OpenMP task execution can be improved with the omp
processor type.
Legion knows to assign an entire node to such a task.
Warning
With the MPI backend, running one process per node with toc
tasks or one
process per core with omp
tasks likely leads to poor performance.
#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"
// this tutorial is based on a 4-data/3-dense.cc tutorial example
// here we will add several forall / parallel_for interfaces
using namespace flecsi;
const field<double>::definition<canon, canon::cells> pressure;
void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) {
std::size_t off{0};
for(const auto c : t.cells()) {
p[c] = (off++) * 2.0;
} // for
} // init
void
modify(canon::accessor<ro> t, field<double>::accessor<rw> p) {
forall(c, t.cells(), "modify") { p[c] += 1; };
} // modify
void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) {
std::size_t off{0};
for(auto c : t.cells()) {
flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
} // for
} // print
void
advance(control_policy &) {
canon::slot canonical;
canonical.allocate(canon::mpi_coloring("test.txt"));
auto pf = pressure(canonical);
// cpu task, default
execute<init>(canonical, pf);
// accelerated task, will be executed on the Kokkos default execution space
// In case of Kokkos built with GPU, default execution space will be GPU
// The runtime moves data between the host and device.
execute<modify, default_accelerator>(canonical, pf);
// cpu_task
execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;
Example 2: reduceall macro / parallel_reduce interface
This example is an extension to the data-dense tutorial example with the only difference of an additional “reduce1” and “reduce2” tasks that use reduceall macro / parallel_reduce interface. Both “modify” tasks are executed on the FleCSI default accelerator.
// Copyright (c) 2016, Triad National Security, LLC
// All rights reserved.
#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"
// this tutorial is based on a 4-data/3-dense.cc tutorial example
using namespace flecsi;
const field<double>::definition<canon, canon::cells> pressure;
void
init(canon::accessor<wo> t, field<double>::accessor<wo> p) {
std::size_t off{0};
for(const auto c : t.cells()) {
p[c] = (off++) * 2.0;
} // for
} // init
void
reduce1(canon::accessor<ro> t, field<double>::accessor<ro> p) {
auto res = reduceall(c, up, t.cells(), exec::fold::max, double, "reduce1") {
up(p[c]);
}; // forall
flog_assert(res == 6.0, res << " != 6.0");
} // reduce1
void
reduce2(canon::accessor<ro> t, field<double>::accessor<ro> p) {
auto res = flecsi::exec::parallel_reduce<exec::fold::max, double>(
t.cells(),
FLECSI_LAMBDA(auto c, auto up) { up(p[c]); },
std::string("reduce2"));
flog_assert(res == 6.0, res << " != 6.0");
} // reduce2
void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) {
std::size_t off{0};
for(auto c : t.cells()) {
flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
} // for
} // print
void
advance(control_policy &) {
canon::slot canonical;
canonical.allocate(canon::mpi_coloring("test.txt"));
auto pf = pressure(canonical);
// cpu task, default
execute<init>(canonical, pf);
execute<reduce1, default_accelerator>(canonical, pf);
execute<reduce2, default_accelerator>(canonical, pf);
// cpu_task
execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;