Shared-memory parallelism

FleCSI provides two different levels of parallelism: distributed memory parallelism and shared memory parallelism.

Distributed memory parallelism is provided through topology coloring and distribution of the data between different processes (shards). FleCSI provides macros forall and reduceall for shared memory parallelism. Currently, it uses Kokkos programing model.


Example 1: forall

This example is a modification of the data-dense tutorial example that replaces the data copy with a modify task that supports Kokkos. The accelerator is an execution space that uses Kokkos parallelism on a GPU or via OpenMP if available. Every execution space has an executor that implements its parallelism (if any) via a forall macro that can be used as a member function.

Note

With the Legion backend, OpenMP task execution can be improved with the omp processor type. Legion knows to assign an entire node to such a task.

Warning

With the MPI backend, running one process per node with toc tasks or one process per core with omp tasks likely leads to poor performance.

#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>

#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"

// this tutorial is based on a 4-data/3-dense.cc tutorial example

using namespace flecsi;

const field<double>::definition<canon, canon::cells> pressure;

void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) noexcept {
  std::size_t off{0};
  for(const auto c : t.cells()) {
    p[c] = (off++) * 2.0;
  } // for
} // init

void
modify(exec::accelerator s,
  canon::accessor<ro> t,
  field<double>::accessor<rw> p) noexcept {
  s.executor().forall(c, t.cells()) {
    p[c] += 1;
  };
}

void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
  std::size_t off{0};
  for(auto c : t.cells()) {
    flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
  } // for
} // print

void
advance(control_policy & p) {
  auto & s = p.scheduler();

  canon::topology canonical(s, canon::mpi_coloring(s, "test.txt"));

  auto pf = pressure(canonical);

  // cpu task, default
  s.execute<init>(canonical, pf);
  // Automatically select an execution space based on Kokkos configuration.
  // The runtime moves data between the host and device.
  s.execute<modify>(exec::on, canonical, pf);
  // cpu_task
  s.execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;

Example 2: reduceall

This example instead uses reduce1 and reduce2 tasks that use the reduceall macro interface and reduce function template interface respectively. The former accepts two names declared for use in the body: the range element, as for forall, and a function that accepts values for the reduction. The latter supports further composition, such as client library interfaces that accept kernel functors; its analog for forall is called for_each. Any of these can have a name attached as illustrated with the named function.

reduce2 also illustrates defining a task as a function template so that it can use an execution space chosen by FleCSI. For syntactic reasons, the function template is wrapped in a struct; it is always named task and has just one template parameter which is the execution space. Note that, to let FleCSI select which specializations to instantiate, the definition of the function template must be available when the task is launched (rather than being defined in another source file). The application can influence that choice: here, the gpu execution space is taken to be undesirable and is disabled by deleting its template specialization.

#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>

#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"

// this tutorial is based on a 4-data/3-dense.cc tutorial example

using namespace flecsi;

const field<double>::definition<canon, canon::cells> pressure;

void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) noexcept {
  std::size_t off{0};
  for(const auto c : t.cells()) {
    p[c] = (off++) * 2.0;
  } // for
} // init

void
reduce1(exec::accelerator s,
  canon::accessor<ro> t,
  field<double>::accessor<ro> p) noexcept {
  auto res = s.executor().named("reduce1").reduceall(
    c, up, t.cells(), exec::fold::max, double) {
    up(p[c]);
  }; // forall

  flog_assert(res == 6.0, res << " != 6.0");
}

struct reduce2 {
  template<class S>
  static void
  task(S s, canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
    auto res = s.executor().template reduce<exec::fold::max, double>(
      t.cells(), FLECSI_LAMBDA(auto c, auto up) { up(p[c]); });

    flog_assert(res == 6.0, res << " != 6.0");
  }
};
template<>
void reduce2::task(exec::gpu,
  canon::accessor<ro>,
  field<double>::accessor<ro>) noexcept = delete;

void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) noexcept {
  std::size_t off{0};
  for(auto c : t.cells()) {
    flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
  } // for
} // print

void
advance(control_policy & p) {
  auto & s = p.scheduler();

  canon::topology canonical(s, canon::mpi_coloring(s, "test.txt"));

  auto pf = pressure(canonical);

  // cpu task, default
  s.execute<init>(canonical, pf);
  s.execute<reduce1>(exec::on, canonical, pf);
  s.execute<reduce2>(exec::on, canonical, pf);
  // cpu_task
  s.execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;