Distributed and shared memory parallelism

FleCSI provides two different levels of parallelism: distributed memory parallelism and shared memory parallelism.

Distributed memory parallelism is provided through topology coloring and distribution of the data between different processes (shards). FleCSI provides macros forall and reduceall for shared memory parallelism. Currently, it uses Kokkos programing model.

Shared memory

Example 1: forall macro / parallel_for interface

This example is an extension to the data-dense tutorial example with the only difference of an additional “modify1” and “modify2” tasks that use forall macro / parallel_for interface. Both “modify” tasks are executed on the FleCSI default accelerator.

Second template parameter to the execute function is a processor_type with loc (latency optimized core) as a default value. default_accelerator is a processor type that corresponds to Kokkos default execution space. For example, if Kokkos is built with Cuda and Serial, Cuda will be a default execution space or toc (throughput optimized core) processor type in FleCSI.

Note

With the Legion backend, OpenMP task execution can be improved with the omp processor type. Legion knows to assign an entire node to such a task.

Warning

With the MPI backend, running one process per node with toc tasks or one process per core with omp tasks likely leads to poor performance.

#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>

#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"

// this tutorial is based on a 4-data/3-dense.cc tutorial example
// here we will add several forall / parallel_for interfaces

using namespace flecsi;

const field<double>::definition<canon, canon::cells> pressure;

void
init(canon::accessor<ro> t, field<double>::accessor<wo> p) {
  std::size_t off{0};
  for(const auto c : t.cells()) {
    p[c] = (off++) * 2.0;
  } // for
} // init

void
modify(canon::accessor<ro> t, field<double>::accessor<rw> p) {
  forall(c, t.cells(), "modify") { p[c] += 1; };
} // modify

void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) {
  std::size_t off{0};
  for(auto c : t.cells()) {
    flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
  } // for
} // print

void
advance(control_policy &) {
  canon::slot canonical;
  canonical.allocate(canon::mpi_coloring("test.txt"));

  auto pf = pressure(canonical);

  // cpu task, default
  execute<init>(canonical, pf);
  // accelerated task, will be executed on the Kokkos default execution space
  // In case of Kokkos built with GPU, default execution space will be GPU
  // The runtime moves data between the host and device.
  execute<modify, default_accelerator>(canonical, pf);
  // cpu_task
  execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;

Example 2: reduceall macro / parallel_reduce interface

This example is an extension to the data-dense tutorial example with the only difference of an additional “reduce1” and “reduce2” tasks that use reduceall macro / parallel_reduce interface. Both “modify” tasks are executed on the FleCSI default accelerator.

// Copyright (c) 2016, Triad National Security, LLC
// All rights reserved.

#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>

#include "../3-execution/control.hh"
#include "../4-data/canonical.hh"

// this tutorial is based on a 4-data/3-dense.cc tutorial example

using namespace flecsi;

const field<double>::definition<canon, canon::cells> pressure;

void
init(canon::accessor<wo> t, field<double>::accessor<wo> p) {
  std::size_t off{0};
  for(const auto c : t.cells()) {
    p[c] = (off++) * 2.0;
  } // for
} // init

void
reduce1(canon::accessor<ro> t, field<double>::accessor<ro> p) {
  auto res = reduceall(c, up, t.cells(), exec::fold::max, double, "reduce1") {
    up(p[c]);
  }; // forall

  flog_assert(res == 6.0, res << " != 6.0");

} // reduce1

void
reduce2(canon::accessor<ro> t, field<double>::accessor<ro> p) {
  auto res = flecsi::exec::parallel_reduce<exec::fold::max, double>(
    t.cells(),
    FLECSI_LAMBDA(auto c, auto up) { up(p[c]); },
    std::string("reduce2"));

  flog_assert(res == 6.0, res << " != 6.0");

} // reduce2

void
print(canon::accessor<ro> t, field<double>::accessor<ro> p) {
  std::size_t off{0};
  for(auto c : t.cells()) {
    flog(info) << "cell " << off++ << " has pressure " << p[c] << std::endl;
  } // for
} // print

void
advance(control_policy &) {
  canon::slot canonical;
  canonical.allocate(canon::mpi_coloring("test.txt"));

  auto pf = pressure(canonical);

  // cpu task, default
  execute<init>(canonical, pf);
  execute<reduce1, default_accelerator>(canonical, pf);
  execute<reduce2, default_accelerator>(canonical, pf);
  // cpu_task
  execute<print>(canonical, pf);
}
control::action<advance, cp::advance> advance_action;