DAXPY

This example presents a simple but complete example of a FleCSI program that manipulates a distributed data structure. Specifically, the program uses FleCSI to perform a parallel DAXPY (double-precision aX + Y) vector operation. The sequential C++ code looks like this:

for (int i = 0; i < N; ++i)
  y[i] += a*x[i];

To further demonstrate FleCSI capabilities and typical program structure we will also allocate and initialize the distributed vectors before the DAXPY operation and report the sum of all elements in Y at the end, as in the following sequential code:

// Allocate and initialize the two vectors.
std::vector<double> x(N), y(N);
for (size_t i = 0; i < N; ++i) {
  x[i] = static_cast<double>(i);
  y[i] = 0.0;
}

// Perform the DAXPY operation.
const double a = 12.34;
for (size_t i = 0; i < N; ++i)
  y[i] += a*x[i];

// Report the sum over y.
double sum = 0.0;
for (size_t i = 0; i < N; ++i)
  sum += y[i];
std::cout << "The sum over all elements in the final vector is " << sum << std::endl;

In the above we arbitrarily initialize X[i] ← i and Y[i] ← 0.

For pedagogical purposes the FleCSI version of the above, which we’ll call “FLAXPY”, is expressed slightly differently from how a full application would more naturally be implemented:

  • FLAXPY is presented as a single file rather than as separate (and typically multiple) source and header files.

  • C++ namespaces are referenced explicitly rather than imported with using.

  • Some function, method, and variable names are more verbose than they would commonly be.

Preliminaries

Here’s a simple CMakeLists.txt file for building FLAXPY:

cmake_minimum_required(VERSION 3.12)
project(Flaxpy LANGUAGES CXX C)

set(CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_STANDARD 17)

find_package(FleCSI REQUIRED)

add_executable(flaxpy flaxpy.cc)
target_link_libraries(flaxpy FleCSI::FleCSI)

FLAXPY is implemented as a single file, flaxpy.cc. We begin by including the header files needed to access the data model, execution model, and other FleCSI components:

#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include <flecsi/runtime.hh>

For user convenience, we define a --length (abbreviation: -l) command-line option for specifying the length of vectors X and Y and with a default of 1,000,000 elements. To do so we declare a variable of type flecsi::program_option, templated on the option type, which in this case is std::size_t. We name the variable vector_length and define it within a flaxpy namespace, which other source files—of which there are none in this simple example—could import. vector_length will be used at run time to access the vector length.

// In a larger program, this namespace would typically appear in a header file,
// where the inline keywords are necessary.
namespace flaxpy {

// Let the user specify the vector length on the command line.
inline flecsi::program_option<std::size_t> vector_length(
  "Flaxpy-specific Options",
  "length,l",
  "Specify the length of the vectors to add.",
  {{flecsi::option_default, 1000000}});

Data structures

FleCSI does not provide ready-to-use, distributed data-structure types. Rather, it provides “proto data-structure types” called core topologies. These require additional compile-time information, such as the number of dimensions of a multidimensional array, and additional run-time information, such as how to distribute their data, to form a concrete data structure. Applications are expected to define specializations to provide all of this information.

FLAXPY is based on the user core topology, so named because it is arguably the simplest core topology that behaves as a user would expect. It is essentially a 1-D vector of user-defined fields with no support for ghost cells. All core topologies specify a coloring type, which represents additional run-time data the topology needs to produce a concrete data structure. A specialization must define a color member function that accepts whatever parameters make sense for that specialization and returns a coloring. The user core topology defines its coloring type as a std::vector<std::size_t> that represents the number of vector indices to assign to each color. (A color is a unit of data upon which a point task operates.) user does not require that the specialization provide any compile-time information, but most other core topologies do.

FleCSI provides the equal_map utility for dividing indices as equally as possible among colors. The following helper function, still within the flaxpy namespace, handles mapping vector_length number of indices (see Preliminaries above) onto a given number of colors:

inline flecsi::util::equal_map
divide_indices_among_colors(flecsi::Color ncolors) {
  return {vector_length.value(), ncolors};
}

// Define a distributed vector type called dist_vector as a specialization

Given that helper function, constructing a specialization of user is trivial. FLAXPY names its specialization (still within the flaxpy namespace) dist_vector:

struct dist_vector
  : flecsi::topo::specialization<flecsi::topo::user, dist_vector> {
  // Return the number of indices to assign to each color.
  static coloring color() {
    // Specify one color per process, and distribute indices accordingly.
    std::vector<std::size_t> ret;
    for(auto c : divide_indices_among_colors(flecsi::processes()))
      ret.push_back(c.size());
    return ret;
  }
};

Note that the specialization is responsible for choosing the number of colors. dist_vector’s color method queries FleCSI for the number of processes and uses that value for the color count.

At this point we have what is effectively a distributed 1-D vector data type that is templated over the element type. The next step is to specify the element type. In FleCSI, each element of a data structure comprises one or more fields. One can think of fields as named columns in a tabular representation of data. FLAXPY adds two fields of type double: x_field and y_field. These are added outside of the flaxpy namespace, in an anonymous namespace. flaxpy.cc uses this anonymous namespace to indicate that its contents are meaningful only locally and not needed by the rest of the application.

// For clarity we specify flecsi::data::layout::dense as a template
// parameter, but this is in fact the default and would normally be
// omitted.
using one_field = flecsi::field<double, flecsi::data::layout::dense>;
const one_field::definition<flaxpy::dist_vector> x_field, y_field;

one_field is defined in the above to save typing, both here and in task definitions (see Tasks below).

Specializations typically require run-time information to produce a usable object. This information may not be available until a number of libraries (FleCSI, Legion, MPI, and the like) have initialized and perhaps synchronized across a distributed system. To allow the lifetime of these objects to be controlled properly, FleCSI imposes a particular means of instantiating a specialization based on what it calls slots. The (topology) slot that will be used within FLAXPY’s Actions are defined within the control policy (discussed next).

Control flow

Recall from the Control Model section that a FleCSI application’s control flow is defined in terms of a control pointactiontask hierarchy. FLAXPY’s overall control flow is illustrated in Fig. 12. The control points of the Control model are drawn as white rounded rectangles; actions are drawn as blue ellipses; and tasks are drawn as green rectangles. As indicated by the figure, FLAXPY is a simple application and uses a trivial sequence of control points (no looping), trivial DAGs of actions (comprising a single node apiece), and trivial task launches (only one per action).

../../_images/flaxpy-control-model.svg

Fig. 12 FLAXPY control model

The bulk of this section is presented in top-down fashion. That is, function invocations are presented in advance of the functions themselves. With the exception of the code appearing in the Control model section, all of the control-flow code listed below appears in an anonymous namespace, again, to indicate that it is meaningful only locally and not needed by the rest of the application.

Control model

FLAXPY defines three control points: initialize, mul_add, and finalize. These are introduced via an enumerated type, which FLAXPY calls cp and defines within the flaxpy namespace:

enum class cp { initialize, mul_add, finalize };

FleCSI expects to be able to convert a cp to a string by dereferencing it. This requires overloading the * operator as follows, still within the flaxpy namespace:

inline const char *
operator*(cp control_point) {
  switch(control_point) {
    case cp::initialize:
      return "initialize";
    case cp::mul_add:
      return "mul_add";
    case cp::finalize:
      return "finalize";
  }
  flog_fatal("invalid control point");
}

Once an application defines its control points it specifies a sequential order for them to execute. (The equivalent of a while loop can be expressed with flecsi::control_base::cycle, and loops can be nested.) FLAXPY indicates with the following code that initialize runs first, then mul_add, and lastly finalize. It also defines a topology slot to hold the field data:

struct control_policy : flecsi::run::control_base {
  using control_points_enum = cp;
  using control_points =
    list<point<cp::initialize>, point<cp::mul_add>, point<cp::finalize>>;

  dist_vector::slot dist_vector_slot;
};

The preceding control_policy class is used to define a fully qualified control type that implements the control policy:

using control = flecsi::run::control<control_policy>;

Actions

Actions, implemented as C++ functions, are associated with control points. The following code associates the initialize_action action with the initialize control point, the mul_add_action action with the mul_add control point, and the finalize_action action with the finalize control point:

flaxpy::control::action<initialize_action, flaxpy::cp::initialize> init;
flaxpy::control::action<mul_add_action, flaxpy::cp::mul_add> ma;
flaxpy::control::action<finalize_action, flaxpy::cp::finalize> fin;

The variables declared by the preceding code (init, ma, and fin) are never used. They exist only for the side effects induced by instantiating a flaxpy::control::action.

The initialize_action action uses the slot and color function defined above in Data structures to allocate memory for the dist_vector specialization. Once this memory is allocated, the action launches an initialize_vectors_task task, granting each constituent point task access to a subset of X and Y via the x_field and y_field fields declared in Data structures.

void
initialize_action(flaxpy::control_policy & policy) {
  policy.dist_vector_slot.allocate(flaxpy::dist_vector::mpi_coloring());
  flecsi::execute<initialize_vectors_task>(
    x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}

The mul_add_action action spawns mul_add_task tasks, passing then a scalar constant a directly and access to a subset of X and Y via x_field and y_field:

void
mul_add_action(flaxpy::control_policy & policy) {
  const double a = 12.34; // Arbitrary scalar value to multiply
  flecsi::execute<mul_add_task>(
    a, x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}

The third and final action, finalize_action, sums the elements of Y by initiating a global reduction. Because they represent a global reduction, the reduce_y_task tasks are spawned using flecsi::reduce instead of flecsi::execute as in the preceding two actions. finalize_action uses the FleCSI logging facility, FLOG, to output the sum. Finally, the function deallocates the memory previously allocated by initialize_action.

void
finalize_action(flaxpy::control_policy & policy) {
  double sum = flecsi::reduce<reduce_y_task, flecsi::exec::fold::sum>(
    y_field(policy.dist_vector_slot))
                 .get();
  flog(info) << "The sum over all elements in the final vector is " << sum
             << std::endl;
}

Tasks

Tasks are functions that collectively and (usually) concurrently process a distributed data structure. flecsi::execute, as seen in the preceding section, spawns one point task per color. Each point task is individually responsible for processing a subspace (or “color”) of the distributed data structure. Because FleCSI follows Legion’s data and concurrency model, a point task is provided access to a subspace via an accessor templated on an access right: ro (read only), wo (write only), rw (read/write), or na (no access).

The initialize_vectors_task task requests write-only access to subspaces of X and Y because write-only access is necessary to initialize a field. The task uses divide_indices_among_colors, defined above in Data structures, to compute the number of vector indices to which a point task has access and the global index corresponding to local index 0. Once these are known, the task initializes X[i] ← i and Y[i] ← 0 over its subset of the distributed X and Y vectors. FLAXPY uses FleCSI’s forall macro to locally parallelize the initialization of Y. (This example works with thread parallelism but not on a GPU, since one field would be accessed on the host and the other on the device.)

void
initialize_vectors_task(one_field::accessor<flecsi::wo> x_acc,
  one_field::accessor<flecsi::wo> y_acc) {
  // Arbitrarily initialize x[i] = i and y[i] = 0.  We use a forall
  // for the latter because it can run in parallel without access to
  // the index variable.
  auto p = x_acc.span().begin();
  for(size_t i :
    flaxpy::divide_indices_among_colors(flecsi::colors())[flecsi::color()])
    *p++ = i;
  forall(elt, y_acc.span(), "init_y") { elt = 0; };
}

mul_add_task is the simplest of FLAXPY’s three tasks but also the one that performs the core DAXPY computation. It accepts a scalar a and requests read-only access to a subspace of X and read/write access to a subspace of Y. The task then computes Y[i] ← aX[i] + Y[i] over its subset of the distributed X and Y vectors.

void
mul_add_task(double a,
  one_field::accessor<flecsi::ro> x_acc,
  one_field::accessor<flecsi::rw> y_acc) {
  std::size_t num_local_elts = x_acc.span().size();
  for(std::size_t i = 0; i < num_local_elts; ++i)
    y_acc[i] += a * x_acc[i];
}

The third and final task, reduce_y_task, computes and returns the sum of a subspace of Y. For this it requests read-only access to the subspace and uses FleCSI’s reduceall macro to locally parallelize (e.g., using thread parallelism) the summation.

double
reduce_y_task(one_field::accessor<flecsi::ro> y_acc) {
  auto local_sum = reduceall(elt,
    accum,
    y_acc.span(),
    flecsi::exec::fold::sum,
    double,
    "reduce_y_task") {
    accum(elt);
  };
  return local_sum;
}

Program initialization

FLAXPY’s main function, expressed outside of any namespace, is largely boilerplate. It initializes FleCSI, executes the FLAXPY code according to the control flow defined above in Control flow, and finalizes FleCSI.

int
main(int argc, char ** argv) {
  // Initialize the FleCSI run-time system.
  flecsi::getopt()(argc, argv);
  const flecsi::run::dependencies_guard dg;
  const flecsi::runtime run;
  flecsi::flog::add_output_stream("clog", std::clog, true);
  // Execute our code control point by control point.
  return run.control<flaxpy::control>();
}

Usage

FLAXPY can be built and run as follows:

cd tutorial/standalone/flaxpy
mkdir build
cd build
cmake ..
make -j
mpiexec -n 8 ./flaxpy

Depending on your installation, you may need to execute mpirun, srun, or another launcher instead of mpiexec. The -n 8 specifies 8-way process parallelism. The output should look like this:

The sum over all elements in the final vector is 6.16999e+12

Try

  • passing --help to flaxpy to view the supported command-line options,

  • passing --length=2000000 to flaxpy to run DAXPY on a vector that is twice as long as the default,

  • running with a different amount of process parallelism, perhaps across multiple nodes in a cluster, or

  • [advanced] modifying flaxpy.cc to construct a specialization of narray (a multidimensional-array core topology) instead of a specialization of user. See the source code to the Poisson example, in particular tutorial/standalone/poisson/include/specialization/mesh.hh, for an example of the use of narray.