# DAXPY

This example presents a simple but complete example of a FleCSI program that manipulates a distributed data structure. Specifically, the program uses FleCSI to perform a parallel DAXPY (double-precision aX + Y) vector operation. The sequential C++ code looks like this:

```for (int i = 0; i < N; ++i)
y[i] += a*x[i];
```

To further demonstrate FleCSI capabilities and typical program structure we will also allocate and initialize the distributed vectors before the DAXPY operation and report the sum of all elements in Y at the end, as in the following sequential code:

```// Allocate and initialize the two vectors.
std::vector<double> x(N), y(N);
for (size_t i = 0; i < N; ++i) {
x[i] = static_cast<double>(i);
y[i] = 0.0;
}

// Perform the DAXPY operation.
const double a = 12.34;
for (size_t i = 0; i < N; ++i)
y[i] += a*x[i];

// Report the sum over y.
double sum = 0.0;
for (size_t i = 0; i < N; ++i)
sum += y[i];
std::cout << "The sum over all elements in the final vector is " << sum << std::endl;
```

In the above we arbitrarily initialize X[i] ← i and Y[i] ← 0.

For pedagogical purposes the FleCSI version of the above, which we’ll call “FLAXPY”, is expressed slightly differently from how a full application would more naturally be implemented:

• FLAXPY is presented as a single file rather than as separate (and typically multiple) source and header files.

• C++ namespaces are referenced explicitly rather than imported with `using`.

• Some function, method, and variable names are more verbose than they would commonly be.

## Preliminaries

Here’s a simple `CMakeLists.txt` file for building FLAXPY:

```cmake_minimum_required(VERSION 3.12)
project(Flaxpy LANGUAGES CXX C)

set(CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_STANDARD 17)

find_package(FleCSI REQUIRED)

```

FLAXPY is implemented as a single file, `flaxpy.cc`. We begin by including the header files needed to access the data model, execution model, and other FleCSI components:

```#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include <flecsi/run/control.hh>
```

For user convenience, we define a `--length` (abbreviation: `-l`) command-line option for specifying the length of vectors X and Y and with a default of 1,000,000 elements. To do so we declare a variable of type `flecsi::program_option`, templated on the option type, which in this case is `std::size_t`. We name the variable `vector_length` and define it within a `flaxpy` namespace, which other source files—of which there are none in this simple example—could import. `vector_length` will be used at run time to access the vector length.

```// In a larger program, this namespace would typically appear in a header file,
// where the inline keywords are necessary.
namespace flaxpy {

// Let the user specify the vector length on the command line.
inline flecsi::program_option<std::size_t> vector_length(
"Flaxpy-specific Options",
"length,l",
"Specify the length of the vectors to add.",
{{flecsi::option_default, 1000000}});
```

## Data structures

FleCSI does not provide ready-to-use, distributed data-structure types. Rather, it provides “proto data-structure types” called core topologies. These require additional compile-time information, such as the number of dimensions of a multidimensional array, and additional run-time information, such as how to distribute their data, to form a concrete data structure. Applications are expected to define specializations to provide all of this information.

FLAXPY is based on the `user` core topology, so named because it is arguably the simplest core topology that behaves as a user would expect. It is essentially a 1-D vector of user-defined fields with no support for ghost cells. All core topologies specify a `coloring` type, which represents additional run-time data the topology needs to produce a concrete data structure. A specialization must define a `color` member function that accepts whatever parameters make sense for that specialization and returns a `coloring`. The `user` core topology defines its `coloring` type as a `std::vector<std::size_t>` that represents the number of vector indices to assign to each color. (A color is a unit of data upon which a point task operates.) `user` does not require that the specialization provide any compile-time information, but most other core topologies do.

FleCSI provides the `equal_map` utility for dividing indices as equally as possible among colors. The following helper function, still within the `flaxpy` namespace, handles mapping `vector_length` number of indices (see Preliminaries above) onto a given number of colors:

```inline flecsi::util::equal_map
divide_indices_among_colors(flecsi::Color ncolors) {
return {vector_length.value(), ncolors};
}

// Define a distributed vector type called dist_vector as a specialization
```

Given that helper function, constructing a specialization of `user` is trivial. FLAXPY names its specialization (still within the `flaxpy` namespace) `dist_vector`:

```struct dist_vector
: flecsi::topo::specialization<flecsi::topo::user, dist_vector> {
// Return the number of indices to assign to each color.
static coloring color() {
// Specify one color per process, and distribute indices accordingly.
std::vector<std::size_t> ret;
for(auto c : divide_indices_among_colors(flecsi::processes()))
ret.push_back(c.size());
return ret;
}
};
```

Note that the specialization is responsible for choosing the number of colors. `dist_vector`’s `color` method queries FleCSI for the number of processes and uses that value for the color count.

At this point we have what is effectively a distributed 1-D vector data type that is templated over the element type. The next step is to specify the element type. In FleCSI, each element of a data structure comprises one or more fields. One can think of fields as named columns in a tabular representation of data. FLAXPY adds two fields of type `double`: `x_field` and `y_field`. These are added outside of the `flaxpy` namespace, in an anonymous namespace. `flaxpy.cc` uses this anonymous namespace to indicate that its contents are meaningful only locally and not needed by the rest of the application.

```// For clarity we specify flecsi::data::layout::dense as a template
// parameter, but this is in fact the default and would normally be
// omitted.
template<typename T>
using one_field = flecsi::field<T, flecsi::data::layout::dense>;
const one_field<double>::definition<flaxpy::dist_vector> x_field, y_field;
```

`one_field` is defined in the above to save typing, both here and in task definitions (see Tasks below).

Specializations typically require run-time information to produce a usable object. This information may not be available until a number of libraries (FleCSI, Legion, MPI, and the like) have initialized and perhaps synchronized across a distributed system. To allow the lifetime of these objects to be controlled properly, FleCSI imposes a particular means of instantiating a specialization based on what it calls slots. The (topology) slot that will be used within FLAXPY’s Actions are defined within the control policy (discussed next).

## Control flow

Recall from the Control Model section that a FleCSI application’s control flow is defined in terms of a control pointactiontask hierarchy. FLAXPY’s overall control flow is illustrated in Fig. 9. The control points of the Control model are drawn as white rounded rectangles; actions are drawn as blue ellipses; and tasks are drawn as green rectangles. As indicated by the figure, FLAXPY is a simple application and uses a trivial sequence of control points (no looping), trivial DAGs of actions (comprising a single node apiece), and trivial task launches (only one per action).

The bulk of this section is presented in top-down fashion. That is, function invocations are presented in advance of the functions themselves. With the exception of the code appearing in the Control model section, all of the control-flow code listed below appears in an anonymous namespace, again, to indicate that it is meaningful only locally and not needed by the rest of the application.

### Control model

FLAXPY defines three control points: `initialize`, `mul_add`, and `finalize`. These are introduced via an enumerated type, which FLAXPY calls `cp` and defines within the `flaxpy` namespace:

```enum class cp { initialize, mul_add, finalize };
```

FleCSI expects to be able to convert a `cp` to a string by dereferencing it. This requires overloading the `*` operator as follows, still within the `flaxpy` namespace:

```inline const char *
operator*(cp control_point) {
switch(control_point) {
case cp::initialize:
return "initialize";
case cp::finalize:
return "finalize";
}
flog_fatal("invalid control point");
}
```

Once an application defines its control points it specifies a sequential order for them to execute. (The equivalent of a `while` loop can be expressed with `flecsi::control_base::cycle`, and loops can be nested.) FLAXPY indicates with the following code that `initialize` runs first, then `mul_add`, and lastly `finalize`. It also defines a topology slot to hold the field data:

```struct control_policy : flecsi::run::control_base {
using control_points_enum = cp;
struct node_policy {};
using control_points =

dist_vector::slot dist_vector_slot;
};
```

The preceding `control_policy` class is used to define a fully qualified control type that implements the control policy:

```using control = flecsi::run::control<control_policy>;
```

### Actions

Actions, implemented as C++ functions, are associated with control points. The following code associates the `initialize_action` action with the `initialize` control point, the `mul_add_action` action with the `mul_add` control point, and the `finalize_action` action with the `finalize` control point:

```flaxpy::control::action<initialize_action, flaxpy::cp::initialize> init;
flaxpy::control::action<finalize_action, flaxpy::cp::finalize> fin;
```

The variables declared by the preceding code (`init`, `ma`, and `fin`) are never used. They exist only for the side effects induced by instantiating a `flaxpy::control::action`.

The `initialize_action` action uses a coloring slot to initialize the topology slot defined above in Control model, allocating memory for the `dist_vector` specialization. Once this memory is allocated, the action launches an `initialize_vectors_task` task, granting each constituent point task access to a subset of X and Y via the `x_field` and `y_field` fields declared in Data structures.

```void
initialize_action(flaxpy::control_policy & policy) {
// Declare a coloring of the distributed vector.
flaxpy::dist_vector::cslot dist_vector_cslot;
dist_vector_cslot.allocate();
policy.dist_vector_slot.allocate(dist_vector_cslot.get());
x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}
```

The `mul_add_action` action spawns `mul_add_task` tasks, passing then a scalar constant a directly and access to a subset of X and Y via `x_field` and `y_field`:

```void
const double a = 12.34; // Arbitrary scalar value to multiply
a, x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}
```

The third and final action, `finalize_action`, sums the elements of Y by initiating a global reduction. Because they represent a global reduction, the `reduce_y_task` tasks are spawned using `flecsi::reduce` instead of `flecsi::execute` as in the preceding two actions. `finalize_action` uses the FleCSI logging facility, FLOG, to output the sum. Finally, the function deallocates the memory previously allocated by `initialize_action`.

```void
finalize_action(flaxpy::control_policy & policy) {
y_field(policy.dist_vector_slot))
.get();
flog(info) << "The sum over all elements in the final vector is " << sum
<< std::endl;
}
```

Tasks are functions that collectively and (usually) concurrently process a distributed data structure. `flecsi::execute`, as seen in the preceding section, spawns one point task per color. Each point task is individually responsible for processing a subspace (or “color”) of the distributed data structure. Because FleCSI follows Legion’s data and concurrency model, a point task is provided access to a subspace via an accessor templated on an access right: `ro` (read only), `wo` (write only), `rw` (read/write), or `na` (no access).

The `initialize_vectors_task` task requests write-only access to subspaces of X and Y because write-only access is necessary to initialize a field. The task uses `divide_indices_among_colors`, defined above in Data structures, to compute the number of vector indices to which a point task has access and the global index corresponding to local index 0. Once these are known, the task initializes X[i] ← i and Y[i] ← 0 over its subset of the distributed X and Y vectors. FLAXPY uses FleCSI’s `forall` macro to locally parallelize (e.g., using thread parallelism) the initialization of Y.

```void
one_field<double>::accessor<flecsi::wo> y_acc) {
// Arbitrarily initialize x[i] = i and y[i] = 0.  We use a forall
// for the latter because it can run in parallel without access to
// the index variable.
auto p = x_acc.span().begin();
for(size_t i :
flaxpy::divide_indices_among_colors(flecsi::colors())[flecsi::color()])
*p++ = i;
forall(elt, y_acc.span(), "init_y") { elt = 0; };
}
```

`mul_add_task` is the simplest of FLAXPY’s three tasks but also the one that performs the core DAXPY computation. It accepts a scalar a and requests read-only access to a subspace of X and read/write access to a subspace of Y. The task then computes Y[i] ← aX[i] + Y[i] over its subset of the distributed X and Y vectors.

```void
one_field<double>::accessor<flecsi::ro> x_acc,
one_field<double>::accessor<flecsi::rw> y_acc) {
std::size_t num_local_elts = x_acc.span().size();
for(std::size_t i = 0; i < num_local_elts; ++i)
y_acc[i] += a * x_acc[i];
}
```

The third and final task, `reduce_y_task`, computes and returns the sum of a subspace of Y. For this it requests read-only access to the subspace and uses FleCSI’s `reduceall` macro to locally parallelize (e.g., using thread parallelism) the summation.

```double
auto local_sum = reduceall(elt,
accum,
y_acc.span(),
flecsi::exec::fold::sum,
double,
accum(elt);
};
return local_sum;
}
```

## Program initialization

FLAXPY’s `main` function, expressed outside of any namespace, is largely boilerplate. It initializes FleCSI, executes the FLAXPY code according to the control flow defined above in Control flow, and finalizes FleCSI.

```int
main(int argc, char ** argv) {
// Initialize the FleCSI run-time system.
auto status = flecsi::initialize(argc, argv);
status = flaxpy::control::check_status(status);
if(status != flecsi::run::status::success) {
return status < flecsi::run::status::clean ? 0 : status;
}

// Execute our code control point by control point.
status = flecsi::start(flaxpy::control::execute);

// Finalize the FleCSI run-time system.
flecsi::finalize();
return status;
}
```

## Usage

FLAXPY can be built and run as follows:

```cd tutorial/standalone/flaxpy
mkdir build
cd build
cmake ..
make -j
mpiexec -n 8 ./flaxpy
```

Depending on your installation, you may need to execute `mpirun`, `srun`, or another launcher instead of `mpiexec`. The `-n 8` specifies 8-way process parallelism. The output should look like this:

```The sum over all elements in the final vector is 6.16999e+12
```

Try

• passing `--help` to `flaxpy` to view the supported command-line options,

• passing `--length=2000000` to `flaxpy` to run DAXPY on a vector that is twice as long as the default,

• running with a different amount of process parallelism, perhaps across multiple nodes in a cluster, or

• [advanced] modifying `flaxpy.cc` to construct a specialization of `narray` (a multidimensional-array core topology) instead of a specialization of `user`. See the source code to the Poisson example, in particular `tutorial/standalone/poisson/include/specialization/mesh.hh`, for an example of the use of `narray`.