DAXPY
This example presents a simple but complete example of a FleCSI program that manipulates a distributed data structure. Specifically, the program uses FleCSI to perform a parallel DAXPY (double-precision a⋅X + Y) vector operation. The sequential C++ code looks like this:
for (int i = 0; i < N; ++i)
y[i] += a*x[i];
To further demonstrate FleCSI capabilities and typical program structure we will also allocate and initialize the distributed vectors before the DAXPY operation and report the sum of all elements in Y at the end, as in the following sequential code:
// Allocate and initialize the two vectors.
std::vector<double> x(N), y(N);
for (size_t i = 0; i < N; ++i) {
x[i] = static_cast<double>(i);
y[i] = 0.0;
}
// Perform the DAXPY operation.
const double a = 12.34;
for (size_t i = 0; i < N; ++i)
y[i] += a*x[i];
// Report the sum over y.
double sum = 0.0;
for (size_t i = 0; i < N; ++i)
sum += y[i];
std::cout << "The sum over all elements in the final vector is " << sum << std::endl;
In the above we arbitrarily initialize X[i] ← i and Y[i] ← 0.
For pedagogical purposes the FleCSI version of the above, which we’ll call “FLAXPY”, is expressed slightly differently from how a full application would more naturally be implemented:
FLAXPY is presented as a single file rather than as separate (and typically multiple) source and header files.
C++ namespaces are referenced explicitly rather than imported with
using
.Some function, method, and variable names are more verbose than they would commonly be.
Preliminaries
Here’s a simple CMakeLists.txt
file for building FLAXPY:
cmake_minimum_required(VERSION 3.12)
project(Flaxpy LANGUAGES CXX C)
set(CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_STANDARD 17)
find_package(FleCSI REQUIRED)
add_executable(flaxpy flaxpy.cc)
target_link_libraries(flaxpy FleCSI::FleCSI)
FLAXPY is implemented as a single file, flaxpy.cc
. We begin by
including the header files needed to access the data model, execution
model, and other FleCSI components:
#include <flecsi/data.hh>
#include <flecsi/execution.hh>
#include <flecsi/flog.hh>
#include <flecsi/runtime.hh>
For user convenience, we define a --length
(abbreviation: -l
)
command-line option for specifying the length of vectors X and Y
and with a default of 1,000,000 elements. To do so we declare a
variable of type flecsi::program_option
, templated on the option
type, which in this case is std::size_t
. We name the variable
vector_length
and define it within a flaxpy
namespace, which
other source files—of which there are none in this simple
example—could import. vector_length
will be used at run time to
access the vector length.
// In a larger program, this namespace would typically appear in a header file,
// where the inline keywords are necessary.
namespace flaxpy {
// Let the user specify the vector length on the command line.
inline flecsi::program_option<std::size_t> vector_length(
"Flaxpy-specific Options",
"length,l",
"Specify the length of the vectors to add.",
{{flecsi::option_default, 1000000}});
Data structures
FleCSI does not provide ready-to-use, distributed data-structure types. Rather, it provides “proto data-structure types” called core topologies. These require additional compile-time information, such as the number of dimensions of a multidimensional array, and additional run-time information, such as how to distribute their data, to form a concrete data structure. Applications are expected to define specializations to provide all of this information.
FLAXPY is based on the user
core topology, so named because it is
arguably the simplest core topology that behaves as a user would expect.
It is essentially a 1-D vector of user-defined fields with
no support for ghost cells.
All core topologies specify a coloring
type, which represents
additional run-time data the topology needs
to produce a concrete data structure.
A specialization must define a color
member function that
accepts whatever parameters make sense for that specialization
and returns a coloring
.
The user
core topology defines its coloring
type as
a std::vector<std::size_t>
that represents
the number of vector indices to assign to each color.
(A color is a unit of data upon which a point task operates.)
user
does not require that the specialization provide
any compile-time information,
but most other core topologies do.
FleCSI provides the equal_map
utility for dividing indices as equally as possible among colors.
The following helper function, still within the flaxpy
namespace,
handles mapping vector_length
number of indices (see
Preliminaries above) onto a given number of colors:
inline flecsi::util::equal_map
divide_indices_among_colors(flecsi::Color ncolors) {
return {vector_length.value(), ncolors};
}
// Define a distributed vector type called dist_vector as a specialization
Given that helper function, constructing a specialization of user
is trivial. FLAXPY names its specialization (still within the
flaxpy
namespace) dist_vector
:
struct dist_vector
: flecsi::topo::specialization<flecsi::topo::user, dist_vector> {
// Return the number of indices to assign to each color.
static coloring color() {
// Specify one color per process, and distribute indices accordingly.
std::vector<std::size_t> ret;
for(auto c : divide_indices_among_colors(flecsi::processes()))
ret.push_back(c.size());
return ret;
}
};
Note that the specialization is responsible for choosing the number of
colors. dist_vector
’s color
method queries FleCSI for the
number of processes and uses that value for the color count.
At this point we have what is effectively a distributed 1-D vector
data type that is templated over the element type. The next step is
to specify the element type. In FleCSI, each element of a data
structure comprises one or more fields. One can think of fields as
named columns in a tabular representation of data. FLAXPY adds two
fields of type double
: x_field
and y_field
. These are
added outside of the flaxpy
namespace, in an anonymous namespace.
flaxpy.cc
uses this anonymous namespace to indicate that its
contents are meaningful only locally and not needed by the rest of the
application.
// For clarity we specify flecsi::data::layout::dense as a template
// parameter, but this is in fact the default and would normally be
// omitted.
using one_field = flecsi::field<double, flecsi::data::layout::dense>;
const one_field::definition<flaxpy::dist_vector> x_field, y_field;
one_field
is defined in the above to save typing,
both here and in task definitions
(see Tasks below).
Specializations typically require run-time information to produce a usable object. This information may not be available until a number of libraries (FleCSI, Legion, MPI, and the like) have initialized and perhaps synchronized across a distributed system. To allow the lifetime of these objects to be controlled properly, FleCSI imposes a particular means of instantiating a specialization based on what it calls slots. The (topology) slot that will be used within FLAXPY’s Actions are defined within the control policy (discussed next).
Control flow
Recall from the Control Model section that a FleCSI application’s control flow is defined in terms of a control point—action—task hierarchy. FLAXPY’s overall control flow is illustrated in Fig. 12. The control points of the Control model are drawn as white rounded rectangles; actions are drawn as blue ellipses; and tasks are drawn as green rectangles. As indicated by the figure, FLAXPY is a simple application and uses a trivial sequence of control points (no looping), trivial DAGs of actions (comprising a single node apiece), and trivial task launches (only one per action).
The bulk of this section is presented in top-down fashion. That is, function invocations are presented in advance of the functions themselves. With the exception of the code appearing in the Control model section, all of the control-flow code listed below appears in an anonymous namespace, again, to indicate that it is meaningful only locally and not needed by the rest of the application.
Control model
FLAXPY defines three control points: initialize
, mul_add
, and
finalize
. These are introduced via an enumerated type, which
FLAXPY calls cp
and defines within the flaxpy
namespace:
enum class cp { initialize, mul_add, finalize };
FleCSI expects to be able to convert a cp
to a string by
dereferencing it. This requires overloading the *
operator as
follows, still within the flaxpy
namespace:
inline const char *
operator*(cp control_point) {
switch(control_point) {
case cp::initialize:
return "initialize";
case cp::mul_add:
return "mul_add";
case cp::finalize:
return "finalize";
}
flog_fatal("invalid control point");
}
Once an application defines its control points
it specifies a sequential order for them to execute.
(The equivalent of a while
loop
can be expressed with flecsi::control_base::cycle
,
and loops can be nested.)
FLAXPY indicates with the following code that
initialize
runs first,
then mul_add
,
and lastly finalize
.
It also defines a topology slot to hold the field data:
struct control_policy : flecsi::run::control_base {
using control_points_enum = cp;
using control_points =
list<point<cp::initialize>, point<cp::mul_add>, point<cp::finalize>>;
dist_vector::slot dist_vector_slot;
};
The preceding control_policy
class is used to define a fully
qualified control type that implements the control policy:
using control = flecsi::run::control<control_policy>;
Actions
Actions, implemented as C++ functions, are associated with control
points. The following code associates the initialize_action
action with the initialize
control point, the mul_add_action
action with the mul_add
control point, and the
finalize_action
action with the finalize
control point:
flaxpy::control::action<initialize_action, flaxpy::cp::initialize> init;
flaxpy::control::action<mul_add_action, flaxpy::cp::mul_add> ma;
flaxpy::control::action<finalize_action, flaxpy::cp::finalize> fin;
The variables declared by the preceding code (init
, ma
, and
fin
) are never used. They exist only for the side effects induced
by instantiating a flaxpy::control::action
.
The initialize_action
action uses the slot and color
function
defined above in Data structures
to allocate memory for the dist_vector
specialization.
Once this memory is allocated,
the action launches an initialize_vectors_task
task,
granting each constituent point task access to a subset of X and Y
via the x_field
and y_field
fields declared in Data structures.
void
initialize_action(flaxpy::control_policy & policy) {
policy.dist_vector_slot.allocate(flaxpy::dist_vector::mpi_coloring());
flecsi::execute<initialize_vectors_task>(
x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}
The mul_add_action
action spawns mul_add_task
tasks,
passing then a scalar constant a directly and access to a subset of
X and Y via x_field
and y_field
:
void
mul_add_action(flaxpy::control_policy & policy) {
const double a = 12.34; // Arbitrary scalar value to multiply
flecsi::execute<mul_add_task>(
a, x_field(policy.dist_vector_slot), y_field(policy.dist_vector_slot));
}
The third and final action, finalize_action
, sums the elements
of Y by initiating a global reduction. Because they represent a
global reduction, the reduce_y_task
tasks are spawned using
flecsi::reduce
instead of flecsi::execute
as in the preceding
two actions. finalize_action
uses the FleCSI logging facility,
FLOG, to output the sum. Finally, the function deallocates the memory
previously allocated by initialize_action
.
void
finalize_action(flaxpy::control_policy & policy) {
double sum = flecsi::reduce<reduce_y_task, flecsi::exec::fold::sum>(
y_field(policy.dist_vector_slot))
.get();
flog(info) << "The sum over all elements in the final vector is " << sum
<< std::endl;
}
Tasks
Tasks are functions that collectively and (usually) concurrently
process a distributed data structure.
flecsi::execute
, as seen in the preceding section,
spawns one point task per color.
Each point task is individually responsible for processing a subspace
(or “color”) of the distributed data structure.
Because FleCSI follows Legion’s data
and concurrency model,
a point task is provided access to a subspace
via an accessor templated on an access right:
ro
(read only),
wo
(write only),
rw
(read/write), or
na
(no access).
The initialize_vectors_task
task requests
write-only access to subspaces of X and Y
because write-only access is necessary to initialize a field.
The task uses divide_indices_among_colors
,
defined above in Data structures,
to compute the number of vector indices to which a point task has access
and the global index corresponding to local index 0.
Once these are known, the task initializes
X[i] ← i and Y[i] ← 0
over its subset of the distributed X and Y vectors.
FLAXPY uses FleCSI’s forall
macro to locally parallelize
the initialization of Y.
(This example works with thread parallelism but not on a GPU, since one field would be accessed on the host and the other on the device.)
void
initialize_vectors_task(one_field::accessor<flecsi::wo> x_acc,
one_field::accessor<flecsi::wo> y_acc) {
// Arbitrarily initialize x[i] = i and y[i] = 0. We use a forall
// for the latter because it can run in parallel without access to
// the index variable.
auto p = x_acc.span().begin();
for(size_t i :
flaxpy::divide_indices_among_colors(flecsi::colors())[flecsi::color()])
*p++ = i;
forall(elt, y_acc.span(), "init_y") { elt = 0; };
}
mul_add_task
is the simplest of FLAXPY’s three tasks
but also the one that performs the core DAXPY computation.
It accepts a scalar a and requests read-only access to a subspace of X
and read/write access to a subspace of Y.
The task then computes
Y[i] ← a⋅X[i] + Y[i]
over its subset of the distributed X and Y vectors.
void
mul_add_task(double a,
one_field::accessor<flecsi::ro> x_acc,
one_field::accessor<flecsi::rw> y_acc) {
std::size_t num_local_elts = x_acc.span().size();
for(std::size_t i = 0; i < num_local_elts; ++i)
y_acc[i] += a * x_acc[i];
}
The third and final task, reduce_y_task
,
computes and returns the sum of a subspace of Y.
For this it requests read-only access to the subspace
and uses FleCSI’s reduceall
macro to locally parallelize
(e.g., using thread parallelism)
the summation.
double
reduce_y_task(one_field::accessor<flecsi::ro> y_acc) {
auto local_sum = reduceall(elt,
accum,
y_acc.span(),
flecsi::exec::fold::sum,
double,
"reduce_y_task") {
accum(elt);
};
return local_sum;
}
Program initialization
FLAXPY’s main
function, expressed outside of any namespace, is
largely boilerplate. It initializes FleCSI, executes the FLAXPY code
according to the control flow defined above in Control flow, and
finalizes FleCSI.
int
main(int argc, char ** argv) {
// Initialize the FleCSI run-time system.
flecsi::getopt()(argc, argv);
const flecsi::run::dependencies_guard dg;
const flecsi::runtime run;
flecsi::flog::add_output_stream("clog", std::clog, true);
// Execute our code control point by control point.
return run.control<flaxpy::control>();
}
Usage
FLAXPY can be built and run as follows:
cd tutorial/standalone/flaxpy
mkdir build
cd build
cmake ..
make -j
mpiexec -n 8 ./flaxpy
Depending on your installation, you may need to execute mpirun
,
srun
, or another launcher instead of mpiexec
. The -n 8
specifies 8-way process parallelism. The output should look like
this:
The sum over all elements in the final vector is 6.16999e+12
Try
passing
--help
toflaxpy
to view the supported command-line options,passing
--length=2000000
toflaxpy
to run DAXPY on a vector that is twice as long as the default,running with a different amount of process parallelism, perhaps across multiple nodes in a cluster, or
[advanced] modifying
flaxpy.cc
to construct a specialization ofnarray
(a multidimensional-array core topology) instead of a specialization ofuser
. See the source code to the Poisson example, in particulartutorial/standalone/poisson/include/specialization/mesh.hh
, for an example of the use ofnarray
.