.. |br| raw:: html
DAXPY
*****
This example presents a simple but complete example of a FleCSI
program that manipulates a distributed data structure. Specifically,
the program uses FleCSI to perform a parallel `DAXPY
`_
(double-precision *a*\ ⋅\ *X* + *Y*) vector operation. The sequential
C++ code looks like this:
.. code:: C++
for (int i = 0; i < N; ++i)
y[i] += a*x[i];
To further demonstrate FleCSI capabilities and typical program
structure we will also allocate and initialize the distributed vectors
before the DAXPY operation and report the sum of all elements in *Y*
at the end, as in the following sequential code:
.. code:: C++
// Allocate and initialize the two vectors.
std::vector x(N), y(N);
for (size_t i = 0; i < N; ++i) {
x[i] = static_cast(i);
y[i] = 0.0;
}
// Perform the DAXPY operation.
const double a = 12.34;
for (size_t i = 0; i < N; ++i)
y[i] += a*x[i];
// Report the sum over y.
double sum = 0.0;
for (size_t i = 0; i < N; ++i)
sum += y[i];
std::cout << "The sum over all elements in the final vector is " << sum << std::endl;
In the above we arbitrarily initialize *X*\ [*i*] ← *i* and *Y*\ [*i*]
← 0.
For pedagogical purposes the FleCSI version of the above, which we'll
call "FLAXPY", is expressed slightly differently from how a full
application would more naturally be implemented:
* FLAXPY is presented as a single file rather than as separate (and
typically multiple) source and header files.
* C++ namespaces are referenced explicitly rather than imported with
``using``.
* Some function, method, and variable names are more verbose than they
would commonly be.
Preliminaries
+++++++++++++
Here's a simple ``CMakeLists.txt`` file for building FLAXPY:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/CMakeLists.txt
:language: cpp
:start-at: cmake_minimum
:end-at: target_link
FLAXPY is implemented as a single file, ``flaxpy.cc``. We begin by
including the header files needed to access the data model, execution
model, and other FleCSI components:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: #include
:end-at: /run
For user convenience, we define a ``--length`` (abbreviation: ``-l``)
command-line option for specifying the length of vectors *X* and *Y*
and with a default of 1,000,000 elements. To do so we declare a
variable of type ``flecsi::program_option``, templated on the option
type, which in this case is ``std::size_t``. We name the variable
``vector_length`` and define it within a ``flaxpy`` namespace, which
other source files—of which there are none in this simple
example—could import. ``vector_length`` will be used at run time to
access the vector length.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: larger program
:end-at: ;
Data structures
+++++++++++++++
FleCSI does not provide ready-to-use, distributed data-structure types.
Rather, it provides "proto data-structure types" called *core topologies*.
These require additional compile-time information, such as the number
of dimensions of a multidimensional array, and additional run-time
information, such as how to distribute their data, to form a concrete
data structure. Applications are expected to define *specializations*
to provide all of this information.
FLAXPY is based on the ``user`` core topology, so named because it is
arguably the simplest core topology that behaves as a user would expect.
It is essentially a 1-D vector of user-defined *fields* with
no support for ghost cells.
All core topologies specify a ``coloring`` type, which represents
additional run-time data the topology needs
to produce a concrete data structure.
A specialization must define a ``color`` member function that
accepts whatever parameters make sense for that specialization
and returns a ``coloring``.
The ``user`` core topology defines its ``coloring`` type as
a ``std::vector`` that represents
the number of vector indices to assign to each *color*.
(A color is a unit of data upon which a point task operates.)
``user`` does not require that the specialization provide
any compile-time information,
but most other core topologies do.
FleCSI provides the ``equal_map`` utility for dividing indices as equally as possible among colors.
The following helper function, still within the ``flaxpy`` namespace,
handles mapping ``vector_length`` number of indices (see
`Preliminaries`_ above) onto a given number of colors:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: equal_map
:end-at: // Define
Given that helper function, constructing a specialization of ``user``
is trivial. FLAXPY names its specialization (still within the
``flaxpy`` namespace) ``dist_vector``:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: struct dist_vector
:end-at: };
Note that the specialization is responsible for choosing the number of
colors. ``dist_vector``'s ``color`` method queries FleCSI for the
number of processes and uses that value for the color count.
At this point we have what is effectively a distributed 1-D vector
data type that is templated over the element type. The next step is
to specify the element type. In FleCSI, each element of a data
structure comprises one or more *fields*. One can think of fields as
named columns in a tabular representation of data. FLAXPY adds two
fields of type ``double``: ``x_field`` and ``y_field``. These are
added outside of the ``flaxpy`` namespace, in an anonymous namespace.
``flaxpy.cc`` uses this anonymous namespace to indicate that its
contents are meaningful only locally and not needed by the rest of the
application.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: For clarity
:end-at: definition
``one_field`` is defined in the above to save typing,
both here and in task definitions
(see `Tasks`_ below).
Specializations typically require run-time information to produce a
usable object. This information may not be available until a number
of libraries (FleCSI, Legion, MPI, and the like) have initialized and
perhaps synchronized across a distributed system.
To allow the lifetime of these objects to be controlled properly, FleCSI imposes a particular means of instantiating a specialization based on what it calls *slots*.
The (topology) *slot* that will be used within FLAXPY's `Actions`_ are defined within the control policy (discussed next).
Control flow
++++++++++++
Recall from the :doc:`control` section that
a FleCSI application's control flow is defined in terms of a
*control point*—*action*—*task* hierarchy.
FLAXPY's overall control flow is illustrated in :numref:`flaxpy_control`.
The control points of the `Control model`_ are drawn as white rounded rectangles;
`actions `_ are drawn as blue ellipses;
and `tasks `_ are drawn as green rectangles.
As indicated by the figure, FLAXPY is a simple application
and uses a trivial sequence of control points (no looping),
trivial DAGs of actions (comprising a single node apiece),
and trivial task launches (only one per action).
.. _flaxpy_control:
.. figure:: images/flaxpy-control-model.svg
:align: center
FLAXPY control model
The bulk of this section is presented in top-down fashion.
That is, function invocations are presented
in advance of the functions themselves.
With the exception of the code appearing in the `Control model`_ section,
all of the control-flow code listed below appears in an anonymous namespace,
again, to indicate that it is meaningful only locally
and not needed by the rest of the application.
Control model
-------------
FLAXPY defines three control points: ``initialize``, ``mul_add``, and
``finalize``. These are introduced via an enumerated type, which
FLAXPY calls ``cp`` and defines within the ``flaxpy`` namespace:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: enum class cp
:end-at: cp
FleCSI expects to be able to convert a ``cp`` to a string by
dereferencing it. This requires overloading the ``*`` operator as
follows, still within the ``flaxpy`` namespace:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: inline const char *
:end-before: // Define
Once an application defines its control points
it specifies a sequential order for them to execute.
(The equivalent of a ``while`` loop
can be expressed with ``flecsi::control_base::cycle``,
and loops can be nested.)
FLAXPY indicates with the following code that
``initialize`` runs first,
then ``mul_add``,
and lastly ``finalize``.
It also defines a topology slot to hold the field data:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: control_base
:end-before: // Define
The preceding ``control_policy`` class is used to define a fully
qualified control type that implements the control policy:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: run::control<
:end-at: <
Actions
-------
Actions, implemented as C++ functions, are associated with control
points. The following code associates the ``initialize_action``
action with the ``initialize`` control point, the ``mul_add_action``
action with the ``mul_add`` control point, and the
``finalize_action`` action with the ``finalize`` control point:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-at: control::action
:end-at: fin;
The variables declared by the preceding code (``init``, ``ma``, and
``fin``) are never used. They exist only for the side effects induced
by instantiating a ``flaxpy::control::action``.
The ``initialize_action`` action uses the slot and ``color`` function
defined above in `Data structures`_
to allocate memory for the ``dist_vector`` specialization.
Once this memory is allocated,
the action launches an ``initialize_vectors_task`` task,
granting each constituent point task access to a subset of *X* and *Y*
via the ``x_field`` and ``y_field`` fields declared in `Data structures`_.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: for the initialize
:end-at: }
The ``mul_add_action`` action spawns ``mul_add_task`` tasks,
passing then a scalar constant *a* directly and access to a subset of
*X* and *Y* via ``x_field`` and ``y_field``:
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: for the mul_add
:end-at: }
The third and final action, ``finalize_action``, sums the elements
of *Y* by initiating a global reduction. Because they represent a
global reduction, the ``reduce_y_task`` tasks are spawned using
``flecsi::reduce`` instead of ``flecsi::execute`` as in the preceding
two actions. ``finalize_action`` uses the FleCSI logging facility,
FLOG, to output the sum. Finally, the function deallocates the memory
previously allocated by ``initialize_action``.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: for the finalize
:end-at: }
Tasks
-----
Tasks are functions that collectively and (usually) concurrently
process a distributed data structure.
``flecsi::execute``, as seen in the preceding section,
spawns one point task per color.
Each point task is individually responsible for processing a subspace
(or "color") of the distributed data structure.
Because FleCSI follows `Legion `_'s data
and concurrency model,
a point task is provided access to a subspace
via an *accessor* templated on an access right:
``ro`` (read only),
``wo`` (write only),
``rw`` (read/write), or
``na`` (no access).
The ``initialize_vectors_task`` task requests
write-only access to subspaces of *X* and *Y*
because write-only access is necessary to initialize a field.
The task uses ``divide_indices_among_colors``,
defined above in `Data structures`_,
to compute the number of vector indices to which a point task has access
and the global index corresponding to local index 0.
Once these are known, the task initializes
*X*\ [*i*] ← *i* and *Y*\ [*i*] ← 0
over its subset of the distributed *X* and *Y* vectors.
FLAXPY uses FleCSI's ``forall`` macro to locally parallelize
the initialization of *Y*.
(This example works with thread parallelism but not on a GPU, since one field would be accessed on the host and the other on the device.)
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: that initializes
:end-before: for the initialize
``mul_add_task`` is the simplest of FLAXPY's three tasks
but also the one that performs the core DAXPY computation.
It accepts a scalar *a* and requests read-only access to a subspace of *X*
and read/write access to a subspace of *Y*.
The task then computes
*Y*\ [*i*] ← *a*\ ⋅\ *X*\ [*i*] + *Y*\ [*i*]
over its subset of the distributed *X* and *Y* vectors.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: that assigns
:end-at: }
The third and final task, ``reduce_y_task``,
computes and returns the sum of a subspace of *Y*.
For this it requests read-only access to the subspace
and uses FleCSI's ``reduceall`` macro to locally parallelize
(e.g., using thread parallelism)
the summation.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: that adds up
:end-before: for the finalize
Program initialization
++++++++++++++++++++++
FLAXPY's ``main`` function, expressed outside of any namespace, is
largely boilerplate. It initializes FleCSI, executes the FLAXPY code
according to the control flow defined above in `Control flow`_, and
finalizes FleCSI.
.. literalinclude:: ../../../../tutorial/standalone/flaxpy/flaxpy.cc
:language: cpp
:start-after: main program
Usage
+++++
FLAXPY can be built and run as follows:
.. code:: bash
cd tutorial/standalone/flaxpy
mkdir build
cd build
cmake ..
make -j
mpiexec -n 8 ./flaxpy
Depending on your installation, you may need to execute ``mpirun``,
``srun``, or another launcher instead of ``mpiexec``. The ``-n 8``
specifies 8-way process parallelism. The output should look like
this:
.. code:: none
The sum over all elements in the final vector is 6.16999e+12
Try
* passing ``--help`` to ``flaxpy`` to view the supported command-line
options,
* passing ``--length=2000000`` to ``flaxpy`` to run DAXPY on a vector
that is twice as long as the default,
* running with a different amount of process parallelism, perhaps
across multiple nodes in a cluster, or
* [**advanced**] modifying ``flaxpy.cc`` to construct a specialization
of ``narray`` (a multidimensional-array core topology) instead of a
specialization of ``user``. See the source code to the
:doc:`Poisson example `, in particular
``tutorial/standalone/poisson/include/specialization/mesh.hh``, for
an example of the use of ``narray``.