.. This file is written with one sentence per physical line.

Internal Summary
****************

This page describes the structure and operation of FleCSI in developers' terms but at a high level.
It is meant to provide context for the details found when reading a source file (especially for the first time).

Components
++++++++++

FleCSI performs only a few fundamental actions, each of which corresponds to one of the namespaces/directories under ``flecsi``:

* ``data``: Allocate distributed one-dimensional arrays of arbitrary trivial types (*fields*).
* ``topo``: Organize those arrays into *topologies* that represent computational physics domains of several kinds.
* ``exec``: Call user-specified functions (*tasks*) with pointers to the allocated memory (in the form of *accessors*).
* ``io``: Save and restore the contents of fields to and from disk.
* ``flog``: Aggregate diagnostic output from multiple processes.

The implementation of these components is divided between the "front end" and one of a number of *backends* that leverage some external mechanism for allocating memory, transferring distributed data, and executing tasks.
The common backend API comprises a small set of classes and function templates that are sufficient to implement the front end; each is called an *entry point*.
Most entry points are defined in files named ``policy.hh`` in a backend-specific directory in a component.
``topo`` and ``flog`` are implemented entirely in the front end.

The reference backend uses Legion for these purposes, which imposes `stringent requirements <https://legion.stanford.edu/tutorial/hybrid.html>`_ on the application because of its implicit operation across processors and memory spaces.
The conceit is that code (in FleCSI and its clients) that is compatible with Legion will also work with most other backends.

The MPI backend uses MPI to transfer data; since MPI does not facilitate general allocation and task launches, these are implemented in terms of ordinary C++ constructs.
It is not intended that it support every FleCSI feature, since some would require implementing a task system of complexity comparable to that of Legion.
Neither is it required for using existing MPI-based libraries; Legion itself uses MPI and can be made compatible with them.
The HPX backend also uses MPI for the data transfer; it however only supports trivial launch maps.

Other components provide support for the above activities:

* ``run``: Maintain internal state, communicate with the external mechanism, and invoke structured sequences of callback functions to perform a simulation.
* ``util``: Organize local data, support compile-time computation, and implement unit-testing assertions akin to those in Google Test.

Of these, ``util`` is implemented entirely in the front end.

Every component has a single user-level header with a similar (if longer) name directly in ``flecsi/``.

Hierarchy
^^^^^^^^^

Since the facilities provided by FleCSI are general-purpose, many of them are used by (other parts of) FleCSI itself.
As a result, there are several circular dependencies among components (when ignoring their internal structure), but there is also a general hierarchy among them.

``io`` is highest because it provides a user-level service in terms of the other components.
Similarly, ``topo`` is high because nontrivial topologies themselves need to allocate and use field data.
Generally, ``exec`` is higher than ``data`` because launching a task involves obtaining access to allocated data, but the latter is the most dispersed among the layers.
``run`` is lower because it provides certain global variables and the access to the underlying communication system, except that its support for callbacks is outside the hierarchy altogether.
``flog`` is lower still, since it relies merely on MPI for communication.
Finally, ``util`` is the lowest, generic level.

There is also an expected hierarchy beyond FleCSI.
In particular, it is expected that the topology *specializations* required as policies for the core topology *categories* (*e.g.*, to specify the number of spatial dimensions) are provided by intermediate clients like FleCSI-sp that also provide domain-specific interfaces to application code.
(Topology specializations are not C++ explicit or partial specializations, although they may be generated by a class template.)

Compound clients
^^^^^^^^^^^^^^^^

FleCSI's design is meant to support a client simulation program that comprises a number of subsystems with minimal coordination among them.
For example, fields are registered on merely a specialization, endowing all instances of its topology type with the field in question without requiring changes to its definition.
(Memory is allocated only if the field is used, so clients can err on the side of registering whatever fields might be useful.)
Additionally, the registrations take the form of non-local variables such that any source file in the program can contribute them independently.

Another example is the callback-based control flow system, which allows actions (and ordering constraints) to be associated with each part of the overall simulation cycle structure.
Again, each such action is a non-local variable that can be supplied separately anywhere in the program.

Data
++++

Task separation
^^^^^^^^^^^^^^^

Field data appears as actual, borrowed arrays in a task but as owning, opaque handles in the caller.
As such, certain *caller-only* or *task-only* types are used only in one context or the other, and the types of some task arguments differ from the types of the parameters they indirectly initialize.
Conveniently, this separation also guarantees thread safety in the event that tasks execute concurrently with their callers in the same process.

Storage
^^^^^^^

The FleCSI data model is based on `that of Legion <https://legion.stanford.edu/tutorial/logical_regions.html>`_, with certain simplifications.
Each field has a value at each of several *index points*, each of which is a pair of abstract integers.
The first indicates the *color* to which the index point pertains, which determines the execution agent that typically operates on the associated field values.
The second is an index within the index points for that color; both are zero-based.
(This structure, where field values at certain index points are periodically copied to other colors for access by other processes, is known as an "Extended Index Space".)

A *region* is the domain for some number of fields: a rectangular subregion of the two-dimensional integer lattice whose lower-left corner is always (0,0).
(Unlike in Legion, FleCSI uses a separate list of fields for each index space.)
The *rows* of that rectangle, one per color, are often considered separately.

To support memory allocations that vary among rows or over time, fields are accessed via a *partition*, which is simply a prefix of each row of a region.
(Often, the length of each such prefix is itself stored in a field that is accessed via a preexisting partition.)
See :numref:`regions` for a visual depiction of regions, prefixes, and partitions.
Several partitions may exist for a single region, although it is common for there to be just one or else just one at any time.
All access the same field values, which survive as long as the region does.
The backends are expected to allocate memory only as necessary for partitions that are actually used with a given field.

.. _regions:
.. figure:: images/region_partition_prefix.png

  Diagram illustrating a region, prefix, and partition.  Regions are merely logical entities,
  whereas actual memory allocations are managed through partitions.  Partitions are a collection of
  prefixes, one for each color.  A prefix always begins at index point 0 of its color.

-----

Each topology is a fixed-size set of *index spaces* (*e.g.*, the cells, vertices, and faces of a mesh), each of which is backed by a region (and a partition).
Often, other regions and partitions are included, organized into *subtopologies* that handle common kinds of data like the count of mesh elements.

The classes ``region_base`` and ``partition`` (and the concrete ``rows`` and ``prefixes`` derived from the latter) are entry points.
Most other entry points for data handling concern *copy engines*, which are uniquely responsible for transferring data between rows.
(FleCSI does not use Legion's support for overlapping partitions as a means of transferring data.)
Each is defined in terms of a field whose value at a destination of the copy is its source index point.
(Since an index point may be copied to more than one destination simultaneously, the inverse function does not exist.)

All of these types are caller-only; their APIs are defined in ``topology.hh``.
Since class members cannot be declared without defining the class, those declarations are discarded by the preprocessor except when building API documentation.
The class template ``partitioned`` addresses the common case where a ``region`` and a ``partition`` are needed together.
*Copy plans* wrap copy engines and the initialization of their associated fields; they are defined in ``copy_plan.hh`` along with ``buffers``, a topology that transfers dynamic amounts of data with datagrams.

Layouts
^^^^^^^

The backend is expected merely to provide uninitialized storage arrays for each field and ``memcpy`` it appropriately.
Therefore, ``sizeof(T)`` and the partition size is sufficient information to allocate it, but (if it is used with any non-MPI task) the type must be self-contained and trivially relocatable.
(This is not a formal C++ classification; note that ``std::tuple<int>`` is not trivially copyable.)

This support is called the ``raw`` *layout*.
Higher-level layouts are implemented in terms of it:

* ``dense`` creates ``T`` objects in a ``raw`` array.
* ``single`` provides syntactic sugar for the case of arrays of length 1 (per color).
* ``ragged`` stores a resizable array of ``T`` at *each* index point, as if the field type were ``std::vector<T>``.
  The elements of the arrays are packed in an underlying ``raw`` field (with slack space to reduce reallocations, as with ``std::vector`` itself); the offsets of the beginning of each array are stored in a separate ``dense`` field.
* ``sparse`` stores a mapping from integers to ``T`` at each index point, as if the field type were ``std::map<std::size_t,T>``.
  The implementation is simply a ``ragged`` field of ``std::pair<std::size_t,T>``, with each row sorted by the key.
* ``particle`` stores a set of ``T`` objects bounded by the size of the index space.
  The implementation augments ``T`` with a "skip field" that allows efficient iteration, insertion, and deletion.

This enumeration is defined in ``layout.hh``.

Definition
^^^^^^^^^^

The various types used for working with a field are exposed as members of ``field<T,L>``, where ``T`` is the field data type and ``L`` is the layout (which defaults to ``dense``).
Application code, topology specializations, and topologies all register fields the same way, by defining caller-only objects of type ``field<T,L>::definition<P,S>``.
``P`` here is the topology type (specialization, sometimes called a "Policy"), and ``S`` is the index space (of the type ``P::index_space``, which is typically an enumeration).
These are often non-local variables.

If ``L`` is ``raw``, the field is registered on the global FleCSI *context* with a field ID drawn from a global counter, organized by topology type and index space.
Otherwise, the ``definition`` recursively registers appropriate underlying fields (via specializations of the helper class templates ``field_base`` and ``field_register``).
These types are defined in ``field.hh`` (but, as a principal name used by application code, ``field`` appears directly in the ``flecsi`` namespace).

``topology`` objects are also caller-only; special support for creating application-level instances is provided by ``scheduler::allocate``.
Using ``specialization::ptr`` allows deferring the initialization of a topology instance that is a member of a control policy object until an appropriate action.
It also provides a second phase of initialization that can be used to launch tasks operating on the new topology object.

Topology objects are constructed from *colorings*, which are descriptions of the computational domain as ordinary C++ data rather than fields.
For reasons of efficiency and interoperability, these are often constructed by special "MPI tasks" (described below).
The class template ``coloring_slot``, defined in ``coloring.hh`` automates invoking such tasks.

Access
^^^^^^

Access to a field is requested with a caller-only *field reference* which identifies a field and a topology instance.
A field reference may be passed as an argument for an accessor parameter of a task.
Accessors are task-only; their types are usually spelled ``field<T,L>::accessor<P,...>``, where each ``P`` is a *privilege* that specifies read and write permissions for some part of the field.

When a single privilege is supplied (``field<T,L>::accessor<P>``), it refers to *owned* index
points.  For two privileges (``field<T,L>::accessor<P1,P2>``), the first (``P1``) refers to *owned*
index points and the second (``P2``) to *ghosts* that may be copied automatically from
pre-specified shared points.  For three privileges (``field<T,L>::accessor<P1,P2,P3>``), the first
(``P1``) refers to *exclusive* index points, the second (``P2``) to *shared* index points, and the
third (``P3``) to *ghosts*.
Ghost copies are performed only when ghosts are read and not written and shared points have been written more recently than the previous read or write.
(There is no mechanism at present to overlap the ghost copies with a task that does not require access to ghosts or write access to shared points.)

Internally, all of the (variable number of) privileges for an accessor are combined into a *privilege pack*.
The syntax ``field<T,L>::accessor1<P>`` may be used to reuse such a pack.
The actual type of an accessor is ``data::accessor<L,T,P>``, which must be used for template argument deduction to apply.

When the arrays for one or more index points in a ``ragged`` or ``sparse`` field are resized, they must be repacked.
To do so efficiently, the interface for such resizing is provided by accessor variants called *mutators*, which use temporary storage (from the ordinary heap) to track changes made by a task and then apply those changes when it finishes.
They automatically resize such fields (according to a policy set by the topology) to maintain slack space for insertions, but the process simply fails if that guess is overrun.
Mutators also have permissions: write-only mutators (re)initialize a field (to all empty structures).
Multiple permissions distinguish mutators that trigger ghost copies from those that implement them.

Accessors of different layouts form a hierarchy parallel to that of field definitions.
The ultimately underlying ``raw`` accessors merely store a ``util::span<T>``, along with a field ID used to look up the storage.
Higher-level accessors implement additional behavior, including certain automatic task launches.
Additionally, ``ragged`` mutators are implemented in terms of the same underlying accessors as ``ragged`` accessors, and ``sparse`` mutators are in turn a wrapper around them.
All these types are defined in ``accessor.hh``, but the (undefined) primary templates are declared in the lower-level ``field.hh``.

Because the structural information about a topology is often necessary for using the physics fields defined on it, each topology defines a *topology accessor* type that packages accessors for the fields that hold that structural information (registered by the topology itself), further extending the hierarchy of composite accessors.
Topology accessors are of course also task-only; a topology accessor parameter is matched by a ``topology`` argument.
The topology's ``access`` type is used wrapped in the ``topology_accessor`` class template defined in ``topology_accessor.hh``.

To help specify the members of topology accessors, which typically are accessors for preselected fields, ``field.hh`` also defines the class template ``accessor_member`` that accepts (a reference to) the field as a template argument and automatically initializes the accessor with the correct field ID.
(The field ID is not known until runtime, but the *location* where it will be stored is known at compile time.)

*Multi-color accessors* allow an execution agent to access data outside of its color (beyond that supplied by ghost copies), including the special case of data outside of the data model altogether (*e.g.*, distributed objects created by MPI-based libraries).
These take the form of a sequence of color-accessor pairs inside a task; the accessor can also be a mutator or a topology accessor.
They are created from a *launch map* that specifies which colors should be made available where (discussed further below).
The MPI backend supports only trivial launch maps that specify some subset of the normal arrangement.

See below about the special case of *reduction accessors*.

Execution
+++++++++

Levels
^^^^^^

The execution of a FleCSI-based application is divided into three levels:

#. The outer program creates a ``flecsi::runtime`` object and calls ``control`` on it.
#. The control model executed by ``control`` creates topologies and executes tasks.
#. The tasks perform all field access (including some needed during topology creation).

The first two are executed as normal `SPMD <https://en.wikipedia.org/wiki/SPMD>`_ jobs; only the control model can launch tasks.
(With Legion, the control model executes in the top-level *task*, but we reserve that word for the leaf tasks.)
When arguments that describe data across multiple colors (*e.g.*, field references) are passed to a task, an *index launch* takes place that executes an instance of the task (a *point task*) for each color.
(Often, there are the same number of colors as MPI ranks.)
Otherwise, a *single launch* of the task is performed, which can be useful for custom I/O operations.

Inherited from Legion is the requirement that all interactions with FleCSI be identical in every instance of the first two levels (called a *shard* in the second case).
(This is akin to the rules for collective communication operations with MPI.)
This includes that the arguments passed to tasks must be identical across shards, which sometimes necessitates providing extra information from which the point task may select according to its color.

Task launches
^^^^^^^^^^^^^

In general, task parameters must be movable and if they are not references they must be copyable.
However, special conversions are first applied to certain arguments like field references recognized via the ``replace_argument`` mechanism in ``launch.hh``.

In addition to converting arguments that identify resources, those resources are recruited for the task's use.
For fields, this involves identifying the responsible ``partition`` from the topology on the caller side.
(For Legion, its associated Legion handles are then identified as resources needed for the task launch, controlling data movement and parallelism discovery.)
The *global topology* (described further below) is a special case: it uses a ``region`` directly, so all point tasks use the same field values.
A task that writes to a global topology instance must therefore be a single launch or use a reduction accessor, which combines values from all point tasks using a reduction operation (see below).
On the task side, the recruited resources and the accessors' field IDs are consulted to obtain values for the contained ``span`` objects.
Because a task's parameters are destroyed as soon as it returns, state accumulated by mutators is stored in separate *buffers* that can be processed afterwards.

On both sides, caller and task, various tag base classes are used to recognize relevant FleCSI types for the parameters.
The most important of these is ``send_tag``: task-parameter types that inherit from it provide a member function template called ``send`` to send an object from a task caller to the task itself.
This ``send`` method also accepts a callback function.
When applying operations to the task parameters, instead of duplicating code to handle the lower-level task parameters, it relies on this callback mechanism.
With it, task parameters are able to decompose themselves down to simpler parameters that have specific processing defined, such as raw accessors that underlie accessors and mutators.
This composition process is handled via inheritance and aggregation, and the ``send`` method may be called multiple times and for various purposes depending on the backend and the path taken to send the object between caller and task.

A call to ``execute<F>`` can return before the task does; it returns a *future* that can be used to wait on the task to finish and obtain its return value (if any).
(Legion provides a mechanism for nontrivial class types to serialize themselves when so returned.)
A call that performs an index launch returns a ``future<T,launch_type_t::index>`` that can access the value returned from each point task.
Such an object can also be passed as a task argument for a ``future<T>`` parameter, which produces an index launch that processes all values in parallel.

The values returned by an index launch may also be *reduced* to a single value (still expressed as a future).
The reduction operation is expressed as a type, passed as a template argument to ``reduce``, with members ``combine`` and ``identity``, which may optionally be templates.
The most common reduction operations are provided in the ``exec::fold`` namespace, defined in ``fold.hh``; the generic interface is adapted to each backend in ``*/reduction_wrapper.hh``.

The function template ``execute`` simply forwards to ``reduce`` with ``void`` as the (non-)reduction type; both are defined in ``execution.hh``.
In turn, ``reduce`` performs periodic log aggregation and then calls the ``reduce_internal`` entry point defined in ``*/policy.hh``.
Certain implementations of ``send`` may themselves execute tasks to prepare field data for the requested task, which means that ``reduce_internal`` is in general *reentrant*.

Common portions of the argument and parameter handling are defined in ``prolog.hh`` and ``buffers.hh``.
The undefined primary template for ``future`` is declared in ``launch.hh``, along with documentation-only definitions of the single- and index-launch specializations.
The backend-specific implementations are in ``*/future.hh``.

Explicit parallelism
^^^^^^^^^^^^^^^^^^^^

Tasks and the control model cannot in general use parallel communication because it might collide with implicit or concurrent communication on the part of the backend and because any two of them may be executed sequentially or concurrently.
For cases where such communication is needed (*e.g.*, to use an MPI-based library), a task can be executed as an *MPI task* via the optional template argument to ``execute``/``reduce``.
Like the control model, an MPI task is executed in parallel on every MPI rank; moreover, no other tasks or internal communication are executed concurrently, and the call is synchronous.
Because the call is not made in an ordinary task context, ``context_t::color`` must not be used by an MPI task; ``context_t::process``, which is always available, has the desired meaning in that context.

Because the execution of an MPI task has largely the same semantics as an ordinary function call, arbitrary arguments may be passed to it.
The usual argument replacements still apply, which allows MPI tasks to have access to both fields and objects created by the caller.
Arguments that are not so interpreted need not have the same value on every shard.
However, return values must follow the ordinary rules (so as to support futures and reductions).

FleCSI also provides, in ``launch.hh`` and ``kernel.hh``, a wrapper interface for simple Kokkos parallel loops and reductions, including macros ``forall`` and ``reduceall`` that are followed by a lambda body (and a semicolon, since the lambda is an expression).
The same reduction types as for index launches are supported.

Topologies
++++++++++

Interfaces
^^^^^^^^^^

The basic responsibility of a (core) topology type is to provide access to fields.
The client machinery for task launches is backend-specific but not topology-specific; it uses a common interface specified by documentation-only example in ``core.hh``.
The interface also enables specializations to be defined; the core topology class template accepts the specialization as a template parameter (but neither inherits from the other as in the `CRTP <https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern>`_).

Inheriting from ``specialization`` enables some simple default options, prohibits construction to prevent confusion with the topology type, and provides several type aliases, several of which are meant for application code to use in preference to the formal names.
In turn, the specialization can specify index spaces, provide a factory function for coloring objects, extend the core ``access`` type, and supply other category-specific parameters, as also presented in ``core.hh``.
The coloring *type*, however, is defined by the category (independent of specialization).

The ``unstructured`` topology is a useful example of the metaprogramming techniques used to define index spaces and fields that are parametrized by the policy.
Its unit tests also provide examples of specializations.

Subcomponents
^^^^^^^^^^^^^

For constructing complex, user-facing topologies, a number of simple topologies are defined for use as subtopologies (typically as data members of type ``sub::topology``).
Some of these are so trivial as to merely inherit from the appropriate specialization of ``specialization`` with an empty class body.
The most fundamental of these is ``resize``, which holds the sizes needed to construct a nontrivial ``partition``.
It is defined in ``size.hh`` in terms of the even lower-level ``color`` and ``column`` machinery (from ``color.hh``) that define fields with a fixed number (1 for ``column``) of values per color.

The type ``topo::repartition`` augments ``data::partition`` with a ``resize`` and schedules a task to initialize them properly; it too can be combined with a ``region`` with the *ed* suffix.
It is defined in ``index.hh``, along with higher-level subtopologies that provide the backing store for the ``ragged`` layout as well as the user-level ``index`` topology that supports ``ragged`` fields.
The topology ``ragged`` is itself a class template, parametrized with the specialization to distinguish ``ragged`` field registrations on each.
For several of these types, there is a helper class (template) of the same name prefixed with ``with_`` to be used as a base class.

Launch maps are constructed from several auxiliary topology types that apply a partial permutation to a region's existing rows.
The permutation is represented by a ``borrow`` object, which is a data entry point implemented in a nontrivial sense only for Legion.
With several such permutations, an arbitrary many-to-many mapping can be expressed.
Topology accessors are supported with a system of wrapper classes that emulate the underlying topology instance, including support for all its index spaces, ragged fields, and other topology-specific details.

Several templates are defined in ``utility_types.hh`` to assist in defining topologies.
In particular, ``topo::id`` serves to distinguish in user-facing interfaces the indices for different index spaces.

Predefined
^^^^^^^^^^

Because the global and index topologies do not need user-defined specializations, a predefined specialization is provided of each (and the categories are suffixed with ``_category``).
A deprecated global instance of each is defined in ``flecsi/data.hh``.
Each backend's initialization code uses the ``data_guard`` type to manage their lifetimes.

.. I/O

Utilities
+++++++++

Since ``util`` is self-contained and has little internal interaction, there is little need for centralized, prose description.
However, a few utilities (beyond the serialization already mentioned) have sufficiently wide relevance as to deserve mention.

Simplified backports of several range utilities from C++20 are provided in ``array_ref.hh``.
The intent is to switch to the ``std`` versions (with only trivial code changes) when compiler support becomes available.

Fixed-size containers that associate an enumeration value with each element are provided in ``constant.hh``.
The class template ``key_array`` derives from ``std::array``; similarly ``key_tuple`` derives from ``std::tuple``.

An idiomatic C++ interface for certain MPI routines is provided in ``mpi.hh``.
A wide variety of types is supported via automatic ``MPI_Datatype`` creation or via serialization.

Unit-testing macros that can be used in tasks are provided by ``unit.hh``.
They use return values (rather than global variables like Google Test), so the driver function must check the return value of a task with its own assertion.
A main program to be linked with each unit test source file is provided.