intro.rst 5.68 KB
Newer Older
1
2
3
4

Introduction
============

Vojtech Cima's avatar
Vojtech Cima committed
5
6
7
HyperLoom is a platform for defining and executing workflow pipelines in a
distributed environment. HyperLoom aims to be a highly scalable framework
that is able to efficiently execute millions of interconnected tasks on hundreds of computational nodes.
8

Vojtech Cima's avatar
Vojtech Cima committed
9
User defines and submits a plan - a computational graph (Directed Acyclic Graph) that captures dependencies between computational tasks. The HyperLoom infrastructure then automatically schedules the tasks on available nodes while managing all necessary data transfers.
10
11
12
13

Architecture
------------

Vojtech Cima's avatar
Vojtech Cima committed
14
15
HyperLoom architecture is depicted in :numref:`architecture`. HyperLoom consist of a server process that manages worker processes running on computational nodes and a client component that provides an user interface to HyperLoom.

16
17
The main components are:

Vojtech Cima's avatar
Vojtech Cima committed
18
* **client** -- The Python gateway to HyperLoom -- it allows users to programmatically chain computational tasks into a plan and submit the plan to the server. It also provides a functionality to gather results of the submitted tasks after the computation finishes.
19

Vojtech Cima's avatar
Vojtech Cima committed
20
* **server** -- receives and decomposes a HyperLoom plan and reactively schedules tasks to run on available computational resources provided by workers.
21

Vojtech Cima's avatar
Vojtech Cima committed
22
* **worker** -- executes and runs tasks as scheduled by the server and inform the server about the task states. HyperLoom provides options to extend worker functionality by defining custom task or data types. (Server and worker are written in C++.)
23
24
25
26
27

.. figure:: arch.png
   :width: 400
   :alt: Architecture scheme
   :name: architecture
Vojtech Cima's avatar
Vojtech Cima committed
28
   :align: center
29

Vojtech Cima's avatar
Vojtech Cima committed
30
   Architecture of HyperLoom
31
32
33
34
35
36
37


Basic terms
-----------

The basic elements of Loom's programming model are: **data object**, **task**,
and **plan**. A **data object** is an arbitrary data structure that can be
Vojtech Cima's avatar
Vojtech Cima committed
38
39
serialized/deserialized. A **task** represents a computational unit that produces data
objects. A **plan** is a set of interconnected tasks.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

Tasks
+++++

A **task** is an object representing a computation together with its
dependencies and a configuration. Each task has the following attributes:

* Task inputs -- task's prerequisites (some other tasks)
* Task type -- the specification of the procedure that should be executed
* Task policy -- defines how should be the task scheduled
* Configuration -- a sequence of bytes that is interpreted according the task type
* Resource constraints

By task *execution*, we mean executing a procedure according to *task type*,
which takes data objects and configuration, and returns a new data object. The
input data objects are obtained as a result of executing tasks defined in task
inputs. Resource constraints serve to express that a task execution may need
some specific hardware or number of processes.


Plan
++++

**Plan** is a set of tasks. Plan has to form a finite asyclic directed
multigraph where nodes are tasks and arcs express input dependencies between
tasks. *Plan execution* is an execution of tasks according to the dependencies
defined in the graph.

.. Note::

  * We have formally restricted each task to return only a single data object as
    its result. However, a task can produce more results by returning an array of
    data objects.
  * Input data objects are always results of a previous tasks. To create a
    specific constant data object, there is a standard task (``tasks.const`` in
    Python API) that takes no input and only creates a data object from its
    configuration.


Symbols
+++++++

Vojtech Cima's avatar
Vojtech Cima committed
82
Customization and extendability are important concepts of HyperLoom. HyperLoom is designed
83
to enable creating customized workers that providies new task types, data
Vojtech Cima's avatar
Vojtech Cima committed
84
objects and resources. HyperLoom uses the concept of name spaces to avoid potential
85
86
87
88
89
90
name clashes between different workers. Each type of data object, task type and
resource type is identified by a symbol. Symbols are hierarchically organized
and the slash character `/` is used as the separator of each level (e.g.
`loom/data/const`). All built-in task types, data object types, and resource
types always start with `loom/` prefix. Other objects introduced in a a
specialized worker should introduce its own prefix.
Vojtech Cima's avatar
Vojtech Cima committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130


Data objects
++++++++++++

Data objects are fundamental entities in HyperLoom. They represent values that serves
as arguments and results of tasks. There are the following build-in basic types
of data objects:

* **Plain object** -- An anonymous sequence of bytes without any additional
  interpretation by HyperLoom.

* **File** -- A handler to an external file on shared file system. From the
  user's perspective, it behaves like a plain object; except when a data
  transfer between nodes occurs, only a path to the file is transferred.

* **Array** -- A sequence of arbitrary data objects

* **Index** -- A logical view over a D-Object data object with a list of positions.
  It is used to slice data according some positions (e.g. positions of the
  new-line character to extract lines). It behaves like an array without
  explicit storing of each entry.

* **PyObj** -- Contains an arbitrary Python object

We call objects that are able to provide a content as continous
chunk of memory as **D-Objects**. Plain object and File object are D-Objects;
Array, Index, and PyObj are *not* D-Objects.

Each data object

* **size** -- the number of bytes needed to store the object
* **length** -- the number of 'inner pieces'. Length is zero when an object has no
  inner structure. Plain objects and files have always zero length; an array has length
  equal to number of lements in the array.

.. Note:: **size** is an approximation. For a plain object, it is the length of
          data itself without any metada. The size of an array is a sum of sizes
          of elements. The size of PyObj is obtained by ``sys.getsizeof``.