StarPU Handbook
|
Note: by default, StarPU does not let CPU workers sleep, to let them react to task release as quickly as possible. For idle time to really let CPU cores save energy, one needs to use the configure
option --enable-blocking-drivers.
If the application can provide some energy consumption performance model (through the field starpu_codelet::energy_model), StarPU will take it into account when distributing tasks. The target function that the scheduler dmda minimizes becomes alpha * T_execution + beta * T_data_transfer + gamma * Consumption
, where Consumption
is the estimated task consumption in Joules. To tune this parameter, use export STARPU_SCHED_GAMMA=3000
(STARPU_SCHED_GAMMA) for instance, to express that each Joule (i.e. kW during 1000us) is worth 3000us execution time penalty. Setting alpha
and beta
to zero permits to only take into account energy consumption.
This is however not sufficient to correctly optimize energy: the scheduler would simply tend to run all computations on the most energy-conservative processing unit. To account for the consumption of the whole machine (including idle processing units), the idle power of the machine should be given by setting export STARPU_IDLE_POWER=200
(STARPU_IDLE_POWER) for 200W, for instance. This value can often be obtained from the machine power supplier, e.g. by running
ipmitool -I lanplus -H mymachine-ipmi -U myuser -P mypasswd sdr type Current
The energy actually consumed by the total execution can be displayed by setting export STARPU_PROFILING=1 STARPU_WORKER_STATS=1
(STARPU_PROFILING and STARPU_WORKER_STATS).
For OpenCL devices, on-line task consumption measurement is currently supported through the OpenCL extension CL_PROFILING_POWER_CONSUMED
, implemented in the MoviSim simulator.
For CUDA devices, on-line task consumption measurement is supported on V100 cards and beyond. This however only works for quite long tasks, since the measurement granularity is about 10ms.
Applications can however provide explicit measurements by feeding the energy performance model by hand. Fine-grain measurement is often not feasible with the feedback provided by the hardware, so users can for instance run a given task a thousand times, measure the global consumption for that series of tasks, divide it by a thousand, repeat for varying kinds of tasks and task sizes, and eventually feed StarPU with these manual measurements. For CUDA devices starting with V100, the starpu_energy_start() and starpu_energy_stop() helpers, described in Measuring energy and power with StarPU below, make it easy.
For older models, one can use nvidia-smi -q -d POWER
to get the current consumption in Watt. Multiplying this value by the average duration of a single task gives the consumption of the task in Joules, which can be given to starpu_perfmodel_update_history(). (exemplified in Performance Model Example with the performance model energy_model
).
Another way to provide the energy performance is to define a perfmodel with starpu_perfmodel::type STARPU_PER_ARCH or STARPU_PER_WORKER , and set the field starpu_perfmodel::arch_cost_function or starpu_perfmodel::worker_cost_function to a function which shall return the estimated consumption of the task in Joules. Such a function can for instance use starpu_task_expected_length() on the task (in µs), multiplied by the typical power consumption of the device, e.g. in W, and divided by 1000000. to get Joules. An example is in the file tests/perfmodels/regression_based_energy.c
.
There are other functions in StarPU that are used to measure the energy consumed by the system during execution. The starpu_energy_use() function declares that there are the energy consumptions of the task, while the starpu_energy_used() function returns the total energy consumed since the start of measurement.
We have extended the performance model of StarPU to measure energy and power values of CPUs. These values are measured using the existing Performance API (PAPI) analysis library. PAPI provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.
RAPL
events, which are available on CPU architecture: RAPL_ENERGY_PKG
that represents the whole CPU socket power consumption, and RAPL_ENERGY_DRAM
that represents the RAM power consumption.PAPI provides a generic, portable interface for the hardware performance counters available on all modern CPUs and some other components of interest that are scattered across the chip and system.
In order to use the right rapl events
for energy measurement, user should check the rapl events
available on the machine, using this command:
$ papi_native_avail
Depending on the system configuration, users may have to run this as root to get the performance counter values.
Since the measurement is for all the CPUs and the memory, the approach taken here is to run a series of tasks on all of them and to take the overall measurement.
In this example, we launch several tasks of the same type in parallel. To perform the energy requirement measurement of a program, we call starpu_energy_start(), which initializes energy measurement counters and starpu_energy_stop(struct starpu_perfmodel *model, struct starpu_task *task, unsigned nimpl, unsigned ntasks, int workerid, enum starpu_worker_archtype archi) to stop counting and update the performance model. This ends up yielding the average energy requirement of a single task. The example below illustrates this for a given task type.
The example starts 40 times more tasks of the same type than there are CPU execution units. Once the tasks are distributed over all CPUs, the latter are all executing the same type of tasks (with the same data size and parameters); each CPU will in the end execute 40 tasks. A specimen task is then constructed and passed to starpu_energy_stop(), which will fold into the performance model the energy requirement measurement for that type and size of task.
For the energy and power measurements, depending on the system configuration, users may have to run applications as root to use PAPI library.
The function starpu_energy_stop() uses PAPI_stop()
to stop counting and store the values into the array. We calculate both energy in Joules
and power consumption in Watt
. We call the function starpu_perfmodel_update_history() in the performance model to provide explicit measurements.
tests/perfmodels/regression_based_memset.c
In some cases, one may want to force some scheduling, for instance force a given set of tasks to GPU0, another set to GPU1, etc. while letting some other tasks be scheduled on any other device. This can indeed be useful to guide StarPU into some work distribution, while still letting some degree of dynamism. For instance, to force execution of a task on CUDA0:
An example is in the file tests/errorcheck/invalid_tasks.c
.
or equivalently
One can also specify a set of worker(s) which are allowed to take the task, as an array of bit, for instance to allow workers 2 and 42:
One can also specify the order in which tasks must be executed by setting the field starpu_task::workerorder. An example is available in the file tests/main/execute_schedule.c
. If this field is set to a non-zero value, it provides the per-worker consecutive order in which tasks will be executed, starting from 1. For a given of such task, the worker will thus not execute it before all the tasks with smaller order value have been executed, notably in case those tasks are not available yet due to some dependencies. This eventually gives total control of task scheduling, and StarPU will only serve as a "self-timed" task runtime. Of course, the provided order has to be runnable, i.e. a task should not depend on another task bound to the same worker with a bigger order.
Note however that using scheduling contexts while statically scheduling tasks on workers could be tricky. Be careful to schedule the tasks exactly on the workers of the corresponding contexts, otherwise the workers' corresponding scheduling structures may not be allocated or the execution of the application may deadlock. Moreover, the hypervisor should not be used when statically scheduling tasks.
Within Heteroprio, one priority per processing unit type is assigned to each task, such that a task has several priorities. Each worker pops the task that has the highest priority for the hardware type it uses, which could be CPU or CUDA for example. Therefore, the priorities has to be used to manage the critical path, but also to promote the consumption of tasks by the more appropriate workers.
The tasks are stored inside buckets, where each bucket corresponds to a priority set. Then each worker uses an indirect access array to know the order in which it should access the buckets. Moreover, all the tasks inside a bucket must be compatible with all the processing units that may access it (at least).
These priorities are now automatically assigned by Heteroprio in auto calibration mode using heuristics. If you want to set these priorities manually, you can change STARPU_HETEROPRIO_USE_AUTO_CALIBRATION and follow the example below.
In this example code, we have 5 types of tasks. CPU workers can compute all of them, but CUDA workers can only execute tasks of types 0 and 1, and are expected to go 20 and 30 time faster than the CPU, respectively.
Then, when a task is inserted, the priority of the task will be used to select in which bucket is has to be stored. So, in the given example, the priority of a task will be between 0 and 4 included. However, tasks of priorities 0-1 must provide CPU and CUDA kernels, and tasks of priorities 2-4 must provide CPU kernels (at least). The full source code of this example is available in the file examples/scheduler/heteroprio_test.c
Heteroprio supports a mode where locality is evaluated to guide the distribution of the tasks (see https://peerj.com/articles/cs-190.pdf). Currently, this mode is available using the dedicated function or an environment variable STARPU_HETEROPRIO_USE_LA, and can be configured using environment variables.
In this mode, multiple strategies are available to determine which memory node's workers are the most qualified for executing a specific task. This strategy can be set with STARPU_LAHETEROPRIO_PUSH and available strategies are:
Other environment variables to configure LaHeteteroprio are documented in Configuring LAHeteroprio
In this mode, Heteroprio saves data about each program execution, in order to improve future ones. By default, these files are stored in the folder used by perfmodel, but this can be changed using the STARPU_HETEROPRIO_DATA_DIR environment variable. You can also specify the data filename directly using STARPU_HETEROPRIO_DATA_FILE.
Additionally, to assign priorities to tasks, Heteroprio needs a way to detect that some tasks are similar. By default, Heteroprio looks for tasks with the same perfmodel, or with the same codelet's name if no perfmodel was assigned. This behavior can be changed to only consider the codelet's name by setting STARPU_HETEROPRIO_CODELET_GROUPING_STRATEGY to 1
Other environment variables to configure AutoHeteteroprio are documented in Configuring AutoHeteroprio