7.1
Task Scheduling Policies
The basics of the scheduling policy are the following:
-
The scheduler gets to schedule tasks (
push
operation) when they become ready to be executed, i.e. they are not waiting for some tags, data dependencies or task dependencies.
-
Workers pull tasks (
pop
operation) one by one from the scheduler.
This means scheduling policies usually contain at least one queue of tasks to store them between the time when they become available, and the time when a worker gets to grab them.
By default, StarPU uses the work-stealing scheduler lws. This is because it provides correct load balance and locality even if the application codelets do not have performance models. Other non-modelling scheduling policies can be selected among the list below, thanks to the environment variable STARPU_SCHED. For instance, export STARPU_SCHED=dmda
. Use help
to get the list of available schedulers. Use the function starpu_sched_get_predefined_policies() to retrieve the NULL-terminated array of all predefined scheduling policies that are available in StarPU. Use starpu_sched_get_sched_policy_in_ctx() or starpu_sched_get_sched_policy() to retrieve the scheduling policy of a task within a specific context or a default context, respectively.
7.1.1
Non Performance Modelling Policies
- The eager scheduler uses a central task queue, from which all workers draw tasks to work on concurrently. This however does not permit to prefetch data since the scheduling decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
- The random scheduler uses a queue per worker, and distributes tasks randomly according to assumed worker overall performance.
- The ws (work stealing) scheduler uses a queue per worker, and schedules a task on the worker which released it by default. When a worker becomes idle, it steals a task from the most loaded worker.
- The lws (locality work stealing) scheduler uses a queue per worker, and schedules a task on the worker which released it by default. When a worker becomes idle, it steals a task from neighbor workers. It also takes into account priorities.
- The prio scheduler also uses a central task queue, but sorts tasks by priority specified by the programmer.
- The heteroprio scheduler uses different priorities for the different processing units. This scheduler must be configured to work correctly and to expect high-performance as described in the corresponding section.
7.1.2
Performance Model-Based Task Scheduling Policies
If (and only if) your application codelets have performance models (PerformanceModelExample), you should change the scheduler thanks to the environment variable STARPU_SCHED, to select one of the policies below, in order to take advantage of StarPU's performance modelling. For instance, export STARPU_SCHED=dmda
. Use help
to get the list of available schedulers.
Note: Depending on the performance model type chosen, some preliminary calibration runs may be needed for the model to converge. If the calibration has not been done, or is insufficient yet, or if no performance model is specified for a codelet, every task built from this codelet will be scheduled using an eager fallback policy.
Troubleshooting: Configuring and recompiling StarPU using the configure
option --enable-verbose displays some statistics at the end of execution about the percentage of tasks which have been scheduled by a DM* family policy using performance model hints. A low or zero percentage may be the sign that performance models are not converging or that codelets do not have performance models enabled.
- The dm (deque model) scheduler takes task execution performance models into account to perform a HEFT-similar scheduling strategy: it schedules tasks where their termination time will be minimal. The difference with HEFT is that dm schedules tasks as soon as they become available, and thus in the order they become available, without taking priorities into account.
- The dmda (deque model data aware) scheduler is similar to dm, but it also takes into account data transfer time.
- The dmdap (deque model data aware prio) scheduler is similar to dmda, except that it sorts tasks by priority order, which allows becoming even closer to HEFT by respecting priorities after having made the scheduling decision (but it still schedules tasks in the order they become available).
- The dmdar (deque model data aware ready) scheduler is similar to dmda, but it also privileges tasks whose data buffers are already available on the target device.
- The dmdas combines dmdap and dmdar: it sorts tasks by priority order, but for a given priority it will privilege tasks whose data buffers are already available on the target device.
- The dmdasd (deque model data aware sorted decision) scheduler is similar to dmdas, except that when scheduling a task, it takes into account its priority when computing the minimum completion time, since this task may get executed before others, and thus the latter should be ignored.
- The heft (heterogeneous earliest finish time) scheduler is a deprecated alias for dmda.
- The pheft (parallel HEFT) scheduler is similar to dmda, it also supports parallel tasks (still experimental). It should not be used when several contexts using it are being executed simultaneously.
- The peager (parallel eager) scheduler is similar to eager, it also supports parallel tasks (still experimental). It should not be used when several contexts using it are being executed simultaneously.
7.1.3
Modularized Schedulers
StarPU provides a powerful way to implement schedulers, as documented in DefiningANewModularSchedulingPolicy . It is currently shipped with the following pre-defined Modularized Schedulers :
- modular-eager , modular-eager-prefetching are eager-based Schedulers (without and with prefetching)), they are
naive schedulers, which try to map a task on the first available resource they find. The prefetching variant queues several tasks in advance to be able to do data prefetching. This may however degrade load balancing a bit.
- modular-prio, modular-prio-prefetching, modular-eager-prio are prio-based Schedulers (without / with prefetching):, similar to Eager-Based Schedulers. Can handle tasks which have a defined priority and schedule them accordingly. The modular-eager-prio variant integrates the eager and priority queue in a single component. This allows it to do a better job at pushing tasks.
- modular-random, modular-random-prio, modular-random-prefetching, modular-random-prio-prefetching are random-based Schedulers (without/with prefetching) :
Select randomly a resource to be mapped on for each task.
- modular-ws) implements Work Stealing: Maps tasks to workers in round-robin, but allows workers to steal work from other workers.
- modular-heft, modular-heft2, and modular-heft-prio are HEFT Schedulers :
Maps tasks to workers using a heuristic very close to Heterogeneous Earliest Finish Time. It needs that every task submitted to StarPU have a defined performance model (PerformanceModelCalibration) to work efficiently, but can handle tasks without a performance model. modular-heft just takes tasks by order. modular-heft2 takes at most 5 tasks of the same priority and checks which one fits best. modular-heft-prio is similar to modular-heft, but only decides the memory node, not the exact worker, just pushing tasks to one central queue per memory node. By default, they sort tasks by priorities and privilege, running first a task which has most of its data already available on the target. These can however be changed with STARPU_SCHED_SORTED_ABOVE, STARPU_SCHED_SORTED_BELOW, and STARPU_SCHED_READY .
- modular-heteroprio is a Heteroprio Scheduler:
Maps tasks to worker similarly to HEFT, but first attribute accelerated tasks to GPUs, then not-so-accelerated tasks to CPUs.
7.2
Task Distribution Vs Data Transfer
Distributing tasks to balance the load induces data transfer penalty. StarPU thus needs to find a balance between both. The target function that the scheduler dmda of StarPU tries to minimize is alpha * T_execution + beta * T_data_transfer
, where T_execution
is the estimated execution time of the codelet (usually accurate), and T_data_transfer
is the estimated data transfer time. The latter is estimated based on bus calibration before execution start, i.e. with an idle machine, thus without contention. You can force bus re-calibration by running the tool starpu_calibrate_bus
. The beta parameter defaults to 1
, but it can be worth trying to tweak it by using export STARPU_SCHED_BETA=2
(STARPU_SCHED_BETA) for instance, since during real application execution, contention makes transfer times bigger. This is of course imprecise, but in practice, a rough estimation already gives the good results that a precise estimation would give.