StarPU Handbook
10. Tasks In StarPU

10.1 Task Granularity

Like any other runtime, StarPU has some overhead to manage tasks. Since it does smart scheduling and data management, this overhead is not always negligible. The order of magnitude of the overhead is typically a couple of microseconds, which is actually quite smaller than the CUDA overhead itself. The amount of work that a task should do should thus be somewhat bigger, to make sure that the overhead becomes negligible. The offline performance feedback can provide a measure of task length, which should thus be checked if bad performance are observed. To get a grasp at the scalability possibility according to task size, one can run tests/microbenchs/tasks_size_overhead.sh which draws curves of the speedup of independent tasks of very small sizes.

This benchmark is installed in $STARPU_PATH/lib/starpu/examples/. It gives a glimpse into how long a task should be (in µs) for StarPU overhead to be low enough to keep efficiency. Running tasks_size_overhead.sh generates a plot of the speedup of tasks of various sizes, depending on the number of CPUs being used.

For example, in figure below, for 128 µs tasks (the red line), StarPU overhead is low enough to guarantee a good speedup if the number of CPUs is not more than 36. But with the same number of CPUs, 64 µs tasks (the black line) cannot have a correct speedup. We need to decrease the number of CPUs to about 17 if we want to keep efficiency.

To determine what task size your application is actually using, one can use starpu_fxt_data_trace, see Data trace and tasks length .

The choice of scheduler also has impact over the overhead: for instance, the scheduler dmda takes time to make a decision, while eager does not. tasks_size_overhead.sh can again be used to get a grasp at how much impact that has on the target machine.

10.2 Task Submission

To let StarPU make online optimizations, tasks should be submitted asynchronously as much as possible. Ideally, all tasks should be submitted, and mere calls to starpu_task_wait_for_all() or starpu_data_unregister() be done to wait for termination. StarPU will then be able to rework the whole schedule, overlap computation with communication, manage accelerator local memory usage, etc. A simple example is in the file examples/basic_examples/variable.c

10.3 Task Priorities

By default, StarPU will consider the tasks in the order they are submitted by the application. If the application programmer knows that some tasks should be performed in priority (for instance because their output is needed by many other tasks and may thus be a bottleneck if not executed early enough), the field starpu_task::priority should be set to provide the priority information to StarPU. Here is an example: examples/heat/dw_factolu_tag.c.

10.4 Setting Many Data Handles For a Task

The maximum number of data a task can manage is fixed by the macro STARPU_NMAXBUFS which has a default value which can be changed through the configure option --enable-maxbuffers.

However, it is possible to define tasks managing more data by using the field starpu_task::dyn_handles when defining a task and the field starpu_codelet::dyn_modes when defining the corresponding codelet.

{
};
struct starpu_codelet dummy_big_cl =
{
.cuda_funcs = { dummy_big_kernel },
.opencl_funcs = { dummy_big_kernel },
.cpu_funcs = { dummy_big_kernel },
.cpu_funcs_name = { "dummy_big_kernel" },
.nbuffers = STARPU_NMAXBUFS+1,
.dyn_modes = modes
};
task->cl = &dummy_big_cl;
task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<task->cl->nbuffers ; i++)
{
task->dyn_handles[i] = handle;
}
enum starpu_data_access_mode modes[STARPU_NMAXBUFS]
Definition: starpu_task.h:538
starpu_cuda_func_t cuda_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition: starpu_task.h:425
struct starpu_task * starpu_task_create(void) STARPU_ATTRIBUTE_MALLOC
int starpu_task_submit(struct starpu_task *task)
#define STARPU_NMAXBUFS
Definition: starpu_config.h:238
Definition: starpu_task.h:334
starpu_data_access_mode
Definition: starpu_data.h:55
struct _starpu_data_state * starpu_data_handle_t
Definition: starpu_data.h:44
@ STARPU_R
Definition: starpu_data.h:57
starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
{
handles[i] = handle;
}
starpu_task_insert(&dummy_big_cl,
STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
0);
int nbuffers
Definition: starpu_task.h:527
#define STARPU_DATA_ARRAY
Definition: starpu_task_util.h:95
int starpu_task_insert(struct starpu_codelet *cl,...)
#define STARPU_VALUE
Definition: starpu_task_util.h:45

The whole code for this complex data interface is available in the file examples/basic_examples/dynamic_handles.c.

10.5 Setting a Variable Number Of Data Handles For a Task

Normally, the number of data handles given to a task is set with starpu_codelet::nbuffers. This field can however be set to STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers must be set, and starpu_task::modes (or starpu_task::dyn_modes, see Setting Many Data Handles For a Task) should be used to specify the modes for the handles. Examples in examples/basic_examples/dynamic_handles.c show how to implement it.

10.6 Insert Task Utility

StarPU provides the wrapper function starpu_task_insert() to ease the creation and submission of tasks.

Here is the implementation of a codelet:

void func_cpu(void *descr[], void *_args)
{
int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
*x0 = *x0 * ifactor;
*x1 = *x1 * ffactor;
}
struct starpu_codelet mycodelet =
{
.cpu_funcs = { func_cpu },
.cpu_funcs_name = { "func_cpu" },
.nbuffers = 2,
.modes = { STARPU_RW, STARPU_RW }
};
starpu_cpu_func_t cpu_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition: starpu_task.h:410
#define STARPU_VARIABLE_GET_PTR(interface)
Definition: starpu_data_interfaces.h:2209
@ STARPU_RW
Definition: starpu_data.h:59
void starpu_codelet_unpack_args(void *cl_arg,...)

And the call to the function starpu_task_insert():

starpu_task_insert(&mycodelet,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
STARPU_RW, data_handles[0],
STARPU_RW, data_handles[1],
0);

The call to starpu_task_insert() is equivalent to the following code:

task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
char *arg_buffer;
size_t arg_buffer_size;
starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);
task->cl_arg = arg_buffer;
task->cl_arg_size = arg_buffer_size;
int ret = starpu_task_submit(task);
void * cl_arg
Definition: starpu_task.h:835
struct starpu_codelet * cl
Definition: starpu_task.h:708
size_t cl_arg_size
Definition: starpu_task.h:852
starpu_data_handle_t handles[STARPU_NMAXBUFS]
Definition: starpu_task.h:785
Definition: starpu_task.h:679
void starpu_codelet_pack_args(void **arg_buffer, size_t *arg_buffer_size,...)

In the example file tests/main/insert_task_value.c, we use these two ways to create and submit tasks.

Instead of calling starpu_codelet_pack_args(), one can also call starpu_codelet_pack_arg_init(), then starpu_codelet_pack_arg() for each data, then starpu_codelet_pack_arg_fini() as follow:

task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
starpu_codelet_pack_arg(&state, &ifactor, sizeof(ifactor));
starpu_codelet_pack_arg(&state, &ffactor, sizeof(ffactor));
int ret = starpu_task_submit(task);
void starpu_codelet_pack_arg(struct starpu_codelet_pack_arg_data *state, const void *ptr, size_t ptr_size)
void starpu_codelet_pack_arg_init(struct starpu_codelet_pack_arg_data *state)
void starpu_codelet_pack_arg_fini(struct starpu_codelet_pack_arg_data *state, void **cl_arg, size_t *cl_arg_size)
Definition: starpu_task_util.h:546

A full code example is in file tests/main/pack.c.

Here a similar call using STARPU_DATA_ARRAY.

starpu_task_insert(&mycodelet,
STARPU_DATA_ARRAY, data_handles, 2,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);

If some part of the task insertion depends on the value of some computation, the macro STARPU_DATA_ACQUIRE_CB can be very convenient. For instance, assuming that the index variable i was registered as handle A_handle[i]:

/* Compute which portion we will work on, e.g. pivot */
starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
/* And submit the corresponding task */
starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
#define STARPU_DATA_ACQUIRE_CB(handle, mode, code)
Definition: starpu_data.h:379
@ STARPU_W
Definition: starpu_data.h:58

The macro STARPU_DATA_ACQUIRE_CB submits an asynchronous request for acquiring data i for the main application, and will execute the code given as the third parameter when it is acquired. In other words, as soon as the value of i computed by the codelet which_index can be read, the portion of code passed as the third parameter of STARPU_DATA_ACQUIRE_CB will be executed, and is allowed to read from i to use it e.g. as an index. Note that this macro is only available when compiling StarPU with the compiler gcc. In the example file tests/datawizard/acquire_cb_insert.c, this macro is used.

StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the STARPU_VALUE arguments passed to the task. There is several ways of calling this function starpu_codelet_unpack_args(). The full code examples are available in the file tests/main/insert_task_value.c.

void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, 0);
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
char buffer[100];
starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
starpu_codelet_unpack_args(buffer, &ffactor);
}
void starpu_codelet_unpack_args_and_copyleft(void *cl_arg, void *buffer, size_t buffer_size,...)

Instead of calling starpu_codelet_unpack_args(), one can also call starpu_codelet_unpack_arg_init(), then starpu_codelet_pack_arg() or starpu_codelet_dup_arg() or starpu_codelet_pick_arg() for each data, then starpu_codelet_unpack_arg_fini() as follow:

void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
size_t size = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, size);
starpu_codelet_unpack_arg(&state, (void**)&ifactor, sizeof(ifactor));
starpu_codelet_unpack_arg(&state, (void**)&ffactor, sizeof(ffactor));
}
void starpu_codelet_unpack_arg(struct starpu_codelet_pack_arg_data *state, void *ptr, size_t size)
void starpu_codelet_unpack_arg_init(struct starpu_codelet_pack_arg_data *state, void *cl_arg, size_t cl_arg_size)
void starpu_codelet_unpack_arg_fini(struct starpu_codelet_pack_arg_data *state)
void func_cpu(void *descr[], void *_args)
{
int *ifactor;
float *ffactor;
size_t size;
size_t psize = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, psize);
starpu_codelet_dup_arg(&state, (void**)&ifactor, &size);
assert(size == sizeof(*ifactor));
starpu_codelet_dup_arg(&state, (void**)&ffactor, &size);
assert(size == sizeof(*ffactor));
}
void starpu_codelet_dup_arg(struct starpu_codelet_pack_arg_data *state, void **ptr, size_t *size)
void func_cpu(void *descr[], void *_args)
{
int *ifactor;
float *ffactor;
size_t size;
size_t psize = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, psize);
starpu_codelet_pick_arg(&state, (void**)&ifactor, &size);
assert(size == sizeof(*ifactor));
starpu_codelet_pick_arg(&state, (void**)&ffactor, &size);
assert(size == sizeof(*ffactor));
}
void starpu_codelet_pick_arg(struct starpu_codelet_pack_arg_data *state, void **ptr, size_t *size)

During unpacking one can also call the function starpu_codelet_unpack_discard_arg() to skip saving the argument in pointer.

A full code example is in file tests/main/pack.c.