StarPU Handbook - StarPU Extensions
21. Creating Parallel Workers On A Machine

21.1 General Ideas

Parallel workers are a concept introduced in this paper where they are called clusters.

The granularity problem is tackled by using resource aggregation: instead of dynamically splitting tasks, resources are aggregated to process coarse grain tasks in a parallel fashion. This is built on top of scheduling contexts to be able to handle any type of parallel tasks.

This comes from a basic idea, making use of two levels of parallelism in a DAG. We keep the DAG parallelism, but consider on top of it that a task can contain internal parallelism. A good example is if each task in the DAG is OpenMP enabled.

The particularity of such tasks is that we will combine the power of two runtime systems: StarPU will manage the DAG parallelism and another runtime (e.g. OpenMP) will manage the internal parallelism. The challenge is in creating an interface between the two runtime systems so that StarPU can regroup cores inside a machine (creating what we call a parallel worker) on top of which the parallel tasks (e.g. OpenMP tasks) will be run in a contained fashion.

The aim of the parallel worker API is to facilitate this process automatically. For this purpose, we depend on the hwloc tool to detect the machine configuration and then partition it into usable parallel workers.

An example of code running on parallel workers is available in examples/sched_ctx/parallel_workers.c.

Let's first look at how to create a parallel worker.

To enable parallel workers in StarPU, one needs to set the configure option --enable-parallel-worker.

21.2 Workers Creating Parallel Workers

Partitioning a machine into parallel workers with the parallel worker API is fairly straightforward. The simplest way is to state under which machine topology level we wish to regroup all resources. This level is a hwloc object, of the type hwloc_obj_type_t. More information can be found in the hwloc documentation.

Once a parallel worker is created, the full machine is represented with an opaque structure starpu_parallel_worker_config. This can be printed to show the current machine state.

struct starpu_parallel_worker_config *parallel_workers;
parallel_workers = starpu_parallel_worker_init(HWLOC_OBJ_SOCKET, 0);
starpu_parallel_worker_print(parallel_workers);
/* submit some tasks with OpenMP computations */
/* we are back to the default StarPU state */
int starpu_parallel_worker_print(struct starpu_parallel_worker_config *parallel_workers)
int starpu_parallel_worker_shutdown(struct starpu_parallel_worker_config *parallel_workers)
struct starpu_parallel_worker_config * starpu_parallel_worker_init(hwloc_obj_type_t parallel_worker_level,...)

The following graphic is an example of what a particular machine can look like once parallel workers are created. The main difference is that we have less worker queues and tasks which will be executed on several resources at once. The execution of these tasks will be left to the internal runtime system, represented with a dashed box around the resources.

StarPU using parallel tasks

Creating parallel workers as shown in the example above will create workers able to execute OpenMP code by default. The parallel worker creation function starpu_parallel_worker_init() takes optional parameters after the hwloc object (always terminated by the value 0) which allow parametrizing the parallel workers creation. These parameters can help to create parallel workers of a type different from OpenMP, or create a more precise partition of the machine.

This is explained in Section Creating Custom Parallel Workers.

Before starpu_shutdown(), we call starpu_parallel_worker_shutdown() to delete the parallel worker configuration.

21.3 Example Of Constraining OpenMP

Parallel workers require being able to constrain the runtime managing the internal task parallelism (internal runtime) to the resources set by StarPU. The purpose of this is to express how StarPU must communicate with the internal runtime to achieve the required cooperation. In the case of OpenMP, StarPU will provide an awake thread from the parallel worker to execute this liaison. It will then provide on demand the process ids of the other resources supposed to be in the region. Finally, thanks to an OpenMP region, we can create the required number of threads and bind each of them on the correct region. These will then be reused each time we encounter a #pragma omp parallel in the following computations of our program.

The following graphic is an example of what an OpenMP-type parallel worker looks like and how it is represented in StarPU. We can see that one StarPU (black) thread is awake, and we need to create on the other resources the OpenMP threads (in pink).

StarPU with an OpenMP parallel worker

Finally, the following code shows how to force OpenMP to cooperate with StarPU and create the aforementioned OpenMP threads constrained in the parallel worker's resources set:

void starpu_parallel_worker_openmp_prologue(void * sched_ctx_id)
{
int sched_ctx = *(int*)sched_ctx_id;
int *cpuids = NULL;
int ncpuids = 0;
int workerid = starpu_worker_get_id();
//we can target only CPU workers
{
//grab all the ids inside the parallel worker
starpu_sched_ctx_get_available_cpuids(sched_ctx, &cpuids, &ncpuids);
//set the number of threads
omp_set_num_threads(ncpuids);
#pragma omp parallel
{
//bind each threads to its respective resource
starpu_sched_ctx_bind_current_thread_to_cpuid(cpuids[omp_get_thread_num()]);
}
free(cpuids);
}
return;
}
void starpu_parallel_worker_openmp_prologue(void *)
enum starpu_worker_archtype starpu_worker_get_type(int id)
int starpu_worker_get_id(void)
@ STARPU_CPU_WORKER
Definition: starpu_worker.h:67

This function is the default function used when calling starpu_parallel_worker_init() without extra parameter.

Parallel workers are based on several tools and models already available within StarPU contexts, and merely extend contexts. More on contexts can be read in Section Scheduling Contexts.

A similar example is available in the file examples/sched_ctx/parallel_code.c.

21.4 Creating Custom Parallel Workers

Parallel workers can be created either with the predefined types provided within StarPU, or with user-defined functions to bind another runtime inside StarPU.

The predefined parallel worker types provided by StarPU are STARPU_PARALLEL_WORKER_OPENMP, STARPU_PARALLEL_WORKER_INTEL_OPENMP_MKL and STARPU_PARALLEL_WORKER_GNU_OPENMP_MKL.

If StarPU is compiled with the MKL library, STARPU_PARALLEL_WORKER_GNU_OPENMP_MKL uses MKL functions to set the number of threads, which is more reliable when using an OpenMP implementation different from the Intel one. Otherwise, it will behave as STARPU_PARALLEL_WORKER_INTEL_OPENMP_MKL.

The parallel worker type is set when calling the function starpu_parallel_worker_init() with the parameter STARPU_PARALLEL_WORKER_TYPE as in the example below, which is creating a MKL parallel worker.

struct starpu_parallel_worker_config *parallel_workers;
parallel_workers = starpu_parallel_worker_init(HWLOC_OBJ_SOCKET,
0);
#define STARPU_PARALLEL_WORKER_TYPE
Definition: starpu_parallel_worker.h:83
@ STARPU_PARALLEL_WORKER_GNU_OPENMP_MKL
Definition: starpu_parallel_worker.h:113

Using the default type STARPU_PARALLEL_WORKER_OPENMP is similar to calling starpu_parallel_worker_init() without any extra parameter.

An example is available in examples/parallel_workers/parallel_workers.c.

Users can also define their own function.

void foo_func(void* foo_arg);
int foo_arg = 0;
struct starpu_parallel_worker_config *parallel_workers;
parallel_workers = starpu_parallel_worker_init(HWLOC_OBJ_SOCKET,
0);
#define STARPU_PARALLEL_WORKER_CREATE_FUNC_ARG
Definition: starpu_parallel_worker.h:79
#define STARPU_PARALLEL_WORKER_CREATE_FUNC
Definition: starpu_parallel_worker.h:74

An example is available in examples/parallel_workers/parallel_workers_func.c.

Parameters that can be given to starpu_parallel_worker_init() are STARPU_PARALLEL_WORKER_MIN_NB, STARPU_PARALLEL_WORKER_MAX_NB, STARPU_PARALLEL_WORKER_NB, STARPU_PARALLEL_WORKER_POLICY_NAME, STARPU_PARALLEL_WORKER_POLICY_STRUCT, STARPU_PARALLEL_WORKER_KEEP_HOMOGENEOUS, STARPU_PARALLEL_WORKER_PREFERE_MIN, STARPU_PARALLEL_WORKER_CREATE_FUNC, STARPU_PARALLEL_WORKER_CREATE_FUNC_ARG, STARPU_PARALLEL_WORKER_TYPE, STARPU_PARALLEL_WORKER_AWAKE_WORKERS, STARPU_PARALLEL_WORKER_PARTITION_ONE, STARPU_PARALLEL_WORKER_NEW and STARPU_PARALLEL_WORKER_NCORES.

21.5 Parallel Workers With Scheduling

As previously mentioned, the parallel worker API is implemented on top of Scheduling Contexts. Its main addition is to ease the creation of a machine CPU partition with no overlapping by using hwloc, whereas scheduling contexts can use any number of any type of resources.

It is therefore possible, but not recommended, to create parallel workers using the scheduling contexts API. This can be useful mostly in the most complex machine configurations, where users have to dimension precisely parallel workers by hand using their own algorithm.

/* the list of resources the context will manage */
int workerids[3] = {1, 3, 10};
/* indicate the list of workers assigned to it, the number of workers,
the name of the context and the scheduling policy to be used within
the context */
int id_ctx = starpu_sched_ctx_create(workerids, 3, "my_ctx", 0);
/* let StarPU know that the following tasks will be submitted to this context */
starpu_sched_ctx_set_task_context(id);
task->prologue_callback_pop_func=&runtime_interface_function_here;
/* submit the task to StarPU */
int starpu_task_submit(struct starpu_task *task)
unsigned starpu_sched_ctx_create(int *workerids_ctx, int nworkers_ctx, const char *sched_ctx_name,...)

As this example illustrates, creating a context without scheduling policy will create a parallel worker. The interface function between StarPU and the other runtime must be specified through the field starpu_task::prologue_callback_pop_func. Such a function can be similar to the OpenMP thread team creation one (see above). An example is available in examples/sched_ctx/parallel_tasks_reuse_handle.c.

Note that the OpenMP mode is the default mode both for parallel workers and contexts. The result of a parallel worker creation is a woken-up master worker and sleeping "slaves" which allow the master to run tasks on their resources.

To create a parallel worker with woken-up workers, the flag STARPU_SCHED_CTX_AWAKE_WORKERS must be set when using the scheduling context API function starpu_sched_ctx_create(), or the flag STARPU_PARALLEL_WORKER_AWAKE_WORKERS must be set when using the parallel worker API function starpu_parallel worker_init().