Skip to content

Commit 3a2f55b

Browse files
Add RFC for creation and use of NUMA arenas
1 parent 6a57193 commit 3a2f55b

File tree

1 file changed

+147
-0
lines changed

1 file changed

+147
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
#+title: API to Facilitate Instantiation and Use of oneTBB's Task Arenas Constrained to NUMA Nodes
2+
3+
*Note:* This is a sub-RFC of the https://github.com/oneapi-src/oneTBB/pull/1535.
4+
5+
* Introduction
6+
Let's consider the example from "Setting the preferred NUMA node" section of the
7+
[[https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html][Guiding Task Scheduler Execution]] page of oneTBB Developer Guide.
8+
9+
** Motivating example
10+
#+begin_src C++
11+
std::vector<tbb::numa_node_id> numa_indexes = tbb::info::numa_nodes(); // [0]
12+
std::vector<tbb::task_arena> arenas(numa_indexes.size()); // [1]
13+
std::vector<tbb::task_group> task_groups(numa_indexes.size()); // [2]
14+
15+
for(unsigned j = 0; j < numa_indexes.size(); j++) {
16+
arenas[j].initialize(tbb::task_arena::constraints(numa_indexes[j])); // [3]
17+
arenas[j].execute([&task_groups, &j](){ // [4]
18+
task_groups[j].run([](){/*some parallel stuff*/});
19+
});
20+
}
21+
22+
for(unsigned j = 0; j < numa_indexes.size(); j++) {
23+
arenas[j].execute([&task_groups, &j](){ task_groups[j].wait(); }); // [5]
24+
}
25+
#+end_src
26+
27+
Usually the users of oneTBB employ this technique to tie oneTBB worker threads
28+
up within NUMA nodes and yet have all the parallelism of a platform utilized.
29+
The pattern allows to find out how many NUMA nodes are on the system. With that
30+
number user creates that many ~tbb::task_arena~ objects, constraining each to a
31+
dedicated NUMA node. Along with ~tbb::task_arena~ objects user instantiates the
32+
same number of ~tbb::task_group~ objects, with which the oneTBB tasks are going
33+
to be associated. The ~tbb::task_group~ objects are needed because they allow
34+
waiting for the work completion as the ~tbb::task_arena~ class does not provide
35+
synchronization semantics on its own. Then the work gets submitted in each of
36+
arena objects, and waited upon their finish at the end.
37+
38+
** Interface issues and inconveniences:
39+
- [0] - Getting the number of NUMA nodes is not the task by itself, but rather a
40+
necessity to know how many objects to initialize further.
41+
- [1] - Explicit step for creating the number of ~tbb::task_arena~ objects per
42+
each NUMA node. Note that by default the arena objects are constructed with a
43+
slot reserved for master thread, which in this particular example usually
44+
results in undersubscription issue as the master thread can join only one
45+
arena at a time to help with work processing.
46+
- [2] - Separate step for instantiation the same number of ~tbb::task_group~
47+
objects, in which the actual work is going to be submitted. Note that user
48+
also needs to make sure the size of ~arenas~ matches the size of
49+
~task_groups~.
50+
- [3] - Actual tying of ~tbb::task_arena~ instances with corresponding NUMA
51+
nodes. Note that user needs to make sure the indices of ~tbb::task_arena~
52+
objects match corresponding indices of NUMA nodes.
53+
- [4] - Actual work submission point. It is relatively easy to make a mistake
54+
here by using the ~tbb::task_arena::enqueue~ method instead. In this case not
55+
only the work submission might be done after the synchronization point [5],
56+
but also the loop counter ~j~ can be mistakenly captured by reference, which
57+
at least results in submission of the work into incorrect ~tbb::task_group~,
58+
and at most a segmentation fault, since the loop counter might not exist by
59+
the time the functor starts its execution.
60+
- [5] - Synchronization point, where user needs to again make sure corresponding
61+
indices are used. Otherwise, the waiting might be done in unrelated
62+
~tbb::task_arena~. It is also possible to mistakenly use
63+
~tbb::task_arena::enqueue~ method with the same consequences as were outlined
64+
in the previous bullet, but since it is a synchronization point, usually the
65+
blocking call is used.
66+
67+
The proposal below addresses these issues.
68+
69+
* Proposal
70+
Introduce simplified interface to:
71+
- Contstrain a task arena to specific NUMA node,
72+
- Submit work into constrained task arenas, and
73+
- To wait for completion of the submitted work.
74+
75+
Since the new interface represents a constrained ~tbb::task_arena~ , the
76+
proposed name is ~tbb::constrained_task_arena~. Not including the word "numa"
77+
into the name would allow it for extension in the future for other types of
78+
constraints.
79+
80+
** Usage Example
81+
#+begin_src C++
82+
std::vector<tbb::constrained_task_arena> numa_arenas =
83+
tbb::initialize_numa_constrained_arenas();
84+
85+
for(unsigned j = 0; j < numa_arenas.size(); j++) {
86+
numa_arenas[j].enqueue( (){/*some parallel stuff*/} );
87+
}
88+
89+
for(unsigned j = 0; j < numa_arenas.size(); j++) {
90+
numa_arenas[j].wait();
91+
}
92+
#+end_src
93+
94+
** New arena interface
95+
The example above requires new class named ~tbb::constrained_task_arena~. On one
96+
hand, it is a ~tbb::task_arena~ class that isolates the work execution from
97+
other parallel stuff executed by oneTBB. On the other hand, it is a constrained
98+
arena that represents an arena associated to a certain NUMA node and allows
99+
efficient and error-prone work submission in this particular usage scenario.
100+
101+
#+begin_src C++
102+
namespace tbb {
103+
104+
class constrained_task_arena : protected task_arena {
105+
public:
106+
using task_arena::is_active();
107+
using task_arena::terminate();
108+
109+
using task_arena::max_concurrency();
110+
111+
using task_arena::enqueue;
112+
113+
void wait();
114+
private:
115+
constrained_task_arena(tbb::task_arena::constraints);
116+
friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas();
117+
};
118+
119+
}
120+
#+end_src
121+
122+
The interface exposes only necessary methods to allow submission and waiting of
123+
a parallel work. Most of the exposed function members are taken from the base
124+
~tbb::task_arena~ class. Implementation-wise, the new task arena would include
125+
associated ~tbb::task_group~ instance, with which enqueued work will be
126+
implicitly associated.
127+
128+
The ~tbb::constrained_task_arena::wait~ method waits for the work in associated
129+
~tbb::task_group~ to finish, if any was submitted using the
130+
~tbb::constrained_task_arena::enqueue~ method.
131+
132+
The instance of the ~tbb::constrained_task_arena~ class can be created only by
133+
~tbb::initialize_numa_constrained_arenas~ function, whose sole purpose is to
134+
instantiate a ~std::vector~ of initialized ~tbb::constrained_task_arena~
135+
instances, each constrained to its own NUMA node of the platform and does not
136+
include reserved slots, and return this vector back to caller.
137+
138+
* Open Questions
139+
1. Should the interface for creation of constrained task arenas support other
140+
construction parameters (e.g., max_concurrency, number of reserved slots,
141+
priority, other constraints) from the very beginning or it is enough as the
142+
first iteration and these parameters can be added in the future when the need
143+
arise?
144+
2. Should the new task arena allow initializing it with, probably, different
145+
parameters after its creation?
146+
3. Should the new task arena interface allow copying of its settings by exposing
147+
its copy-constructor similarly to what ~tbb::task_arena~ does.

0 commit comments

Comments
 (0)