Exploiting full computational power of current more and more hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. Unfortunately, most operating systems only provide a poor scheduling API that does not allow applications to transmit valuable scheduling hints to the system. In a previous paper, we showed that using a bubble-based thread scheduler can significantly improve applications' performance in a portable way. However, since multithreaded applications have various scheduling requirements, there is no universal scheduler that could meet all these needs. In this paper, we present a framework that allows scheduling experts to implement and experiment with customized thread schedulers. It provides a powerful API for dynamically distributing bubbles among the machine in a high-level, portable, and efficient way. Several examples show how experts can then develop, debug and tune their own portable bubble schedulers.
arXiv:0706.2069v1 [cs.DC] 14 Jun 2007
Building
P
ortable
Thread
S
hedulers
for
Hierar
hi al
Multipro
essors:
the
BubbleS
hed
F
ramew
ork
Sam
uel
Thibault,
Ra
ymond
Nam
yst,
and
Pierre-André
W
a renier
INRIA
F
uturs
-
LaBRI
351
ours
de
la
lib
ération
33405
T
alen e
edex,
F
ran e
{thibault,namyst,wa renier} la
bri.f
r
Abstra t.
Exploiting
full
omputational
p
o
w
er
of
urren
t
more
and
more
hierar
hi al
m
ultipro
essor
ma
hines
requires
a
v
ery
areful
dis-
tribution
of
threads
and
data
among
the
underlying
non-uniform
ar-
hite ture.
Unfortunately
,
most
op
erating
systems
only
pro
vide
a
p
o
or
s
heduling
API
that
do
es
not
allo
w
appli ations
to
transmit
v
aluable
s he
duling
hints
to
the
system.
In
a
previous
pap
er
[10℄,
w
e
sho
w
ed
that
using
a
bubble
-based
thread
s
heduler
an
signi an
tly
impro
v
e
appli-
ations'
p
erforman e
in
a
p
ortable
w
a
y
.
Ho
w
ev
er,
sin e
m
ultithreaded
appli ations
ha
v
e
v
arious
s
heduling
requiremen
ts,
there
is
no
univ
er-
sal
s
heduler
that
ould
meet
all
these
needs.
In
this
pap
er,
w
e
presen
t
a
framew
ork
that
allo
ws
s
heduling
exp
erts
to
implemen
t
and
exp
er-
imen
t
with
ustomized
thread
s
hedulers.
It
pro
vides
a
p
o
w
erful
API
for
dynami ally
distributing
bubbles
among
the
ma
hine
in
a
high-lev
el,
p
ortable,
and
e ien
t
w
a
y
.
Sev
eral
examples
sho
w
ho
w
exp
erts
an
then
dev
elop,
debug
and
tune
their
o
wn
p
ortable
bubble
s he
dulers.
Keyw
ords:
Thr
e
ads,
S he
duling,
Bubbles,
NUMA,
SMP,
Multi-Cor
e,
SMT.
1
In
tro
du tion
Both
AMD
and
Intel
no
w
pro
vide
quad- ore
hips
and
are
heading
for
100- ores
hips.
This
is,
in
the
lo
w-end
mark
et,
the
emerging
part
of
a
deep
trend,
in
the
s i-
en
ti
omputation
mark
et,
to
w
ards
more
and
more
omplex
ma
hines
(e.g.
Sun
WildFire,
SGI
Al
tix,
Bull
No
v
aS ale).
Su
h
large
shared-memory
ma-
hines
are
t
ypi ally
based
on
Non-Uniform
Memory
Ar
hite tures
(NUMA).
Re-
en
t
te
hnologies
su
h
as
Sim
ultaneous
Multi-Threading
(SMT)
and
m
ulti- ore
hips
mak
e
these
ar
hite tures
ev
en
more
hierar
hi al.
Exploiting
these
ma
hines
e ien
tly
is
a
real
hallenge,
and
a
thread
s
hed-
uler
is
fa ed
with
dilemmas
when
trying
to
tak
e
in
to
a oun
t
the
memory
hierar-
h
y
and
the
CPU
utilization
sim
ultaneously
.
On
NUMA
ma
hines
for
instan e,
threads
should
generally
b
e
s
heduled
as
lose
to
their
data
as
p
ossible,
but
bandwidth- onsuming
threads
should
rather
b
e
distributed
o
v
er
dieren
t
hips.
The
ore
s
heduler
of
op
erating
systems
an
often
b
e
inuen ed,
but
it
misses
the
pre ise
appli ation
b
eha
vior:
for
instan e,
adaptiv
ely-rened
meshes
en
tail
v
ery
irregular
and
unpredi table
b
eha
vior.
A
go
o
d
solution
w
ould
b
e
to
let
appli-
ation
programmers
tak
e
on
trol
of
the
s
heduling,
but
writing
a
whole
s
heduler
for
hierar
hi al
ma
hines
is
a
v
ery
di ult
task.
In
a
previous
pap
er
[10℄,
w
e
in
tro
du ed
the
bubble
s
heduling
on ept
that
helps
to
express
the
inheren
t
parallel
stru ture
of
m
ultithreaded
appli ations
in
a
w
a
y
that
an
b
e
e ien
tly
exploited
b
y
the
underlying
thread
s
heduler.
Bubbles
are
abstra tions
to
group
threads
whi
h
w
ork
together
in
a
re ursiv
e
w
a
y
.
The
rst
pr
o
of-of-
on
ept
implemen
tation
of
our
bubble
s
heduler
w
as
featuring
a
generi
hard- o
ded
s
heduler.
Ho
w
ev
er,
appli ations
ma
y
ha
v
e
dieren
t
s
hedul-
ing
requiremen
ts
and
th
us
ma
y
atta
h
dieren
t
seman
ti s
to
bubbles,
enfor ing
memory
anit
y
or
emphasizing
a
high
frequen y
of
global
syn
hronization
op-
erations
for
instan e.
Ob
viously
,
no
generi
s
heduler
an
meet
all
these
needs.
In
this
pap
er,
w
e
presen
t
BubbleS he
d,
a
framew
ork
designed
to
ease
the
dev
el-
opmen
t
and
the
ev
aluation
of
ustomized,
high-lev
el
thread
s
hedulers.
2
On
the
Design
of
Thread
S
hedulers
Designing
a
thread
s
heduler
for
hierar
hi al
ma
hines
is
omplex
b
e ause
it
means
nding
an
appli ation-sp
e i
ompromise
b
et
w
een
lots
of
onstrain
ts:
fa
v
oring
anities
b
et
w
een
threads
and
memory
,
taking
adv
an
tage
of
all
ompu-
tational
p
o
w
er,
redu ing
syn
hronization
ost,
et .
2.1
What
Input
Can
a
S
heduler
Exp
e t?
T
o
mak
e
appropriate
de isions
at
exe ution
time,
a
thread
s
heduler
an
om
bine
a
n
um
b
er
of
parameters
to
ev
aluate
the
go
o
dness
of
ea
h
p
oten
tial
s
heduling
a tion.
These
parameters
an
b
e
olle ted
from
sev
eral
pla es,
at
dieren
t
times.
A
t
runtime,
some
useful
kno
wledge
ab
out
the
target
ma
hine
an
b
e
dis-
o
v
ered.
The
s
heduler
an
not
only
get
the
n
um
b
er
of
pro
essors
but
also
the
ar
hite ture
hierar
h
y:
ho
w
pro
essors
and
memory
banks
are
onne ted,
ho
w
a
he
lev
els
are
shared
b
et
w
een
pro
essors,
et .
Moreo
v
er,
indi ation
ab
out
ho
w
w
ell
the
threads
are
using
the
underlying
pro
essors
an
b
e
fet
hed
from
p
erfor-
man e
oun
ters.
Some
NUMA
hips
an
also
rep
ort
the
ratio
of
remote
memory
a esses.
The
s
heduler
an
hen e
he
k
whether
threads
and
data
are
prop
This content is AI-processed based on open access ArXiv data.