stable_broadcast.tex
Name view /
Date view /
Revision view
_
2007-11-12
_
2007-11-13
_
2007-11-14
_
2007-11-15
_
2007-11-16
_
2007-11-17
_
2007-11-18
_
2007-11-19
_
2007-11-20
\documentclass[conference]{IEEEtran}
%\usepackage{latex8}
%\usepackage{times}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{algorithm,algorithmic}
%\ifCLASSINFOpdf
%
correct
bad
hyphenation
here
\hyphenation{op-tical
net-works
semi-conduc-tor}
\begin{document}
\title{A
Stable
Broadcast
Algorithm}
\author{
\IEEEauthorblockN{Kei
Takahashi,
Hideo
Saito,
Takeshi
Shibata
and
Kenjiro
Taura}
\IEEEauthorblockA{The
University
of
Tokyo\\
Email:
\{kay,
h\_saito,
shibata,
tau\}@}}
\maketitle
%
\begin{abstract}
%OK
%
In
many
data-intensive
applications,
each
node
can
start
processing
as
soon
as
it
has
received
required
data.
%
Thus,
each
node
is
desired
to
receive
the
data
in
the
broadest-possible
bandwidth.
%
However,
the
bandwidth
each
node
receives
is
possibly
degraded
in
most
broadcast
algorithms
compared
to
an
exclusive
direct
transfer
from
the
source.
%
We
say
a
broadcast
is
{\em
stable}
when
the
bandwidth
of
some
nodes
is
never
diminished
by
adding
other
nodes.
%
Many
cluster
environments
have
a
tree
network
topology
with
links
that
have
the
same
bandwidth
in
the
both
directions.
%
Under
this
assumption,
we
propose
a
stable
broadcast
algorithm
that
uses
multiple
partial
pipelines.
Pipelines
are
iteratively
constructed
in
a
depth-first
manner,
and
each
pipeline
connects
some
of
the
destinations.
%
We
proved
that
our
broadcast
algorithm
delivers
the
same
amount
of
data
to
each
node
as
when
an
exclusive
data
transfer
is
performed
to
it.
As
a
result,
the
aggregate
bandwidth,
the
cumulative
sum
of
data
that
receiving
by
all
nodes
in
a
unit
time,
is
maximized.
%better
describes
the
performance
of
a
long-message
broadcast,
%
Our
simulation
has
shown
that
our
algorithm
achieves
the
best
performance
and
is
stable
for
adding
nodes
with
narrow
bandwidth.
In
a
large
bandwidth
variance
environment,
our
scheme
yielded
twice
the
aggregate
bandwidth
compared
to
a
single
depth-first
pipeline.
We
have
also
performed
a
experiment
on
a
real
environment
to
demonstrate
its
practicality.
%
\end{abstract}
\begin{abstract}
%OK
Distributing
large
data
to
many
nodes,
known
as
a
broadcast
or
a
multicast,
is
an
important
operation
in
parallel
and
distributed
computing.
Most
previous
broadcast
algorithms
explicitly
or
implicitly
try
to
deliver
data
to
all
nodes
in
the
same
rate.
While
being
reasonable
for
homogeneous
environments
where
all
nodes
have
similar
receiving
capababilities,
such
algorithms
may
suffer
from
slowly
receiving
nodes
in
heterogeneous
settings.
In
such
settings,
each
node
desires
to
receive
data
at
its
largest
possible
bandwidth
and
start
computation
as
soon
as
it
receives
the
data.
In
this
paper
we
propose
to
say
a
broadcast
is
{\em
stable\/}
when
the
bandwidth
to
a
node
is
never
sacrificed
by
the
presence
of
other,
possibly
slow,
receiving
nodes,
and
proposes
the
stability
as
a
desired
property
of
broadcast
algorithms.
In
addition,
we
show
a
simple
and
efficient
stable
broadcast
when
the
topology
among
nodes
is
a
tree
and
each
link
has
a
symmetric
bandwidth.
This
work
improves
upon
previously
proposed
algorithms
such
as
FPFR
and
Balanced
Multicast.
For
general
graphs,
it
outperforms
them
when
the
network
is
heterogeneous
and
for
trees,
it
is
stable
and
optimal.
Our
simulation
confirmed
that
our
algorithm
has
the
desired
properties.
In
heterogeneous
environments,
our
scheme
yielded
twice
the
aggregate
bandwidth
compared
to
a
simpler
algorithm
using
a
single
pipeline.
We
have
also
performed
an
experiment
on
a
real
environment
to
demonstrate
its
practicality.
\end{abstract}
\IEEEpeerreviewmaketitle
\section{Introduction}
\label{sec:background}
There
are
growing
demands
to
effectively
execute
data
intensive
applications
in
distributed,
wide-area
environments.
Since
data
transfers
occupy
a
considerable
part
of
the
total
processing
time
in
the
execution
of
such
applications,
efficient
data
transfer
techniques
are
required.
One
of
the
most
common,
practical
transfer
problems
is
broadcast.
In
a
broadcast,
one
or
more
{\it
source}
nodes
have
some
data
that
need
to
be
copied
to
some
{\it
destination}
nodes.
Various
algorithms
have
been
proposed
to
optimize
broadcasts
according
to
various
criteria,
such
as
completion
time
and
aggregate
bandwidth
(the
sum
of
the
bandwidths
of
all
nodes)~\cite{fastreplica,balanced,fpfr,mpichg2,bittorrent,mob}.
Such
existing
broadcast
algorithms
have
a
few
shortcomings,
especially
for
data
intensive
applications
executed
in
distributed,
wide-area
environments.
One
is
that
they
are
not
{\it
stable},
in
that
the
addition
of
a
new
node
to
an
existing
broadcast
can
potentially
lower
the
bandwidth
of
a
node
already
participating
in
the
broadcast.
This
complicates
task
scheduling,
because
earlier
scheduling
decisions
may
have
depended
on
a
particular
node
receiving
data
at
a
particular
rate.
Another
shortcoming
of
existing
broadcast
algorithms
is
that
they
are
not
{\it
optimal}
in
terms
of
aggregate
bandwidth.
Even
though
many
existing
algorithms
do
optimize
for
aggregate
bandwidth,
they
do
so
under
the
assumption
that
all
nodes
have
the
same
incoming
bandwidth.
As
a
result,
they
do
not
explore
the
possibility
of
different
nodes
receiving
data
at
different
rates,
missing
the
opportunity
of
having
nodes
that
receive
data
early
start
computation
without
waiting
for
other
nodes.
In
wide-area
environments
where
link
speeds
can
vary
a
great
amount,
it
is
especially
important
to
prevent
nodes
with
slow
links
from
slowing
down
nodes
with
faster
links.
In
this
paper,
we
propose
a
broadcast
algorithm
that
uses
bandwidth-annotated
topology
information
to
optimize
for
aggregate
bandwidth.
Our
algorithm
improves
an
existing
algorithm
called
FPFR~\cite{fpfr},
and
performs
broadcasts
with
higher
aggregate
bandwidth
than
FPFR
for
any
graph
topology.
Moreover,
for
tree
topologies,
we
prove
that
our
algorithm
satisfies
the
following
two
properties:
\begin{itemize}
\item
{\it
Stability:}
Adding
a
node
to
an
existing
broadcast
does
not
lower
the
incoming
bandwidth
of
any
node
already
participating
in
the
broadcast.
\item
{\it
Optimality:}
A
broadcast
performed
using
our
algorithm
achieves
the
maximum
possible
aggregate
incoming
bandwidth.
\end{itemize}
While
our
algorithm
works
well
with
any
graph
topology,
we
use
tree
topologies
for
simplicity
in
the
exposition.
To
evaluate
our
algorithm,
we
used
the
topology
inference
tool
developed
by
Shirai
et
al.~\cite{shirai}
to
obtain
a
tree
topology,
and
performed
both
simulations
and
real-machine
experiments.
The
rest
of
this
paper
is
organized
as
follows.
In
section
\ref{sec:broadcast},
some
previous
broadcast
techniques
are
explained.
Our
algorithm
as
well
as
proof
of
optimality
and
stability
are
shown
in
Section
\ref{sec:algorithm}.
Section
\ref{sec:eva}
depicts
the
experiments
and
evaluation
of
our
algorithm
in
the
real
environment,
and
we
conclude
the
paper
in
Section
\ref{sec:conclusion}.
\section{Related
Work}\label{sec:broadcast}
In
this
section,
some
existing
broadcast
techniques
are
shown.
Although
much
research
has
been
done
to
improve
aggregate
bandwidth,
none
has
been
evaluated
in
terms
of
stability.
%Many
of
them
uses
bandwidth-annotated
topology.
\subsection{Topology-unaware
Broadcast}
%
flat
tree
The
simplest
broadcast
algorithm
is
{\it
flat-tree},
in
which
a
source
node
directly
sends
data
to
all
the
destinations.
As
this
algorithm
does
not
perform
well
because
of
the
bottleneck
at
the
source,
some
broadcast
algorithms
have
been
invented.
%
Fast
Replica
In
{\it
FastReplica}~\cite{fastreplica},
the
source
splits
data
into
small
pieces
and
sends
each
piece
to
a
different
destination.
Then,
the
destinations
construct
a
ring
connection,
exchange
pieces
and
finally
obtain
the
whole
file.
In
this
method,
the
brocadcast
bandwidth
is
not
restricted
by
the
master.
However,
since
the
method
does
not
consider
network
topologies,
congestions
are
possibly
induced
by
using
the
same
link
many
times.
For
example,
if
a
certain
link
is
used
by
$N$
transfers,
the
bandwidth
is
reduced
to
$1/N$.
In
addition,
since
a
node
connected
with
a
narrow
link
relays
data
to
other
node,
it
limits
the
performance
of
all
the
other
nodes.
%
BitTorrent
{\it
BitTorrent}~\cite{bittorrent,
sched_bt}
and
some
other
work
\cite{mob}
adaptively
improve
the
transfer
graph,
by
changing
the
relaying
network.
First,
data
are
divided
into
small
pieces
and
each
node
is
allowed
to
receive
them
in
an
arbitrary
order.
Each
node
periodically
changes
the
parent,
and
nodes
with
a
broad
link
are
more
likely
to
relay
data.
As
a
result,
the
aggregate
bandwidth
is
increased.
However,
these
algorithms
cannot
always
reach
a
schedule
that
avoids
the
link
sharing.
In
addition,
it
takes
time
to
reach
a
better
schedule
since
these
methods
improve
the
transfer
tree
by
try-and-error.
\subsection{Topology-aware
Broadcast}
To
avoid
and
obtain
a
good
schedule
from
the
beginning,
some
methods
that
use
network
topology
information
have
been
proposed.
%
Mpich
Karonis
et
al.
proposed
to
use
a
hierarchical
tree
that
describes
the
real
network
topology~\cite{mpichg2}.
A
broadcast
is
performed
by
tracing
the
tree:
a
node
relays
data
to
all
its
children.
Since
a
link
is
only
used
once
for
the
same
data,
it
can
avoid
self-induced
congestion.
However,
since
the
tree
has
many
branches,
the
aggregate
bandwidth
decreases.
If
a
node
has
$N$
children,
the
bandwidth
of
the
lower-level
nodes
is
limited
to
$1/N$
at
least.
%Since
it
does
not
consider
the
bandwidth
of
each
link,
for
a
long
message,
bandwidth
matters,
%in
terms
of
RTT
is
placed
in
near
place,
and
a
broadcast
is
performed
using
that
tree.
%
Pipeline
In
a
pipeline
transfer~\cite{shirai,
DBLP},
each
node
relays
data
to
only
one
other
node.
By
tracing
a
topology
in
a
depth-first
manner,
a
pipeline
is
obtained.
In
this
pipeline,
each
directional
link
is
used
only
once,
so
the
throughput
is
limited
by
the
narrowest
link
in
it.
%If
the
bandwidth
of
every
link
is
the
same,
every
node
can
receive
the
same
amount
of
data
as
the
link
bandwidth.
%In
a
tree
symmetric
network,
the
depth-first
pipeline
is
stable
for
the
slowest
node,
that
means,
the
slowest
node
always
receives
the
same
amount
of
data.
%The
aggregate
bandwidth
is
improved
by
dwindling
the
pipeline
\begin{figure}[tb]
\begin{center}
\includegraphics[width=9cm]{spanning.eps}\\
\end{center}
\caption{Effect
of
Adding
Nodes
with
Narrow
Bandwidth}
\label{fig:spanning}
\end{figure}
However,
the
depth-first
pipeline
cannot
achieve
good
performance
in
terms
of
aggregate
bandwidth
when
some
nodes
lack
bandwidth.
Figure
\ref{fig:spanning}
illustrates
a
situation
in
which
new
node
$D_{3}$
with
small
bandwidth
joins
the
broadcast.
One
source
$S$
holds
the
original
data,
three
destinations
$D_{0},
D_{1},
D_{2}$
need
them.
The
label
on
each
link
shows
the
bidirectional
bandwidth.
Originally
the
aggregate
bandwidth
of
$D_{0}$,
$D_{1}$
and
$D_{2}$
is
$15$.
In
the
new
pipeline,
however,
the
aggregate
bandwidth
has
dropped
to
$6$.
This
deterioration
can
be
avoided
by
concurrently
using
multiple
transfers.
In
{\it
Multi-stream
Pipeline}
in
Figure~\ref{fig:spanning},
the
aggregate
bandwidth
is
improved
to
$16$.
%\begin{figure}[tb]
%\begin{center}
%\includegraphics[width=8cm]{flat_pipe.eps}\\
%\end{center}
%\caption{Flat
tree
and
pipeline
broadcast}
%\label{fig:flat_pipe}
%\end{figure}
%
%In
the
{\it
Flat
tree}
schedule,
the
link
near
the
source
stalls,
and
the
aggregate
bandwidth
is
limited
to
10.
Although
the
aggregate
bandwidth
is
slightly
improved
to
12
in
the
{\it
Random
Pipeline},
which
connects
the
source
and
destination
in
a
random
order,
each
node
receives
much
smaller
amount
of
data
than
in
a
execlusive
direct
transfer.
This
is
because
transfers
of
the
same
data
go
back
and
forth
the
same
link
many
times.
\subsection{Multistream
Broadcast}
%
FPFR
The
Fast
Parallel
File
Replication
(FPFR)
tool~\cite{fpfr}
uses
bandwidth-annotated
network
topology
information
to
perform
efficient
broadcasts.
It
iteratively
constructs
multiple
spanning
trees,
and
uses
them
concurrently.
By
utilizing
available
link
bandwidth
more
effectively,
FPFR
achieves
higher
aggregate
bandwidth
than
methods
that
only
use
a
single
spanning
tree.
Let
the
network
topology
and
available
bandwidth
of
each
link
be
given
in
advance.
The
first
tree
is
built
based
on
the
original
bandwidth
values.
They
have
tried
several
ways
to
construct
spanning
trees,
and
{\it
depth-first}
method,
which
traces
the
destinations
from
the
source
in
a
depth-first
manner,
achieved
the
best
performance.
%:
{\it
depth-first},
{\it
breadth-first}and
{\it
Dijkstra}.
After
the
first
tree
was
formed,
the
bandwidth
value
for
the
tree
is
subtracted
from
the
original
bandwidth
of
all
the
links.
The
following
trees
are
repeatedly
built
with
the
subtracted
bandwidth
values
until
no
spanning
tree
can
be
created.
%The
details
to
build
a
tree
is
as
follows.
%The
{\it
depth-first}
method
traces
the
destinations
from
the
source
in
a
depth-first
manner.
%The
only
difference
in
{\it
breadth-first}
method
is
that
destinations
are
traced
in
breadth-first
style.
%In
the
{\it
Dijkstra}
method,
the
source
and
destinations
already
connected
to
the
source
form
%the
``
source
set''.
A
destination
that
can
be
reached
from
the
source
set
in
the
widest
bandwidth
%are
picked
and
put
into
the
source
set.
This
operation
is
continued
until
every
destination
is
%connected
to
the
source.
These
multiple
spanning
trees
are
concurrently
used
with
the
planned
bandwidth.
Each
tree
sends
different
pieces
of
the
file,
and
finally
every
destination
receives
the
whole
file.
FPFR
has
achieved
better
performance
compared
to
a
method
that
uses
only
one
tree.
{\it
Balanced
Multicasting}~\cite{balanced}
has
further
improved
FPFR
by
using
linear
programming
to
maximize
the
aggregate
bandwidth.
However,
we
think
FPFR
has
problems
with
stability.
Since
FPFR
only
uses
spanning
trees,
a
node
with
narrow
bandwidth
limits
the
aggregate
bandwidth
of
the
other
nodes.
Especially
in
a
case
of
a
tree
topology,
it
only
outputs
one
pipeline.
Since
the
throughput
of
a
pipeline
is
limited
by
the
narrowest
link
in
it,
the
stability
is
not
achieved.
%seriously
undermined.
\def
\tapa#1{\langle
{#1}
\rangle}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Algorithm}
\label{sec:algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Basic
Idea}
Like
many
broadcast
algorithms
referred
in
section~\ref{sec:broadcast},
our
algorithm
takes
as
input
a
bandwidth-annotated
network
topology,
a
single
source
node,
and
multiple
destination
nodes,
to
generate
a
set
of
{\em
transfer
trees\/}
each
of
which
is
labeled
with
its
throughput
(the
bandwidth
it
consumes).
%
\begin{enumerate}
%
\item
The
network
topology
and
the
bandwidth
of
each
link
are
known.
The
network
contains
nodes
as
its
leaf
nodes
and
switches
as
its
internal
nodes.
%
\item
Each
link
is
bidirectional
and
has
a
symmetric
bandwidth
in
both
directions.
%
\item
There
is
only
one
source
node.
%
\end{enumerate}
In
this
model,
a
set
of
transfer
trees
is
said
{\em
feasible\/}
when
the
total
throughput
used
on
each
link
is
within
its
capacity.
In
practical
terms,
we
model
the
problem
assuming
that
each
switch
has
a
sufficiently
strong
backblane
so
the
transfer
rate
is
limited
only
by
capacities
of
hosts
or
individual
ports
of
switches,
not
by
internal
switching
capacities.
%
These
conditions
are
satisfied
in
many
multi-cluster
environment,
on
which
data-intensive
applications
are
executed.
%
Under
these
conditions,
A
broadcast
algorithm
is
stable
when
adding
more
nodes
to
destinations
never
decrease
bandwidths
delivered
to
the
original
nodes.
Formally,
for
any
pair
of
destination
sets
$D$
and
$E$
such
that
$D
\subset
E$,
and
for
any
node
$n
\in
D$,
the
bandwidth
to
$n$
generated
with
destinations
$E$
is
never
smaller
than
that
generated
with
desinations
$D$.
%
the
planed
bandwidth
for
each
destination
node
never
decrease
if
some
destinations
are
added
and
the
algorithm
executed
again.
We
propose
a
broadcast
algorithm
which
achieves
more
aggregate
bandwidth
than
FPFR
and
the
balanced
multicast.
Besides,
the
algorithm
is
shown
to
be
stable
and
optimal
with
respect
to
aggregate
bandwidth
when
the
network
satisfies
the
following
conditions.
\begin{itemize}
\item
The
topology
is
a
tree.
\item
Each
link
is
bidirectional
and
has
a
symmetric
capacity
in
both
directions.
\end{itemize}
The
basic
idea
of
the
algorithm
is
similar
to
FPFR
and
the
balanced
multicast,
in
that
it
builds
multiple
transfer
trees
and
sends
data
fragments
through
them.
The
main
difference
is
that
while
FPFR
stops
building
transfer
trees
as
soon
as
the
narrowest
link
saturates,
our
algorithm
continues
to
make
{\em
partial
trees\/}
that
involve
only
a
subset
of
the
destinations.
Such
trees
play
an
important
role
to
guarantee
that
our
algorithm
is
stable,
or
that
nodes
connected
to
the
source
with
higher
bandwidth
will
receive
data
with
their
highest
bandwidths.
Transfer
scheduling
becomes
slightly
more
complex
than
FPFR
to
guarantee
that
all
nodes
receive
all
data
despite
that
some
transfer
trees
only
contain
a
subset
of
the
destinations.
Details
are
shown
in
Section~\ref{sec:transferusing}.
%
Performance
of
FPFR
may
be
limited
by
nodes
with
narrow
bandwidths.
%
This
is
fundamentally
because
it
uses
only
trees
involving
all
destinations
to
transfer
data.
%
If
we
assume
that
the
data
transferred
in
different
pipelines
do
not
overlap,
%
all
pipelines
are
required
to
be
created
as
spanning
trees.
%
However,
if
we
introduce
BitTorrerant-like
transfer
policy,
partial
trees
that
do
not
connect
every
destination
can
be
used.
%
Based
on
this
idea,
we
continue
to
build
partial
trees
even
no
spanning
tree
can
be
constructed.
%
The
idea
is
also
applicable
to
the
Balanced
Multicasting
algorithm,
and
improves
it.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{algorithm}
\caption{Constructing
Transfer
Trees
}
\label{alg}
\begin{algorithmic}
\REQUIRE
$\mathcal{T}$
is
a
network
topology.
$B_0(e)$
is
the
bandwidth
for
link
$e$.
$s$
is
the
source
node
and
$D_0$
is
a
set
of
destinations.
\STATE
$B
:=
B_0$;
$T:=
\varnothing$;
\WHILE{TRUE}
\STATE
$t:=$
the
tree
obtained
by
tracing
destinations
in
depth-first
manner
from
$s$;
\IF{
$t$
contains
no
destination
}
\STATE
\textbf{break}
\ENDIF
\STATE
$u:=$
the
bandwidth
of
the
narrowest
link
in
$t$;
\STATE
$T:=
T
\cup
\{
\tapa{t,u}
\}$;
\FOR{
each
link
$e$
in
$t$
}
\STATE
$B(e):=B(e)-u$;
\ENDFOR
\ENDWHILE
\RETURN
$T$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Sending
Data
via
Transfer
Trees}
\label{algtrans}
\begin{algorithmic}
\REQUIRE
$t_1,
\ldots
,
t_n$
are
the
transfer
trees
returned
by
the
construction
algorithm
($t_i$
is
made
by
the
$i$-th
iteration).
$u_i$
is
the
throughput
of
$t_i$.
$D$
is
the
data
to
broadcast.
\FOR{$k:=n$
downto
1}
\STATE
$R:=$
data
nodes
in
$t_k$
have
not
yet
received;
\STATE
send
$R$
through
$t_1,
\ldots
,
t_k$,
allocating
to
$t_i$
the
amount
of
data
proportional
to
$u_i$;
\ENDFOR
\end{algorithmic}
\end{algorithm}
\subsection{Constructing
Transfer
Trees}
The
algorithm
to
create
transfer
trees
is
shown
in
Algorithm~\ref{alg}.
The
network
topology
$\mathcal{T}$
is
given
in
advance
as
a
directed
graph,
which
consists
of
computation
nodes
as
leaf
nodes
and
switches
as
intermediate
nodes.
Each
link
is
biedirectional
and
modeled
as
two
separate
edges
of
the
graph.
In
addition,
the
bandwidth
(capacity)
$B_0(e)$
for
each
link
$e$
is
given.
%
$p$
is
called
\emph{pipeline}
iff
$p$
is
a
single
path
from
the
source
%
to
some
destinations,
where
no
directed
link
overlaps
and
the
bandwidth
of
every
link
in
$p$
is
more
than
$0$.
%We
call
the
first
bandwidth
on
each
link
{\it
original
bandwidth}.
%With
this
bandwidth-annotated
topology,
the
first
pipeline
is
%constructed.
Like
FPFR,
the
algorithm
repeats
building
transfer
trees.
The
first
tree
is
construted
by
visiting
all
destinations
from
the
source
node
in
a
depth-first
manner.
As
a
result,
it
creates
a
tree
that
connects
the
source
and
every
destination.
The
throughput
of
this
tree
is
set
to
the
capacity
of
the
narrowest
link
in
it.
\begin{figure}%[tb]
\begin{center}
\includegraphics[width=9cm]{pipelines.eps}\\
\end{center}
\caption{The
Algorithm
to
Build
Transfer
Trees}
\label{fig:pipelines}
\end{figure}
\begin{figure}%[tb]
\begin{center}
\includegraphics[width=9cm]{exe.eps}\\
\end{center}
\caption{Execution
of
the
Broadcast}
\label{fig:exe}
\end{figure}
After
the
first
tree
is
obtained,
bandwidth
of
each
link
available
for
subsequent
trees
is
calculated
by
subtracting
the
throughput
allocated
to
the
first
tree.
Edges
whose
capacities
become
zero
are
removed.
Note
that
at
least
one
edge
is
removed.
The
second
tree
is
constructed
with
this
reduced
bandwidth
map
using
the
same
depth-first
traversal.
Note
however
that
in
this
time,
it
may
not
be
possible
to
reach
every
destination
from
the
source
as
some
edges
have
been
removed.
If
that
happens
we
connect
only
destinations
reachable
from
the
source.
Note
that
if
the
network
is
a
tree,
in
particular,
at
least
one
node
becomes
unreachable.
The
throughput
of
the
second
tree
is
again
set
to
the
capacity
of
its
narrowest
link,
and
is
then
subtracted
to
each
link
used
by
the
tree.
We
repeat
this
until
no
nodes
become
reachable
from
the
source.
Figure~\ref{fig:pipelines}
illustrates
a
process
to
build
transfer
trees.
One
source
$S$
and
destinations
$D_{0}
\dots
D_{4}$
are
engaged
in
this
broadcast.
The
first
tree
connects
every
destination
$D_{0}
\dots
D_{4}$,
whose
throughput
is
3,
by
the
link
near
$D_{0}$.
The
second
tree
is
built
after
subtracting
3
from
each
link
used
by
the
first
tree.
It
connects
$D_{1}
\dots
D_{4}$
with
the
throughput
of
2.
After
the
construction
of
the
third
and
the
fourth
trees,
the
available
bandwidth
map
does
not
allow
to
construct
any
transfer
from
the
source.
Consequently,
four
trees
are
obtained.
This
algorithm
can
also
be
applied
for
multi-source
broadcast.
For
each
source,
the
algorithm
shown
above
is
performed.
After
some
trees
originated
from
a
source
are
constructed,
the
bandwidth
used
in
those
trees
are
subtracted.
With
this
available
bandwidth,
trees
starting
from
the
other
source
are
structured.
\subsection{Transfer
Using
Multiple
Trees}
\label{sec:transferusing}
Once
the
transfer
trees
have
been
constructed,
we
send
data
fragments
using
them.
To
achieve
the
claimed
stability
and
optimality,
all
trees
must
be
effectively
utilized.
Algorithm~\ref{algtrans}
shows
how
the
transfer
is
done.
Data
are
sent
in
multiple
{\em
stages\/},
with
each
stage
using
different
set
of
transfer
trees.
Let
$t_i$
be
the
transfer
tree
made
by
the
$i$-th
iteration
of
the
construction
algorithm
($i
=
1,
\ldots
,
n$)
and
denote
by
$N(t)$
the
set
of
destinations
involved
in
tree
$t$.
Note
that
we
have
$N(t_n)
\subset
N(t_{n-1})
\subset
\cdots
\subset
N(t_1)$.
In
the
first
stage,
{\em
the
entire
data\/}
is
partitioned
into
fragments
and
sent
using
{\em
all\/}
transfer
trees.
Data
must
be
allocated
to
trees
so
that
they
finish
the
transfer
at
the
same
time.
In
principle
this
can
be
achieved
by
allocating
to
each
tree
the
amount
of
data
proportional
to
its
throughput.
In
the
actual
implementation,
we
achieve
this
by
dividing
data
into
small
chunks
and
by
allocating
data
dynamically
to
trees.
Each
transfer
edge
is
implemented
by
a
TCP
connection
and
we
send
a
chunk
via
a
connection
only
when
the
connection
is
writable
(i.e.,
writing
to
it
does
not
block).
In
the
end
of
the
first
stage,
nodes
in
$N(t_n)$
will
have
obtained
the
entire
data
and
need
not
to
receive
data
any
more.
Other
nodes
only
get
a
part
of
the
data,
which
will
be
delivered
in
later
stages.
In
the
second
stage,
we
send
data
that
the
nodes
in
$N(t_{n-1})$
have
just
missed
in
the
first
stage
(i.e.,
data
sent
through
$t_n$)
using
$t_1,
\ldots
,
t_{n-1}$,
with
the
same
policy
to
allocate
data
to
individual
trees.
The
second
stage
will
deliver
all
data
to
nodes
involved
in
$t_{n-1}$.
Similarly
in
the
third
stage,
data
that
the
nodes
in
$N(t_{n-2})$
have
not
yet
received
will
be
sent
via
$t_1,
\ldots
,
t_{n-2}$
so
they
will
have
all
data
in
the
end
of
the
stage.
We
repeat
this
until
all
destination
nodes
receive
all
data.
Figure~\ref{fig:exe}(a)
illustrates
how
the
transfers
are
performed
with
trees
constructed
in
the
example
of
Figure~\ref{fig:pipelines}.
Let's
see
the
node
$D_{3}$
for
example,
which
is
connected
to
the
source
with
bandwidth
5.
It
receives
data
in
the
first
three
stages
(using
$t_1$
and
$t_2$)
with
the
maximum
bandwidth
5.
%
It
is
involved
in
the
first
two
trees
($t_1$
and
$t_2$).
In
the
first
and
the
second
stages,
receives
the
same
amount
of
data
as
in
{\it
Exclusive
Direct
Transfer},
which
is
shown
in
Figure
\ref{fig:exe}(b).
%
Thus,
the
stability
and
the
optimality
properties
hold
for
$D_{3}$.
They
also
hold
for
every
other
destination.
%
The
procedure
to
send
a
file
is
as
follows.
Let
tree
$T_{0}
\dots
T_{n-1}$
are
constructed
in
this
order.
Each
tree
stops
sending
data
when
every
destination
in
the
tree
has
received
the
whole
data.
Since
the
destinations
in
$T_{k}$
are
a
subset
of
the
destinations
in
$T_{k-1}$,
each
tree
leaves
in
the
opposite
order
of
the
construction.
%
Let
{\it
target\_data}
denote
the
data
that
have
not
been
sent
in
the
current
trees
$T_{0}
\dots
T_{k}$.
The
{\it
target\_data}
are
partitioned
by
the
trees,
and
sent
in
parallel.
When
all
the
{\it
target\_data}
are
sent,
destinations
in
$T_{k}$
have
received
the
whole
data.
Thus,
$T_{k}$
leaves
from
the
transfer,
and
{\it
target\_data}
are
updated.
By
repeating
this
procedure,
every
node
finally
receives
the
entire
data.
%Since
we
use
multiple
trees
in
parallel,
the
receiver
should
permit
to
receive
them
in
an
arbitrary
order.
Note
that
in
this
procedure,
some
nodes
need
to
stay
in
the
transfer
for
relaying
data
to
the
rest
of
the
nodes
even
after
having
received
the
whole
data.
For
example,
in
Figure~\ref{fig:pipelines}(c),
node
$D1$
and
$D2$
needs
to
continue
relaying
to
$D3$.
To
avoid
this,
the
tree
could
be
planned
again,
as
shown
in
Figure~\ref{fig:pipelines}(d).
%
but
even
we
reconstructs
the
tree,
the
aggregate
bandwidth
on
the
rest
of
the
nodes
is
not
improved.
\subsection{Stability
and
Optimality
for
Tree
Topology}\label{sec:proof}
We
show
that
the
algorithm
introduced
in
the
Section~\ref{sec:algorithm}
is
stable
and
optimal
in
the
sense
of
aggregate
bandwidth
if
the
the
network
topology
is
a
tree.
The
proof
consists
of
three
parts.
First,
we
show
that
we
can
treat
each
link
as
an
undirected
link
during
tree
construction.
After
that,
we
show
some
behavior
about
the
set
of
links
used
in
direct
transfers.
Finally,
we
prove
that
the
set
of
trees
deliver
to
a
node
the
maximum
amount
of
data
that
can
be
received
by
the
node.
%each
node
can
receive
as
much
data
as
in
the
direct
transfer
%in
the
case
of
%exclusive
direct
transfer.
\subsubsection{Preparation:
a
Property
from
Symmetric
Link}%Directional
and
Bidirecional
Link}
Basically,
we
need
to
treat
each
link
as
a
directed
link
during
the
construction
of
trees.
The
bandwidths
of
the
two
directed
links
on
the
same
path
are
separately
calculated,
and
the
link
is
added
separately.
However,
when
each
link
has
the
same
bandwidth
in
the
two
directions,
we
can
treat
the
two
directed
links
as
one
link,
whose
bandwidth
is
given
by
the
smaller
bandwidth
of
the
two
directional
links.
The
reason
is
shown
as
follows.
We
have
a
tree
topology
that
contains
a
source
$s$
and
some
destinations.
A
bidirectional
link
$e$
in
the
network
has
two
subtrees
connected
to
both
ends.
Since
the
network
is
a
tree,
the
source
node
$s$
is
contained
in
either
of
the
two
subtrees.
Assume
that
the
subtree
$T$
does
not
contain
the
source
node
$s$.
The
link
$e$
can
be
split
into
two
directional
links:
one
heading
for
$T$
and
the
other
away
from
it.
Let
the
term
{\it
forward
link}
$e_{f}$
and
{\it
backward
link}
$e_{b}$
denote
the
former
and
the
latter
link,
respectively.
This
is
shown
in
Figure~\ref{fig:l_ab}(a).
In
a
tree
network,
the
removal
of
$e$
from
the
topology
disconnects
any
node
in
$T$
and
$s$.
Therefore,
if
$e_{f}$
is
removed,
we
can
create
no
route
from
$s$
to
any
node
in
$T$.
Here,
assume
the
backward
link
$e_{b}$
is
adopted
by
a
tree.
Since
the
tree
uses
$e_{b}$,
it
has
reached
one
of
the
destinations
in
$T$.
Without
$e_{f}$,
we
cannot
create
a
route
from
$s$
to
any
node
in
$T$.
Thus,
when
a
tree
contains
$e_{b}$,
it
also
employs
the
corresponding
forward
link
$e_{f}$.
Therefore,
$B_{k}(e_{b})
-
B_{k-1}(e_{b})
\le
B_{k}(e_{f})
-
B_{k-1}(e_{f})$
holds,
where
$B_{k}(e)$
denotes
the
available
bandwidth
used
in
the
construction
of
the
$(k)$th
tree.
Since
the
bandwidth
of
a
link
is
symmetric
at
first,
$B_{0}(e_{b})
=
B_{0}(e_{f})$.
As
a
result,
$B_{k}(e_{b})
\le
B_{k}(e_{f})$
is
lead.
Thus,
the
bottleneck
in
the
tree
always
lies
on
the
forward
link.
Consequently,
during
the
construction
of
the
trees,
we
only
need
to
use
the
bandwidth
of
the
forward
link.
\begin{figure}[tb]
\begin{center}
\begin{tabular}{cc}
\multicolumn{2}{c}{\includegraphics[width=9cm]{l_ab.eps}}\\
(a)
Forward/Backward
Link
&
(b)
An
example
of
$L(a,
b)$
\\
\end{tabular}
\end{center}
\caption{Links
In
a
Tree
Symmetric
Network}
\label{fig:l_ab}
\end{figure}
\subsubsection{Properties
for
Single
Transfer}
Since
the
topology
is
a
tree,
there
is
exactly
one
route
that
connects
two
nodes
$a$
and
$b$
using
the
minimum
number
of
links.
Let
$L(a,
b)$
denote
this
set
of
links.
This
is
shown
in
Figure~\ref{fig:l_ab}(b).
The
removal
of
any
link
in
$L(a,
b)$
from
the
tree
disconnects
$a$
and
$b$.
Thus,
any
route
from
$a$
to
$b$
contains
all
the
links
in
$L(a,
b)$.
The
transfer
throughput
is
maximized
in
a
route
that
uses
each
link
in
$L(a,
b)$
once.
Let
$v_k(a,
b)$
denote
this
throughput
when
the
bandwidth
is
$B_k$:
%Let
$v(a,
b)$
denote
this
throughput,
and
it
holds
the
following:
\begin{equation}
\label{equ:v}
v_{k}(a,
b)
=
\min_{e
\in
L(a,
b)}(B_{k}(e))
\end{equation}
With
a
set
of
links
$(L(a,
b)
\cup
L(b,
c))$,
we
can
construct
a
path
from
$a$
to
$c$
via
$b$.
Therefore,
the
following
relation
holds:
\begin{equation}
\label{equ:l_subset}
(L(a,
b)
\cup
L(b,
c))
\supset
L(a,
c)
\end{equation}
Assume
we
iteratively
construct
a
total
of
$n$
trees.
From
the
source
$s$,
a
tree
traces
some
destinations
in
a
depth-first
path.
Let
$D_{k}$
denote
the
set
of
destinations
included
in
the
$(k)$th
depth-first
tree,
and
%In
addition,
let
$P_{k}$
denote
the
set
of
links
used
in
the
$(k)$th
tree.
%The
bandwidth
of
each
link
changes
during
the
construction
of
the
pipelines.
%Let
$v_{k}(a,
b)$
denote
the
throughput
from
$a$
to
$b$
in
the
%construction
of
the
$(k)$th
pipeline.
%Name
each
node
in
the
destinations
$D_{k}$
as
$d_{k,
0}
\dots
d_{k,
n-1}$
to
satisfy
$v_{k}(s,
d_{k,
i})
\le
v_{k}(s,
d_{k,
i+1})$.
%Among
$P_{k}$
and
$L(s,
d_{k,0})
\dots
L(s,
d_{k,n-1})$,
the
following
relation
holds:
Using
the
avobe
notations,
the
following
equations
(\ref{equ:pl})
and
(\ref{equ:non0})
hold.
Because
the
graph
is
a
tree,
%we
can
write
$P_{k}$
as
the
following:
\begin{equation}
\label{equ:pl}
%P_{k}
=
L(s,
d_{k,0})
\cup
\dots
\cup
L(s,
d_{k,n-1}),
P_{k}
=
\bigcup_{d
\in
D_k}
L(s,
d).
\end{equation}
Because
the
algorithm
traces
the
path
from
$s$,
%the
following
condition
holds.
\begin{equation}
d
\not\in
D_k
\text{
if
and
only
if
}
v_k(s,d)=
0.
\label{equ:non0}
\end{equation}
%\begin{description}
%\item[Proof
of
(\ref{equ:pl}):]
%The
following
is
one
of
the
requirement
conditions
for
$P$:
%\begin{equation}
%\forall
d
\in
D,
L(s,d)
\subset
P
%\end{equation}
%One
sufficient
condition
for
$P$
is:
%\begin{equation}
%\forall
a,
b
\in
(\{s\}
\cup
D),
L(a,
b)
\subset
P
%\end{equation}
%From
the
definition
of
$L$:
%\begin{equation}
%\forall
d
\in
D,
L(d)
\subset
(L(d_{0})
\cup
\dots
\cup
L(d_{n-1}))
%\end{equation}
%From
(\ref{equ:l_subset}),
%\begin{equation}
%\forall
a,
b
\in
D,
L(a,
b)
\subset
(L(d_{0})
\cup
\dots
\cup
L(d_{n-1}))
%\end{equation}
%Consequently,
(\ref{equ:pl})
has
been
proved.
%\end{description}
\subsubsection{Proof
of
Stability
and
Optimality}
Let
$d_{k,
0}
\dots
d_{k,
n-1}$
denote
the
nodes
in
the
destination
set
$D_{k}$,
where
$v_{k}(s,
d_{k,
i})
\le
v_{k}(s,
d_{k,
i+1})$.
Let
$u_{k}$
denote
the
throughput
of
the
tree
$P_{k}$.
%From
$v_{k}(s,
d_{k,
i})
\le
v_{k}(s,
d_{k,
i+1})$,
We
get
the
following:
\begin{eqnarray}
\label{equ:u_eq_v}
u_{k}
=
\min_{e
\in
P_{k}}(B(e))
%=
min_{e
\in
(L_{0}
\cup
\dots
\cup
L_{n-1})}(B(e))
=
\min_{0
\le
i
<
n}(v_{k}(s,
d_{k,
i}))
=
v_{k}(s,
d_{k,
0}).
\end{eqnarray}
%Let
us
define
$A_{k}$
as
follows:
%\begin{equation}
%\label{equ:define_da}
%A_{k}
=
\{
d
|
d
\in
D_{k},
v_{k}(s,
d)
=
v_{k}(s,
d_{k,0})
\}.
%\end{equation}
%Among
$A_{k}$,
$D_{k}$
and
$D_{k+1}$,
the
following
holds:
The
following
equation
holds.
\begin{equation}
\label{equ:d_sub_d}
D_{k+1}
=
D_{k}
\setminus
A_{k},
\end{equation}
where
$A_k
=
\{d
\in
D_{k}|
v_k(s,d)=v_k(s,d_{k,0})
\}$.
\begin{proof}
(Proof
of
(\ref{equ:d_sub_d}))
From
the
definition
of
$B_k$,
\[
%\forall
e
\in
D_{k},
B_{k+1}(e)
=
B_{k}(e)
-
u_{k}
\\
%\forall
e
\notin
D_{k},
B_{k+1}(e)
=
B_{k}(e)
B_{k+1}(e)
=
\begin{cases}
B_{k}(e)
-
u_k
&
\text{if
$e
\in
P_k$,}\\
B_{k}(e)
&\text{otherwise.}
\end{cases}
\]
Thus,
from
(\ref{equ:non0}),
\begin{equation}
%\forall
e
\in
D_{k},
B_{k+1}(e)
=
B_{k}(e)
-
u_{k}
\\
%\forall
e
\notin
D_{k},
B_{k+1}(e)
=
B_{k}(e)
v_{k+1}(s,d)
=
\begin{cases}
v_{k}(s,d)
-
u_k
&
\text{if
$d
\in
D_k$,}\\
0
&\text{otherwise.}
\end{cases}
\label{equ:db}
\end{equation}
If
$d
\in
A_k$,
namely
$
v_k(s,d)
=
v_k(s,d_{k,0})$,
then
$v_{k+1}(s,d)
=
u_k
-
u_k
=
0$
from
(\ref{equ:u_eq_v}).
If
$d
\in
D_k
\setminus
A_k$,
namely
$
v_k(s,d)
>
v_k(s,d_{k,0})$,
then
$v_{k+1}(s,d)
=
v_k(s,d)
-
u_k
>
0$.
From
(\ref{equ:non0}),
$D_{k+1}
=
\{d
\in
D_0
|
v_{k+1}(s,d)\neq
0
\}
=
D_k
\setminus
A_k$.
\end{proof}
From
(\ref{equ:d_sub_d}),
a
destination
$d$
in
$A_{k}$
receives
data
from
the
$(0)$th
to
the
$(k)$th
trees.
Let
$w(d)$
denote
the
total
bandwidth
$d$
receives
from
the
trees,
that
is,
%and
the
following
holds:
\begin{equation}
\label{equ:d_and_u}
%\forall
d
\in
A_{k},
w(d)
=
\sum_{0
\le
i
\le
k}{u_{k}}.
\end{equation}
From
(\ref{equ:db}),
the
following
holds:
\begin{eqnarray*}
u_{k}
=
v_{k}(s,
d_{k,
0})
=
v_{k-1}(s,
d_{k,
0})
-
u_{k-1}
\\
=
v_{0}(s,
d_{k,
0})
-
\sum_{0
\le
i
<
k}{u_{i}}.
\end{eqnarray*}
Thus
\begin{equation}
v_{0}(s,
d_{k,
0})
=
\sum_{0
\le
i
\le
k}{u_{i}}.
\label{equ:v_sum_u}
\end{equation}
From
(\ref{equ:d_and_u})
and
(\ref{equ:v_sum_u}),
we
get
the
following:
\begin{equation}
\forall
d
\in
A_{k},
w(d)
=
v_{0}(s,
d_{k,
0})
\end{equation}
This
means
the
trees
deliver
to
$d$
the
same
amount
of
data
as
in
the
direct
transfer
from
$s$
to
$d$
in
the
condition
that
there
is
no
other
traffic.
From
\ref{equ:v},
this
is
the
maximum
amount
of
data
$d$
can
receive.
From
the
definition
of
$A_k$
and
(\ref{equ:d_sub_d}),
every
destination
is
included
in
one
of
$A_{0},
\dots,
A_{n-1}$,
where
$n$
is
the
number
of
the
trees.
Consequently,
it
is
proved
that
our
trees
deliver
to
a
node
the
maximum
amount
of
data
that
can
be
received
by
the
node.
%TAG
achieve
the
same
bandwidth
as
in
the
exclusive
direct
transfers.
It
is
also
stable
because
the
receiving
bandwidth
does
not
depend
on
other
destinations.
\begin{figure*}[tb]
\begin{center}
\includegraphics[width=18cm]{sim.eps}
\end{center}
\caption{The
Relative
Bandwidth
to
the
{\it
FlatTree}
Algorithm
in
Simulations}
\label{fig:sim}
\end{figure*}
\begin{figure*}[tb]
\begin{center}
\includegraphics[width=13cm]{real.eps}
\end{center}
\caption{Real
Machine
Experimets}
\label{fig:real}
\end{figure*}
\section{Evaluation}
\label{sec:eva}
We
implemented
the
algorithm,
and
evaluated
it
in
both
a
simulation
and
a
real
environment.
Five
algorithms
are
compared:
\begin{description}
\item[Ours]\
\\
This
is
the
algorithm
we
proposed.
Multiple
partial
trees
are
iteratively
constructed
in
a
depth-first
manner,
and
transfers
are
performed
in
parallel.
\item[Depth-First
(FPFR
\cite{fpfr}-like)]\
\\
The
algorithm
constructs
multiple
spanning
trees
in
a
depth-first
manner
\cite{balanced,
fpfr}.
%Since
the
topology
is
a
tree,
only
one
tree
is
constructed.
\item[Dijkstra]\
\\
This
algorithm
iteratively
builds
a
tree
in
a
greedy
manner.
Pick
one
unreached
destination
that
can
be
reached
in
the
maximum
possible
bandwidth
from
reached
nodes.
The
method
is
explained
in
\cite{fpfr},
and
a
similar
method
is
proposed
in
\cite{DBLP}.
\item[Random
Pipeline]\
\\
This
algorithm
randomly
creates
a
tree.
One
node
is
randomly
chosen
from
all
the
unreached
destinations
at
a
time.
For
each
condition,
we
generated
100
candidates
and
chose
the
candidate
with
the
largest
aggregate
bandwidth.
\item[Flat
Tree]\
\\
The
source
sends
the
data
to
all
the
destinations.
\end{description}
Except
for
{\it
Flat-tree},
every
algorithm
requires
a
bandwidth-annotated
topology.
\subsection{Simulation}
A
simulator
has
been
implemented
to
evaluate
broadcast
algorithms.
We
make
the
throughputs
of
the
machines
and
switches
large
enough
compared
to
the
link
bandwidths,
so
that
they
do
not
become
bottlenecks.
We
used
a
tree
topology
with
400
nodes,
taken
from
the
real
environment.
During
the
experiment,
three
different
bandwidth
distributions
were
tested:
\begin{description}
\item[Uniform
Random]\
\\
A
uniform
random
value
is
assigned
to
each
link
to
see
general
behavior
of
these
algorithms.
Both
{\it
low-variance}
(from
500
to
1000)
and
{\it
high-variance}
(from
100
to
1000)
conditions
are
tested.
\item[Mixed
Fast
and
Slow
Links]\
\\
While
80\%
of
the
links
have
broad
bandwidth
(1000),
the
others
have
narrow
one
(100).
It
describes
a
situation
that
some
nodes
have
a
slower
NIC,
or
some
switches
are
slow.
\item[Random
Inter-Cluster]\
\\
In
this
condition,
inter-switch
links
are
assigned
random
distribution
(from
100
to
1000).
\end{description}
The
simulation
is
performed
10
times
for
the
same
algorithm,
bandiwdth
variance
and
number
of
destinations.
Both
symmetric
and
asymmetric
conditions
are
tested
for
each
bandwidth
distribution.
The
result
is
shown
in
Figure~\ref{fig:sim}.
The
vertical
axis
shows
the
relative
aggregate
bandwidth
of
the
{\it
FlatTree
algorithm}.
Our
algorithm
took
only
2
milliseconds
to
induce
the
schedule.
Our
method
achieved
the
best
performance
in
every
symmetric
condition.
%Among
{\it
Greedy
Tree}
and
{\it
Depth-First},
the
former
As
shown
in
Figure~\ref{fig:sim}(c),
the
improvement
is
especially
notable
when
fast
and
slow
links
are
mixed.
In
this
case,
the
performance
of
the
other
algorithms
is
dropped
because
of
the
lack
of
stability.
In
an
asymmetric
network,
the
performance
of
our
algorithm
is
the
best
except
for
one
case,
Figure~\ref{fig:sim}(d).
Even
in
this
case,
the
difference
is
only
$3\%$.
As
a
result,
the
superiority
of
our
algorithm
is
confirmed.
We
have
also
performed
an
experiment
that
simulates
multiple
source
broadcast.
It
is
shown
in
Figure~\ref{fig:sim}
(i).
%In
{\it
depth-first}
algorithm,
the
construction
of
the
pipeline
is
slightly
chaged.
While
we
changed
the
number
of
destinations
from
$10$
to
$50$,
the
number
of
sources
are
also
changed
from
$1$
to
$5$.
Figure~\ref{fig:sim}(i)
shows
that
the
performance
compared
to
the
{\it
FlatTree}
is
not
changed
in
the
multi-source
multicast.
\subsection{Real
Machine
Experiment}
%
TODO
We
also
performed
some
experiments
using
a
real
machine
environment.
This
environment
had
the
tree
topology
shown
in
Figure~\ref{fig:top},
with
105
nodes
in
4
clusters.
The
bandwidth
of
the
links
are
also
shown
in
Figure~\ref{fig:top}.
We
performed
broadcasts
among
10,
47
and
105
nodes.
Table~\ref{tab:realnodes}
shows
the
number
of
nodes
from
each
cluster.
Each
condition
was
tested
four
times,
and
the
best
aggregate
bandwidth
was
taken.
%%
Some
experiments
have
done
by
using
the
real
environment.
The
environment
has
a
tree
topology
shown
in
Figure~\ref{fig:top}
with
105
nodes
in
4
clusters.
We
performed
broadcast
among
10,
47,
and
105
nodes.
Table
\ref{tab:realbw}
describes
the
bandwidth
conditions,
and
Table
\ref{tab:realnodes}
shows
the
number
of
nodes
from
each
cluster.
Each
conditions
are
tested
four
times,
and
the
best
aggregate
bandwidth
is
taken.
\begin{table}
\begin{center}
\caption{Number
of
Nodes
Used
in
the
Experiments}
\label{tab:realnodes}
\begin{tabular}{|c||c|c|c|c|}
\hline
Number
of
Destinations
&
A
&
B
&
C
&
D
\\
\hline
\hline
105
Node
&
59
&
9
&
35
&
2
\\
\hline
47
Node
&
30
&
1
&
16
&
1
\\
\hline
11
Node
&
6
&
1
&
3
&
1
\\
\hline
\end{tabular}
\end{center}
\end{table}
Figure~\ref{fig:real}
illustrates
the
results.
Since
the
environment
is
heterogeneous,
the
performance
was
improved
significantly
by
our
algorithm.
The
aggregate
bandwidth
increased
by
$2.1$
to
$2.6$
times
compared
to
the
best
result
in
the
other
algorithms.
However,
the
bandwidth
was
$30\%$
to
$45\%$
worse
than
the
optimal
value
predicted
by
simulation.
An
investigation
revealed
that
our
assumption
that
each
link
is
bidirectional
and
thus
has
an
independent
(full-duplex)
bandwidth
in
each
direction
was
subtly
violated,
because
transferring
data
in
both
directions
saturated
CPUs.
While
each
node
could
send
or
receive
900Mbps
independently,
it
could
only
send
and
receive
750Mbps
when
it
was
simultaneously
sending
and
receiving.
In
the
simulation,
the
link
adjoining
such
a
node
was
modeled
as
having
900Mbps
in
each
direction,
but
in
reality
each
only
had
750Mbps
when
it
was
used
to
relay
data.
%%
However,
the
%%
bandwidth
was
$30\%$
to
$45\%$
worse
than
simulated
value,
which
is
%%
the
total
of
the
bandwidth
in
exclusive
direct
transfer
to
each
%%
destination.
This
is
because
the
processing
capacity
on
each
node
was
%%
the
bottleneck.
While
a
node
performed
900Mbps
when
a
node
has
only
%%
one
transfer,
the
throughput
dropped
to
750Mbps
each
when
the
node
%%
treated
both
incoming
and
outgoing
transfers.
\begin{figure}[tb]
\begin{center}
\begin{tabular}{ccc}
\includegraphics[width=8.5cm]{top.eps}
&
\end{tabular}
\end{center}
\caption{The
Real
Environment
Topology}
\label{fig:top}
\end{figure}
To
demonstrate
the
stability
of
our
broadcast,
another
test
was
performed.
In
this
experiment,
a
node
with
small
bandwidth
joined
a
9-node
broadcast.
Figure~\ref{fig:real}
(d)
shows
the
change
in
the
aggregate
bandwidth
of
the
older
9
nodes
for
four
broadcast
algorithms.
While
the
aggregate
bandwith
dropped
to
$52\%$
in
{\it
depth-first},
which
works
the
same
as
{\it
FPFR}
for
a
tree,
the
deterioration
in
our
algorithm
was
only
$2.5\%$.
Although
the
drop
in
the
other
two
algorithms
were
also
small,
our
algorithm
yielded
1.5
times
aggregate
bandwidth
of
them.
%%
To
demonstrate
the
sability,
another
test
has
performed.
In
this
experiment,
a
node
with
small
bandwidth
joined
to
a
broadcast
in
9
nodes.
%%
The
change
in
the
aggregate
throughput
of
the
older
9
nodes
are
compared
among
four
broadcast
algorithms.
It
is
shown
in
\ref{fig:real}(d).
%%
As
can
be
seen,
the
deterioration
is
up
to
$2.5\%$
in
our
method.
\section{Conclusion}
\label{sec:conclusion}%
OK
In
this
paper,
we
introduced
the
notion
of
{\it
stability}
in
broadcasts,
and
proposed
a
simple
and
efficient
broadcast
for
heterogeneous
environments.
Our
broadcast
improves
a
previously
proposed
class
of
broadcast
algorithms
that
include
FPFR
and
Balanced
Multicasting,
focusing
on
the
fact
that
each
node
should
receive
data
as
fast
as
possible
and
start
computation
without
waiting
for
other
nodes.
Like
FPFR
and
Balanced
Multicasting,
our
broadcast
achieves
high
bandwidth
by
forwarding
data
along
multiple
spanning
trees.
In
addition,
our
broadcast
avoids
the
effect
of
nodes
with
narrow
bandwidth
by
forwarding
data
along
multiple
partial
trees.
While
the
slowest
node
only
receives
data
from
the
spanning
trees,
faster
nodes
also
receive
data
from
the
partial
trees.
For
general
graphs,
our
broadcast
will
always
outperform
FPFR
and
Balanced
Multicasting,
and
for
trees,
we
proved
that
it
is
{\it
stable}
as
well
as
{\it
optimal}.
In
a
real
environment
with
100
machines
in
4
clusters,
our
scheme
achieved
$2.1$
to
$2.6$
times
aggregate
bandwidth
compared
to
the
best
result
in
the
other
algorithms.
We
also
demonstrated
the
stability
by
adding
a
slow
node
to
a
broadcast.
While
the
aggregate
bandwith
dropped
to
$52\%$
in
{\it
depth-first},
which
works
the
same
as
{\it
FPFR}
for
a
tree,
the
deterioration
in
our
algorithm
was
only
$2.5\%$.
Some
simulations
also
showed
that
our
algorithm
also
performs
well
in
many
bandwidth
distributions.
%
even
in
an
asymmetric
environment,
our
algorithm
%outperformed
all
other
methods
that
we
have
tested.
We
also
performed
%an
experiment
in
a
real
environment
to
demonstrate
its
practicality.
%%
In
this
paper,
we
introduced
a
notion
of
{\it
stable
broadcast}.
In
conventional
broadcast
algorithms,
adding
a
new
node
with
narrow
bandwidth
degrades
the
bandwidth
received
by
old
nodes.
A
stable
broadcast
algorithm
delivers
the
same
amount
of
data
to
each
node
as
when
an
exclusive
data
transfer
is
performed
to
it.
Consequently,
the
aggregate
bandwidth
of
some
nodes
are
not
diminished
by
adding
other
nodes.
In
a
tree
topology,
a
stable
broadcast
achieves
in
many
criterion,
such
as
aggregate
bandwidth
and
the
total
transfer
completion
time.
%%
We
proposed
a
stable
broadcast
algorithm
under
the
assumption
of
a
tree
symmetric
network.
As
we
have
seen
in
section
\ref{sec:algorithm},
our
algorithm
has
two
points.
First,
we
use
the
bandwidth-annotated
topology
information
to
avoid
self-induced
congestion.
If
the
tree
is
constructed
without
considering
the
network
topology,
the
broadcast
performance
is
possibly
limited
because
transfers
for
the
same
data
use
the
same
link
many
times.
By
constructing
a
pipeline
in
a
depth-first
manner,
a
link
is
used
only
once
for
the
same
kind
of
data.
As
a
result,
the
transfer
performance
is
improved.
%%
Second,
we
use
multiple
partial
pipelines
in
parallel
to
omit
the
effect
of
a
node
with
narrow
bandwidth.
%%
While
the
node
connected
to
the
source
in
the
narrowest
bandwidth
receives
data
from
only
one
pipeline,
more
numbers
of
pipelines
delivers
data
to
a
node
connected
in
broader
bandwidth
links.
Therefore,
each
node
receives
as
much
amount
of
data
as
in
a
single
transfer
to
that
node
from
the
source.
%,
which
is
the
maximum
%%
The
stability
of
our
algorithm
is
proved
in
section
\ref{sec:proof}.
Our
algorithm
is
tested
on
both
in
simulations
and
in
a
real
network
environment.
The
proposed
scheme
achieves
the
best
performance
and
is
stable
for
adding
nodes
with
narrow
bandwidth.
In
a
large
bandwidth
variance
environment,
our
scheme
yielded
twice
the
aggregate
bandwidth
compared
to
a
single
depth-first
pipeline.
Even
in
an
asymmetric
environment,
our
algorithm
outperformed
all
other
methods
we
have
tested.
We
have
also
performed
a
experiment
on
a
real
environment
to
demonstrate
its
practicality.
In
the
future
work,
we
are
going
to
invent
a
stable
algorithm
in
an
arbitrary
graph
network,
not
only
in
a
tree.
Since
the
algorithm
uses
bandwidth
of
each
link,
we
are
also
planning
to
develop
a
fast
algorithm
to
detect
bandwidth
of
all
the
links
in
the
network.
%-------------------------------------------------------------------------
%%
\nocite{ex1,ex2}
%\bibliographystyle{latex8}
%\bibliography{reference}
\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,reference}
\end{document}
CVS Diff Visualizer by
Kei