stable_broadcast.tex

Name view / Date view / Revision view
_ 1_98 _ 1_96 _ 1_97 _ 1_94 _ 1_95 _ 1_92 _ 1_93 _ 1_90 _ 1_91 _ 1_16 _ 1_17 _ 1_14 _ 1_15 _ 1_12 _ 1_13 _ 1_10 _ 1_11 _ 1_18 _ 1_19 _ 1_89 _ 1_88 _ 1_81 _ 1_80 _ 1_83 _ 1_82 _ 1_85 _ 1_84 _ 1_87 _ 1_86 _ 1_74 _ 1_75 _ 1_76 _ 1_77 _ 1_70 _ 1_71 _ 1_72 _ 1_73 _ 1_78 _ 1_79 _ 1_67 _ 1_66 _ 1_65 _ 1_64 _ 1_63 _ 1_62 _ 1_61 _ 1_60 _ 1_69 _ 1_68 _ 1_52 _ 1_53 _ 1_50 _ 1_51 _ 1_56 _ 1_57 _ 1_54 _ 1_55 _ 1_58 _ 1_59 _ 1_49 _ 1_48 _ 1_45 _ 1_44 _ 1_47 _ 1_46 _ 1_41 _ 1_40 _ 1_43 _ 1_42 _ 1_4 _ 1_5 _ 1_6 _ 1_7 _ 1_1 _ 1_2 _ 1_3 _ 1_8 _ 1_9 _ 1_38 _ 1_39 _ 1_30 _ 1_31 _ 1_32 _ 1_33 _ 1_34 _ 1_35 _ 1_36 _ 1_37 _ 1_29 _ 1_28 _ 1_23 _ 1_22 _ 1_21 _ 1_20 _ 1_27 _ 1_26 _ 1_25 _ 1_24

\documentclass[conference]{IEEEtran}
%\usepackage{latex8}
%\usepackage{times}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{algorithm,algorithmic}

%\ifCLASSINFOpdf

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}
\title{A Stable Broadcast Algorithm}
\author{
\IEEEauthorblockN{Kei Takahashi, Hideo Saito, Takeshi Shibata and Kenjiro Taura}
\IEEEauthorblockA{The University of Tokyo\\
Email: \{kay, h\_saito, shibata, tau\}@}}

\maketitle

% \begin{abstract} %OK
% In many data-intensive applications, each node can start processing as soon as it has received required data.
% Thus, each node is desired to receive the data in the broadest-possible bandwidth.
% However, the bandwidth each node receives is possibly degraded in most broadcast algorithms compared to an exclusive direct transfer from the source.
% We say a broadcast is {\em stable} when the bandwidth of some nodes is never diminished by adding other nodes.
% Many cluster environments have a tree network topology with links that have the same bandwidth in the both directions.
% Under this assumption, we propose a stable broadcast algorithm that uses multiple partial pipelines. Pipelines are iteratively constructed in a depth-first manner, and each pipeline connects some of the destinations.
% We proved that our broadcast algorithm delivers the same amount of data to each node as when an exclusive data transfer is performed to it. As a result, the aggregate bandwidth, the cumulative sum of data that receiving by all nodes in a unit time, is maximized. %better describes the performance of a long-message broadcast,
% Our simulation has shown that our algorithm achieves the best performance and is stable for adding nodes with narrow bandwidth. In a large bandwidth variance environment, our scheme yielded twice the aggregate bandwidth compared to a single depth-first pipeline. We have also performed a experiment on a real environment to demonstrate its practicality.
% \end{abstract}

\begin{abstract} %OK
Distributing large data to many nodes, known as a broadcast or a multicast, is an important operation in parallel and distributed computing.
Most previous broadcast algorithms explicitly or implicitly try to deliver data to all nodes in the same rate.
While being reasonable for homogeneous environments where all nodes have similar receiving capababilities, such algorithms may suffer from slowly receiving nodes in heterogeneous settings.
In such settings, each node desires to receive data at its largest possible bandwidth and start computation as soon as it receives the data.
In this paper we propose to say a broadcast is {\em stable\/} when the bandwidth to a node is never sacrificed by the presence of other, possibly slow, receiving nodes, and proposes the stability as a desired property of broadcast algorithms.
In addition, we show a simple and efficient stable broadcast when the topology among nodes is a tree and each link has a symmetric bandwidth.
This work improves upon previously proposed algorithms such as FPFR and Balanced Multicast.
For general graphs, it outperforms them when the network is heterogeneous and for trees, it is stable and optimal.
Our simulation confirmed that our algorithm has the desired properties.
In heterogeneous environments, our scheme yielded twice the aggregate bandwidth compared to a simpler algorithm using a single pipeline.
We have also performed an experiment on a real environment to demonstrate its practicality.
\end{abstract}

\IEEEpeerreviewmaketitle

\section{Introduction} \label{sec:background}

There are growing demands to effectively execute data intensive
applications in distributed, wide-area environments. Since data
transfers occupy a considerable part of the total processing time in
the execution of such applications, efficient data transfer techniques
are required.

One of the most common, practical transfer problems is broadcast. In
a broadcast, one or more {\it source} nodes have some data that need
to be copied to some {\it destination} nodes. Various algorithms have
been proposed to optimize broadcasts according to various criteria,
such as completion time and aggregate bandwidth (the sum of the
bandwidths of all
nodes)~\cite{fastreplica,balanced,fpfr,mpichg2,bittorrent,mob}.

Such existing broadcast algorithms have a few shortcomings, especially
for data intensive applications executed in distributed, wide-area
environments. One is that they are not {\it stable}, in that the
addition of a new node to an existing broadcast can potentially lower
the bandwidth of a node already participating in the broadcast. This
complicates task scheduling, because earlier scheduling decisions may
have depended on a particular node receiving data at a particular
rate. Another shortcoming of existing broadcast algorithms is that
they are not {\it optimal} in terms of aggregate bandwidth. Even
though many existing algorithms do optimize for aggregate bandwidth,
they do so under the assumption that all nodes have the same incoming
bandwidth. As a result, they do not explore the possibility of
different nodes receiving data at different rates, missing the
opportunity of having nodes that receive data early start computation
without waiting for other nodes. In wide-area environments where link
speeds can vary a great amount, it is especially important to prevent
nodes with slow links from slowing down nodes with faster links.

In this paper, we propose a broadcast algorithm that uses
bandwidth-annotated topology information to optimize for aggregate
bandwidth. Our algorithm improves an existing algorithm called
FPFR~\cite{fpfr}, and performs broadcasts with higher aggregate
bandwidth than FPFR for any graph topology. Moreover, for tree
topologies, we prove that our algorithm satisfies the following two
properties:

\begin{itemize}
\item {\it Stability:} Adding a node to an existing broadcast does not
lower the incoming bandwidth of any node already participating in the
broadcast.
\item {\it Optimality:} A broadcast performed using our algorithm
achieves the maximum possible aggregate incoming bandwidth.
\end{itemize}

While our algorithm works well with any graph topology, we use
tree topologies for simplicity in the exposition.
To evaluate our algorithm, we used
the topology inference tool developed by Shirai et al.~\cite{shirai} to
obtain a tree topology, and performed both simulations and
real-machine experiments.

The rest of this paper is organized as follows. In section
\ref{sec:broadcast}, some previous broadcast techniques are explained.
Our algorithm as well as proof of optimality and stability are shown
in Section \ref{sec:algorithm}. Section \ref{sec:eva} depicts the
experiments and evaluation of our algorithm in the real environment,
and we conclude the paper in Section \ref{sec:conclusion}.

\section{Related Work}\label{sec:broadcast}

In this section, some existing broadcast techniques are shown.
Although much research has been done to improve aggregate bandwidth,
none has been evaluated in terms of stability.
%Many of them uses bandwidth-annotated topology.

\subsection{Topology-unaware Broadcast}
% flat tree
The simplest broadcast algorithm is {\it flat-tree}, in which a source
node directly sends data to all the destinations. As this algorithm
does not perform well because of the bottleneck at the source, some
broadcast algorithms have been invented.

% Fast Replica
In {\it FastReplica}~\cite{fastreplica}, the source splits data into
small pieces and sends each piece to a different destination. Then,
the destinations construct a ring connection, exchange pieces and
finally obtain the whole file. In this method, the brocadcast
bandwidth is not restricted by the master. However, since the method
does not consider network topologies, congestions are possibly induced
by using the same link many times. For example, if a certain link is
used by $N$ transfers, the bandwidth is reduced to $1/N$. In addition,
since a node connected with a narrow link relays data to other node,
it limits the performance of all the other nodes.

% BitTorrent
{\it BitTorrent}~\cite{bittorrent, sched_bt} and some other work
\cite{mob} adaptively improve the transfer graph, by changing the
relaying network. First, data are divided into small pieces and each
node is allowed to receive them in an arbitrary order. Each node
periodically changes the parent, and nodes with a broad link are more
likely to relay data. As a result, the aggregate bandwidth is
increased. However, these algorithms cannot always reach a schedule
that avoids the link sharing. In addition, it takes time to reach a
better schedule since these methods improve the transfer tree by
try-and-error.

\subsection{Topology-aware Broadcast}
To avoid and obtain a good schedule from the beginning, some methods
that use network topology information have been proposed.
% Mpich
Karonis et al. proposed to use a hierarchical tree that describes the
real network topology~\cite{mpichg2}. A broadcast is performed by
tracing the tree: a node relays data to all its children. Since a
link is only used once for the same data, it can avoid self-induced
congestion. However, since the tree has many branches, the aggregate
bandwidth decreases. If a node has $N$ children, the bandwidth of the
lower-level nodes is limited to $1/N$ at least.
%Since it does not consider the bandwidth of each link, for a long message, bandwidth matters,
%in terms of RTT is placed in near place, and a broadcast is performed using that tree.

% Pipeline
In a pipeline transfer~\cite{shirai, DBLP}, each node relays data to
only one other node. By tracing a topology in a depth-first manner, a
pipeline is obtained. In this pipeline, each directional link is used
only once, so the throughput is limited by the narrowest link in it.
%If the bandwidth of every link is the same, every node can receive the same amount of data as the link bandwidth.

%In a tree symmetric network, the depth-first pipeline is stable for the slowest node, that means, the slowest node always receives the same amount of data.
%The aggregate bandwidth is improved by dwindling the pipeline

\begin{figure}[tb]
\begin{center}
\includegraphics[width=9cm]{spanning.eps}\\
\end{center}
\caption{Effect of Adding Nodes with Narrow Bandwidth}
\label{fig:spanning}
\end{figure}

However, the depth-first pipeline cannot achieve good performance in
terms of aggregate bandwidth when some nodes lack bandwidth. Figure
\ref{fig:spanning} illustrates a situation in which new node $D_{3}$
with small bandwidth joins the broadcast. One source $S$ holds the
original data, three destinations $D_{0}, D_{1}, D_{2}$ need them. The label on
each link shows the bidirectional bandwidth. Originally the aggregate
bandwidth of $D_{0}$, $D_{1}$ and $D_{2}$ is $15$. In the new
pipeline, however, the aggregate bandwidth has dropped to $6$.
This deterioration can be avoided by concurrently using multiple
transfers. In {\it Multi-stream Pipeline} in
Figure~\ref{fig:spanning}, the aggregate bandwidth is improved to
$16$.

%\begin{figure}[tb]
%\begin{center}
%\includegraphics[width=8cm]{flat_pipe.eps}\\
%\end{center}
%\caption{Flat tree and pipeline broadcast}
%\label{fig:flat_pipe}
%\end{figure}
%
%In the {\it Flat tree} schedule, the link near the source stalls, and the aggregate bandwidth is limited to 10. Although the aggregate bandwidth is slightly improved to 12 in the {\it Random Pipeline}, which connects the source and destination in a random order, each node receives much smaller amount of data than in a execlusive direct transfer. This is because transfers of the same data go back and forth the same link many times.

\subsection{Multistream Broadcast}

% FPFR
The Fast Parallel File Replication (FPFR) tool~\cite{fpfr} uses
bandwidth-annotated network topology information to perform efficient
broadcasts. It iteratively constructs multiple spanning trees, and
uses them concurrently. By utilizing available link bandwidth more
effectively, FPFR achieves higher aggregate bandwidth than methods
that only use a single spanning tree.

Let the network topology and available bandwidth of each link be given
in advance. The first tree is built based on the original bandwidth
values. They have tried several ways to construct spanning trees, and
{\it depth-first} method, which traces the destinations from the
source in a depth-first manner, achieved the best performance.
%: {\it depth-first}, {\it breadth-first}and {\it Dijkstra}.
After the first tree was formed, the bandwidth value for the tree is
subtracted from the original bandwidth of all the links. The following
trees are repeatedly built with the subtracted bandwidth values until
no spanning tree can be created.
%The details to build a tree is as follows.
%The {\it depth-first} method traces the destinations from the source in a depth-first manner.
%The only difference in {\it breadth-first} method is that destinations are traced in breadth-first style.
%In the {\it Dijkstra} method, the source and destinations already connected to the source form
%the `` source set''. A destination that can be reached from the source set in the widest bandwidth
%are picked and put into the source set. This operation is continued until every destination is
%connected to the source.
These multiple spanning trees are concurrently used with the planned
bandwidth. Each tree sends different pieces of the file, and finally
every destination receives the whole file.

FPFR has achieved better performance compared to a method that uses
only one tree. {\it Balanced Multicasting}~\cite{balanced} has
further improved FPFR by using linear programming to maximize the
aggregate bandwidth. However, we think FPFR has problems with
stability. Since FPFR only uses spanning trees, a node with narrow
bandwidth limits the aggregate bandwidth of the other nodes.
Especially in a case of a tree topology, it only outputs one pipeline.
Since the throughput of a pipeline is limited by the narrowest link in
it, the stability is not achieved.
%seriously undermined.

\def \tapa#1{\langle {#1} \rangle}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Algorithm} \label{sec:algorithm}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Basic Idea}

Like many broadcast algorithms referred in section~\ref{sec:broadcast},
our algorithm takes as input a bandwidth-annotated network topology, a single source node, and multiple destination nodes, to generate a set of {\em transfer trees\/} each of which is labeled with its throughput (the bandwidth it consumes).
% \begin{enumerate}
% \item The network topology and the bandwidth of each link are known. The network contains nodes as its leaf nodes and switches as its internal nodes.
% \item Each link is bidirectional and has a symmetric bandwidth in both directions.
% \item There is only one source node.
% \end{enumerate}
In this model, a set of transfer trees is said {\em feasible\/} when the total throughput used on each link is within its capacity.
In practical terms, we model the problem assuming that each switch has a sufficiently strong backblane so the transfer rate is limited only by capacities of hosts or individual ports of switches, not by internal switching capacities.

% These conditions are satisfied in many multi-cluster environment, on which data-intensive applications are executed.
% Under these conditions,
A broadcast algorithm is stable when adding more nodes to destinations never decrease bandwidths delivered to the original nodes.
Formally, for any pair of destination sets $D$ and $E$ such that $D \subset E$, and for any node $n \in D$, the bandwidth to $n$ generated with destinations $E$ is never smaller than that generated with desinations $D$.

% the planed bandwidth for each destination node never decrease if some destinations are added and the algorithm executed again.

We propose a broadcast algorithm which achieves more aggregate bandwidth than FPFR and the balanced multicast.
Besides, the algorithm is shown to be stable and optimal with respect to aggregate bandwidth when the network satisfies the following conditions.
\begin{itemize}
\item The topology is a tree.
\item Each link is bidirectional and has a symmetric capacity in both directions.
\end{itemize}

The basic idea of the algorithm is similar to FPFR and the balanced multicast, in that it builds multiple transfer trees and sends data fragments through them.
The main difference is that while FPFR stops building transfer trees as soon as the narrowest link saturates, our algorithm continues to make {\em partial trees\/} that involve only a subset of the destinations.
Such trees play an important role to guarantee that our algorithm is stable, or that nodes connected to the source with higher bandwidth will receive data with their highest bandwidths.
Transfer scheduling becomes slightly more complex than FPFR to guarantee that all nodes receive all data despite that some transfer trees only contain a subset of the destinations.
Details are shown in Section~\ref{sec:transferusing}.
% Performance of FPFR may be limited by nodes with narrow bandwidths.
% This is fundamentally because it uses only trees involving all destinations to transfer data.
% If we assume that the data transferred in different pipelines do not overlap,
% all pipelines are required to be created as spanning trees.
% However, if we introduce BitTorrerant-like transfer policy, partial trees that do not connect every destination can be used.
% Based on this idea, we continue to build partial trees even no spanning tree can be constructed.
% The idea is also applicable to the Balanced Multicasting algorithm, and improves it.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{algorithm}
\caption{Constructing Transfer Trees } \label{alg}
\begin{algorithmic}
\REQUIRE $\mathcal{T}$ is a network topology. $B_0(e)$ is the bandwidth for link $e$. $s$ is the source node and $D_0$ is a set of destinations.
\STATE $B := B_0$; $T:= \varnothing$;
\WHILE{TRUE}
\STATE $t:=$ the tree obtained by tracing destinations in
depth-first manner from $s$;
\IF{ $t$ contains no destination }
\STATE \textbf{break}
\ENDIF
\STATE $u:=$ the bandwidth of the narrowest link in $t$;
\STATE $T:= T \cup \{ \tapa{t,u} \}$;
\FOR{ each link $e$ in $t$ }
\STATE $B(e):=B(e)-u$;
\ENDFOR
\ENDWHILE
\RETURN $T$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Sending Data via Transfer Trees} \label{algtrans}
\begin{algorithmic}
\REQUIRE $t_1, \ldots , t_n$ are the transfer trees returned by the construction algorithm ($t_i$ is made by the $i$-th iteration). $u_i$ is the throughput of $t_i$. $D$ is the data to broadcast.
\FOR{$k:=n$ downto 1}
\STATE $R:=$ data nodes in $t_k$ have not yet received;
\STATE send $R$ through $t_1, \ldots , t_k$, allocating to $t_i$ the amount of data proportional to $u_i$;
\ENDFOR
\end{algorithmic}
\end{algorithm}

\subsection{Constructing Transfer Trees}

The algorithm to create transfer trees is shown in Algorithm~\ref{alg}.
The network topology $\mathcal{T}$ is given in advance
as a directed graph, which consists of computation nodes as
leaf nodes and switches as intermediate nodes.
Each link is biedirectional and modeled as two separate edges of the graph.
In addition, the bandwidth (capacity) $B_0(e)$ for each link $e$ is given.
% $p$ is called \emph{pipeline} iff $p$ is a single path from the source
% to some destinations, where no directed link overlaps and the bandwidth of every link in $p$ is more than $0$.

%We call the first bandwidth on each link {\it original bandwidth}.

%With this bandwidth-annotated topology, the first pipeline is
%constructed.
Like FPFR, the algorithm repeats building transfer trees.
The first tree is construted by visiting all destinations from the source node in a depth-first manner.
As a result, it creates a tree that connects the source and every destination.
The throughput of this tree is set to the capacity of the narrowest link in it.

\begin{figure}%[tb]
\begin{center}
\includegraphics[width=9cm]{pipelines.eps}\\
\end{center}
\caption{The Algorithm to Build Transfer Trees}
\label{fig:pipelines}
\end{figure}

\begin{figure}%[tb]
\begin{center}
\includegraphics[width=9cm]{exe.eps}\\
\end{center}
\caption{Execution of the Broadcast}
\label{fig:exe}
\end{figure}

After the first tree is obtained, bandwidth of each link available for subsequent trees is calculated by subtracting the throughput allocated to the first tree.
Edges whose capacities become zero are removed.
Note that at least one edge is removed.
The second tree is constructed with this reduced bandwidth map using the same depth-first traversal.
Note however that in this time, it may not be possible to reach every destination from the source as some edges have been removed.
If that happens we connect only destinations reachable from the source.
Note that if the network is a tree, in particular, at least one node becomes unreachable.
The throughput of the second tree is again set to the capacity of its narrowest link, and is then subtracted to each link used by the tree.
We repeat this until no nodes become reachable from the source.

Figure~\ref{fig:pipelines} illustrates a process to build transfer trees.
One source $S$ and destinations $D_{0} \dots D_{4}$ are engaged in this broadcast.
The first tree connects every destination $D_{0} \dots D_{4}$, whose throughput is 3, by the link near $D_{0}$.
The second tree is built after subtracting 3 from each link used by the first tree.
It connects $D_{1} \dots D_{4}$ with the throughput of 2.
After the construction of the third and the fourth trees, the available bandwidth map does not allow to construct any transfer from the source.
Consequently, four trees are obtained.

This algorithm can also be applied for multi-source broadcast. For each source, the algorithm shown above is performed. After some trees originated from a source are constructed, the bandwidth used in those trees are subtracted. With this available bandwidth, trees starting from the other source are structured.

\subsection{Transfer Using Multiple Trees} \label{sec:transferusing}

Once the transfer trees have been constructed, we send data fragments using them.
To achieve the claimed stability and optimality, all trees must be effectively utilized.
Algorithm~\ref{algtrans} shows how the transfer is done.

Data are sent in multiple {\em stages\/}, with each stage using different set of transfer trees.
Let $t_i$ be the transfer tree made by the $i$-th iteration of the construction algorithm ($i = 1, \ldots , n$) and denote by $N(t)$ the set of destinations involved in tree $t$.
Note that we have $N(t_n) \subset N(t_{n-1}) \subset \cdots \subset N(t_1)$.

In the first stage, {\em the entire data\/} is partitioned into fragments and sent using {\em all\/} transfer trees.
Data must be allocated to trees so that they finish the transfer at the same time.
In principle this can be achieved by allocating to each tree the amount of data proportional to its throughput.
In the actual implementation, we achieve this by dividing data into small chunks and by allocating data dynamically to trees.
Each transfer edge is implemented by a TCP connection and we send a chunk via a connection only when the connection is writable (i.e., writing to it does not block).
In the end of the first stage, nodes in $N(t_n)$ will have obtained the entire data and need not to receive data any more.
Other nodes only get a part of the data, which will be delivered in later stages.

In the second stage, we send data that the nodes in $N(t_{n-1})$ have just missed in the first stage (i.e., data sent through $t_n$) using $t_1, \ldots , t_{n-1}$, with the same policy to allocate data to individual trees.
The second stage will deliver all data to nodes involved in $t_{n-1}$.
Similarly in the third stage, data that the nodes in $N(t_{n-2})$ have not yet received will be sent via $t_1, \ldots , t_{n-2}$ so they will have all data in the end of the stage.
We repeat this until all destination nodes receive all data.

Figure~\ref{fig:exe}(a) illustrates how the transfers are performed with trees constructed in the example of Figure~\ref{fig:pipelines}.
Let's see the node $D_{3}$ for example, which is connected to the source with bandwidth 5.
It receives data in the first three stages (using $t_1$ and $t_2$) with the maximum bandwidth 5.
% It is involved in the first two trees ($t_1$ and $t_2$). In the first and the second stages, receives the same amount of data as in {\it Exclusive Direct Transfer}, which is shown in Figure \ref{fig:exe}(b).
% Thus, the stability and the optimality properties hold for $D_{3}$. They also hold for every other destination.

% The procedure to send a file is as follows. Let tree $T_{0} \dots T_{n-1}$ are constructed in this order. Each tree stops sending data when every destination in the tree has received the whole data. Since the destinations in $T_{k}$ are a subset of the destinations in $T_{k-1}$, each tree leaves in the opposite order of the construction.

% Let {\it target\_data} denote the data that have not been sent in the current trees $T_{0} \dots T_{k}$. The {\it target\_data} are partitioned by the trees, and sent in parallel. When all the {\it target\_data} are sent, destinations in $T_{k}$ have received the whole data. Thus, $T_{k}$ leaves from the transfer, and {\it target\_data} are updated. By repeating this procedure, every node finally receives the entire data.

%Since we use multiple trees in parallel, the receiver should permit to receive them in an arbitrary order.
Note that in this procedure, some nodes need to stay in the transfer for relaying data to the rest of the nodes even after having received the whole data.
For example, in Figure~\ref{fig:pipelines}(c), node $D1$ and $D2$ needs to continue relaying to $D3$.
To avoid this, the tree could be planned again, as shown in Figure~\ref{fig:pipelines}(d).

% but even we reconstructs the tree, the aggregate bandwidth on the rest of the nodes is not improved.

\subsection{Stability and Optimality for Tree Topology}\label{sec:proof}
We show that the algorithm introduced in the Section~\ref{sec:algorithm}
is stable and optimal in the sense of aggregate bandwidth
if the the network topology is a tree.
The proof consists of three parts.
First, we show that we can treat each link as an undirected link during tree construction.
After that, we show some behavior about the set of links used in direct transfers.
Finally, we prove that the set of trees deliver to a node the maximum amount of data that can be received by the node.
%each node can receive as much data as in the direct transfer
%in the case of
%exclusive direct transfer.

\subsubsection{Preparation: a Property from Symmetric Link}%Directional and Bidirecional Link}
Basically, we need to treat each link as a directed link during the construction of trees. The bandwidths of the two directed links on the same path are separately calculated, and the link is added separately. However, when each link has the same bandwidth in the two directions, we can treat the two directed links as one link, whose bandwidth is given by the smaller bandwidth of the two directional links. The reason is shown as follows.

We have a tree topology that contains a source $s$ and some destinations. A bidirectional link $e$ in the network has two subtrees connected to both ends. Since the network is a tree, the source node $s$ is contained in either of the two subtrees. Assume that the subtree $T$ does not contain the source node $s$.

The link $e$ can be split into two directional links: one heading for $T$ and the other away from it. Let the term {\it forward link} $e_{f}$ and {\it backward link} $e_{b}$ denote the former and the latter link, respectively. This is shown in Figure~\ref{fig:l_ab}(a). In a tree network, the removal of $e$ from the topology disconnects any node in $T$ and $s$. Therefore, if $e_{f}$ is removed, we can create no route from $s$ to any node in $T$.

Here, assume the backward link $e_{b}$ is adopted by a tree. Since the tree uses $e_{b}$, it has reached one of the destinations in $T$. Without $e_{f}$, we cannot create a route from $s$ to any node in $T$. Thus, when a tree contains $e_{b}$, it also employs the corresponding forward link $e_{f}$. Therefore, $B_{k}(e_{b}) - B_{k-1}(e_{b}) \le B_{k}(e_{f}) - B_{k-1}(e_{f})$ holds, where $B_{k}(e)$ denotes the available bandwidth used in the construction of the $(k)$th tree.

Since the bandwidth of a link is symmetric at first, $B_{0}(e_{b}) = B_{0}(e_{f})$.
As a result, $B_{k}(e_{b}) \le B_{k}(e_{f})$ is lead.

Thus, the bottleneck in the tree always lies on the forward link. Consequently, during the construction of the trees, we only need to use the bandwidth of the forward link.

\begin{figure}[tb]
\begin{center}
\begin{tabular}{cc}
\multicolumn{2}{c}{\includegraphics[width=9cm]{l_ab.eps}}\\
(a) Forward/Backward Link & (b) An example of $L(a, b)$ \\
\end{tabular}
\end{center}
\caption{Links In a Tree Symmetric Network}
\label{fig:l_ab}
\end{figure}

\subsubsection{Properties for Single Transfer}

Since the topology is a tree, there is exactly one route that connects
two nodes $a$ and $b$ using the minimum number of links. Let $L(a, b)$ denote
this set of links. This is shown in Figure~\ref{fig:l_ab}(b). The
removal of any link in $L(a, b)$ from the tree disconnects $a$ and
$b$. Thus, any route from $a$ to $b$ contains all the links in $L(a,
b)$. The transfer throughput is maximized in a route that uses each link
in $L(a, b)$ once.
Let $v_k(a, b)$ denote this throughput when the bandwidth is $B_k$:
%Let $v(a, b)$ denote this throughput, and it holds the following:
\begin{equation}
\label{equ:v}
v_{k}(a, b) = \min_{e \in L(a, b)}(B_{k}(e))
\end{equation}

With a set of links $(L(a, b) \cup L(b, c))$, we can construct a path from $a$ to $c$ via $b$. Therefore, the following relation holds:
\begin{equation}
\label{equ:l_subset}
(L(a, b) \cup L(b, c)) \supset L(a, c)
\end{equation}

Assume we iteratively construct a total of $n$ trees. From the source
$s$, a tree traces some destinations in a depth-first path. Let
$D_{k}$ denote the set of destinations included in the $(k)$th depth-first
tree, and
%In addition, let
$P_{k}$ denote the set of links used in the $(k)$th tree.
%The bandwidth of each link changes during the construction of the pipelines.
%Let $v_{k}(a, b)$ denote the throughput from $a$ to $b$ in the
%construction of the $(k)$th pipeline.
%Name each node in the destinations $D_{k}$ as $d_{k, 0} \dots d_{k, n-1}$ to satisfy $v_{k}(s, d_{k, i}) \le v_{k}(s, d_{k, i+1})$.
%Among $P_{k}$ and $L(s, d_{k,0}) \dots L(s, d_{k,n-1})$, the following relation holds:

Using the avobe notations, the following equations (\ref{equ:pl}) and (\ref{equ:non0}) hold.
Because the graph is a tree, %we can write $P_{k}$ as the following:
\begin{equation}
\label{equ:pl}
%P_{k} = L(s, d_{k,0}) \cup \dots \cup L(s, d_{k,n-1}),
P_{k} = \bigcup_{d \in D_k} L(s, d).
\end{equation}
Because the algorithm traces the path from $s$,
%the following condition holds.
\begin{equation}
d \not\in D_k \text{ if and only if } v_k(s,d)= 0.
\label{equ:non0}
\end{equation}

%\begin{description}
%\item[Proof of (\ref{equ:pl}):]
%The following is one of the requirement conditions for $P$:

%\begin{equation}
%\forall d \in D, L(s,d) \subset P
%\end{equation}

%One sufficient condition for $P$ is:

%\begin{equation}
%\forall a, b \in (\{s\} \cup D), L(a, b) \subset P
%\end{equation}

%From the definition of $L$:
%\begin{equation}
%\forall d \in D, L(d) \subset (L(d_{0}) \cup \dots \cup L(d_{n-1}))
%\end{equation}

%From (\ref{equ:l_subset}),
%\begin{equation}
%\forall a, b \in D, L(a, b) \subset (L(d_{0}) \cup \dots \cup L(d_{n-1}))
%\end{equation}
%Consequently, (\ref{equ:pl}) has been proved.
%\end{description}

\subsubsection{Proof of Stability and Optimality}

Let $d_{k, 0} \dots d_{k, n-1}$ denote the nodes in the destination
set $D_{k}$, where $v_{k}(s, d_{k, i}) \le v_{k}(s, d_{k, i+1})$. Let
$u_{k}$ denote the throughput of the tree $P_{k}$.
%From $v_{k}(s, d_{k, i}) \le v_{k}(s, d_{k, i+1})$,
We get the following:
\begin{eqnarray}
\label{equ:u_eq_v}
u_{k} = \min_{e \in P_{k}}(B(e)) %= min_{e \in (L_{0} \cup \dots \cup L_{n-1})}(B(e))
= \min_{0 \le i < n}(v_{k}(s, d_{k, i})) = v_{k}(s, d_{k, 0}).
\end{eqnarray}
%Let us define $A_{k}$ as follows:
%\begin{equation}
%\label{equ:define_da}
%A_{k} = \{ d | d \in D_{k}, v_{k}(s, d) = v_{k}(s, d_{k,0}) \}.
%\end{equation}
%Among $A_{k}$, $D_{k}$ and $D_{k+1}$, the following holds:
The following equation holds.
\begin{equation}
\label{equ:d_sub_d}
D_{k+1} = D_{k} \setminus A_{k},
\end{equation}
where $A_k = \{d \in D_{k}| v_k(s,d)=v_k(s,d_{k,0}) \}$.

\begin{proof}
(Proof of (\ref{equ:d_sub_d}))

From the definition of $B_k$,
\[
%\forall e \in D_{k}, B_{k+1}(e) = B_{k}(e) - u_{k} \\
%\forall e \notin D_{k}, B_{k+1}(e) = B_{k}(e)
B_{k+1}(e) =
\begin{cases}
B_{k}(e) - u_k & \text{if $e \in P_k$,}\\
B_{k}(e) &\text{otherwise.}
\end{cases}
\]
Thus, from (\ref{equ:non0}),
\begin{equation}
%\forall e \in D_{k}, B_{k+1}(e) = B_{k}(e) - u_{k} \\
%\forall e \notin D_{k}, B_{k+1}(e) = B_{k}(e)
v_{k+1}(s,d) =
\begin{cases}
v_{k}(s,d) - u_k & \text{if $d \in D_k$,}\\
0 &\text{otherwise.}
\end{cases}
\label{equ:db}
\end{equation}

If $d \in A_k$, namely $ v_k(s,d) = v_k(s,d_{k,0})$, then
$v_{k+1}(s,d) = u_k - u_k = 0$ from (\ref{equ:u_eq_v}). If $d \in D_k
\setminus A_k$, namely $ v_k(s,d) > v_k(s,d_{k,0})$, then $v_{k+1}(s,d) =
v_k(s,d) - u_k > 0$.

From (\ref{equ:non0}), $D_{k+1} = \{d \in D_0 | v_{k+1}(s,d)\neq 0 \} =
D_k \setminus A_k$.
\end{proof}

From (\ref{equ:d_sub_d}), a destination $d$ in $A_{k}$ receives data from the $(0)$th to the $(k)$th trees.
Let $w(d)$ denote the total bandwidth $d$ receives from the trees,
that is,
%and the following holds:
\begin{equation}
\label{equ:d_and_u}
%\forall d \in A_{k},
w(d) = \sum_{0 \le i \le k}{u_{k}}.
\end{equation}

From (\ref{equ:db}), the following holds:
\begin{eqnarray*}
u_{k} = v_{k}(s, d_{k, 0}) = v_{k-1}(s, d_{k, 0}) - u_{k-1} \\
= v_{0}(s, d_{k, 0}) - \sum_{0 \le i < k}{u_{i}}.
\end{eqnarray*}
Thus
\begin{equation}
v_{0}(s, d_{k, 0}) = \sum_{0 \le i \le k}{u_{i}}. \label{equ:v_sum_u}
\end{equation}

From (\ref{equ:d_and_u}) and (\ref{equ:v_sum_u}), we get the following:
\begin{equation}
\forall d \in A_{k}, w(d) = v_{0}(s, d_{k, 0})
\end{equation}
This means the trees deliver to $d$ the same amount of data as in the direct transfer from $s$ to $d$
in the condition that there is no other traffic. From \ref{equ:v}, this is the maximum amount of data
$d$ can receive.

From the definition of $A_k$ and (\ref{equ:d_sub_d}), every destination is included
in one of $A_{0}, \dots, A_{n-1}$, where $n$ is the number of the trees.

Consequently, it is proved that our trees deliver to a node the maximum amount of data that can be received by the node.
%TAG achieve the same bandwidth as in the exclusive direct transfers.
It is also stable because the receiving bandwidth does not depend on other destinations.

\begin{figure*}[tb]
\begin{center}
\includegraphics[width=18cm]{sim.eps}
\end{center}
\caption{The Relative Bandwidth to the {\it FlatTree} Algorithm in Simulations}
\label{fig:sim}
\end{figure*}

\begin{figure*}[tb]
\begin{center}
\includegraphics[width=13cm]{real.eps}
\end{center}
\caption{Real Machine Experimets}
\label{fig:real}
\end{figure*}

\section{Evaluation} \label{sec:eva}
We implemented the algorithm, and evaluated it in both a simulation and a real environment.
Five algorithms are compared:
\begin{description}
\item[Ours]\ \\
This is the algorithm we proposed. Multiple partial trees are iteratively constructed in a depth-first manner, and transfers are performed in parallel.

\item[Depth-First (FPFR \cite{fpfr}-like)]\ \\
The algorithm constructs multiple spanning trees in a depth-first manner \cite{balanced, fpfr}.
%Since the topology is a tree, only one tree is constructed.

\item[Dijkstra]\ \\
This algorithm iteratively builds a tree in a greedy manner. Pick one unreached destination that can be reached in the maximum possible bandwidth from reached nodes. The method is explained in \cite{fpfr}, and a similar method is proposed in \cite{DBLP}.

\item[Random Pipeline]\ \\
This algorithm randomly creates a tree. One node is randomly chosen from all the unreached destinations at a time.
For each condition, we generated 100 candidates and chose the candidate with the largest aggregate bandwidth.

\item[Flat Tree]\ \\
The source sends the data to all the destinations.
\end{description}

Except for {\it Flat-tree}, every algorithm requires a bandwidth-annotated topology.

\subsection{Simulation}

A simulator has been implemented to evaluate broadcast algorithms. We
make the throughputs of the machines and switches large enough
compared to the link bandwidths, so that they do not become
bottlenecks. We used a tree topology with 400 nodes, taken from the
real environment. During the experiment, three different bandwidth
distributions were tested:

\begin{description}
\item[Uniform Random]\ \\
A uniform random value is assigned to each link
to see general behavior of these algorithms. Both {\it low-variance}
(from 500 to 1000) and {\it high-variance} (from 100 to 1000)
conditions are tested.

\item[Mixed Fast and Slow Links]\ \\
While 80\% of the links have broad
bandwidth (1000), the others have narrow one (100). It describes a
situation that some nodes have a slower NIC, or some switches are slow.

\item[Random Inter-Cluster]\ \\
In this condition, inter-switch links are
assigned random distribution (from 100 to 1000).
\end{description}

The simulation is performed 10 times for the same algorithm, bandiwdth variance and number of destinations.
Both symmetric and asymmetric conditions are tested for each bandwidth distribution.

The result is shown in Figure~\ref{fig:sim}. The vertical axis shows
the relative aggregate bandwidth of the {\it FlatTree algorithm}. Our algorithm took only 2 milliseconds to induce the schedule.
Our method achieved the best performance in every symmetric condition. %Among {\it Greedy Tree} and {\it Depth-First}, the former
As shown in Figure~\ref{fig:sim}(c), the improvement is especially notable when fast and slow links are mixed.
In this case, the performance of the other algorithms is dropped because of the lack of stability.
In an asymmetric network, the performance of our algorithm is the best except for one case, Figure~\ref{fig:sim}(d).
Even in this case, the difference is only $3\%$.
As a result, the superiority of our algorithm is confirmed.

We have also performed an experiment that simulates multiple source broadcast. It is shown in Figure~\ref{fig:sim} (i).
%In {\it depth-first} algorithm, the construction of the pipeline is slightly chaged.
While we changed the number of destinations from $10$ to $50$, the number of sources are also changed from $1$ to $5$.
Figure~\ref{fig:sim}(i) shows that the performance compared to the {\it FlatTree} is not changed in the multi-source multicast.

\subsection{Real Machine Experiment} % TODO

We also performed some experiments using a real machine environment.
This environment had the tree topology shown in Figure~\ref{fig:top},
with 105 nodes in 4 clusters. The bandwidth of the links are also shown in Figure~\ref{fig:top}.
We performed broadcasts among 10, 47
and 105 nodes. Table~\ref{tab:realnodes} shows the number of nodes
from each cluster. Each condition was tested four times, and the best
aggregate bandwidth was taken.

%% Some experiments have done by using the real environment. The environment has a tree topology shown in Figure~\ref{fig:top} with 105 nodes in 4 clusters. We performed broadcast among 10, 47, and 105 nodes. Table \ref{tab:realbw} describes the bandwidth conditions, and Table \ref{tab:realnodes} shows the number of nodes from each cluster. Each conditions are tested four times, and the best aggregate bandwidth is taken.

\begin{table}
\begin{center}
\caption{Number of Nodes Used in the Experiments}
\label{tab:realnodes}
\begin{tabular}{|c||c|c|c|c|}
\hline
Number of Destinations & A & B & C & D \\ \hline \hline
105 Node & 59 & 9 & 35 & 2 \\ \hline
47 Node & 30 & 1 & 16 & 1 \\ \hline
11 Node & 6 & 1 & 3 & 1 \\ \hline
\end{tabular}
\end{center}
\end{table}

Figure~\ref{fig:real} illustrates the results. Since the environment
is heterogeneous, the performance was improved significantly by our
algorithm. The aggregate bandwidth increased by $2.1$ to $2.6$ times
compared to the best result in the other algorithms. However, the
bandwidth was $30\%$ to $45\%$ worse than the optimal value predicted
by simulation. An investigation revealed that our assumption that
each link is bidirectional and thus has an independent (full-duplex)
bandwidth in each direction was subtly violated, because transferring
data in both directions saturated CPUs. While each node could send or
receive 900Mbps independently, it could only send and receive 750Mbps
when it was simultaneously sending and receiving. In the simulation,
the link adjoining such a node was modeled as having 900Mbps in each
direction, but in reality each only had 750Mbps when it was used to
relay data.

%% However, the
%% bandwidth was $30\%$ to $45\%$ worse than simulated value, which is
%% the total of the bandwidth in exclusive direct transfer to each
%% destination. This is because the processing capacity on each node was
%% the bottleneck. While a node performed 900Mbps when a node has only
%% one transfer, the throughput dropped to 750Mbps each when the node
%% treated both incoming and outgoing transfers.

\begin{figure}[tb]
\begin{center}
\begin{tabular}{ccc}
\includegraphics[width=8.5cm]{top.eps} &
\end{tabular}
\end{center}
\caption{The Real Environment Topology}
\label{fig:top}
\end{figure}

To demonstrate the stability of our broadcast, another test was
performed. In this experiment, a node with small bandwidth joined a
9-node broadcast. Figure~\ref{fig:real} (d) shows the change in the
aggregate bandwidth of the older 9 nodes for four broadcast
algorithms.
While the aggregate bandwith dropped to $52\%$ in {\it depth-first}, which works the same as
{\it FPFR} for a tree, the deterioration in our algorithm was only $2.5\%$.
Although the drop in the other two algorithms were also small, our algorithm
yielded 1.5 times aggregate bandwidth of them.

%% To demonstrate the sability, another test has performed. In this experiment, a node with small bandwidth joined to a broadcast in 9 nodes.
%% The change in the aggregate throughput of the older 9 nodes are compared among four broadcast algorithms. It is shown in \ref{fig:real}(d).
%% As can be seen, the deterioration is up to $2.5\%$ in our method.

\section{Conclusion} \label{sec:conclusion}% OK

In this paper, we introduced the notion of {\it stability} in
broadcasts, and proposed a simple and efficient broadcast for
heterogeneous environments.

Our broadcast improves a previously proposed class of
broadcast algorithms that include FPFR and Balanced Multicasting,
focusing on the fact that each node should receive data as fast as
possible and start computation without waiting for other nodes. Like
FPFR and Balanced Multicasting, our broadcast achieves high bandwidth
by forwarding data along multiple spanning trees. In addition, our
broadcast avoids the effect of nodes with narrow bandwidth by
forwarding data along multiple partial trees. While the slowest node
only receives data from the spanning trees, faster nodes also receive
data from the partial trees.

For general graphs, our broadcast will always outperform FPFR and
Balanced Multicasting, and for trees, we proved that it is {\it
stable} as well as {\it optimal}. In a real environment with 100 machines
in 4 clusters, our scheme achieved $2.1$ to $2.6$ times aggregate bandwidth
compared to the best result in the other algorithms. We also demonstrated
the stability by adding a slow node to a broadcast. While the aggregate
bandwith dropped to $52\%$ in {\it depth-first}, which works the same as
{\it FPFR} for a tree, the deterioration in our algorithm was only $2.5\%$.
Some simulations also showed that our algorithm also performs well in
many bandwidth distributions.
% even in an asymmetric environment, our algorithm
%outperformed all other methods that we have tested. We also performed
%an experiment in a real environment to demonstrate its practicality.

%% In this paper, we introduced a notion of {\it stable broadcast}. In conventional broadcast algorithms, adding a new node with narrow bandwidth degrades the bandwidth received by old nodes. A stable broadcast algorithm delivers the same amount of data to each node as when an exclusive data transfer is performed to it. Consequently, the aggregate bandwidth of some nodes are not diminished by adding other nodes. In a tree topology, a stable broadcast achieves in many criterion, such as aggregate bandwidth and the total transfer completion time.

%% We proposed a stable broadcast algorithm under the assumption of a tree symmetric network. As we have seen in section \ref{sec:algorithm}, our algorithm has two points. First, we use the bandwidth-annotated topology information to avoid self-induced congestion. If the tree is constructed without considering the network topology, the broadcast performance is possibly limited because transfers for the same data use the same link many times. By constructing a pipeline in a depth-first manner, a link is used only once for the same kind of data. As a result, the transfer performance is improved.

%% Second, we use multiple partial pipelines in parallel to omit the effect of a node with narrow bandwidth.
%% While the node connected to the source in the narrowest bandwidth receives data from only one pipeline, more numbers of pipelines delivers data to a node connected in broader bandwidth links. Therefore, each node receives as much amount of data as in a single transfer to that node from the source. %, which is the maximum

%% The stability of our algorithm is proved in section \ref{sec:proof}. Our algorithm is tested on both in simulations and in a real network environment. The proposed scheme achieves the best performance and is stable for adding nodes with narrow bandwidth. In a large bandwidth variance environment, our scheme yielded twice the aggregate bandwidth compared to a single depth-first pipeline. Even in an asymmetric environment, our algorithm outperformed all other methods we have tested. We have also performed a experiment on a real environment to demonstrate its practicality.

In the future work, we are going to invent a stable algorithm in an arbitrary graph network, not only in a tree. Since the algorithm uses bandwidth of each link, we are also planning to develop a fast algorithm to detect bandwidth of all the links in the network.

%-------------------------------------------------------------------------
%% \nocite{ex1,ex2}
%\bibliographystyle{latex8}
%\bibliography{reference}

\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,reference}
\end{document}
CVS Diff Visualizer by Kei