Measuring network flow sizes is important for tasks like accounting/billing, network forensics and security. Per-flow accounting is considered hard because it requires that many counters be updated at a very high speed; however, the large fast memories needed for storing the counters are prohibitively expensive. Therefore, current approaches aim to obtain approximate flow counts; that is, to detect large elephant flows and then measure their sizes. Recently the authors and their collaborators have developed [1] a novel method for per-flow traffic measurement that is fast, highly memory efficient and accurate. At the core of this method is a novel counter architecture called "counter braids.'' In this paper, we analyze the performance of the counter braid architecture under a Maximum Likelihood (ML) flow size estimation algorithm and show that it is optimal; that is, the number of bits needed to store the size of a flow matches the entropy lower bound. While the ML algorithm is optimal, it is too complex to implement. In [1] we have developed an easy-to-implement and efficient message passing algorithm for estimating flow sizes.
This paper addresses a theoretical problem arising in a novel approach to network traffic measurement the authors and their collaborators have recently developed. We refer the reader to [1] for technological background, motivation, related literature and other details. In order to keep this paper self-contained, we summarize the background and restrict the literature survey to what is relevant for the results of this paper. Background. Measuring the sizes of network flows on high speed links is known to be a technologically challenging problem [2]. The nature of the data to be measured is as follows: At any given time several 10s or 100s of thousands of flows can be active on core Internet links. Packets arrive at the rate of one in every 40-50 nanoseconds on these links which currently run at 10 Gbps. Finally, flow size distributions are heavy-tailed, giving rise to the well-known decomposition of flows into a large number of short "mice" and a few large "elephants." As a rule of thumb, network traffic follows an "80-20 rule": 80% of the flows are small, and the remaining 20% of the large flows bring about 80% of the packets or bytes.
This implies that measuring flow sizes accurately requires a large array of counters which can be updated at very high speeds, and a good counter management algorithm for Yi Lu is with the Department of Electrical Engineering, Stanford University, yi.lu@stanford.edu. Andrea Montanari is with Departments of Electrical Engineering and Statistics, Stanford University, montanari@stanford.edu. Balaji Prabhakar is with the Department of Electrical Engineering, Stanford University, balaji@stanford.edu.
updating counts, installing new counters when flows initiate and uninstalling them when flows terminate.
Since high-speed large memories are either too expensive or simply infeasible with the current technology, the bulk of research on traffic measurement has focused on approximate counting methods. These approaches aim at detecting elephant flows and measuring their sizes. Counter braids. In [1] we develop a novel counter architecture, called “counter braids”, which is fast, very efficient with memory usage and gives an accurate measurement of all flow sizes, not just the elephants. We will briefly review this architecture using the following simple example.
Suppose we are given 5 numbers and are told that four of them are no more than 2 bits long while the fifth can be 8 bits long. We are not told which is which! Figures 1 and2 present two approaches for storing the values of the 5 numbers. The first one corresponds to a traditional array of counters, whereby the same number of memory registers is allocated to each measured variable (flow). The structure in Fig. 2 is more efficient in memory, but retrieving the count values is less straightforward, requiring a flow size estimation algorithm. Viewed from an information-theoretic perspective, the design of an efficient counting scheme and a good flow size estimation is equivalent to the design of an efficient source code [3]. However, the applications we consider impose a stringent constraint on such a code: each time the size of a flow changes (because a new packet arrives) a small number of operations must be sufficient to update the stored information. This is not the case with standard source codes, where changing a single letter in the source stream may alter completely the compressed version.
In this paper we prove that, under a probabilistic model for the flow sizes (namely that they form a vector of iid random variables), counter braids achieve a compression rate equal to the entropy of the flow sizes distribution, in the large system limit. Namely, for any rate larger than the flow entropy, the flow sizes can be recovered from the counter values, with error probability vanishing in the large system limit. Further, we prove optimal compression can be achieved by using braids that are sparse. The result is nonobvious, since counter braids form a pretty restrictive family of architectures.
Our treatment makes use of techniques from the theory of low-density parity check codes, and the whole construction is inspired by that of LDPC [4], [5]. The construction of LDPC codes has an analogy in the source coding problem thanks to standard equivalence between coding over discrete memoryless symmetric channels, and compressing iid discrete random variables [6]. However, the key ideas in the present paper have been developed to deal with the problem that the flow sizes are a priori unbounded. In the channel coding language, this would be equivalent to use a countable but infinite input alphabet.
Finally, we insist on using sparse braids for two reasons. First, this allows the stored values to be updated with a small (typically bounded) number of operations. Second, it is easy to realize that ML decoding of counter braids is NP-hard, since it has ML decoding of linear codes as a special case [7]. However, thanks to the sparseness
This content is AI-processed based on open access ArXiv data.