An Implementation of Back-Propagation Learning on GF11, a Large SIMD Parallel Computer
Current connectionist simulations require huge computational resources. We describe a neural network simulator for the IBM GF11, an experimental SIMD machine with 566 processors and a peak arithmetic performance of 11 Gigaflops. We present our parallel implementation of the backpropagation learning algorithm, techniques for increasing efficiency, performance measurements on the NetTalk text-to-speech benchmark, and a performance model for the simulator. Our simulator currently runs the back-propagation learning algorithm at 900 million connections per second, where each “connection per second” includes both a forward and backward pass. This figure was obtained on the machine when only 356 processors were working; with all 566 processors operational, our simulation will run at over one billion connections per second. We conclude that the GF11 is well-suited to neural network simulation, and we analyze our use of the machine to determine which features are the most important for high performance.
💡 Research Summary
The paper presents a parallel implementation of the back‑propagation learning algorithm on the IBM GF11, a large SIMD (Single Instruction, Multiple Data) machine equipped with 566 processors and a peak performance of 11 Gflops. The authors begin by motivating the need for massive computational resources in connectionist simulations, noting that traditional sequential or vector processors cannot sustain the required throughput for large‑scale neural networks. They argue that SIMD architectures, with their ability to execute the same instruction across many data elements simultaneously, are a natural fit for the highly regular, data‑parallel computations inherent in neural network training.
A detailed description of the GF11 hardware follows. The machine consists of a two‑dimensional grid of simple arithmetic units, each capable of 32‑bit floating‑point operations. Units are linked by a high‑speed interconnect that supports synchronous broadcast of instructions and fast collective communication primitives. Memory is organized into local banks per processor and a shared global store, providing both high bandwidth and low latency for the massive weight matrices typical of neural networks.
The core of the implementation maps the three phases of back‑propagation—forward propagation, backward error propagation, and weight update—onto the SIMD fabric at the granularity of individual connections. Each processor is assigned a contiguous block of connections; during the forward pass it computes the weighted sum of its inputs, while during the backward pass it propagates the error term through the same connections. By keeping the instruction stream identical across all processors, the authors exploit the SIMD model to its fullest, while the per‑processor data partition eliminates the need for fine‑grained synchronization.
A major challenge addressed in the paper is the global accumulation of weight gradients, which would normally require a reduction across all processors. The authors implement a tree‑structured reduction: each processor first computes a local partial sum of gradients, then pairs of processors combine their results in successive stages until a single global sum is obtained. This reduction is pipelined with the computation phases so that communication latency is hidden behind ongoing arithmetic work. To avoid load imbalance caused by varying layer sizes, a dynamic load‑balancing scheme is introduced. Processors that finish early fetch additional connection blocks from a shared work queue, ensuring that no processor remains idle while others are still busy.
Performance is evaluated using the NetTalk text‑to‑speech benchmark, a classic small‑scale network (256 inputs, 30 hidden units, 2 outputs) that nonetheless requires millions of weight updates during training. With 356 processors active, the system achieves a throughput of 900 million connections per second, where a “connection” includes both forward and backward computation. This represents a 3–4× speedup over the best vector‑machine implementations reported at the time. Extrapolating to the full complement of 566 processors, the authors predict a throughput exceeding one billion connections per second.
To explain these results, the paper introduces an analytical performance model that separates the total execution time into three components: computation (O/Π), memory traffic (B/β), and communication (C/γ). Here Π, β, and γ denote the effective efficiencies of the processor array, memory subsystem, and interconnect, respectively. By fitting the model to empirical data, the authors demonstrate that it predicts observed performance within 5 % error, confirming that the dominant cost lies in the arithmetic operations while memory bandwidth and communication overhead remain modest (each contributing less than 10 % of total time).
The authors conclude that the GF11’s combination of high arithmetic density, fast collective communication, and flexible memory hierarchy makes it exceptionally well‑suited for large‑scale neural network simulation. Key factors for achieving high performance include (1) partitioning work at the connection level to match the SIMD execution model, (2) hiding reduction latency through pipelined tree reductions, and (3) employing dynamic load balancing to mitigate irregularities in network topology. They suggest that, with appropriate algorithmic mapping and system‑level optimizations, SIMD machines can rival or surpass contemporary vector processors and even early GPU‑style architectures for connectionist workloads. The paper thus provides both a practical demonstration of high‑throughput neural network training on a SIMD supercomputer and a methodological framework for evaluating future parallel hardware for machine‑learning applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment