Accelerating the Execution of Matrix Languages on the Cell Broadband Engine Architecture
Matrix languages, including MATLAB and Octave, are established standards for applications in science and engineering. They provide interactive programming environments that are easy to use due to their scripting languages with matrix data types. Current implementations of matrix languages do not fully utilise high-performance, special-purpose chip architectures such as the IBM PowerXCell processor (Cell), which is currently used in the fastest computer in the world. We present a new framework that extends Octave to harness the computational power of the Cell. With this framework the programmer is relieved of the burden of introducing explicit notions of parallelism. Instead the programmer uses a new matrix data-type to execute matrix operations in parallel on the synergistic processing elements (SPEs) of the Cell. We employ lazy evaluation semantics for our new matrix data-type to obtain execution traces of matrix operations. Traces are converted to data dependence graphs; operations in the data dependence graph are lowered (split into sub-matrices), scheduled and executed on the SPEs. Thereby we exploit (1) data parallelism, (2) instruction level parallelism, (3) pipeline parallelism and (4) task parallelism of matrix language programs. We conducted extensive experiments to show the validity of our approach. Our Cell-based implementation achieves speedups of up to a factor of 12 over code run on recent Intel Core2 Quad processors.
💡 Research Summary
The paper addresses the gap between high‑level matrix languages such as MATLAB and Octave and the specialized, high‑performance IBM PowerXCell processor, which powers some of the world’s fastest computers. While Cell’s architecture—comprising a general‑purpose PowerPC core (PPE) and eight Synergistic Processing Elements (SPEs) with 256 KB local stores and SIMD units—offers massive parallelism, traditional matrix language implementations cannot exploit it because they lack explicit parallel constructs and are not tuned to the memory hierarchy.
To bridge this gap, the authors extend Octave with a new matrix data type called cellMatrix. Instead of evaluating operations immediately, the framework employs lazy evaluation, recording every matrix operation in an execution trace. The trace is then transformed into a data‑dependence graph, where nodes represent individual matrix primitives (e.g., addition, multiplication) and edges encode data flow.
The graph undergoes a lowering phase that partitions each matrix into sub‑matrices (tiles) sized to fit an SPE’s local store. This tiling creates many independent tasks, enabling the exploitation of four forms of parallelism:
- Data parallelism – identical operations on different tiles;
- Instruction‑level parallelism – SIMD‑friendly re‑ordering of low‑level instructions;
- Pipeline parallelism – overlapping DMA transfers, computation, and result writes across SPEs;
- Task parallelism – concurrent execution of unrelated graph nodes.
A custom scheduler respects the dependence edges while dynamically assigning tiles to SPEs, balancing load and minimizing inter‑SPE communication. The implementation also handles practical issues such as limited local memory, DMA latency, and the need for dynamic tile resizing when a computation exceeds the available store.
Experimental evaluation covers a suite of Octave benchmarks, including dense linear algebra (matrix multiplication, LU factorization), image‑processing kernels (convolution, geometric transforms), and signal‑processing routines (FFT). Compared against a contemporary Intel Core 2 Quad system, the Cell‑based version achieves average speedups of 6× and peak speedups of up to 12×. The authors attribute the gains primarily to effective pipelining and the high arithmetic density of SPEs when operating on suitably sized tiles.
The study demonstrates that high‑level matrix languages can be retrofitted to exploit heterogeneous architectures without exposing programmers to low‑level parallel programming. The authors suggest future work on extending the approach to other many‑core platforms (GPUs, Xeon Phi), refining automatic tile‑size selection, and integrating dynamic runtime adaptation for irregular workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment