FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision FPU, and a 43.7GFLOPS/W at 74.6GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI
📝 Abstract
FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI. Each unit’s parameters, e.g pipeline stages, booth encoding etc., were optimized to yield 1.42ns latency at 110GLOPS/W (SP) and 1.39ns latency at 36GFLOPS/W (DP). At 100% activity, body-bias control improves the energy efficiency by about 20%; at 10% activity this saving is almost 2x. Keywords: FPU, energy efficiency, hardware generator, SOI
💡 Analysis
FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI. Each unit’s parameters, e.g pipeline stages, booth encoding etc., were optimized to yield 1.42ns latency at 110GLOPS/W (SP) and 1.39ns latency at 36GFLOPS/W (DP). At 100% activity, body-bias control improves the energy efficiency by about 20%; at 10% activity this saving is almost 2x. Keywords: FPU, energy efficiency, hardware generator, SOI
📄 Content
FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision FPU, and a 43.7GFLOPS/W at 74.6GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI Jing Pu1, Sameh Galal2, Xuan Yang1, Ofer Shacham3, Mark Horowitz1 1Stanford University, Stanford, CA USA, 2Soft Machines Inc., Santa Clara, CA USA, 3Google Inc., Mountain View, CA USA Abstract FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI. Each unit’s parameters, e.g pipeline stages, booth encoding etc., were optimized to yield 1.42ns latency at 110GLOPS/W (SP) and 1.39ns latency at 36GFLOPS/W (DP). At 100% activity, body-bias control improves the energy effi- ciency by about 20%; at 10% activity this saving is almost 2x. Keywords: FPU, energy efficiency, hardware generator, SOI Introduction Floating-point (FP) computation has become ubiquitous in digital systems for either sequential or parallel workloads. To help designers create efficient implementations for these different environments, we created FPGen, an FPU generator, which extracted design innovations from 50 years of published research on FPU design [1]. FPGen explores different implementation techniques to find those that optimize the design for the desired applications’ power, performance and area constraints. The FPMax chip evaluates the capability of FPGen and Ultra-Thin Body BOX fully-depleted SOI (UTBB FDSOI) technology by incorporating four generated FP Multiply-Accumulate (FMAC) units optimized for different precisions and either latency or throughput applications in ST 28nm UTBB FDSOI LVT technology. All 4 FPUs are fully pipelined, implement IEEE compliant rounding, and utilize internal forwarding before rounding [8]. They use widely different implementations for the FMAC (Table I). The designs optimized to minimize the latency use a cascade multiply-add (CMA) architecture with a Wallace tree to sum up the partial products (PPs). The throughput optimized designs use a fused multiply accumulation (FMA) design with simpler combiners for the multiplication. FPU Architectures For latency oriented FPUs, the primary metrics are the energy/FLOP and the average latency per FLOP. In FMAs [2], the latencies for using the result as a multiplier or an adder input are the same (Fig. 1(a)). However, in many applications, accumulation dependencies tend to be more common. A CMA architecture has a longer total latency, but a shorter path for accumulation (Fig. 1(b)), and thus performs better for practical workloads. Fig. 2(a) shows the pipeline of our 5-stage double-precision (DP) CMA. With internal bypasses, the un-rounded result at stage 4 can be forwarded either to the multiplier input at stage 1 or to the adder input at stage 3 or earlier. We define an average latency penalty as the average number of cycles a dependent operation (either accumulation or multiplication) must stall before its data is available [1]. Our experiments show that compared to 5-cycle FMAs with and without unrounded results forwarding, DP CMA achieves 37% and 57% less average latency penalty in SPEC FP benchmarks, respectively (Fig. 2(c)). For the single-precision (SP) latency optimized unit, we explored a more deeply pipelined and faster clocked design. The longer clock cycle allows the DP unit to use Booth 3 encoding to reduce the area and energy, while the SP unit uses more traditional Booth 2 encoding. Partial Product Array (48b) Aligner (72b) LZA (48b) + Normalizer (48b) !=0? + (48b) PP Array (48b) Exp Diff
(48b) LZA (48b) Aligner (48b) + Shift 1 Exp 4:2 CSA 3:2 CSA (48b) 2:1 Mux 2:1 Mux 2:1 Mux (48b) +1 (24b) + 2:1 Mux 2:1 Mux (48b) 2:1 Mux (48b) Norm (48b) Rounder EA EBC EB EC SB SC SA Significant Result Significant Result EA EB EC SB offset SC SA Multiplier Close Path Far Path (a) FMA (b) CMA (47:0) (72:48) (23:0) (47:0) (47:24) subtract exp diff ≤1 Adder
Fig. 1. Simplified block diagrams for single-precision (a) FMA and (b) CMA (adapted from [2]).
Fig. 2. (a) SP and (b) DP CMA pipelines and internal bypasses and (c)
latency comparison for CMA and FMA w/ and w/o bypasses.
For GPU type applications with abundant parallelism, the
latency of an individual operation is less critical. As a result,
the metrics become the energy per FLOP and the compute
efficiency in GFLOPS/mm2 [2]. We find that FMAs are more
area efficient than CMAs. The focus on area and energy
efficiencies again leads to the use of Booth 3 encoding, and
also simpler combiner structures for the multiplier partial
products: the DP units uses a simple array and the SP uses a
modified array called a ZM structure [3].
Chip Implementation and Measurement
The design parameters of the FPUs were selected from the
Pareto curves of energy vs. performance shown in Fig. 3 for SP
throughput designs. The curve with triangle marks represents
the performance for designs with different architectural
parameters simulated at 1V su
This content is AI-processed based on ArXiv data.