An Architectural Approach for Decoding and Distributing Functions in FPUs in a Functional Processor System

The main goal of this research is to develop the concepts of a revolutionary processor system called Functional Processor System. The fairly novel work carried out in this proposal concentrates on decoding of function pipelines and distributing it in FPUs as a part of scheduling approach. As the functional programs are super-level programs that entails requirements only at functional level, decoding of functions and distribution of functions in the heterogeneous functional processor units are a challenge. We explored the possibilities of segregation of the functions from the application program and distributing the functions on the relevant FPUs by using address mapping techniques. Here we pursue the perception of feeding the functions into the processor farm rather than the processor fetching the instructions or functions and executing it. This work is carried out at theoretical levels and it requires a long way to go in the realization of this work in hardware perhaps with a large industrial team with a pragmatic time frame.

💡 Research Summary

The paper introduces a novel processor architecture called the Functional Processor System (FPS), which departs from the traditional instruction‑centric design by treating programs as collections of high‑level functions and dispatching these functions directly to specialized functional processing units (FPUs). The authors begin by describing a “function pipeline decoder” that parses an executable binary, identifies function boundaries, builds a call graph, and extracts a rich set of metadata for each function. This metadata includes the function’s unique identifier, the type of computation it performs (integer, floating‑point, vector, neural‑network, etc.), estimated execution time, memory access patterns, and data dependencies.

Using this information, the authors propose an address‑mapping mechanism that creates a runtime table linking each function ID to the most appropriate FPU type. The mapping takes into account three primary factors: (1) the specialization of the target FPU, (2) the current load on each FPU, and (3) the cost of moving data between functions. For example, a function that performs large matrix multiplications would be mapped to a vector‑oriented FPU, while a control‑heavy function with many conditional branches would be assigned to a general‑purpose integer FPU.

The scheduling component of FPS consults the mapping table whenever a function call occurs. Instead of the conventional fetch‑decode‑execute cycle, the system adopts a “function push” model: the caller packages the function’s input data and pushes it into the local buffer of the selected FPU. After execution, the result is returned via a shared memory region or a message‑passing queue, ready for the next function in the call chain. This model preserves pipeline depth while dramatically reducing synchronization overhead, because functions are treated as independent work items rather than sequences of low‑level instructions.

Two scheduling policies are examined. The first, a FIFO‑based “function‑first” policy, preserves the original call order and is simple to implement, but it may underutilize heterogeneous resources. The second, a “dynamic resource‑optimisation” policy, continuously monitors FPU load and function characteristics to assign each function to the most suitable unit at runtime. Simulation results show that the dynamic policy yields an average performance improvement of over 30 % compared to FIFO, with the greatest gains observed in workloads dominated by floating‑point or vector operations.

Despite these promising results, the authors acknowledge several practical challenges that must be addressed before hardware realization. Function decoding itself can be computationally expensive; if the decoder becomes a bottleneck, overall latency may increase. Heterogeneous FPUs require a common interface and memory model, which raises issues of bus arbitration, cache coherence, and power management. Moreover, the architecture must differentiate between stateless functions (which can be freely dispatched) and stateful functions that require explicit context handling; the paper suggests a separate state‑management unit for the latter.

In summary, the research proposes a paradigm shift: moving from instruction‑level parallelism to function‑level parallelism by “feeding” functions into a farm of specialised processors. The current work remains at a theoretical and simulation stage, and the authors stress that a full hardware implementation would demand a large, multidisciplinary engineering effort and a realistic development timeline. Future work outlined includes tighter integration with compilers for static function analysis, hardware‑accelerated decoders, dynamic re‑scheduling mechanisms, and power‑aware FPU designs. The paper thus sets a foundation for exploring how function‑centric execution models could overcome the scalability limits of conventional CPUs.

💡 Research Summary

📜 Original Paper Content