Memory-constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

Memory-constraine d V ectorization and Scheduling of Dataf low Graphs for Hybrid CP U-GP U P latforms SH UOXIN LIN, University of Maryland, USA JIAHA O W U, University of Maryland, USA SH UVRA S. BHA T T ACHARYY A, University of Maryland, USA and T ampere University of T echnology, Finland The increasing use of heter ogeneous embe dded systems with multi-core CPUs and Graphics Processing Units (GP Us) presents important challenges in eectively exploiting pipeline, task and data-level parallelism to meet throughput requirements of digital signal processing (DSP) applications. Mor eover , in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inecient and error-prone, and can therefor e, greatly slow down development time or result in highly underutilized processing resources. In this paper , we pr esent vectorization and scheduling methods to eectively exploit multiple forms of parallelism for throughput optimization on hybrid CP U-GP U platforms, while conforming to system-level memory constraints. The methods operate on synchronous dataow representations, which ar e widely used in the design of embe dded systems for signal and information processing. W e show that our novel methods can signicantly improv e system throughput compared to pre vious vectorization and scheduling approaches under the same memory constraints. In addition, we present a practical case-study of applying our methods to signicantly improve the thr oughput of an orthogonal frequency division multiplexing (OFDM) receiver system for wireless communications. Additional K ey W ords and Phrases: Dataow models, design optimization, heterogeneous computing, signal processing systems, software synthesis 1 INTRODUCTION Heterogeneous multiprocessor platforms are of incr easing relevance in the design and implementa- tion of many kinds of embedde d systems. Among these platforms, heterogeneous CP U-GP U platforms (HCGPs) , which integrate multicore central processing units (CP Us) and graphics processing units (GP Us), have been shown to signicantly boost throughput for many applications. System-lev el performance optimization requires ecient utilization of both CP U cores and GP Us on HCGPs. In embedded system designs, multiple system constraints must be met including memor y , latency or cost requirements. Manual performance tuning on a case-by-case suers fr om ineciency and can lead to highly sub-optimal solutions. When system constraints or the target platforms are changed, the designer often needs to repeat the same process, which further reduces de velopment productiv- ity , and increases the chance of introducing implementation errors. Ther efore, methods for HCGPs that are based on high-level models, and systematically e xplore parallelization opportunities are highly desirable. Dataow models provide high-lev el abstractions for spe cifying, analyzing and implementing a wide range of embedded system applications (e .g., see [ 2 ]). A dataow graph is a directed graph G = ( V , E ) with a set of vertices ( actors ) V and a set of edges E . An actor v ∈ V represents a computational task of arbitrary complexity . An edge e = ( u , v ) ∈ E represents a rst-in, rst-out (FIFO) buer that stores data values as they are produced by u and consume d by v . These data A uthors’ addresses: Shuoxin Lin, University of Maryland, College Park, MD, 20742, USA, slin07@umd.e du; Jiahao Wu, University of Maryland, College Park, MD, 20742, USA, jiahao@umd.edu; Shuvra S. Bhattachar yya, University of Mar yland, College Park, MD, 20742, USA, T ampere University of T echnology, T ampere, Finland, ssb@umd.edu. T o appear in the ACM Transactions on Embedded Computing Systems. 2 S. Lin et al. (a) (b) (c) Fig. 1. An illustration of parallelism expressed using SDF graphs: (a) task parallelism, ( b) pipeline parallelism, (c) data parallelism based on vectorization. values ar e called tokens , and represent the basic unit of data that is pr ocessed by actors. When an actor res , it consumes tokens from its input edges, executes its associated task, and produces tokens on its output edges. Synchronous dataow (SDF) is a sp ecialized form of dataow in which the numb ers of tokens produced and consumed on each edge are constant across all rings of its source and sink actors [ 15 ]. These two numbers are called the production rate and consumption rate of an edge. Generally , the production rate and consumption rate of an SDF edge can take on any positive integer value. SDF graphs are pow erful tools for analyzing and optimizing important system-level metrics, including memory requirements, latency , and throughput. Additionally , SDF graphs naturally expose pipeline, task and data parallelism across distinct actors and distinct rings of the same actor , as illustrated in Figure 1 . Pipeline and task parallelism can b e exploited by assigning actors on dierent cores or processors (Figure 1(a) and 1(b) , while exploitation of data parallelism can be enhanced by vectorization of actors such that dierent sets of tokens ar e processed by the same actor concurrently on data-parallel hardware (Figur e 1(c) ). GP Us in HCGPs accelerate computational tasks by supporting large-scale data parallelism with hundreds or thousands of SIMD ( single instruction multiple data) processors. GP Us can achieve high throughput gain ov er CP Us when parallel data is abundant. Howev er , when parallel data is insucient, GP U performance can be worse compared to CP U cores. For an SDF graph, a sucient amount of parallel data may not be present to eectively utilize a GP U. In this case, vectorization can be of great utility in impr oving the degree of expose d data parallelism, and the ee ctive utilization of GP U resources. However , previous research on scheduling and software synthesis from SDF graphs has fo cused largely on task and pipeline parallelism, therefore pro viding inadequate support of GP U-targeted design ows. The developments in this paper are intended to address this gap. In general, the average time required for an actor ring scales dierently in terms of the vec- torization factor between a CP U and GP U. A dditionally , overheads involving interprocessor com- munication and synchronization can limit or even negate performance gains achiev ed thr ough vectorization. Thus, eective throughput optimization for HCGPs requires rigorous joint consider- ation of vectorization and scheduling. In this paper , we develop integrated vectorization and scheduling (I VS) techniques for software synthesis targeted to HCGPs. These techniques jointly consider vectorization and scheduling for thorough optimization of SDF graphs. W e refer to this pr oblem of joint vectorization and scheduling as the SDF vectorization-scheduling throughput optimization ( VSTO ) problem, or simply as “VSTO” . Our contribution is summarized as follows. First, we formally present the VSTO problem for HCGPs. Second, w e de velop a set of novel vectorization and scheduling techniques for VSTO under memory constraints. Third, we pr opose a new scheduling strategy called Σ -scheduling that is eective for mapping dataow actors on heterogeneous computing platforms. Finally , we demonstrate our approaches to VSTO by applying them to a large collection of synthetic, randomly-generated dataow graphs and an Orthogonal Frequency Division Multiplexing (OFDM) r eceiver . 3 2 RELA TED W ORK SDF throughput analysis under r esource constraints using explicit state space e xploration has been studied in [ 9 ]. In [ 19 ], the authors present a scheduling algorithm for SDF graphs that applies static topological analysis and vectorization to improve SDF throughput and memor y usage on shared-memory , multicore platforms. In [ 4 ], a buer optimization technique for pipelined, multicore schedules is discussed. Earlier work on SDF vectorization has focuse d on throughput optimization for single-processor implementation on programmable digital signal processors, and mor e recently , on multicore imple- mentation. SDF vectorization techniques to maximize throughput for single-processor implementa- tion were rst developed in [ 20 ]. In [ 14 ], the authors presented methods to construct v ectorized, single-processor schedules that optimize throughput under memory constraints. In [ 13 ], the au- thors presented techniques for maximizing throughput when simulating SDF graphs on multicore platforms. These techniques simultaneously optimize vectorization, inter-thread communication, and buer memory management. In these w orks, SIMD architectures are not involved, and vector- ization is applied to reduce synchr onization overhead and context switching rather than to exploit data-parallelism. V arious studies have targeted automated exploitation of parallelism to map dataow models onto heterogeneous computing platforms. Design tools that exploit various forms of parallelism using CUD A or OpenCL have been developed in [ 5 , 17 , 21 ]. These tools assume that vectorization has b een specied by the designer , and map an actor onto a GP U whenever a GP U-accelerated implementation of the actor is available. For such actors, these tools do not take into account the possibility that CP U-targeted execution may be more ecient. In [ 29 ], SDF graphs are automatically vectorized, transformed to single-rate SDF graphs, and then scheduled using Mixed-Integer Programming techniques. However , this appr oach does not take memor y constraints into account. Intuitively , a single-rate SDF graph is one in which all actors are red at the same average rate. This concept is discussed in more detail in Section 3 . When SDF graphs are converted to single-rate graphs, they can be scheduled in the same way that task graphs are scheduled in programming envir onments such as StarP U [ 1 ], FastFlow [ 10 ], and OmpSS [ 8 ]. These environments support run-time task graph scheduling and parallelization on hybrid CP U-GP U platforms. StarP U, for example , uses the Heterogeneous Earliest Finish Time ( HEFT ) heuristic to schedule tasks on HCGPs. Ho wever , these pr ogramming models cannot directly be applied to multirate SDF graphs; a designer must manually vectorize the graph and convert it to a single-rate SDF graph before working with it in such environments. In addition to requiring such manual transformation, this process limits the exibility in v ectorization and scheduling for SDF execution, which can lead to inecient memory usage and execution time performance. Dataow models can be used at arbitrary levels of abstraction in computing systems, and hence compilation optimization of dataow programs is also investigated at various levels of abstraction. For example, the works in [ 12 , 28 ] fo cus on improving the thr oughput of GP U kernels that are represented by dataow graphs. The aim of those works is to generate high-performance GP U kernel code through better utilization of on-device resources. In contrast, the methods introduced in this paper focus on optimizing the mapping of coarse-grain, system-level dataow models onto CP U-GP U platforms, where each actor can encompass a computational task of arbitrary complexity , and can encapsulate one or multiple kernels. In this work, we go b eyond the previous works by jointly considering SDF vectorization and scheduling for HCGPs under memor y constraints. T o our knowledge, our work is the rst to take memor y constraints into account in the context of SDF vectorization and scheduling for heterogeneous computing platforms. Our methods ar e not restricted to single-rate SDF graphs, and 4 S. Lin et al. are capable of deriving ecient, memor y-constrained vectorization congurations. The techniques in this paper are developed in the DIF-GP U Framework, which was pr esented in [ 16 ]. DIF-GP U incorporates techniques for minimizing runtime overhead through compile-time scheduling and incorporation of carefully-designed protocols for interprocessor communication. 3 BA CKGROUND The HCGPs that we target in this pap er consist of one multi-core CP U and one GP U each. This class of multicore architectures is widely used in embedded systems. In our targete d class of HCGPs, we refer to the CP U as the host , as it controls o verall execution ow and manages the associate d GP U, and we refer to the GP U as the device . The device receives instructions and data fr om the host. Additionally , in the target architecture , there exists a context transfer overhead when an appli- cation’s execution path switches between CP U cores and a GP U. This o verhead can include the time for interprocessor communication and synchronization, context switching, and transferring data from one memory address to another . Although most existing embedded HCGPs provide shared physical memory , this context transfer overhead can still be signicant, and in general varies from one architecture / application to another [ 11 ]. W e refer to such conte xt transfer overhead as host-to-device (H2D) or device-to-host (D2H) context transfer , depending on the direction. Given an SDF graph G = ( V , E ) and an actor v ∈ V , we denote the sets of input and output edges of v as in ( v ) and out ( v ) , respectively . Given an edge e ∈ E , we denote the source and sink actors of e by src ( e ) and snk ( e ) , respectively . W e denote as prd ( e ) the number of tokens produced onto e by each ring of src ( e ) , and similarly , we denote as cns ( e ) the number of tokens consumed from e by each ring of snk ( e ) . Signal processing systems represented as SDF graphs are often required to be executed indenitely — that is, iterated through a number of iterations for which no useful bound is known in advance. T o support such indenite execution, the concepts of consistency and periodic schedules in SDF graphs are important [ 15 ]. An SDF graph is consistent if it has a perio dic sche dule , which is a sequence of actor executions that does not deadlo ck, res each actor at least once, and produces no net change in the number of tokens on each edge. Consistent SDF graphs can be executed indenitely with nite buer memory requirements. Furthermore , for each actor v ∈ V in a consistent SDF graph G = ( V , E ) , there is a unique repetition count q ( v ) , which gives the minimum number of rings of v in a periodic schedule. W e call a set of actor rings in which each actor v res exactly q ( v ) times an iteration of G . Figure 2(a) shows an SDF graph example, where each repetition count is denoted as < q ( v ) > above the corresponding actor v . In this example , prd ( e AB ) = 1 , cns ( e AB ) = 2 , q ( A ) = 2 , and q ( C ) = 7 . If q ( v ) = 1 for every actor v ∈ V , then G is called a single-rate SDF graph, as sho wn in Figure 2( c) . Because each actor needs to re only once to complete an iteration of G , single rate SDF graphs can be scheduled the same way as task graphs (e.g., se e [ 29 ]). In a task graph, nodes represent computational tasks, and edges represent dependencies asso ciated with pairs of no des without any specic data structure implied for inter-actor communication. A wide variety of algorithms have been developed for scheduling task graphs onto multiprocessor systems (e.g., see [ 24 ]). For implementation of G , we assume a static buer allocation model, where we allocate a FIFO buer of xed, nite size (“buer bound”) buf ( e ) for each edge e ∈ E . When an actor v res, it must satisfy (1) for each edge e i ∈ in ( v ) , e i contains at least cns ( e i ) tokens, and (2) for each edge e o ∈ out ( v ) , e o contains no more than ( buf ( e o ) − prd ( e o )) tokens. When this condition is met, the actor is said to be bounded-buer r eable , and SDF graph execution following this rule is called bounded-buer execution . The minimum buer requirement for an SDF graph G , mbr ( G ) , is the minimum over all periodic schedules of the amount of memor y (in units of tokens) required to implement the dataow edges in 5 (a) (b) (c) (d) Fig. 2. An example of vectorization and minimum buer requirements. (a) Original graph. (b) Actor-level V ectorization of A by 2. (c) Graph-level V ectorization with β = 2 . (d) Actor-lev el V ectorization of B by 2. a given graph (see [ 25 ]). A lower bound mbr ( e ) on the minimum buer requirement for a delayless SDF edge e can b e determined by mbr ( e ) = prd ( e ) + cns ( e ) − gcd ( prd ( e ) , cns ( e )) , (1) where gcd represents the greatest common divisor operator [ 3 ]. The lower bound of mbr ( G ) is the sum of mbr ( e ) over all edges: mbr ( G ) = Í v ∈ V mbr ( e ) . Although this low er b ound is not always achievable, it is achievable for the datao w graphs in Figure 2 . W e represent the individual processors in the target multiprocessor platform as P = { p 1 , p 2 , . . . , p N } , where p 1 , p 2 , . . . , p N − 1 represent the available CP U cores, and p N represents the GP U. When sched- uling G onto the platform, actor rings are assigned to processors to be executed. In this context, we say that an actor v ∈ V is mapped onto processor p ∈ P if all rings of v are assigned to execute on p . As mentioned in Section 2 , we assume in this paper that the input SDF graphs for vectorization and software synthesis are acyclic. Cycles in synchr onous dataow models may impose complex constraints on what vectorization degrees are valid for actors [ 20 ]. Furthermore, cy cles introduce complex trade-os b etween code size and buer memor y minimization in SDF graphs, which are also rele vant to memory-constrained vectorization problems (e.g. see [ 3 ]). Third, acyclic SDF graphs encompass a broad class of important signal processing applications, so te chniques for this class have signicant practical relevance [ 3 ]. Currently in our framework, we assume that actor vectorizations are constrained only by memory , and not by cycles in the input graph. Investigating vectorization with topological constraints caused by cy cles is an interesting dir ection for future work. 4 PROBLEM FORMULA TION In this section, we formally dene the VSTO problem for HCGPs. W e begin by dening the concept of actor-level vectorization. Given a consistent SDF graph G = ( V , E ) , and an actor v ∈ V , the vectorization of v by a vectorization degree ( VCD ) b is dened as a transformation of G that involves the following set of operations: (1) replacing v by v b , where ring v b is equivalent to b consecutive rings of v ; (2) replacing each edge e i ∈ in ( v ) by an edge e ′ i such that cns ( e ′ i ) = b × cns ( e i ) and prd ( e ′ i ) = prd ( e i ) ; and (3) replacing each e dge e o ∈ out ( v ) by an edge e ′ o such that prd ( e ′ o ) = 6 S. Lin et al. b × prd ( e o ) and cns ( e ′ i ) = cns ( e i ) . W e r efer to the actor v b as the b -vectorized actor of v , and the transformed graph that results from the vectorization operation as vect ( G , v , b ) . For example , in Figure 2 , G b = v ect ( G a , A , 2 ) . The denition of vectorization that we adopt her e corresponds to a dataow graph transformation that is consistent with the vectorization concept intr oduced by Ritz et al. [ 20 ], as opposed to the aggregation of basic operations that corresponds to vectorization in compilers for procedural programming languages. If G is a consistent, acyclic SDF graph, then vect ( G , v , b ) is also consistent for any v ∈ V , and any positive integer b . Howev er , in this work, we restrict the set of allowable vectorization degrees to the set alwb ( G , v ) , which is dene d as alwb ( G , v ) = { n ∈ { 1 , 2 , . . . } | ( n is a factor of q ( v )) or ( n is a multiple of q ( v )) } . (2) Equation 2 refers specically to positive integer factors and multiples. For example, if q ( v ) = 8 , then alwb ( G , v ) = { 1 , 2 , 4 , 8 , 16 , 24 , . . . } . V ectorization of an actor v that is restricted to alwb ( G , v ) enables fast derivation of repetition counts for G ′ = vect ( G , v , b ) , which in turn facilitates incremental vectorization techniques, wher e actors are selected for vectorization one at a time accor ding to specic greedy criteria. In particular , if b is a factor of q ( v ) , then q ( G ′ , v ) = q ( G , v )/ b , while the repetition counts of all other actors are unchanged. Similarly , if b is a multiple of q ( v ) , then q ( G ′ , v ) = 1 , while for any other actor u , v , q ( G ′ u ) = b q ( G , u )/ q ( G , v ) . In Section 5 , we discuss specic techniques for incremental vectorization that apply these forms of repetition count updates. On HCGPs, v ectorized actors can exploit SIMD processors such as GP Us to execute multiple rings of the same actor in parallel. Note that although parallel processing of tokens cannot in general be applied easily to stateful actors, vectorization may still benet dataow execution by reducing overheads associated with inter-processor communication, synchronization and context switching. In the presence of memory constraints, there are limits to the amount of vectorization that can be applied. For example, as we can see in Figure 2 , vectorizing A (Fig. 2(a) ) and vectorizing B (Fig. 2(d) ) by 2 r esults in dierent increases to the minimum buer requirement. T o represent SDF graphs with vectorized actors and their relationships with the original graphs, we dene vectorized SDF graphs ( VSDFs ) as follows. Denition 4.1. Suppose that G = ( V , E ) is a consistent SDF graph, b v ∈ alwb ( v , G ) is a V CD for each v , and B = {( v , b v ) | v ∈ V } . Then the B -vectorized SDF graph of G is dened as G B = ( V B , E B ) , where (1) each v B ∈ V B is the b v -vectorized actor of v , (2) each edge e B = ( x B , y B ) in G B is derived from the corresponding edge ( x , y ) ∈ E , and (3) for each e B = ( u B , v B ) ∈ E B , prd ( e B ) = b u × prd ( e ) , and cns ( e B ) = b v × cns ( e ) , where e = ( u , v ) . The vectorized graph G B is an SDF graph. W e dene a restricted form of vectorization, called graph-level vectorization (GLV) , in which a common “repetitions vector multiplier” β ∈ { 1 , 2 , . . . } is used for all actors in the input graph. That is, b v = β × q ( G , v ) for all v ∈ V . In this context, we refer to β as the graph vectorization degree ( GVD ). Under GLV , G B is a single-rate SDF graph. Howev er , vectorization does not need to b e conned to GLV . W e refer to this more general form of vectorization, as actor-level vectorization ( ALV ). For example, Figure 2( c) shows the vectorized graph that corresponds to Figure 2(a) with GLV and β = 2 . Figure 2(d) shows the v ectorize d graph that results from applying ALV to Figure 2( a) with b B = 2 . As discussed in Section 2 , the conventional approach to solving VSTO inv olves 3 steps: (1) the designer or design tool sets the GVD base d on memor y constraints, (2) converts the SDF graph into a single-rate SDF graph using GLV , and (3) generates a schedule using task graph scheduling methods. Compared to ALV , GLV can require signicantly larger buers (see Figure 2( c) ). The vectorization methods that we present in this paper go beyond these conventional approaches by considering general ALV solutions instead of being restricted only to GLV solutions. 7 For multiprocessor scheduling of ALV solutions, we introduce in this work a general scheduling strategy , which is suitable for HCGPs, and can loosely be viewed as a variant of the list scheduling strategy . This variant is adapted for memor y-constrained, multiprocessor mapping of transforme d graphs that result from ALV . This strategy is a static scheduling strategy that op erates using compile-time estimates of actor execution times. The general strategy is dened as follows. Denition 4.2. Given a consistent SDF graph G = ( V , E ) , and a multiprocessor target ar chitecture with a set of processors P , the Σ -scheduling strategy (1) statically assigns each actor v ∈ V to a processor p ∈ P , (2) statically determines a buer bound buf ( e ) for each edge e ∈ E , and (3) iteratively selects a bounde d-buer rable actor to re on its assigned processor p as soon as p has completed all executions. An algorithm that conforms to this sche duling strategy completes when all actors in G have been schedule d using the iterative process of Step (3). The Σ -scheduling strategy is closely r elated to the Ω -scheduling strategy , which was intr oduce d in [ 13 ]. Both the Σ and Ω strategies satisfy Parts (1) and (2) of Denition 4.2 ; the main dierence is that with respect to Part (3), Σ -scheduling maps actors onto a nite number of processors, while Ω -scheduling assumes an unlimited number of processors. A dditionally , in our application of Σ - scheduling, we perform ALV to construct the input graph to the strategy . In contrast, Ω -scheduling in [ 13 ] is applied to the original (unvectorized) SDF graph. T o determine the buer bounds { buf ( e ) } in Σ -scheduling, we apply the Ω -buering technique dened in [ 13 ]. This technique derives the buer b ounds by applying Ω -scheduling, and determining the buer bounds to be e qual to the corresponding buer sizes { buf ( e ) } that result from Ω - scheduling. W e refer to the buer b ound buf ( e ) for each edge e that is computed in this way as the Ω buer bound for e . It is shown that Ω -buering sustains maximum throughput for SDF graphs under Ω scheduling [ 13 ] so that imposing these bounds imposes no the oretical limitation on throughput. Given an SDF graph G = ( V , E ) , we denote by Ω buf ( G ) the total buer memory cost for G as determined by Ω -sche duling: Ω buf ( G ) = s um e ∈ E ( buf ( e )) . Denition 4.3. Supp ose that G = ( V , E ) is a consistent SDF graph, b v ∈ alwb ( v , G ) is a V CD for each v , B = {( v , b v ) | v ∈ V } , S B is a p eriodic schedule for the B -vectorized graph G B , and T ( S B ) is an estimate of the time required to execute a single iteration of S B . Then from the fundamental properties of p eriodic SDF sche dules [ 15 ], we can derive a unique positive integer J ( S B , G ) , which w e call the blocking factor of S B relative to G , such that S B executes each v ∈ V exactly ( J ( S B , G ) × q ( G , v )) times. In this context, we dene the relative throughput of S B or the throughput of S B relative to G by the quotient J ( S B , G )/ T ( S B ) . This metric gives the average number of iterations of the original (unv ectorize d) SDF graph that is e xecuted per unit time by the schedule S B . As an e xample, in Figure 2 , executing one iteration of G b , G c or G d is equivalent to executing two iterations of G a . Thus, J ( S B , G b ) = J ( S B , G c ) = J ( S B , G d ) = 2 . Intuitively , vectorization improv es relative thr oughput when T ( S B ) < J ( S B ) × T ( S ) , where S is the best available minimal-p eriodic (unvectorized) sche dule for S . Such eciency in the vectorize d execution time T ( S B ) can be achieved due to impro ved utilization of processing resources under carefully-optimized GLV and ALV congurations. A limitation of the vectorization techniques de veloped in this paper is that they may increase latency , and thus, they may not be suitable for implementations in which latency is a critical performance metric. However , it is envisioned that the methods developed in this paper provide a useful foundation that can b e built upon for latency-aware vectorization. Investigating adaptations of these methods to take latency constraints into account is an interesting dir ection for future w ork. Based on the denitions introduced in this section, we formulate the VSTO problem as follows. 8 S. Lin et al. Denition 4.4. Let G = ( V , E ) be a consistent SDF graph, and P = { p 1 , p 2 , . . . , p N } be the set of processors in an HCGP , wher e p 1 , p 2 , . . . , p N − 1 represent the CP U cores, and p N represents the GP U. Given a total memory budget M (a positive integer ), the vectorization-scheduling throughput optimization problem , or VSTO problem associated with G and P is the problem of nding a set B of vectorization degrees, and a schedule S B for G B = ( V B , E B ) such that the throughput of S B relative to G is maximized subject to Ω buf ( G B ) ≤ M . W e refer to a set of ordered pairs C = { ( v , c v ) | ( v ∈ V ) and ( c v ∈ alwb ( G , v )) } as an ALV conguration for G . Note that if an actor is not represented within a given ALV conguration (i.e., it does not appear as the rst element of any order ed pair in the set), then the actor is assumed to be unvectorized (equivalent to a v ectorization degree of 1 ). Thus, the VSTO problem can be thought of as the problem of jointly determining an ALV conguration B together with a schedule for G B such that the resulting schedule optimizes throughput subject to a given buer memory constraint M . The vectorization formulation and techniques de veloped in this paper assume that each SDF edge (FIFO buer) is implemented in a separate block of memory . V arious techniques have been developed in recent years to share memory eciently among edges in multirate SDF graphs (e.g., see [ 6 , 27 ]). Extending the techniques in this paper to incorporate such memory sharing techniques is a useful direction for future work. 5 VECTORIZA TION AND SCHEDULING WITH MEMORY CONSTRAINTS In this section, we develop three main heuristics, called Incremental Actor V ectorization (IA V), N - candidates IA V , and Mapping-Based Devectorization, for the VSTO problem. These three heuristics can be view ed as “peers” in the sense that any one of them may be the preferable choice for a given application. Thus, the designer or a design tool can apply all three of these complementary methods and select the best result for a given application. This is ho w we have integrated the thr ee heuristics in our DIF-GP U software frame work. More details on the integration with software synthesis and associated experimental results are discussed in Section 6 and Section 7 . 5.1 Incremental Actor V ectorization In this section, we dene a general appr oach for searching the space of ALV congurations that is base d on sele cting and vectorizing actors one at a time using some specic greedy criteria. W e refer to this general approach as Incr emental Actor V ectorization ( IA V ). Each iteration of IA V , called an IA V iteration , involves the selection and vectorization of a single actor . This r esults in a sequence of intermediate vectorized graphs, I 1 , I 2 , . . . , I N , where I i is the transformed graph that results from IA V iteration i , and N is the total number of iterations before IA V terminates. The approach is incremental in both the dimensions of actors and vectorization degrees — that is, each IA V iteration selects a single actor v , and increases its vectorization degree to the next highest element of alwb ( G , v ) . Given an actor v that has an associated v ectorization degree b v , we refer to this process of replacing b v with the next highest element min ( x ∈ alwb ( v ) | x > b v ) as stepping up the vectorization of v or just “stepping up v ” . In IA V , we dene a “score” function to guide the vectorization process. At each algorithm iteration, IA V selects an actor that has the highest score among all actors whose stepping up would not result in a violation of the given memory budget M . Analogous to how dierent priority functions can be used to select tasks in multiprocessor list scheduling (e.g., see [ 24 ]), dierent score functions can b e used to apply dierent ALV criteria in IA V . This contributes to a no vel design space for development of integrated vectorization and scheduling techniques. The specic score functions that w e experiment with in this w ork rst apply Σ -scheduling to generate a schedule µ ( i ) of the current I i (intermediate vectorized graph) onto the target HCGP 9 P , and then use a specic metric to estimate the potential “gain” of each candidate stepping up operation relative to the processor assignment associated with µ ( i ) . Given a schedule S returned by Σ -scheduling, we dene the associated processor assignment asso ciated with S and dataow graph G = ( V , E ) as the function mp S : V → P such that for each v ∈ V , mp S ( v ) gives the processor to which actor v is mapped according to S . The initial schedule µ ( 0 ) is derived by applying Σ -scheduling to the input (unvectorized) graph for IA V . Algorithm 1 shows a pseudocode description of the IA V approach that emplo ys this mapping- based method of score function formulation. In the r emainder of this pap er , we refer to the mapping- based form of IA V shown in Algorithm 1 as “ Σ -IA V” . ALGORI THM 1: Integrated vectorization and mapping using Σ -IA V . Function incrementalVectorize ( G , P , M ) initialize congs = ∅ , G B = G , B = { ( v , 1 ) | v ∈ V } ; while memSize ( G B ) ≤ M do mp = generateMapping ( G B , P ) ; v ∗ = argmax v ∈ V score ( B , mp , v ) ; B ( v ∗ ) = nextVCD ( v ∗ , b v ∗ ) ; G B = vectorize ( G , B ) ; if memSize ( G B ) ≤ M then congs = congs ∪ { ( B , mp )} ; end end return argmax c ∈ congs throughput ( G , c ) In Algorithm 1 , generateMapping is a placeholder for any Σ -scheduling technique that is applied to map a given intermediate vectorized graph onto the targeted heterogeneous platform P . In our implementation of Σ -IA V , w e employ a specic Σ -scheduling technique called Incremental Actor Re-assignment (IAR) as the generateMapping function. The IAR technique is discussed further in Section 4 . The function nextV CD ( v , b v ) gives smallest element of alwb ( v ) that exceeds b v . The function throughput referenced in Algorithm 1 repr esents a placeholder for any function that is used to estimate the throughput of a mapping that is generated by generateMapping for an intermediate vectorized graph. In our implementation of Σ -IA V , w e employ an ecient simulation- based approach for this kind of throughput estimation. This simulation approach is discussed further in Section 5.5 . In general, heuristic-based mapping techniques, including our techniques, do not guarantee an optimal scheduling. It is therefor e possible for the throughput to get w orse during incremental vectorization. For this reason, we assess the throughput of each compute d conguration and then select a conguration that results in the best throughput. W e formulate and experiment with tw o specic score functions in this work. W e refer to these score functions as time-saving ( TMSV ) and time-saving-per-byte ( TMSVPB ). The TMSV scor e for actor v during IA V iteration i is dened as largest adjusted execution time reduction achievable (across all pr ocessors in P ) when stepping up v . This “adjusted” time reduction is computed relative to the execution of v on mp µ ( i ) ( v ) , and is normalized by the vectorization degree. The units of this adjusted time reduction ar e thus “seconds per unit of vectorization” . This score can be expressed as: tmsv ( v , i ) = max p ∈ P ( t ( v , b , mp µ ( i ) ( v )) b − t ( v , b ′ , p ) b ′ ) , (3) 10 S. Lin et al. Fig. 3. A simple example to illustrate Σ -IA V using the TMSV score function. where b is the current V CD of v (in IA V iteration i ), and b ′ ∈ alwb ( v ) is the V CD that would result from stepping up v . For a given actor v , vectorization degree b ∈ proled ( v ) , and processor p ∈ P , t ( v , b , p ) gives the proling-deriv ed estimate for the execution time of v on p with vectorization degree b . Here, we use “proling” as a general term that encompasses any method for deriving a compile- time estimate for the execution time of a vectorized actor execution. The sp ecic approach to proling that we use in our experiments is discussed in Section 6 . Figure 3 shows a simple example of v ectorization using the TMSV score function. The table in this gur e pro vides analytical models, in terms of the vectorization degree v , that are used to derive the proling function t . For example , the models estimate that actor A requires appro ximately ( 0 . 5 × v ) units of time to execute. The IA V process begins with an unvectorized graph and an initial mapping where all actors are mapped to the CP U core. In the rst IA V iteration ( i = 0 ) shown in Figure 3 , A has the largest TMSV score, so it is selected, and a new mapping is generated based on the V CDs. In the second iteration, B has the largest TMSV score, so B is vectorized (stepped up), and the mapping is updated again. This process continues until no mor e vectorization operations can be carried out without exceeding the memory budget M . Under memory constraints, we expect that it will b e more useful to consider the increase in buer requirements when selecting actors for ALV . This motivates our formulation of the TMSVPB score function. Here, “PB” stands for “per byte. ” This memory-aware score function can be formulated as: tmsvpb ( v , i ) = max p ∈ P ( t ( v , b , mp µ ( i ) ( v ))/ b − t ( v , b ′ , p )/ b ′ Ω buf ( G B ′ ) − Ω buf ( G B ( i ) ) + ϵ ) , (4) where B ( i ) represents the curr ent ALV conguration in ILV iteration i , and B ′ = B ( i ) − { ( v , b ) } + {( v , b ′ )} represents the candidate conguration that r esults from stepping up v . ϵ is a small constant to avoid division by 0 when Ω buf ( G B ′ ) = Ω buf ( G B ( i ) ) . Thus, the TMSVPB function favors actors whose vectorization results in throughput improvement without excessive increase in buer requirements. 11 5.2 N -Candidates IA V Our proposed Σ -IA V approach has two drawbacks — (1) it selects only one actor at each step, and (2) with the TMSV and TMSVPB score functions, the selections are based primarily on actor e xecution times, and do not take into account the SDF graph topology . W e alleviate the rst drawback by storing multiple vectorized-graph candidates to consider in each IA V iteration following the very rst iteration. In particular , we store N candidate graphs that provide the highest thr oughput when processed by Σ -scheduling. Here, N is a parameter that can be controlled by the designer or tool developer . The second drawback can be addressed by applying Σ -scheduling to optimize throughput over each actor for every candidate graph. That is, for each candidate graph Y that is stored, and each actor v , we apply Σ -scheduling to the transformed graph that results fr om stepping up v in Y . W e then take the best result from all of these Σ -scheduling-based evaluations to determine the vectorization operation that is to be applied in the associate d IA V iteration. This approach results in some increase in complexity , but has the potential to perform signicantly more thorough optimization at a relatively high lev el of design abstraction. W e refer to this modied Σ -IA V approach as N -candidates IA V . Algorithm 2 provides a pseudocode description of N -Candidates IA V . Here, the notation c . 1 denotes the rst element of the order ed pair c , and congs [ 1 : N ] denotes the list that consists of the rst N elements of the list congs . The function visited ( B ′ ) tests whether the vectorization conguration B ′ has been examined before during operation of the algorithm. ALGORI THM 2: A pseudocode description of N -candidates IA V . Function nCandidatesVectorize ( G = ( V , E ) , P , M , N ) initialize B = {( v , 1 ) | v ∈ V } , mp = generateMapping ( G , P ) , congs = {( B , mp ) } , ag = true while ag = true do ag = false ; foreach c ∈ congs do foreach v ∈ V do B ′ = c . 1 − { ( v , b v )} ∪ {( v , nextVCD ( v , b v )} ; if ( visited ( B ′ ) = false ) and ( Ω buf ( G B ′ ) ≤ M ) then mp = generateMapping ( G B ′ , P ) ; congs = congs ∪ { ( B ′ , mp ) } ; ag = true , visited ( B ′ ) = true ; end end end sortByThroughput ( congs ); congs = congs [ 1 : N ] end return argmax c ∈ congs throughput ( G , c ) As with our implementation of Σ -IA V , w e employ in our implementation of N -candidates IA V the IAR technique (Section 4 ) as the generateMapping function. Similarly , our implementation of N -candidates IA V incorporates the simulation-based throughput estimation technique that is discussed in Section 5.5 . This estimation technique corresponds to the function calle d throughput in Algorithm 2 . Intuitively , N -candidates IA V is a greedy metho d that tries to avoid unsatisfactory search paths by retaining multiple intermediate vectorized graphs during each IA V iteration. Larger values for 12 S. Lin et al. Fig. 4. An example that illustrates the utility of devectorization. (a) The original graph. (b) The graph with GVD = N applie d. (c) The graph with de vectorization applied to all CP U-mapped actors — C , F , K , S . the parameter N allow mor e extensive design space exploration at the cost of greater running time. When N = 1 , N -candidates IA V reduces to IA V with the score function being the estimated throughput (“throughput”) of the transformed graph that r esults from the selected vectorization operation. In our implementation of N -candidates IA V , w e estimate throughput using simulation. This simulation approach is discussed further in Section 5.5 . In Algorithm 2 , throughput ( G , c ) represents the estimate of throughput that is derived in this way for a given interme diate vectorized graph G that is based on ALV conguration c . Other score functions can be used in N -candidates IA V other than throughput. Howev er , in our experiments, we found that among TMSV , TMSVPB, and throughput, the throughput score function produces the best results. Investigation of other score functions in this context is an inter esting direction for future work. In our experiments, we use N = | V | as the number of candidates to be stored. W e select N = | V | so that NIA V keeps a numb er of candidates that scales with the number of actors in the dataow graph while keeping analysis time manageable. IAR, IA V and NIA V are all greedy-algorithm motivated heuristics base d on an evaluation metric (score function) to select v ectorization choices at each step. Investigation of other types of heuristics to further improve v ectorization is a useful direction for future work. 5.3 Mapping-Based Devectorization N -candidates IA V is an incremental vectorization method that starts with an unvectorized graph, and gradually increases the V CDs of selecte d actors. In some cases, it may b e advantageous to also consider decreasing V CDs during the optimization process. Such decreasing of V CDs can b e useful to reduce memory consumption associate d with selected actors so that memor y can be dedicated to groups of other actors that provide greater throughput benet through vectorization. A spe cic form of decrease that we consider in this se ction is devectorization , where an actor with VCD b > 1 is transformed to have no vectorization (V CD of unity). Figure 4 (a) shows an example of this kind of scenario. Here , S (source), K (sink), F (fork), and C (combine) ar e computationally simple actors without potential for GP U acceleration, and only very limited p otential for spe edup through CP U-base d vectorization. On the other hand, actors A 1 , A 2 , A 3 , A 4 have GP U-accelerate d versions with signicant thr oughput gain. In this case, however , the overall throughput gain is limited by the slowest of the four A i s so that incrementally vectorizing individual A i s does not directly impact throughput gain. T o provide memory ecient vectorization in which this kind of scenario is of dominant concern, we propose another vectorization method calle d Mapping-Based Devectorization ( MBD ). In contrast with ALV -based incremental vectorization, MBD applies GLV to rst vectorize all vectorizable actors, and then performs devectorization on the transformed graph derived from GLV . MBD 13 is useful in devectorizing actors that have relatively low CP U-based performance gain through vectorization, and in jointly considering vectorization improvements produced by groups of actors. MBD performs GLV , generates a processor assignment A , and then evaluates for devectorization each actor that is mapped to a CP U core in A . If a given devectorization operation does not reduce the original throughput by a pre-dened threshold r , the actor is devectorized. In our experiments, we set the threshold r empirically by experimenting with dierent values of r . W e found in our experiments that r = 0 . 95 achiev es the maximum throughput gain for MBD (see Se ction 6 ) on the same set of random graphs. The optimal choice of r may change for a dierent set of graphs. Alternatively , r can be customized for a given graph by performing a search (such as a binary search) to optimize this parameter . Although the MBD algorithm begins by applying GLV , the algorithm produces solutions that are in general ALV solutions. This is because of the application of devectorization later in the algorithm, which in general results in heterogeneous vectorization degrees across the set of actors in the input graph. In principle, the processor assignment A can be generated using any multiprocessor task graph scheduling technique. In our implementation of MBD , we employ the Heterogeneous Earliest Finish Time (HEFT) heuristic (e.g., see [ 1 , 26 ]) to generate a schedule for the transformed graph that results from GLV , and then we extract the processor assignment from this generated schedule. Devectorization saves memory from lo w-impact vectorization of actors that are mapped onto CP U cores. When memor y constraints are loose enough to allow GLV , the MBD te chnique, based on the memory savings achieved through devectorization, may impro ve throughput by allowing greater GVDs to be applied. Figure 4 (c) illustrates the application of MBD . In this example , since actors C , F , K , and S are mapped onto CP U cores, they are devectorized. As a result of this devectorization, the buer requirements on edges ( S , F ) and ( C , K ) are reduced to 1 for each edge. Algorithm 3 provides a pseudocode description for MBD. ALGORI THM 3: Mapping-Based Devectorization (MBD). Function mappingBasedDevectorize ( G = ( V , E ) , P , M ) initialize B = {( v , 1 ) | v ∈ V } , mp = generateMapping ( G , P ) , congs = {( B , mp ) } , G B = G , gvd = 1 ; repeat B ′ = B , mp ′ = mp ; B = graphVectDegrees ( G , gvd ) ; G B = vectorize ( G , B ) ; mp = generateMapping ( G B ) ; cpu _ actors = { v ∈ V | v is mapped to a CP U core } ; foreach v ∈ cpu _ actors do B ′′ = B − { ( v , b ) } ∪ { ( v , 1 ) } ; if throughput ( G , ( B ′′ , mp )) ≥ r × throughput ( G , ( B , mp )) then B = B ′′ ; end end gvd = gvd + 1 ; until memSize ( G B ) ≤ M ; return ( B ′ , mp ′ ) 14 S. Lin et al. 5.4 Mapping Actors onto HCGPs The Σ -IA V and N -candidates IA V methods presented in Section 5.1 and Section 5.2 , respectively , b oth employ Σ -scheduling throughout the optimization process to generate schedules for intermediate vectorized graphs. The Σ -scheduling approach is useful in our iterative optimization context b ecause it provides moderate-complexity , b ounded-buer scheduling of multirate SDF graphs. As mentioned in Section 5.1 and Section 5.2 , we develop a spe cic Σ -scheduling technique calle d Incr emental Actor Re-assignment ( IAR ) for use in both Σ -IA V and N -candidates IA V . In this section, w e elaborate on the IAR technique. In contrast to time-intensive scheduling methods such as Mixe d Linear Programming and Genetic Algorithms, IAR is designed with computational eciency as a primar y objective. This is because IAR is invoked r epeatedly during each IA V iteration — in particular , it is invoked for each candidate ALV conguration. Intuitively , IAR incrementally moves actors in Σ schedules from “busier” (more loade d) processors to less busy ones. Algorithm 4 provides a pseudocode description of the IAR method. IAR initializes the actor assignment by mapping all actors that have GP U-accelerated versions onto the GP U, and all other actors onto a single CP U core. This results in an initial assignment that utilizes at most two processors (the GP U and one CP U core). ALGORI THM 4: Incremental Actor Re-assignment (IAR). Function generateMapping ( G , P ) for v ∈ V , initialize bestMp ( v ) = p N if t ( v , p N ) < ∞ and bestMp ( v ) = p 1 otherwise ; initialize best = throughput ( G , bestMp ) ; foreach v ∈ V do mp = bestMp , th = best , p ∗ = bestMp ( v ) ; Q = { q ∈ P | q , p ∗ } ; foreach p ∈ Q do mp ′ = mp − {( v , mp ( v ))} ∪ { ( v , p ) } ; th ′ = throughput ( G , mp ′ ) ; if th ′ > th then mp = mp ′ , th = th ′ ; end if th > b est then b estMp = mp , best = th ; end return ( bestMp , b est ) Then IAR iteratively computes the maximum throughput gain for all actor-processor pairs, and selects the pair that gives the highest throughput at each iteration. In this context, selection of an actor-processor pair ( a , p ) means that the current processor assignment of actor a will b e discarded, and actor a will be assigned (“moved”) to processor p . For this selection process, only actors that have not yet been sele cted during previous iterations ar e considered. The throughput gain is computed with the aid of the function denoted in Algorithm 4 as throughput . This function invokes the simulation-based throughput estimator discussed in Section 5.5 . Each actor is mo ved only once during execution of IAR. 5.5 Throughput Estimation For compile-time throughput estimation, w e have developed a throughput simulator for SDF graphs that follows bounded-buer execution semantics (dened in Section 3 ) with a statically-determined processor assignment, as derived by the Σ strategy introduced in Section 4 . The inputs to the 15 Fig. 5. Layered structure of v ectorization, scheduling, and performance estimation in the proposed design optimization framework. simulator are: (1) the transformed SDF graph G v that results from the candidate set of vectorization operations that is under evaluation; (2) the Σ mapping for G v that is generated by IAR; (3) the Ω buer bound for each edge in G v ; (4) an estimate of the execution time for each actor in G v ; and (5) an estimate of the context transfer time between the main memor y and the device memory on the target platform. T o estimate the throughput of a vectorized SDF graph, we rst map vectorize d actors onto processors, and follow the approach of Σ -scheduling dene d in Denition 4.2 to compute the schedule. Throughput is then estimated by simulating the execution of the derived schedule. In our experiments, the execution time estimates under dierent v ectorization degrees for each actor as well as the context transfer time are derived by using measurements of actor and context transfer execution on the target HCGP. 5.6 Summary Figure 5 summarizes the developments of this section by illustrating relationships among the key analysis and optimization techniques that have been introduced. Re call that IA V , HEFT , and MBD stand, r espectively , for incremental actor vectorization, heterogeneous earliest nish time, and mapping-based devectorization. Each directed edge in Figure 5 represents usage of one technique (at the sink of the edge) by another (at the source of the edge). For example, IAR is used by Σ -IA V . 6 EXPERIMENTS USING SYNTHETIC GRAPHS In this section, we demonstrate the eectiveness of the models and methods developed in Section 5 through experiments that study throughput gain and running time. W e compare our methods with the approach of applying graph-level v ectorization (GLV) followed by task-graph scheduling. W e use Heterogeneous Earliest Finish Time (HEFT) as the task-graph sche duling method in this comparison. HEFT is a commonly used task-graph sche duling method for HCGPs (e.g., see [ 1 ]). The integration of HEFT with GLV can be viewed as a natural way to integrate SDF vectorization and scheduling using conventional techniques. W e refer to the combination of GLV and HEFT as the GLV -HEFT baseline or simply as GLV -HEFT . As implied by this terminology , GLV -HEFT is employed in this experimental study as a baseline for evaluating our propose d methods. The GLV -HEFT baseline applies b oth GP U acceleration and CP U-GP U multi-processor scheduling. W e demonstrate in this section that the ALV and IAR sche duling metho ds de veloped in this paper provide signicant throughput gain o ver this baseline approach under given memory constraints. 6.1 Experimental Setup W e have developed an integrated software synthesis framework called DIF-GP U to provide a streamlined workow that combines actor-level / graph-level v ectorization, multi-rate / single-rate SDF scheduling, code generation, and runtime support on heterogeneous computing platforms with multi-core CP Us and GP Us. For details about the DIF-GP U framework, we r efer readers to [ 16 ]. 16 S. Lin et al. In the e xperiments presented in this section, we employ an HCGP consisting of a quad-core Intel i5-6400 CP U and an NVIDIA Geforce GTX750 GP U. Actor implementations that are developed for multi-core CP U and GP U execution ar e compiled using GCC 4.6.3 and the NVIDIA CUDA compiler (NVCC) 7.0, respectively . 6.2 Synthetic Graph Generation W e use T ask Graphs For Free (TGFF) [ 7 ] to generate large sets of synthetic SDF graphs with varied size and complexity . K ey parameter settings that we use in TGFF are as follows: the maximum in-degree and out-degree for graph nodes are both set to 3, and the average and multiplier for the lower bound on the number of graph nodes are both set to 20. From the graph topologies generated by TGFF, we randomly map each graph verte x to a spe cic DSP actor type that has b oth a CP U-targeted and GP U-targeted implementation. W e p erform this vertex-to-actor mapping for all actors in each randomly-generated graph. A broad set of DSP actor typ es — including actors for cross-correlation, FIR ltering, FFT computation, and vector algebra — are considered when performing this mapping. The GPU-accelerate d implementations of these actors provide speedups from 1X to 20X compar ed to the corresponding multicore CP U implementations. This use of TGFF in conjunction with randomly generated actor mappings helps us to evaluate the performance of our proposed methods on a large variety of graph topologies. In our experiments, the source and sink actors are selecte d from a pool of dierent implementa- tions of data sources and sinks. Because the input/output interfacing functionality in an embedded HCGP is typically implemented on a CP U, we assume that source and sink actors can only be mapped onto CP U cores. W e prole the actors by measuring the execution times of the actors’ rings on the target platform under a series of vectorization degrees. This proled data is then used as input to the evaluated vectorization and scheduling te chniques. The proled data is also used to simulate the vectorization- integrated schedules that are derived from the proposed and baseline techniques. This simulation is based on the throughput simulator presented in Section 5.5 . W e use simulation here to enable ecient, automate d comparisons across a large variety of dierent graph structures. In Section 7 , we complement this simulation-based evaluation approach with our experimental evaluation of a case study involving an orthogonal frequency-division multiplexing (OFDM) receiver . The evaluation in Section 7 is performed by synthesizing software using DIF-GP U for the targete d HCGP platform, executing the synthesized software on the target platform, and measuring the resulting execution time performance. 6.3 V ectorization In this section, we apply the dier ent ALV methods introduced in Section 5 to a large collection of synthetic SDF graphs, and evaluate the performance of the derived schedules by simulating their b ounded-buer execution. Note that the baseline for speedup here is GLV -HEFT , where extensive v ectorization has been applied, and both the CP U and GP U are used to schedule dataow actors. This is a much “higher” baseline than single- / multi-core CP U implementation without vectorization. Therefore, the speedup computed over this baseline is relatively small. The synthetic graphs are generated using TGFF together with randomized vertex-to-actor mappings, as described in Section 6.2 . W e evaluate the speedup over the GLV-HEFT baseline under dierent memory constraints. T o compare speedups across SDF graphs that have dierent sizes (i.e., dierent numb ers of actors and edges) and dierent multirate properties (as dened by the production and consumption rates on the actor ports), we introduce a concept of relative memor y b ounds as a normalized representation for memory constraints. Given an algorithm A for performing GLV , the relative memory bound 17 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 1 1.1 1.2 1.3 1.4 1.5 1.6 Relative Memory Bound Speedup tmsv tmsvpb niav mbd Fig. 6. A verage spe edups measured for four ALV techniques introduce d in Section 5 : TMSV Σ -IA V , TMSVPB Σ -IA V , N -candidates IA V , and MBD . M ( G ) for an SDF graph G is dene d as M ( G ) = M 0 × α , where M 0 is the memory cost of the GLV solution derived by Algorithm A when applie d to G with GVD = 1 , and α is a constant that represents the “tightness/looseness” of the applied memory constraint. W e experiment with α ∈ { 1 . 0 , 1 . 5 , . . . , 4 . 5 , 5 } to cover a series of memory constraints ranging, respectively , from tight to loose. Figure 6 shows the average simulated speedup that we measured fr om a set of randomly generated SDF graphs for dierent techniques for ALV that were introduced in Section 5 . As mentioned previously , these spe edups are in comparison to baseline solutions that are derived using the GLV -HEFT baseline technique. These r esults are for a target platform conguration that consists of 1 CP U core and 1 GP U . Here, “TMSV Σ -IA V” and “TMSVPB Σ -IA V” represent the Σ -IA V algorithm with the TMSV and TMSVPB score functions, respectively . The measured throughput gain ranges from 0.8X to 2.4X, and also exhibits signicant variation from one SDF graph to another . W e refer to ALV -IAR as the meta-algorithm that results from applying all four of the proposed ALV techniques, and selecting the b est result from among the four derived solutions. In Section 7 , we perform further experimental analysis of the ALV -IAR method, which provides a way to leverage complementary benets of all of the key ALV techniques introduced in Se ction 5 . ALV -IAR is useful, in particular , for design scenarios that can tolerate the relativ ely large optimization time that is required by N -candidates IA V , which dominates the time required by ALV -IAR. ALV -IAR demonstrates average and maximum speedup values of 1.36X and 2.9X on the benchmark set. W e se e that N -candidates IA V pr ovides the larg est av erage speedup by a signicant margin, and this algorithm also provides the largest maximum speedup. W e anticipate that this is be cause N -candidates IA V uses more vectorization candidate solutions throughout the search process. The other three ALV techniques achieve similar average and maximum throughput gain. W e have also observed that the average speedup of ALV methods increases until M ( G ) = 2 . 5 and then gradually drops o. When M ( G ) is close to 1, there is little room for vectorization, so ALV and GLV achieve similar throughput, and the av erage spe edup is close to 1. As M ( G ) increases to 2.5, more exibility is provided for ALV to vectorize for b etter performance than GLV . When M ( G ) increases beyond 2.5, the memory is sucient to allow relativ ely large vectorizations for all actors, so the throughput gain from enabling further vectorization is worn o. Although the MBD method and the two Σ -IA V methods achieve smaller average speedup com- pared to N -candidates IA V, they run signicantly faster (see Section 6.4 ), and can be useful in cases where quicker turnaround time is desir ed from the software synthesis process. In addition, there are cases where they perform better than N -candidates IA V. 18 S. Lin et al. 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 Number of nodes Running time (s 1/4 ) tmsv tmsvpb niav mbd (a) 0 5 10 15 20 25 30 0 2 4 6 8 10 12 Number of nodes Running time (s 1/4 ) tmsv tmsvpb niav mbd (b) Fig. 7. Runtime of ALV methods under dierent memor y constraints: (a) M = 2 M 0 , and (b) M = 4 M 0 . 6.4 Runtime In this section, we compare the measured running times of the four proposed ALV techniques. W e tested the running times of the ALV techniques on the same set S д of randomly generate d SDF graphs that we used in the experiments reported on in Se ction 6.3 . The set S д consists of 120 graphs, where the of number of nodes in a given graph ranges from 3 to 30 . Figure 7 sho ws the measured running times for the four ALV methods with respect to the number of nodes in the input graph. For each of the four ALV methods, there are 120 points plotted in each part of the gure — one p oint for each graph in S д . Thus, Figure 7 (a) and Figure 7 (b) each depicts a total of 4 × 120 = 480 plotted points. Figure 7 presents running time results associated with two dierent memor y constraints — M = 2 M 0 in Figure 7 (a), and M = 4 M 0 in Figure 7 (b) (see the discussion on relativ e memory bounds in Section 6.3 ). These two memory constraints are used to represent relativ ely tight and loose memory budgets, respectively . The vertical axes in Figure 7 correspond to s 1 / 4 , where s is the measured running time in seconds. Here, we apply an exponent of ( 1 / 4 ) to help improve clarity in depicting the large number of displayed points. The list of the ALV methods sorted from the fastest to the slowest are: MBD , Σ -IA V with the TMSV score function, Σ -IA V with the TMSVPB score function, and NIA V . Note that the TMSVPB score function runs more slo wly compared to TMSV due to the computation cost of Ω buf ( G B ′ ) in the denominator of Equation 4 . This cost involv es recomputing the buer requirements for all of the edges in G . T able 1 shows the running times of the ALV methods on a specic graph with 22 nodes and 33 edges. This graph is selected randomly to provide further insight into variations in running time among the four ALV methods. In our e xperiments, we nd that typically MBD nishes within 1 second, while the running times of the two Σ -IA V methods usually range fr om several seconds up to a fe w minutes. W e expect that this kind of running time prole is acceptable in many coarse grain dataow design scenarios in the embedded signal pr ocessing domain, where actors typically p erform higher level signal processing operations, and therefore the number of nodes in the graphs is limited compared to other types of dataow graphs that are based on ne-grained actors. The running time of N -candidates IA V is generally the longest among all four methods, and grows rapidly with the number of nodes. In our experiments with an SDF graph having 30 no des, for example, N -candidates IA V takes 3 hours to nish its computation. Therefore , N -candidates 19 T able 1. The running times (in seconds) of the ALV methods on a specific SDF graph with 22 nodes and 33 edges. TMSV Σ -IA V TMSVPB Σ -IA V NIA V MBD M = 2 M 0 2.0 8.4 320 0.1 M = 4 M 0 13.0 35.9 3500 0.7 Fig. 8. SDF model of OFDM-RX application. T able 2. Actors in the OFDM-RX application. Actor Description src Read samples of the input signal. syn Perform time-domain synchronization. cfo Remove carrier frequency osets. rcp Remove cy clic prex. t Perform Fast-Fourier Transform on symbols. dmp Map OFDM symbols into bit stream. snk W rite bit stream onto the output. IA V is more suitable in situations when the SDF graph is relatively small, design turnaround time is not critical, or solution quality is of utmost importance. 7 CASE ST UD Y: OFDM In this section, we demonstrate the eectiveness of our new ALV -integrated software synthesis framework through a case study inv olving an orthogonal frequency-division multiplexing ( OFDM ) receiver ( OFDM-RX ). The OFDM-RX is an adapted version of the OFDM system described in [ 18 ]. Figure 8 shows an SDF model for the OFDM-RX application. The value abov e each actor in Figure 8 gives the repetition count of the actor . T able 2 lists the actors in this SDF model and describes their corresponding functions. The system can operate with dierent parameter values, as shown in T able 3 . 7.1 System Implementation and Profiling W e have implemented the OFDM-RX actors using the Lightweight Dataow Environment ( LIDE ), which provides a programming methodology and associated application programming interfaces (APIs) for implementing dataow graph actors and edges in a wide variety of platform-oriented languages, such as C, C++, CUDA, and V erilog [ 22 , 23 ]. In our OFDM-RX system, GP U-accelerated implementations are available for all actors other than the src and snk actors. The src and snk actors are not mapp ed to the GP U in our implementation b ecause of input/output operations that are involved in these actors. W e have proled the execution times for the OFDM-RX actors on both the CP U and GP U. Figure 9 summarizes the average execution times per SDF graph iteration for the actors. This average time can be e xpressed as t T ( v ) = q ( v ) t ( v ) , where q represents the repetitions vector of the enclosing 20 S. Lin et al. T able 3. Parameters in the OFDM-RX application model, along with the seings or values we use in our experiments. Description V alues L Number of sub carriers p er OFDM symbol [128, 256, 512, 1024] N Number of OFDM symbols per frame 10 L cp Length of cyclic prex for each OFDM symbol ( 9 / 128 ) L M Number of bits per sample 4 D Length of data excluding training symbols ( N − 1 )( L + L cp ) F Length of a frame N ( L + L cp ) S Size of sample stream 2 F src syn cfo rcp fft dmp snk 0 5 10 15 20 0 5 10 15 20 25 Number of Frames Execution Time (ms) (a) syn cfo rcp fft dmp 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Number of Frames Execution Time (ms) (b) Fig. 9. t T ( v ) s on (a) CP U and ( b) GP U for OFDM-RX actors that are vectorized to process multiple frames in each firing. SDF graph, and t ( v ) represents the average execution time measured for a single ring of v . These execution time estimates are measur ed on both the CP U and GP U when L = 256 , and the actors are vectorized to process dierent numbers of data frames per vectorized invocation. Observe from Figure 9 that the distribution of the t T ( v ) in OFDM-RX are uneven, and that the syn and cfo actors dominate the execution times both on the CP U and GP U. Also, observe that although actor execution times are r oughly proportional to the number of frames N F , they increase at dierent rates in relation to N F — for example, t T ( cfo ) on the GP U grows very slowly with increases in N F , and t T ( syn ) grows much faster . 7.2 Soware Synthesis with GLV -HEFT W e rst measure the performance impro vement achiev ed by GLV -HEFT when integrated in our DIF- GP U software synthesis framework. Here , we measure the system thr oughput under 11 dierent congurations without any memory constraints imposed. These measurements are performed 21 Fig. 10. Spee dup of the OFDM-RX application over a single CP U implementation for dierent GVD values and dierent values of the bandwidth parameter L . on softwar e implementations that are generate d automatically using DIF-GP U integrated with GLV -HEFT . In contrast to the relativ e throughput metric ( see Section 4 ) that is used as a general performance metric in Se ction 6 , we emplo y frames per second as the throughput metric more spe cic to the OFDM-RX application. W e denote the results (throughput values) from these measurements by  0 ,  1 , . . . ,  10 . Here,  0 , denotes the throughput when the input graph is not vectorized and all actors are mapped onto a single CP U core. On the other hand, for b ∈ { 1 , 2 , . . . , 10 } ,  b represents the throughput obtained when GLV is applied with GVD = b , and HEFT is used to schedule the resulting vectorized graph (GLV -HEFT) [ 16 ]. Figure 10 shows the speedup in throughput of GLV over the single-CP U implementation, and compares  0 and  10 in more detail for dierent values of L . The maximum measured speedups achieved here are 10.1X, 18.1X, 31.9X, 41.1X for L = 128 , 256 , 512 , 1024 , respectively . In our experiments, actors in the dataow model are coarse-grained signal-processing modules. Before vectorization, the actors alr eady encapsulate multiple steps of processing on large signal arrays, and extensively utilize GP U data-parallelism. For example, the unvectorized syn actor in the OFDM receiver application consists of multiple steps of cross-correlation on 20 OFDM symbols. Therefore , saturation at small GVD levels can be expected in Figure 10 . 7.3 Soware Synthesis with ALV -IAR In this section, we perform measurements and comparisons that inv olve software implementations that are generated automatically using DIF-GP U integrated with ALV -IAR. The experiments are performed under dierent memory budgets and dierent lev els of bandwidth L (an application-level parameter). For comparison, we apply DIF-GP U integrate d with GLV -HEFT to synthesize software that incorporates vectorized schedules constructe d using GLV -HEFT instead of ALV -IAR. T able 4 shows an example of the vectorization degrees and processor assignments deriv ed for OFDM-RX under a specic memory constraint. This memory constraint is selecte d to repr esent one that is neither v ery tight nor very loose. These vectorized scheduling results are derived by ALV -IAR, and the throughput is measured by executing the resulting softwar e implementation that is synthesized by DIF-GP U. The vectorization and processor assignment (mapping) results are shown in T able 4 as lists of values that corr espond to the graph actors in their topological order ( src , syn , . . . , snk ). The numbers 0 and 1 in the Mapping column represent the CP U-core and GP U, respectively . The results in T able 4 show that ALV -IAR produces a 1.2X sp eedup compared to the baseline technique for the selected memor y constraint. 22 S. Lin et al. 2 4 6 8 10 0 5 10 15 20 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (a) 5 10 15 20 0 2 4 6 8 10 12 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (b) 5 10 15 20 25 30 35 40 0 1 2 3 4 5 6 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (c) 10 20 30 40 50 60 70 80 0 0.5 1 1.5 2 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (d) Fig. 11. Memory-constraine d throughput of OFDM-RX systems with dierent levels of memory budget M and bandwidth L using ALV -IAR compared to the GLV-HEFT baseline . These experiments are performed on Intel i5-6400 CP U and N VIDIA Geforce GTX750 GP U architectures. ( a) L = 128 , (b) L = 256 , (c) L = 512 , (d) L = 1024 . T able 4. V ectorization degrees and mapping results generate d by ALV -IAR and GLV -HEFT under the memory constraint M = 2 . 8 Mb, and L = 512 . Method V ectorization Mapping  ( 10 3 / s ) ALV -IAR [1,3,12,1,1,1,1] [0,1,1,0,0,0,0] 3.15 GLV -HEFT [4,4,4,36,36,36,144] [0,1,1,1,1,0,0] 2.60 The memory budgets ar e set to M = b log ( L ) × 10 5 , where b = { 1 , 2 , . . . , 10 } . W e compare the throughput levels of implementations generated using the two methods — ALV -IAR and GLV -HEFT — as shown in Figure 11 . The results shown in Figure 11 show that using actor-lev el vectorization and Σ scheduling, we are able to obtain system throughput that consistently e xceeds that pro vided by the baseline method under same memory constraint. When memor y constraints are relatively tight, GLV has diculty in adequately exploiting data parallelism in the OFDM-RX system. ALV -IAR alleviates this problem by focusing memory resources to vectorize selecte d, performance-critical actors. Specically , ALV -IAR successfully identies syn and cfo as the two actors that benet the most fr om vectorized execution on the GP U. Prioritizing the vectorization of these two actors helps to avoid wasting memor y on vectorizations that have relatively little or no impact on o verall system performance. This is reected by a large throughput gain when b ≤ 4 . When the memory constraint is relaxed, the gap in the throughput gain between ALV -IAR and GLV is reduce d, as data-parallelism in the system can exploited more eectively by GLV under loose memory constraints. When optimizing the OFDM-RX system, ALV -IAR maps only syn and cfo onto the GP U, and assigns the other actors to the CP U to utilize pipeline parallelism in the system. Under this mapping, 23 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (a) 5 10 15 20 0 0.5 1 1.5 2 2.5 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (b) 5 10 15 20 25 30 35 40 0 0.5 1 1.5 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (c) 10 20 30 40 50 60 70 80 0 0.1 0.2 0.3 0.4 0.5 0.6 Memory Budget (10 5 bytes) Throughput(10 3 /s) GLV ALV (d) Fig. 12. Memory-constraine d throughput of OFDM-RX systems with dierent levels of memory budget M and bandwidth L using ALV -IAR compared to the GLV -HEFT baseline on the NVIDIA Jetson TX1. (a) L = 128 , (b) L = 256 , (c) L = 512 , (d) L = 1024 . rings of syn and cfo from subsequent frames can be executed in parallel with rings of rcp , t , dmp and snk from earlier frames. In these experiments, the maximum measured speedup values of ALV -IAR over GLV -HEFT are 2.66X, 2.45X, 1.94X and 1.71X for L = 128 , 256 , 512 , 1024 , respectively . The maximum spe edup values of ALV -IAR compared to a single-core , unv ectorized CP U baseline implementation are 11.1X, 19.8X, 33.8X, and 47.6X, for L = 128 , 256 , 512 , 1024 , respectively . Although the speedup gain of ALV -IAR ov er GLV -HEFT in this application is signicantly higher than the av erage sp eedup in Section 6.3 , it still falls within the range of the maximum speedup reported in Section 6.3 . W e expe ct that this is attributable to the relativ ely simple, chain-structured topology of the application’s dataow graph. The measurements described ab ove are carrie d out on a CP U-GP U architecture in a desktop computer platform. T o complement these e xperiments using an embedded platform, we investigate the performance of ALV -IAR and GLV -HEFT by p erforming the same experiments on an N VIDIA Jetson TX1 (TX1). The TX1 is a popular embedde d platform that consists of a Quad-core ARM A57 CP U and an NVIDIA Maxwell GP U with 256 CUDA cor es. The results ar e summarized in Figure 12 . These results are found to be similar to those obtained using the desktop platform. More specically , the maximum spe edup values of ALV -IAR over GLV -HEFT are 2.4X, 3.5X, 2.4X, 2.5X for L = 128 , 256 , 512 , 1024 , respectively , as measured on the TX1. These results show that ALV -IAR also consistently outperforms GLV -HEFT by a signicant margin on the TX1. In summary , the throughput improvement obtained by HCGP acceleration using the methods developed in this work facilitates r eal-time, memory constrained pr ocessing of OFDM signals. Such acceleration can benet a variety of software-dened radio and cognitive radio applications. 24 S. Lin et al. 8 CONCLUSION In this paper , we hav e investigated memory-constrained, throughput optimization for synchr onous dataow (SDF) graphs on heterogeneous CP U-GP U platforms. W e have de veloped novel methods for Integrated V ectorization and Scheduling (I VS) that provide throughput- and memory-ecient implementations on the targeted class of platforms. W e have integrated these I VS methods into the DIF-GP U Framework, which provides capabilities for automated synthesis of GP U software fr om high-level dataow graphs specied using the dataow interchange format (DIF). Our de velopment of novel IVS methods and their integration into DIF-GP U provide a streamlined workow for automated exploitation of pipeline, data and task level parallelism from SDF graphs. W e have demonstrated our I VS methods through extensive experiments involving a large colle ction of diverse, synthetic SDF graphs, as w ell as on a practical embedde d signal processing case study involving a wireless communications receiver that is based on orthogonal frequency division multiplexing. The results of our experiments demonstrate that our proposed new methods for I VS provide signicant improv ements in system throughput when mapping SDF graphs onto CP U-GP U platforms. Our proposed methods provide a range of useful trade-os between analysis time and speedup improvement that designers can select among depending on their specic preferences and constraints. 9 A CKNO WLEDGMENTS This research was supported in part by the Laboratory for T elecommunication Sciences and the National Science Foundation. REFERENCES [1] C. Augonnet, S. Thibault, R. Namyst, and P.- A. Wacr enier . 2011. StarP U: a unied platform for task scheduling on heterogeneous multicore architectures. Journal of Concurrency and Computation: Practice & Exp erience 23, 2 (February 2011), 187–198. [2] S. S. Bhattacharyya, E. Depr ettere, R. Leupers, and J. T akala (Eds.). 2013. Handbook of Signal Processing Systems (second ed.). Springer . http://dx.doi.org/10.1007/978- 1- 4614- 6859- 2 ISBN: 978-1-4614-6858-5 (Print); 978-1-4614-6859-2 (Online). [3] S. S. Bhattacharyya, P. K. Murthy , and E. A. Lee. 1996. Software Synthesis from Dataow Graphs . Kluwer Academic Publishers. [4] Y . Chen and H. Zhou. 2012. Buer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the Asia South Pacic Design A utomation Conference . 127–132. [5] F. Ciccozzi. 2013. Automatic Synthesis of Heter ogeneous CP U-GP U Embedded Applications from a UML Prole. In Proceedings of the International W orkshop on Model Based Ar chitecting and Construction of Emb e dded Systems . [6] K. Desnos, M. Pelcat, J.-F. Nezan, and Slaheddine Aridhi. 2015. Buer merging technique for minimizing memory footprints of Synchronous Dataow specications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing . 1111–1115. [7] R. P. Dick, D . L. Rhodes, and W . W olf. 1998. TGFF: T ask Graphs for Free. In Proceedings of the International W orkshop on Hardware/Software Codesign . 97–101. [8] A. Duran, E. A yguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2 (2011). [9] A. H. Ghamarian, M. C. W . Geilen, S. Stuijk, T . Basten, A. J. M. Mo onen, M. J. G. Bekooij, B. D. Theelen, and M. R. Mousavi. 2006. Throughput analysis of synchronous data ow graphs. In Proceedings of the International Conference on A pplication of Concurrency to System Design . [10] M. Goli, M. T . Garba, and H. González- Vélez. 2012. Streaming Dynamic Coarse-Grained CP U/GP U W orkloads with Heterogeneous Pipelines in FastFlow. In Proceedings of HPCC-ICESS . 445–452. [11] C. Gregg and K. Hazelwood. 2011. Where is the data? Why you cannot debate CP U vs. GP U performance without the answer . In Procee dings of the IEEE International Symposium on Performance A nalysis of Systems and Software . 134–144. [12] A. Hagiescu, H. P. Huynh, W .-F . W ong, and R. S. M. Goh. 2011. Automated Architecture- A ware Mapping of Streaming Applications Onto GPUs. In Proceedings of the International Symposium on Parallel and Distribute d Processing . 467–478. 25 [13] C. Hsu, J. Pino, and S. S. Bhattachar yya. 2011. Multithreaded Simulation for Synchronous Dataow Graphs. ACM Transactions on Design Automation of Electronic Systems 16, 3 (June 2011), 25–1–25–23. [14] M. Ko, C. Shen, and S. S. Bhattacharyya. 2008. Memory-constrained Blo ck Processing for DSP Software Optimization. Journal of Signal Processing Systems 50, 2 (February 2008), 163–177. [15] E. A. Lee and D. G. Messerschmitt. 1987. Synchr onous Dataow . Proc. IEEE 75, 9 (September 1987), 1235–1245. [16] S. Lin, Y . Liu, W . Plishker , and S. S. Bhattachary ya. 2016. A Design Framework for Mapping V ectorized Synchronous Dataow Graphs onto CPU–GP U Platforms. In Proceedings of the International W orkshop on Software and Compilers for Embedded Systems . Sankt Goar , Germany , 20–29. http://portal.acm.org/dl.cfm [17] W . Lund, S. Kanur , J. Ersfolk, L. T siopoulos, J. Lilius, J. Haldin, and U . Falk. 2015. Execution of Dataow Process Networks on OpenCL Platforms. In Euromicro International Conference on Parallel, Distributed, and Network-Based Processing . 618–625. [18] J. W . Massey , J. Starr , S. Lee, D. Lee, A. Gerstlauer, and R. W . Heath. 2012. Implementation of a real-time wireless interference alignment network. In Pr ocee dings of the IEEE Asilomar Conference on Signals, Systems, and Computers . 104–108. [19] J. Park and W . J. Dally . 2010. Buer-space ecient and deadlo ck-free scheduling of stream applications on multi-core architectures. (2010). [20] S. Ritz, M. Pankert, and H. Meyr . 1993. Optimum V ectorization of Scalable Synchronous Dataow Graphs. In Proceedings of the International Conference on Application Specic A rray Processors . [21] L. Schor , A. Tretter , T . Scherer , and L. Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataow graphs on top of OpenCL. In Proceedings of the IEEE W orkshop on Embe dded Systems for Real- Time Multimedia . 41–50. [22] C. Shen, W . Plishker , H. W u, and S. S. Bhattachary ya. 2010. A Lightweight Dataow Approach for Design and Implementation of SDR Systems. In Procee dings of the Wireless Innovation Conference and Product Exp osition . 640–645. [23] C. Shen, L. W ang, I. Cho, S. Kim, S. W on, W . Plishker , and S. S. Bhattachar yya. 2011. The DSPCAD Lightweight Dataow Environment: Introduction to LIDE V ersion 0.1 . Technical Report UMIA CS- TR-2011-17. Institute for Advanced Computer Studies, University of Maryland at College Park. http://hdl.handle.net/1903/12147. [24] S. Sriram and S. S. Bhattachary ya. 2009. Embedded Multiprocessors: Scheduling and Synchronization (second ed.). CRC Press. ISBN:1420048015. [25] S. Stuijk, M. Geilen, and T . Basten. 2006. Exploring Tradeos in Buer Requirements and Throughput Constraints for Synchronous Dataow Graphs. In Proceedings of the Design Automation Conference . [26] H. T opcuoglu, S. Hariri, and M.- Y . Wu. 2002. Performance-ee ctive and low-complexity task scheduling for heteroge- neous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (2002), 260–274. [27] S. Tripakis, D . Bui, M. Geilen, B. Rodiers, and E. A. Lee. 2013. Compositionality in synchronous data o w: Modular code generation from hierarchical SDF graphs. A CM Transactions on Embedded Computing Systems 12, 3 (2013). [28] A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. 2009. Software Pipelined Execution of Stream Programs on GP Us. In Proceedings of the International Symp osium on Code Generation and Optimization . 200–209. [29] G. Zaki, W . Plishker , S. S. Bhattachary ya, C. Clancy , and J. Kuykendall. 2013. Integration of Dataow-based Heteroge- neous Multiprocessor Scheduling T echniques in GN U Radio. Journal of Signal Processing Systems 70, 2 (February 2013), 177–191.

Memory-constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment