DiffTaichi: Differentiable Programming for Physical Simulation

We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source c…

Authors: Yuanming Hu, Luke Anderson, Tzu-Mao Li

DiffTaichi: Differentiable Programming for Physical Simulation
Published as a conference paper at ICLR 2020 D I FF T A I C H I : D I FF E R E N T I A B L E P R O G R A M M I N G F O R P H Y S I C A L S I M U L A T I O N Y uanming Hu † , Luke Anderson † , Tzu-Mao Li ∗ , Qi Sun ‡ , Nathan Carr ‡ , Jonathan Ragan-K elley ∗ , Frédo Durand † † MIT CSAIL {yuanming,lukea,fredo}@mit.edu ‡ Adobe Research {qisu,ncarr}@adobe.com ∗ UC Berkeley {tzumao,jrk}@berkeley.edu A B S T R AC T W e present Dif fT aichi, a new differentiable programming language tailored for building high-performance dif ferentiable physical simulators. Based on an imper - ativ e programming language, DiffT aichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and paral- lelism. A light-weight tape is used to record the whole simulation program struc- ture and replay the gradient kernels in a reversed order , for end-to-end backprop- agation. W e demonstrate the performance and productivity of our language in gradient-based learning and optimization tasks on 10 different physical simula- tors. For example, a dif ferentiable elastic object simulator written in our language is 4 . 2 × shorter than the hand-engineered CUDA version yet runs as fast, and is 188 × faster than the T ensorFlow implementation. Using our differentiable pro- grams, neural network controllers are typically optimized within only tens of iter - ations. 1 I N T R O D U C T I O N Figure 1: Left: Our language allows us to seamlessly integrate a neural network (NN) controller and a physical simulation module, and update the weights of the controller or the initial state param- eterization (blue). Our simulations typically have 512 ∼ 2048 time steps, and each time step has up to one thousand parallel operations. Right: 10 differentiable simulators built with Dif fT aichi. Differentiable physical simulators are ef fective components in machine learning systems. For e xam- ple, de A vila Belb ute-Peres et al. (2018a) and Hu et al. (2019b) ha ve shown that controller optimiza- tion with dif ferentiable simulators con verges one to four orders of magnitude faster than model-free reinforcement learning algorithms. The presence of dif ferentiable physical simulators in the inner loop of these applications makes their performance vitally important. Unfortunately , using existing tools it is difficult to implement these simulators with high performance. W e present DiffT aichi , a new differentiable programming language for high performance physical simulations on both CPU and GPU. It is based on the T aichi programming language (Hu et al., 1 Published as a conference paper at ICLR 2020 2019a). The DiffT aichi automatic differentiation system is designed to suit key language features required by physical simulation, yet often missing in existing differentiable programming tools, as detailed below: Megakernels Our language uses a “megakernel” approach, allo wing the programmer to naturally fuse multiple stages of computation into a single kernel, which is later differentiated using source code transformations and just-in-time compilation. Compared to the linear algebra operators in T ensorFlow (Abadi et al., 2016) and PyT orch (Paszke et al., 2017), DiffT aichi kernels hav e higher arithmetic intensity and are therefore more efficient for ph ysical simulation tasks. Imperative Parallel Programming In contrast to functional array programming languages that are popular in modern deep learning (Ber gstra et al., 2010; Abadi et al., 2016; Li et al., 2018b), most traditional physical simulation programs are written in imperati ve languages such as Fortran and C++. DiffT aichi likewise adopts an imperativ e approach. The language provides parallel loops and control flows (such as “if ” statements), which are widely used constructs in physical simulations: they simplify common tasks such as handling collisions, e valuating boundary conditions, and b uild- ing iterativ e solvers. Using an imperati ve style makes it easier to port existing physical simulation code to DiffT aichi. Flexible Indexing Existing parallel dif ferentiable programming systems provide element-wise op- erations on arrays of the same shape, e.g. c[i, j] = a[i, j] + b[i, j] . Howe ver , many physical simulation operations, such as numerical stencils and particle-grid interactions are not element- wise. Common simulation patterns such as y[p[i] * 2, j] = x[q[i + j]] can only be expressed with unintuitive scatter / gather operations in these existing systems, which are not only inefficient but also hard to dev elop and maintain. On the other hand, in DiffT aichi, the programmer directly manipulates array elements via arbitrary inde xing, thus allowing partial updates of global arrays and making these common simulation patterns naturally expressible. The explicit indexing syntax also makes it easy for the compiler to perform access optimizations (Hu et al., 2019a). The three requirements motiv ated us to design a tailored two-scale automatic differentiation sys- tem, which makes Dif fT aichi especially suitable for de veloping comple x and high-performance dif- ferentiable physical simulators, possibly with neural network controllers (Fig. 1, left). Using our language, we are able to quickly implement and automatically dif ferentiate 10 physical simulators 1 , cov ering rigid bodies, deformable objects, and fluids (Fig. 1, right). A comprehensiv e comparison between DiffT aichiand other dif ferentiable programming tools is in Appendix A. 2 B AC K G R O U N D : T H E T A I C H I P R O G R A M M I N G L A N G U A G E DiffT aichi is based on the T aichi programming language (Hu et al., 2019a). T aichi is an imperativ e programming language embedded in C++14. It deli vers both high performance and high producti v- ity on modern hardware. The key design that distinguishes T aichi from other imperative program- ming languages such as C++/CUDA is the decoupling of computation from data structures. This allows programmers to easily switch between different data layouts and access data structures with indices (i.e. x[i, j, k] ), as if they are normal dense arrays, regardless of the underlying layout. The T aichi compiler then takes both the data structure and algorithm information to apply perfor- mance optimizations. T aichi provides “parallel-for" loops as a first-class construct. These designs make T aichi especially suitable for writing high-performance physical simulators. For more details, readers are referred to Hu et al. (2019a). The DiffT aichi language frontend is embedded in Python, and a Python AST transformer compiles DiffT aichi code to T aichi intermediate representation (IR). Unlike Python, the DiffT aichi language is compiled, statically-typed, parallel, and differentiable. W e extend the T aichi compiler to further compile and automatically dif ferentiate the generated T aichi IR into forw ard and backward e xecuta- bles. 1 Our language, compiler , and simulator code is open-source. All the results in this work can be reproduced by a single Python script. V isual results in this work are presented in the supplemental video . 2 Published as a conference paper at ICLR 2020 W e demonstrate the language using a mass-spring simulator, with three springs and three mass points, as shown right. In this section we introduce the forward simulator using the DiffT aichi frontend of T aichi, which is an easier-to-use wrapper of the T aichi C++14 frontend. Allocating Global V ariables Firstly we allocate a set of global tensors to store the simulation state. These tensors include a scalar loss of type float32 , 2D tensors x, v, force of size steps × n_springs and type float32x2 , and 1D arrays of size n_spring for spring properties: spring_anchor_a (int32), spring_anchor_b (int32), spring_length (float32) . Defining Ker nels A mass-spring system is modeled by Hooke’ s law F = k ( k x a − x b k 2 − l 0 ) x a − x b k x a − x b k 2 where k is the spring stiffness, F is spring force, x a and x b are the positions of two mass points, and l 0 is the rest length. The following kernel loops over all the springs and scatters forces to mass points: @ti. kernel def apply_spring_force(t: ti . i32 ): # Kernels can have parameters. Here t is a parameter with type int32. for i in range (n_springs): # A parallel for, preferably on GPU a, b = spring_anchor_a[i], spring_anchor_b[i] x_a, x_b = x[t - 1, a], x[t - 1, b] dist = x_a - x_b length = dist.norm() + 1e-4 F = (length - spring_length[i]) * spring_stiffness * d ist / length # Apply spring impulses to mass points. force[t, a] += -F # "+=" is atomic by default force[t, b] += F For each particle i , we use semi-implicit Euler time integration with damping: v t,i = e − ∆ tα v t − 1 ,i + ∆ t m i F t,i , x t,i = x t − 1 ,i + ∆ t v t,i , where v t,i , x t,i , m i are the velocity , position and mass of particle i at time step t , respectiv ely . α is a damping factor . The kernel is as follo ws: @ti. kernel def time_integrate(t: ti . i32 ): for i in range (n_objects): s = math.exp(-dt * damping) # Compile-time evaluation since dt and damping are constants v[t, i] = s * v[t - 1, i] + dt * force[t, i] / mass # mass = 1 in this example x[t, i] = x[t - 1, i] + dt * v[t, i] Assembling the Forward Simulator W ith these components, we define the forward time integra- tion: def forward(): for t in range (1, steps): apply_spring_force(t) time_integrate(t) 3 A U T O M A T I C A L L Y D I FF E R E N T I A T I N G P H Y S I C A L S I M U L A T O R S I N T A I C H I The main goal of Dif fT aichi’ s automatic differentiation (AD) system is to generate gradient simula- tors automatically with minimal code changes to the traditional forward simulators. Design Decision Source Code Transformation (SCT) (Grie wank & W alther, 2008) and Trac- ing (W engert, 1964) are common choices when designing AD systems. In our setting, using SCT to differentiate a whole simulator with thousands of time steps, results in high performance yet poor flexibility and long compilation time. On the other hand, naiv ely adopting tracing provides flexibil- ity yet poor performance, since the “me gakernel" structure is not preserved during backpropagation. T o get both performance and flexibility , we dev eloped a two-scale automatic differentiation system (Figure 2): we use SCT for differentiating within kernels, and use a light-weight tape that only stores function pointers and arguments for end-to-end simulation dif ferentiation. The global tensors are natural checkpoints for gradient ev aluation. 3 Published as a conference paper at ICLR 2020 Figure 2: Left: The DiffT aichi system. W e reuse some infrastructure (white boxes) from T aichi, while the blue boxes are our extensions for dif ferentiable programming. Right: The tape records kernel launches and replays the gradient kernels in re verse order during backpropagation. Assumption Unlike functional programming languages where immutable output buf fers are gen- erated, imperative programming allows programmers to freely modify global tensors. T o make automatic differentiation well-defined under this setting, we make the following assumption on im- perativ e kernels: Global Data Access Rules: 1) If a global tensor element is written more than once, then starting from the second write, the write must come in the form of an atomic add (“accumulation”). 2) No read accesses happen to a global tensor element, until its accumulation is done. In forward simulators, programmers may make subtle changes to satisfy the rules. F or instance, in the mass-spring simulation example, we record the whole history of x and v , instead of keep- ing only the latest values. The memory consumption issues caused by this can be alle viated via checkpointing, as discussed later in Appendix D. W ith these assumptions, kernels will not overwrite the outputs of each other , and the goal of AD is clear: gi ven a primal kernel f that takes as input X 1 , X 2 , . . . , X n and outputs (or accumulates to) Y 1 , Y 2 , . . . , Y m , the generated gradient (adjoint) kernel f ∗ should tak e as input X 1 , X 2 , . . . , X n and Y ∗ 1 , Y ∗ 2 , . . . , Y ∗ m and accumulate gradient contributions to X ∗ 1 , X ∗ 2 , . . . , X ∗ m , where each X ∗ i is an adjoint of X i , i.e. ∂ (loss) /∂ X i . Storage Control of Adjoint T ensors Users can specify the storage of adjoint tensors using the T aichi data structure description language (Hu et al., 2019a), as if they are primal tensors. W e also provide ti . root .lazy_grad() to automatically place the adjoint tensors following the layout of their primals. 3 . 1 L O C A L A D : D I FF E R E N T I AT I N G T A I C H I K E R N E L S U S I N G S O U R C E C O D E T R A N S F O R M S A typical T aichi kernel consists of multiple levels of for loops and a body block. T o make later AD easier , we introduce two basic code transforms to simplify the loop body , as detailed below . int a = 0; if (b > 0) { a = b;} else { a = 2b;} a = a + 1; return a; // flatten branching int a = 0; a = select(b > 0, b, 2b); a = a + 1 return a; // eliminate mutable var ssa1 = select(b > 0, b, 2b); ssa2 = ssa1 + 1 return ssa2; Figure 3: Simple IR preprocessing before running the AD source code transform (left to right). Demonstrated in C++ . The actual T aichi IR is often more complex. Containing loops are ignored. Flatten Branching In physical simulation branches are common, e.g., when implementing bound- ary conditions and collisions. T o simplify the reverse-mode AD pass, we first replace “if ” statements with ternary operators select(cond, value_if_true, value_if_false) , whose gradients are clearly de- fined (Fig. 3, middle). This is a common transformation in program vectorization (e.g. Karrenberg & Hack (2011); Pharr & Mark (2012)). Eliminate Mutable Local V ariables After removing branching, we end up with straight-line loop bodies. T o further simplify the IR and make the procedure truly single-assignment, we apply a 4 Published as a conference paper at ICLR 2020 series of local variable store forwarding transforms, until the mutable local variables can be fully eliminated (Fig. 3, right). After these two custom IR simplification transforms, DiffT aichi only has to differentiate the straight- line code without mutable v ariables, which it achieves with re verse-mode AD, using a standard source code transformation (Griewank & W alther, 2008). More details on this transform are in Appendix B. Loops Most loops in physical simulation are parallel loops, and during AD we preserve the parallel loop structures. For loops that are not e xplicitly mark ed as parallel, we rev erse the loop order during AD transforms. W e do not support loops that carry a mutating local v ariable since that would require a complex and costly run-time stack to maintain the history of local variables. Instead, users are instructed to employ global v ariables that satisfy the global data access rules. Parallelism and Thread Safety For forward simulation, we inherit the “parallel-for" construct from T aichi to map each loop iteration onto CPU/GPU threads. Programmers use atomic operations for thread safety . Our system can automatically dif ferentiate these atomic operations. Gradient contributions in backw ard kernels are accumulated to the adjoint tensors via atomic adds. 3 . 2 G L O BA L A D : E N D - T O - E N D B A C K P R O PAG A T I O N U S I N G A L I G H T - W E I G H T T A P E W e construct a tape (Fig. 2, right) of the kernel ex ecution so that gradient kernels can be replayed in a rev ersed order . The tape is very light-weight: since the intermediate results are stored in global tensors, during forward simulation the tape only records kernel names and the (scalar) input param- eters, unlike other differentiable functional array systems where all the intermediate buf fers have to be recorded by the tape. Whenever a DiffT aichi kernel is launched, we append the kernel function pointer and parameters to the tape. When ev aluating gradients, we tra verse the rev ersed tape, and in voke the gradient kernels with the recorded parameters. Note that DiffT aichi AD is ev aluating gradients with respect to input global tensors instead of the input parameters. Learning/Optimization with Gradients Now we revisit the mass-spring example and make it differentiable for optimization. Suppose the goal is to optimize the rest lengths of the springs so that the triangle area formed by the three springs becomes 0 . 2 at the end of the simulation. W e first define the loss function: @ti. kernel def compute_loss(t: ti . i32 ): x01 = x[t, 0] - x[t, 1] x02 = x[t, 0] - x[t, 2] # Triangle area from cross product area = ti . abs (0.5 * (x01[0]*x02[1] - x01[1]*x02[0])) target_area = 0.2 loss[None] = ti .sqr(area - target_area) # Everything in Taichi is a tensor. # "loss" is a scalar (0-D tensor), thereby indexed with [None]. The programmer uses ti .Tape to memorize forward kernel launches. It automatically replays the gradients of these kernels in rev erse for backpropagation. Initially the springs hav e lengths [0 . 1 , 0 . 1 , 0 . 14] , and after optimization the rest lengths are [0 . 600 , 0 . 600 , 0 . 529] . This means the springs will expand the triangle according to Hooke’ s law and form a larger triangle: [Reproduce: mass_spring_simple.py ] 5 Published as a conference paper at ICLR 2020 def main(): for iter in range (200): with ti .Tape(loss): forward() compute_loss(steps - 1) print (’Iter=’, iter ) print (’Loss=’,loss[None]) # Gradient descent for i in range (n_springs): spring_length[i] -= lr * spring_length.grad[i] 0 10 20 Gradient descent iterations 0.000 0.002 0.004 0.006 0.008 Loss Spring Rest Length Optimization Complex Ker nels Sometimes the user may want to override the gradients provided by the com- piler . For example, when differentiating a 3D singular value decomposition done with an iterativ e solver , it is better to use a manually engineered SVD deri vati ve subroutine for better stability . W e provide two more decorators ti .complex_kernel and ti .complex_kernel_grad to ov erwrite the default automatic dif ferentiation, as detailed in Appendix C. Apart from custom gradients, comple x kernels can also be used to implement checkpointing, as detailed in Appendix D. 4 E V A L UA T I O N W e ev aluate DiffT aichi on 10 different physical simulators covering large-scale continuum and small-scale rigid body simulations. All results can be reproduced with the provided script. The dynamic/optimization processes are visualized in the supplemental video . In this section we focus our discussions on three simulators. More details on the simulators are in Appendix E. 4 . 1 D I FF E R E N T I A B L E C O N T I N U U M M E C H A N I C S F O R E L A S T I C O B J E C T S [diffmpm] First, we build a differentiable continuum simulation for soft robotics applications. The physical system is governed by momentum and mass conservation, i.e. ρ D v Dt = ∇ · σ + ρ g , Dρ Dt + ρ ∇ · v = 0 . W e follow ChainQueen’ s implementation (Hu et al., 2019b) and use the moving least squares material point method (Hu et al., 2018) to simulate the system. W e were able to easily translate the original CUDA simulator into DiffT aichi syntax. Using this simulator and an open-loop controller, we can easily train a soft robot to mov e forward (Fig. 1, diffmpm ). Perf ormance and Productivity Compared with manual gradient implementations in (Hu et al., 2019b), getting gradients in DiffT aichi is effortless. As a result, the DiffT aichi implementation is 4 . 2 × shorter in terms of lines of code, and runs almost as fast; compared with T ensorFlow , Dif fT aichi code is 1 . 7 × shorter and 188 × faster (T able 1). The T ensorflow implementation is verbose due to the heavy use of tf.gather_nd/scatter_nd and array transposing and broadcasting. T able 1: diffmpm performance comparison on an NVIDIA GTX 1080 T i GPU. W e benchmark in 2D using 6.4K particles. For the lines of code, we only include the essential implementation, excluding boilerplate code. [Reproduce: python3 diffmpm_benchmark.py ] Approach Forward T ime Backward T ime T otal Time # Lines of Code T ensorFlow 13.20 ms 35.70 ms 48.90 ms (188. × ) 190 CUD A 0.10 ms 0.14 ms 0.24 ms (0.92 × ) 460 DiffT aichi 0.11 ms 0.15 ms 0.26 ms (1.00 × ) 110 4 . 2 D I FF E R E N T I A B L E I N C O M P R E S S I B L E F L U I D S I M U L A T O R [smoke] W e implemented a smoke simulator (Fig. 1, smoke ) with semi-Lagrangian advection (Stam, 1999) and implicit pressure projection, follo wing the e xample in Autograd (Maclaurin et al., 2015). Using gradient descent optimization on the initial velocity field, we are able to find a velocity field that changes the pattern of the fluid to a target image (Fig. 7a in Appendix). W e compare the performance of our system against PyT orch, Autograd, and J AX in T able 2. Note that as an example from the 6 Published as a conference paper at ICLR 2020 T able 2: smoke benchmark against Autograd, PyT orch, and J AX. W e used a 110 × 110 grid and 100 time steps, each with 6 Jacobi pressure projections. [Reproduce: python3 smoke_[autograd/pytor ch/jax/taichi_cpu/taichi_gpu].py ] . Note that the Autograd program uses float64 precision, which approximately doubles the run time. Approach Forward T ime Backward T ime T otal Time # Essential LoC PyT orch (CPU, f32) 405 ms 328 ms 733 ms ( 13 . 8 × ) 74 PyT orch (GPU, f32) 254 ms 457 ms 711 ms ( 13 . 4 × ) 74 Autograd (CPU, f64) 307 ms 1197 ms 1504 ms ( 28 . 4 × ) 51 J AX (GPU, f32) 24 ms 75 ms 99 ms ( 1 . 9 × ) 90 DiffT aichi (CPU, f32) 66 ms 132 ms 198 ms ( 3 . 7 × ) 75 DiffT aichi (GPU, f32) 24 ms 29 ms 53 ms ( 1 . 0 × ) 75 Autograd library , this grid-based simulator is intentionally simplified to suit traditional array-based programs. For example, a periodic boundary condition is used so that Autograd can represent it using numpy.roll , without any branching. Still, T aichi delivers higher performance than these array- based systems. The whole program takes 10 seconds to run in DiffT aichi on a GPU, and 2 seconds are spent on JIT . J AX JIT compilation takes 2 minutes. 4 . 3 D I FF E R E N T I A B L E R I G I D B O DY S I M U L A T O R S [rigid_body] W e built an impulse-based (Catto, 2009) differentiable rigid body simulator (Fig. 1, rigid_body ) for optimizing robot controllers. This simulator supports rigid body collision and friction, spring forces, joints, and actuation. The simulation is end-to-end differentiable except for a countable number of discontinuities. Interestingly , although the forward simulator works well, naiv ely differentiating it with DiffT aichi leads to completely misleading gradients, due to the rigid body collisions. W e discuss the cause and solution of this issue below . Impro ving collision gradients Consider the rigid ball example in Fig. 4 (left), where a rigid ball collides with a friction-less ground. Gravity is ignored, and due to conservation of kinetic energy the ball keeps a constant speed e ven after this elastic collision. In the forward simulation, using a small ∆ t often leads to a reasonable result, as done in many physics simulators. Lowering the initial ball height will increase the final ball height, since there is less distance to travel before the ball hits the ground and more after (see the loss curves in Fig.4, middle right). Ho wev er , using a naiv e time integrator , no matter how small ∆ t is, the ev aluated gradient of final height w .r .t. initial height will be 1 instead of − 1 . This counter-intuiti ve behavior is due to the fact that time discretization itself is not differentiated by the compiler . Fig. 4 explains this effect in greater detail. Figure 4: How gradients can go wrong with naiv e time integrators. For clarity we use a large ∆ t here. Left: Since collision detection only happens at multiples of ∆ t ( 2∆ t in this case), lo wering the initial position of the ball (light blue) leads to a lowered final position. Middle Left: By improving the time integrator to support continuous time of impact (TOI), collisions can be detected at any time, e.g. 1 . 9∆ t (light red). Now the blue ball ends up higher than the green one. Middle Right: Although the tw o time integration techniques lead to almost identical forw ard results (in practice ∆ t is small), the nai ve time integrator gi ves an incorrect gradient of 1 , b ut adding TOI yields the correct gradient. Please see our supplemental video for a better demonstration. [Reproduce: python3 rigid_body_toi.py ] Right: When zooming in, the loss of the nai ve inte grator is decreasing, and the saw-tooth pattern e xplains the positiv e gradients. [Reproduce: python3 rigid_body_toi.py zoom ] 7 Published as a conference paper at ICLR 2020 0 20 40 60 80 100 Gradient Descent Iterations 0.45 0.40 0.35 0.30 0.25 Loss Mass Spring Robot 1 Naive TOI 0 20 40 60 80 100 Gradient Descent Iterations 0.5 0.4 0.3 0.2 0.1 Loss Mass Spring Robot 2 Naive TOI 0 5 10 15 Gradient Descent Iterations 0.5 0.6 0.7 0.8 0.9 Loss Rigid Body Robot 1 Naive TOI Figure 5: Adding TOI greatly improves gradient and optimization quality . Each experiment is repeated fi ve times. [Reproduce: python3 [mass_spring/rigid_body .py] [1/2] plot && python3 plot_losses.py ] W e propose a simple solution of adding continuous collision resolution (see, for example, Redon et al. (2002)), which considers precise time of impact (TOI), to the forward program (Fig. 4, middle left). Although it barely improv es the forward simulation (Fig. 4, middle right), the gradient will be corrected ef fectiv ely (Fig. 4, right). The details of continuous collision detection are in Appendix F. In real-world simulators, we find the TOI technique leads to significant improv ement in gradient quality in controller optimization tasks (Fig. 5). Having TOI or not barely affects forward simula- tion: in the supplemental video , we show that a robot controller optimized in a simulator with TOI, actually works well in a simulator without T OI. The takeaway is, differ entiating physical simulators does not always yield useful gradients of the physical system being simulated, even if the simulator does forward simulation well . In Appendix G, we discuss some additional gradient issues we hav e encountered. 5 R E L A T E D W O R K Differentiable programming The recent rise of deep learning has motiv ated the de velopment of differentiable programming libraries for deep NNs, most notably auto-differentiation frameworks such as Theano (Bergstra et al., 2010), T ensorFlow (Abadi et al., 2016) and PyT orch (Paszk e et al., 2017). Howe ver , physical simulation requires complex and customizable operations due to the intrinsic computational irregularity . Using the aforementioned framew orks, programmers hav e to compose these coarse-grained basic operations into desired complex operations. Doing so often leads to unsatisfactory performance. Earlier work on automatic differentiation focuses on transforming existing scalar code to obtain deriv ativ es (e.g. Utke et al. (2008), Hascoet & Pascual (2013), Pearlmutter & Siskind (2008)). A recent trend has emerged for modern programming languages to support differentiable function transformations through annotation (e.g. Innes et al. (2019), W ei et al. (2019)). These frameworks enable differentiating general programming languages, yet the y provide limited parallelism. Differentiable array programming languages such as Halide (Ragan-K elley et al., 2013; Li et al., 2018b), Autograd (Maclaurin et al., 2015), J AX (Bradbury et al., 2018), and Enoki (Jakob, 2019) operate on arrays instead of scalars to utilize parallelism. Instead of operating on arrays that are im- mutable, DiffT aichi uses an imperativ e style with flexible indexing to make porting existing physical simulation algorithms easier . Differentiable Physical Simulators Building differentiable simulators for robotics and machine learning has recently increased in popularity . W ithout differentiable programming, Battaglia et al. (2016), Chang et al. (2016) and Mrowca et al. (2018) used NNs to approximate the physical pro- cess and used the NN gradients as the approximate simulation gradients. Degrav e et al. (2016) and de A vila Belbute-Peres et al. (2018b) used Theano and PyT orch respectively to build differen- tiable rigid body simulators. Schenck & F ox (2018) dif ferentiates position-based fluid using custom CUD A kernels. Popo vi ´ c et al. (2000) used a differentiable rigid body simulator for manipulating physically based animations. The ChainQueen differentiable elastic object simulator (Hu et al., 2019b) implements forward and gradient versions of continuum mechanics in hand-written CUD A kernels, leading to performance that is two orders of magnitude higher than a pure T ensorFlow implementation. Liang et al. (2019) built a differentiable cloth simulator for material estimation and motion control. The deep learning community also often incorporates differentiable rendering 8 Published as a conference paper at ICLR 2020 operations (OpenDR (Loper & Black, 2014), N3MR (Kato et al., 2018), redner (Li et al., 2018a), Mitsuba 2 (Nimier-Da vid et al., 2019)) to learn from 3D scenes. 6 C O N C L U S I O N W e hav e presented Dif fT aichi, a ne w dif ferentiable programming language designed specifically for building high-performance dif ferentiable physical simulators. Motiv ated by the need for supporting megak ernels, imperativ e programming, and flexible indexing, we developed a tailored two-scale automatic differentiation system. W e used DiffT aichi to build 10 simulators and inte grated them into deep neural networks, which proved the performance and productivity of DiffT aichi over existing systems. W e hope our programming language can greatly lower the barrier of future research on differentiable ph ysical simulation in the machine learning and robotics communities. 9 Published as a conference paper at ICLR 2020 B I B L I O G R A P H Y Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffre y Dean, Matthieu Devin, Sanjay Ghemaw at, Geoffrey Irving, Michael Isard, et al. T ensorflow: A system for large- scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 16) , pp. 265–283, 2016. Peter W . Battaglia, Razvan P ascanu, Matthew Lai, Danilo Rezende, and K oray Kavukcuoglu. Inter- action networks for learning about objects, relations and physics. 2016. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David W arde-Farle y , and Y oshua Bengio. Theano: A cpu and gpu math compiler in python. In Pr oc. 9th Python in Science Conf , volume 1, pp. 3–10, 2010. James Bradbury , Roy Frostig, Peter Hawkins, Matthe w James Johnson, Chris Leary , Dougal Maclaurin, and Skye W anderman-Milne. J AX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax . Erin Catto. Modeling and solving constraints. In Game Developers Confer ence , pp. 16, 2009. Michael B Chang, T omer Ullman, Antonio T orralba, and Joshua B T enenbaum. A compositional object-based approach to learning physical dynamics. ICLR , 2016. Filipe de A vila Belb ute-Peres, Ke vin Smith, Kelse y Allen, Josh T enenbaum, and J Zico K olter . End-to-end differentiable physics for learning and control. In Advances in Neural Information Pr ocessing Systems , pp. 7178–7189, 2018a. Filipe de A vila Belb ute-Peres, Ke vin A Smith, Kelse y Allen, Joshua B T enenbaum, and J Zico K olter . End-to-end differentiable physics for learning and control. In Neural Information Pr o- cessing Systems , 2018b. Jonas Degra ve, Michiel Hermans, Joni Dambre, et al. A differentiable physics engine for deep learning in robotics. arXiv preprint , 2016. Ronald M Errico. What is an adjoint model? Bulletin of the American Meteor ological Society , 78 (11):2577–2592, 1997. Richard Gordon, Robert Bender, and Gabor T Herman. Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and x-ray photography . Journal of theor etical Biology , 29(3):471–481, 1970. Andreas Griew ank and Andrea W alther . Evaluating derivatives: principles and techniques of algo- rithmic differ entiation , volume 105. Siam, 2008. Laurent Hascoet and V alérie Pascual. The T apenade automatic differentiation tool: Principles, model, and specification. 39(3):20:1–20:43, 2013. Y uanming Hu, Y u F ang, Ziheng Ge, Ziyin Qu, Y ixin Zhu, Andre Pradhana, and Chenfanfu Jiang. A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. 37(4):150, 2018. Y uanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-K elley , and Frédo Durand. T aichi: A language for high-performance computation on spatially sparse data structures. In SIGGRAPH Asia 2019 T echnical P apers , pp. 201. A CM, 2019a. Y uanming Hu, Jiancheng Liu, Andrew Spielber g, Joshua B T enenbaum, W illiam T Freeman, Jiajun W u, Daniela Rus, and W ojciech Matusik. Chainqueen: A real-time differentiable physical simula- tor for soft robotics. Pr oceedings of IEEE International Confer ence on Robotics and Automation (ICRA) , 2019b. Mike Innes, Alan Edelman, Keno Fischer , Chris Rackauckus, Elliot Saba, V iral B Shah, and W ill T ebbutt. Zygote: A differentiable programming system to bridge machine learning and scientific computing. arXiv preprint , 2019. 10 Published as a conference paper at ICLR 2020 W enzel Jakob . Enoki: structured vectorization and differentiation on modern processor architec- tures, 2019. https://github.com/mitsuba-renderer/enoki. Chenfanfu Jiang. The material point method for the physics-based simulation of solids and fluids . Univ ersity of California, Los Angeles, 2015. Ralf Karrenberg and Sebastian Hack. Whole-function vectorization. In Code Generation and Opti- mization , pp. 141–150, 2011. Hiroharu Kato, Y oshitaka Ushiku, and T atsuya Harada. Neural 3d mesh renderer . In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pp. 3907–3916, 2018. Stig Larsson and V idar Thomée. P artial differ ential equations with numerical methods , volume 45. Springer Science & Business Media, 2008. Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. Differentiable monte carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 T echnical P apers , pp. 222. ACM, 2018a. Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-K elley . Differ - entiable programming for image processing and deep learning in Halide. A CM T ransactions on Graphics (T OG) , 37(4):139, 2018b. Junbang Liang, Ming C Lin, and Vladlen Koltun. Differentiable cloth simulation for in verse prob- lems. Advances in Neural Information Pr ocessing Systems , 2019. Matthew M Loper and Michael J Black. Opendr: An approximate differentiable renderer . In Eur o- pean Confer ence on Computer V ision , pp. 154–169. Springer , 2014. Dougal Maclaurin, David Duv enaud, and Ryan P Adams. Autograd: Effortless gradients in nump y . In ICML 2015 AutoML W orkshop , volume 238, 2015. Aleka McAdams, Andre w Selle, Rasmus T amstorf, Joseph T eran, and Eftychios Sifakis. Computing the singular v alue decomposition of 3x3 matrices with minimal branching and elementary floating point operations. T echnical report, Univ ersity of Wisconsin-Madison Department of Computer Sciences, 2011. Damian Mro wca, Chengxu Zhuang, Elias W ang, Nick Haber , Li Fei-Fei, Joshua B T enenbaum, and Daniel LK Y amins. Flexible neural representation for physics prediction. 1806.08047 , 2018. Merlin Nimier-Da vid, Delio V icini, Tizian Zeltner , and W enzel Jakob . Mitsuba 2: A retargetable forward and in verse renderer . T ransactions on Graphics (Pr oceedings of SIGGRAPH Asia) , 38 (6), Nov ember 2019. doi: 10.1145/3355089.3356498. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Y ang, Zachary DeV ito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer . Automatic differentiation in pytorch. 2017. Barak A. Pearlmutter and Jeffrey Mark Siskind. Rev erse-mode AD in a functional frame work: Lambda the ultimate backpropagator . ACM T ransactions on Pr ogramming Languages and Sys- tems , 30(2):7:1–7:36, 2008. Matt Pharr and W illiam R Mark. ispc: A spmd compiler for high-performance cpu programming. In Innovative P arallel Computing , pp. 1–13, 2012. Jov an Popovi ´ c, Stev en M Seitz, Michael Erdmann, Zoran Popovi ´ c, and Andrew W itkin. Interactiv e manipulation of rigid body simulations. In Pr oceedings of the 27th annual conference on Com- puter graphics and interactive techniques , pp. 209–217. A CM Press/Addison-W esley Publishing Co., 2000. Jonathan Rag an-Kelle y , Connelly Barnes, Andrew Adams, Sylv ain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality , and recom- putation in image processing pipelines. SIGPLAN Not. , 48(6):519–530, jun 2013. 11 Published as a conference paper at ICLR 2020 Stéphane Redon, Abderrahmane Kheddar , and Sabine Coquillart. F ast continuous collision detec- tion between rigid bodies. In Computer graphics forum , volume 21, pp. 279–287. W iley Online Library , 2002. Connor Schenck and Dieter Fox. Spnets: Differentiable fluid dynamics for deep neural networks. arXiv pr eprint arXiv:1806.06094 , 2018. Jos Stam. Stable fluids. In Siggraph , v olume 99, pp. 121–128, 1999. Andre Pradhana T ampubolon, Theodore Gast, Gergely Klár , Chuyuan Fu, Joseph T eran, Chenfanfu Jiang, and K en Museth. Multi-species simulation of porous sand and water mixtures. ACM T ransactions on Graphics (T OG) , 36(4):105, 2017. Jean Utke, Uwe Naumann, Mike Fagan, Nathan T allent, Michelle Strout, Patrick Heimbach, Chris Hill, and Carl W unsch. Openad/f: A modular open-source tool for automatic differentiation of fortran codes. 34(4):18, 2008. Jui-Hsien W ang, Ante Qu, Timoth y R Langlois, and Doug L James. T ow ard wave-based sound synthesis for computer animation. ACM T rans. Gr aph. , 37(4):109–1, 2018. Richard W ei, Marc Rasi Dan Zheng, and Bart Chrzaszcz. Differentiable program- ming mega-proposal. https://github.com/apple/swift/blob/master/docs/ DifferentiableProgramming.md , 2019. Accessed: 2019-09-25. R. E. W engert. A simple automatic deriv ativ e ev aluation program. Communications of the ACM , 7 (8):463–464, aug 1964. 12 Published as a conference paper at ICLR 2020 A C O M PA R I S O N W I T H E X I S T I N G S Y S T E M S W orkload differences between deep learning and differentiable physical simulation Existing differentiable programming tools for deep learning are typically centered around large data blobs. For example, in AlexNet, the second con volution layer has size 27 × 27 × 128 × 128 . These tools usually provide users with both lo w-level operations such as tensor add and mul, and high-lev el op- erations such as con volution. The bottleneck of typical deep-learning-based computer vision tasks are con volutions, so the provided high-lev el operations, with very high arithmetic intensity 2 , can fully exploit hardware capability . Howe ver , the provided operations are “atoms” of these differ - entiable programming tools, and cannot be further customized. Users often have to use low-le vel operations to compose their desired high-lev el operations. This introduces a lot of temporary buf fers, and potentially excessiv e GPU kernel launches. As sho wn in Hu et al. (2019b), a pure T ensorFlow implementation of a comple x physical simulator is 132 × slower than a CUD A implementation, due to excessi ve GPU kernel launches and the lack of producer-consumer locality 3 . The table below compares Dif fT aichi with existing tools for build differentiable physical simulators. T able 3: Comparisons between DiffT aichi and other differentiable programming tools. Note that this table only discusses features related to differentiable physical simulation , and the other tools may not hav e been designed for this purpose. For example, PyT orch and T ensorFlow are designed for classical deep learning tasks and hav e proven successful in their target domains. Also note that the XLA backend of T ensorFlo w and JIT feature of PyT orch allow them to fuse operators to some extent, but for simulation we want complete operator fusion within megakernels. “Swift” AD (W ei et al., 2019) is partially implemented as of Nov ember 2019. “Julia” refers to Innes et al. (2019). Feature DiffT aichi PyT orch T ensorFlow Enoki J AX Halide Julia Swift GPU Megak ernels ∆ ∆ Imperativ e Scheme Parallelism Flexible Inde xing B D I FF E R E N T A T I N G S T R A I G H T - L I N E T A I C H I K E R N E L S U S I N G S O U R C E C O D E T R A N S F O R M Primal and adjoint kernels Recall that in DiffT aichi, (primal) kernels are operators that take as input multiple tensors (e.g., X, Y ) and output another set of tensors. Mathematically , kernel f has the form f ( X 0 , X 1 , .., X n ) = Y 0 , Y 1 , . . . , Y m . Kernels usually execute uniform operations on these tensors. When it comes to differentiable pro- gramming, a loss function is defined on the final output tensors. The gradients of the loss function “ L ” with respect to each tensor are stored in adjoint tensors and computed via adjoint kernels . The adjoint tensor of (primal) tensor X ij k is denoted as X ∗ ij k . Its entries are defined by X ∗ ij k = ∂ L/∂ X ij k . At a high lev el, our automatic differentiation (AD) system transforms a primal kernel into its adjoint form. Mathematically , (primal) f ( X 0 , X 1 , .., X n ) = Y 0 , Y 1 , . . . , Y m   y Rev erse-Mode Automatic Differentiation (adjoint) f ∗ ( X 0 , X 1 , .., X n , Y ∗ 0 , Y ∗ 1 , . . . , Y ∗ m ) = X ∗ 0 , X ∗ 1 , .., X ∗ n . 2 FLOPs per byte loaded from/stored to main memory . 3 The CUDA kernels in Hu et al. (2019b) hav e much higher arithmetic intensity compared to the T ensorFlo w computational graph system. In other words, when implementing in CUD A immediate results are cached in registers, while in T ensorFlow they are “cached" in main memory . 13 Published as a conference paper at ICLR 2020 Differentiating within kernels: The “make_adjoint” pass (r everse-mode AD) After the preprocessing passes, which flatten branching and eliminate mutable local variables, the “make_adjoint” pass transforms a forward e valuation (primal) kernel into its gradient accumulation (“adjoint”) kernel. It takes straight-line code directly and operates on the hierarchical intermediate representation (IR) of T aichi 4 . Multiple outer for loops are allo wed for the primal k ernel. The T aichi compiler will distribute these parallel iterations onto CPU/GPU threads. During the “make_adjoint” pass, for each SSA instruction, a local adjoint variable will be allocated for gradient contribution accumulation. The compiler will traverse the statements in rev erse order , and accumulate the gradients to the corresponding adjoint local variable. For e xample, a 1D array operation y i = sin x 2 i has its IR representation as follows: for i ∈ range(0, 16, step 1) do %1 = load x [ i ] %2 = mul %1 , %1 %3 = sin(%2) store y [ i ] = %3 end for The abov e primal kernel will be transformed into the follo wing adjoint kernel: for i in range(0, 16, step 1) do // adjoint variables %1 ad j = alloca 0 . 0 %2 ad j = alloca 0 . 0 %3 ad j = alloca 0 . 0 // original forward computation %1 = load x [ i ] %2 = mul %1 , %1 %3 = sin (%2) // rev erse accumulation %4 = load y _ ad j [ i ] %3 ad j += %4 %5 = cos (%2) %2 ad j += %3 ad j ∗ %5 %1 ad j += 2 ∗ %1 ∗ %2 ad j atomic add x _ ad j [ i ] , %1 ad j end for Note that for clarity the transformed code is not strictly SSA here. The actual IR has more instruc- tions. A following simplification pass will simplify redundant instructions generated by the AD pass. C C O M P L E X K E R N E L S Here we demonstrated ho w to use complex k ernels to o verride the automatic differentiation system. W e use singular value decomposition (SVD) of 3 × 3 matrices ( M = UΣV ∗ ) as an example. Fast SVD solvers used in physical simulation are often iterativ e, yet directly ev aluate the gradient of this iterati ve process is likely numerically unstable. Suppose we use McAdams et al. (2011) as the forward SVD solver , and use the method in Jiang (2015) (Section 2.1.1.2) to ev alute the gradients, the complex kernels are used as follo ws: 4 T aichi uses a hierarchical static single assignment (SSA) intermediate representation (IR) as its internal program representation. The T aichi compiler applies multiple transform passes to lower and simplify the SSA IR in order to get high-performance binary code. 14 Published as a conference paper at ICLR 2020 # Do Singular Value Decomposition (SVD) on n matrices @ti. kernel def iterative_svd(num_iterations: ti . f32 ): for i in range (n): input = matrix_M[i] for iter in range (num_iterations): ... iteratively solve SVD using McAdams et al. 2011 ... matrix_U[i] = ... matrix_Sigma[i] = ... matrix_V[i] = ... # A custom complex kernel that wraps the iterative SVD kernel @ti.complex_kernel def svd_forward(num_iterations): iterative_svd(num_iterations) @ti. kernel def svd_gradient(): for i in range (n): ... Implement, for example, section 2.1.1.2 of Jiang ( 2015) ... # A complex kernel that is registered as the svd_forward complex kernel @ti.complex_kernel_grad(svd_forward) def svd_backward(num_iterations): # differentiave SVD svd_gradient() D C H E C K P O I N T I N G In this section we demonstrate ho w to use checkpointing via complex kernels. The goal of check- pointing is to use recomputation to save memory space. W e demonstrate this using the diffmpm example, whose simulation cycle consists of particle to grid transform ( p2g ), grid boundary con- ditions ( grid_op ), and grid to particle transform ( g2p ). W e assume the simulation has O ( n ) time steps. D . 1 R E C O M P U TA T I O N W I T H I N T I M E S T E P S A nai ve implementation without checkpointing allocates O ( n ) copied of the simulation grid, which can cost a lot of memory space. Actually , if we recompute the grid states during the backward simulation time step by redoing p2g and grid_op , we can reused the grid states and allocate only one copy . This checkpointing optimization is demonstrated in the code below: @ti.complex_kernel def advance(s): clear_grid() compute_actuation(s) p2g(s) grid_op() g2p(s) @ti.complex_kernel_grad(advance) def advance_grad(s): clear_grid() p2g(s) grid_op() # recompute the grid g2p.grad(s) grid_op.grad() p2g.grad(s) compute_actuation.grad(s) D . 2 S E G M E N T - W I S E R E C O M P U TA T I O N Giv en a simulation with O ( n ) time steps, if all simulation steps are recorded, the space consumption is O ( n ) . This linear space consumption is sometimes too large for high-resolution simulations with long time horizon. Fortunately , we can reduce the space consumption using a segment-wise checkpointing trick: W e split the simulation into segments of S steps, and in forward simulation 15 Published as a conference paper at ICLR 2020 store only the first simulation state in each segment. During backpropagation when we need the remaining simulation states in a segment, we recompute them based on the first state in that segment. Note that if the segment size is O ( S ) , then we only need to store O ( n/S ) simulation steps for checkpoints and O ( S ) reusable simulation steps for backpropagation within segments. The total space consumption is O ( S + n/S ) . Setting S = O ( √ n ) reduces memory consumption from O ( n ) to O ( √ n ) . The time complexity remains O ( n ) . E D E TA I L S O N 1 0 D I FF E R E N T I A B L E S I M U L A T O R S E . 1 D I FF E R E N T I A B L E C O N T I N U U M M E C H A N I C S F O R E L A S T I C O B J E C T S [diffmpm] Figure 6: Controller optimization with our differentiable continuum simulators. Left: the 2D robot with four muscles. Middle: A 3D robot with 16 muscles and 30K particles crawling on the ground. [Reproduce: python3 [diffmpm/diffmpm3d].py ] Right: W e couple the robot (30K particles) and the liquid simulator (13K particles), and optimize its open-loop controller in this difficult situa- tion. [Reproduce: python3 liquid.py ] E . 2 D I FF E R E N T I A B L E L I Q U I D S I M U L A T O R [liquid] W e follow the weakly compressible fluid model in T ampubolon et al. (2017) and implemented a 3D differentiable liquid simulator within the [diffmpm3d] frame work. Our liquid simulation can be two-way coupled with elastic object simulation (Figure 6, right). E . 3 D I FF E R E N T I A B L E I N C O M P R E S S I B L E F L U I D S I M U L A T O R [smoke] (a) smoke (b) wa ve Figure 7: (a): (Left to right) with an optimized initial smoke velocity field, the fluid changes its pattern to a “T aichi" symbol. [Reproduce: python3 smoke_taichi.py ] (b): Unoptimized (top three) and optimized (bottom three) wa ves at time step 3, 189, and 255. [Reproduce: python3 wave.py ] Backpropagating Through Pressur e Projection W e followed the baseline implementation in Autograd, and used 10 Jacobi iterations for pressure projection. T echnically , 10 Jacobi iterations are not sufficient to make the velocity field fully div ergence-free. Ho wev er , in this example, it does a decent job, and we are able to successfully backpropagate through the unrolled 10 Jacobi iterations. In larger -scale simulations, 10 Jacobi iterations are lik ely not sufficient. Assuming the Poisson solve is done by an iterativ e solver (e.g. multigrid preconditioned conjugate gradients, MGPCG) with 5 multigrid le vels and 50 conjugate gradient iterations, then automatic dif ferentiation will likely not be able to provide gradients with sufficient numerical accuracy across this long iterative process. The accuracy is likely worse when conjugate gradients present, as they are known to numerically drift as the number of iterations increases. In this case, the user can still use Dif fT aichi to implement the forward MGPCG solver , while implementing the backward part of the Poisson solve manually , 16 Published as a conference paper at ICLR 2020 likely using adjoint methods (Errico, 1997). DiffT aichi provides “complex kernels” to override the built-in AD system, as sho wn in Appendix C. E . 4 D I FF E R E N T I A B L E H E I G H T F I E L D S H A L L O W W AT E R S I M U L A T O R [wave] W e adopt the wav e equation in W ang et al. (2018) to model shallow water height field e volution: ¨ u = c 2 ∇ 2 u + cα ∇ 2 ˙ u, (1) where u is the height of shallow water , c is the “speed of sound” and α is a damping coefficient. W e use the ˙ u and ¨ u notations for the first and second order partial deriv ativ es of u w .r .t time t respectiv ely . W ang et al. (2018) used the finite dif ferent time-domain (FDTD) method (Larsson & Thomée, 2008) to discretize Eqn. 1, yielding an update scheme: u t,i,j = 2 u t − 1 ,i,j + ( c 2 ∆ t 2 + cα ∆ t )( ∇ 2 u ) t − 1 ,i,j − p t − 2 ,i,j − cα ∆ t ( ∇ 2 u ) t − 2 ,i,j , where ( ∇ 2 u ) t,i,j = − 4 u t,i,j + u t,i,j +1 + u t,i,j − 1 + u t,i +1 ,j + u t,i − 1 ,j ∆ x 2 . W e implemented this wav e simulator in DiffT aichi to simulate shallow water . W e used a grid of resolution 128 × 128 and 256 time steps. The loss function is defined as L = X i,j ∆ x 2 ( u T ,i,j − ˆ u i,j ) 2 where T is the final time step, and ˆ u is the target height field. 200 gradient descent iterations are then used to optimize the initial height field. W e set ˆ u to be the pattern “T aichi", and Fig. 7b shows the unoptimized and optimized wa ve e volution. W e set the “T aichi" symbol as the target pattern. Fig. 7b sho ws the unoptimized and optimized final wa ve patterns. More details on discretization is in Appendix E. E . 5 D I FF E R E N T I A B L E M A S S - S P R I N G S Y S T E M [mass_spring] W e extend the mass-spring system in the main text with ground collision and a NN controller . The time-of-impact fix is implemented for impro ved gradients. The optimization goal is to maximize the distance moved forward with 2048 time steps. W e designed three mass-spring robots as shown in Fig. 8 (left). E . 6 D I FF E R E N T I A B L E B I L L I A R D S I M U L A T O R [billiards] A differentiable rigid body simulator is built for optimizing a billiards strategy (Fig. 8, middle). W e used forward Euler for the billiard ball motion and conservation of momentum and kinetic energy for collision resolution. Figure 8: Left: Three mass-spring robots. The red and blue springs are actuated. A two layer NN is used as controller . [Reproduce: python3 mass_spring.py [1/2/3] train ] . Middle: Optimizing billiards. The optimizer adjusts the initial position and velocity of the white ball, so that the blue ball will reach the tar get destination (black dot). [Reproduce: python3 billiards.py ] Right: Optimizing a robot walking. The rigid robot is controlled with a NN controller and learned to walk in 20 gradient descent iterations. [Reproduce: python3 rigid_body .py ] 17 Published as a conference paper at ICLR 2020 E . 7 D I FF E R E N T I A B L E R I G I D B O DY S I M U L A T O R [rigid_body] Are rigid body collisions differentiable? It is worth noting that discontinuities can happen in rigid body collisions, and at a countable number of discontinuities the objective function is non- differentiable. Howe ver , apart from these discontinuities, the process is still dif ferentiable almost ev erywhere. The situation of rigid body collision is some what similar to the “ReLU” activ ation function in neural networks: at point x = 0 , ReLU is not differentiable (although continuous), yet it is still widely adopted. The rigid body simulation cases are more complex than ReLU, as we have not only non-dif ferentiable points, but also discontinuous points. Based on our experiments, in these impulse-based rigid body simulators, we still find the gradients useful for optimization despite the discontinuities, especially with our time-of-impact fix. E . 8 D I FF E R E N T I A B L E W A T E R R E N D E R E R [water_renderer] W e implemented differentiable renderers to visualize the refracting water surfaces from wave . W e use finite differences to reconstruct the water surface models based on the input height field and refract camera rays to sample the images, using bilinear interpolation for meaningful gradients. T o show our system works well with other differentiable programming systems, we use an adversarial optimization goal: fool VGG-16 into thinking that the refracted squirrel image is a goldfish (Fig. 9). Figure 9: This three-stage program (simulation, rendering, recognition) is end-to-end differentiable. Our optimized initial water height field ev olves to form a refraction pattern that perturbs the image into one that fools VGG16 (99.91% goldfish). [Reproduce: python3 water_render er .py ] E . 9 D I FF E R E N T I A B L E V O L U M E R E N D E R E R [volume_renderer] W e implemented a basic v olume renderer that simply uses ray marching (we ignore light, scattering, etc.) to integrate a density field over each camera ray . In this task, we render a number of target images from different viewpoints, with the camera rotated around the giv en volume. The goal is then to optimize for the density field of the volume that would produce these target images: we render candidate images from the same viewpoints and compute an L2 loss between them and the target images, before performing gradient descent on the density field (Fig. 10). Essentially , this demonstrates how to use gradients to reconstruct 3D objects out of X-ray photos in a brute-force manner . Other approaches to this task include algebraic reconstruction techniques (AR T) (Gordon et al., 1970). Figure 10: V olume rendering of bunny shaped density field. Left: 3 (of the 7) target images. Right: optimized images of the middle bunny after iteration 2, 50, 100. [Reproduce: python3 volume_r enderer .py ] 18 Published as a conference paper at ICLR 2020 E . 1 0 D I FF E R E N T I A B L E E L E C T R I C F I E L D S I M U L A T O R [electric] Recall Coulomb’ s law: F = k q 1 q 2 r 2 ˆ r . In the right figure, there are eight electrodes carrying static char ge. The red ball also carries static charge. The controller , which is a two-layer neural network, tries to manipulate the electrodes so that the red ball follows the path of the blue ball. The bigger the electrode, the more positi ve char ge it carries. F F I X I N G G R A D I E N T S W I T H T I M E O F I M PAC T A N D C O N T I N U O U S C O L L I S I O N D E T E C T I O N Here is a naiv e time integrator in the mass-spring system example: @ti. kernel def advance(t: ti . i32 ): for i in range (n_objects): s = math.exp(-dt * damping) new_v = s * v[t - 1, i] + dt * gravity * ti . Vector ([0.0, 1.0]) old_x = x[t - 1, i] depth = old_x[1] - ground_height if depth < 0 and new_v[1] < 0: # assuming a sticky ground (infinite coefficient of frictio n) new_v[0] = 0 new_v[1] = 0 # Without considering time of impact, we assume the whole dt uses new_v new_x = old_x + dt * new_v v[t, i] = new_v x[t, i] = new_x Implementing TOI in this system is relati ve straightforward: @ti. kernel def advance_toi(t: ti . i32 ): for i in range (n_objects): s = math.exp(-dt * damping) old_v = s * v[t - 1, i] + dt * gravity * ti . Vector ([0.0, 1.0]) old_x = x[t - 1, i] new_x = old_x + dt * old_v toi = 0.0 new_v = old_v if new_x[1] < ground_height and old_v[1] < -1e-4: # The 1e-4 safe guard is important for numerical stability toi = -(old_x[1] - ground_height) / old_v[1] # Compute the time of impact new_v = ti . Vector ([0.0, 0.0]) # Note that with time of impact, dt is divided into two parts, # the first part using old_v, and second part using new_v new_x = old_x + toi * old_v + (dt - toi) * new_v v[t, i] = new_v x[t, i] = new_x In rigid body simulation, the implementation follows the same idea yet is slightly more complex. Please refer to rigid_body.py for more details. G A D D I T I O N A L T I P S O N G R A D I E N T B E H A V I O R S Initialization matters: flat lands and local minima in physical processes A tri vial example of objectiv e flat land is in billiards . W ithout proper initialization, gradient descent will make no progress since gradients are zero (Fig. 11). Also note the local minimum near ( − 5 , 0 . 03) . 19 Published as a conference paper at ICLR 2020 75 50 25 0 25 50 75 Angle of velocity 0.02 0.03 0.04 0.05 0.06 Objective Billiard Scene Objective 20 10 0 10 20 Angle of velocity 0.02 0.03 0.04 0.05 0.06 Objective Billiard Scene Objective Figure 11: Left: Scanning initial velocity in the billiard example. Middle: Most initial angles yield a flat objective (final distance between the blue ball and black destination) of 0 . 065 , since the white ball does not collide with any other balls and imposes no effect on the pink ball via the chain reaction. Right: A zoomed-in view of the middle figure. The complex collisions lead to a lot of local minimums. [Reproduce: python3 billiards.py 1.0/0.23 ] In mass_spring and rigid_body , once the robot falls down, gradient descent will quickly become trapped. A robot on the ground will make no further progress, no matter how it changes its controller . This leads to a more non-trivial local minimum and zero gradient case. Ideal physical models are only “ideal”: discontinuities and singularities Real-world macro- scopic physical processes are usually continuous. Howe ver , building upon ideal physical models, ev en in the forward physical simulation results can contain discontinuities. For example, in a rigid body model with friction, changing the initial rotation of the box can lead to different corners hitting the ground first, and result in a discontinuity (Fig. 12). In electric and mass_spring , due to the 1 r 2 and 1 r terms, when r → 0 , gradients can be very inaccurate due to numerical precision issues. Note that d (1 /r ) /dr = − 1 /r 2 , and the gradient is more numerically problematic than the primal for a small r . Safeguarding r is critically important for gradient stability . 1.0 0.5 0.0 0.5 1.0 Initial Rotation Angle (degrees) 0.2 0.3 0.4 0.5 Loss (m) Rigid Body Simulation Discontinuity coeff of friction=0 coeff of friction=1 Figure 12: Friction in rigid body with collision is a common source of discontinuity . In this scene a rigid body hits the ground. Slightly rotating the rigid body changes which corner (A/B) hits the ground first, and dif ferent normal/friction impulses will be applied to the rigid body . This leads to a discontinuity in its final position (loss=final y coordinate). [Reproduce: python3 rigid_body_discontinuity .py ] Please see our supplemental video for more details. 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment