Method

We begin by formalizing the time lapse video synthesis problem. Given a painting $`x_T`$, our task is to synthesize the past frames $`x_1,\cdots,x_{T-1}`$. Suppose we have a training set of real time lapse videos $`\{\mathbf{x}^{(i)}=x^{(i)}_1, \cdots, x^{(i)}_{T^{(i)}}\}`$. We first define a principled probabilistic model, and then learn its parameters using these videos. At test time, given a completed painting, we sample from the model to create new videos of realistic-looking painting processes.

Model

We propose a probabilistic, temporally recurrent model for changes made during the painting process. At each time instance $`t`$, the model predicts a pixel-wise intensity change $`\delta_t`$ that should be added to the previous frame to produce the current frame; that is, $`x_t = x_{t-1} + \delta_t`$. This change could represent one or multiple physical or digital paint strokes, or other effects such as erasing or fading.

We model $`\delta_t`$ as being generated from a random latent variable $`z_t`$, the completed piece $`x_T`$, and the image content at the previous time step $`x_{t-1}`$; the likelihood is $`p_{\theta}(\delta_t|z_t,x_{t-1};x_T)`$ (Figure 1). Using a random variable $`z_t`$ helps to capture the stochastic nature of painting. Using both $`x_T`$ and $`x_{t-1}`$ enables the model to capture time-varying effects such as the progression of coarse to fine brush sizes, while the Markovian assumption facilitates learning from a small number of video examples.

The proposed probabilistic model. Circles represent random variables; the shaded circle denotes a variable that is observed at inference time. The rounded rectangle represents model parameters.

It is common to define such image likelihoods as a per-pixel normal distribution, which results in an L2 image similarity loss term in maximum likelihood formulations . In synthesis tasks, using L2 loss often produces blurry results . We instead design our image similarity loss as the L1 distance in pixel space and the L2 distance in a perceptual feature space. Perceptual losses are commonly used in image synthesis and style transfer tasks to produce sharper and more visually pleasing results . We use the L2 distance between normalized VGG16 features as described in . We let the likelihood take the form:

\begin{align}
&p_{\theta}(\delta_t|z_t,x_{t-1};x_T)\nonumber\\
&\propto e^{-\frac{1}{\sigma_1}|\delta_t - \hat{\delta}_t|}\mathcal{N}\big(V(x_{t-1} + \delta_t); V(x_{t-1} + \hat{\delta}_t), \sigma_2^2\mathbbm{I}\big),\label{eq:prob_form_likelihood}
\end{align}

where $`\hat{\delta}_t=g_{\theta}(z_t,x_{t-1},x_T)`$, $`g_{\theta}(\cdot)`$ represents a function parameterized by $`\theta`$, $`V(\cdot)`$ is a function that extracts normalized VGG16 features, and $`\sigma_1,\sigma_2`$ are fixed noise parameters.

We assume the latent variable $`z_t`$ is generated from the multivariate standard normal distribution:

\begin{align}
p(z_t)&=\mathcal{N}(z_t; 0, \mathbbm{I}).\label{eq:prob_form_prior}
\end{align}

We aim to find model parameters $`\theta`$ that best explain all videos in our dataset:

\begin{align}
&\argmax_\theta \Pi_i \Pi_t p_\theta(\delta^{(i)}_t,x^{(i)}_{t-1}, x^{(i)}_{T^{(i)}})\nonumber\\\vspace{-5pt}
&=\argmax_\theta \Pi_i \Pi_t \int_{z_t}p_\theta(\delta_t^{(i)}|z_t^{(i)},x_{t-1}^{(i)};x_{T^{(i)}}^{(i)})dz_t.\label{eq:optimization}
\end{align}

This integral is intractable, and the posterior $`p(z_t|\delta_t,x_{t-1};x_T)`$ is also intractable, preventing the use of the EM algorithm. We instead use variational inference and introduce an approximate posterior distribution $`p(z_t|\delta_t,x_{t-1};x_T)\approx q_\phi(z_t|\delta_t,x_{t-1};x_T)`$ . We let this approximate distribution take the form of a multivariate normal:

\begin{align}
&q_\phi(z_t|\delta_t,x_{t-1}, x_T)\nonumber\\
&\hspace{5pt}=\mathcal{N}\big(z_t; \mu_\phi(\delta_t,x_{t-1},x_T), \Sigma_\phi(\delta_t,x_{t-1},x_T)\big),\label{eq:prob_form_posterior}
\end{align}

where $`\mu_\phi(\cdot),\Sigma_\phi(\cdot)`$ are functions parameterized by $`\phi`$, and $`\Sigma_\phi(\cdot)`$ is diagonal.

Neural network framework

Neural network architecture. We implement our model using a conditional variational autoencoder framework. At training time, the network is encouraged to reconstruct the current frame x_t, while sampling the latent z_t from a distribution that is close to the standard normal. At test time, the encoding branch is removed, and z_t is sampled from the standard normal. We use the shorthand δ̂_t = g_θ(z_t, x_t − 1, x_T), x̂_t = x_t − 1 + δ̂_t.

We implement the functions $`g_\theta`$, $`\mu_\phi`$ and $`\Sigma_\phi`$ as a convolutional encoder-decoders parameterized by $`\theta`$ and $`\phi`$, using a conditional variational autoencoder (CVAE) framework . We use an architecture similar to . We summarize our architecture in Figure 2 and include full details in the appendix.

Learning

We learn model parameters using short sequences from the training video dataset, which we discuss in further detail in Section 20.1. We use two stages of optimization to facilitate convergence: pairwise optimization, followed by sequence optimization.

Pairwise optimization

From Equations [eq:optimization] and [eq:prob_form_posterior], we obtain an expression for each pair of consecutive frames (a derivation is provided in the appendix):

\begin{align}
&\log  p_\theta(\delta_t,x_{t-1}, x_T)\nonumber\\
&\geq\mathbb{E}_{z_t\sim q_\phi(z_t|x_{t-1}, \delta_t;x_T)}\big[\log p_\theta(\delta_t|z_t,x_{t-1};x_T)\big]\nonumber\\
&\hspace{10pt}- KL[q_\phi(z_t|\delta_t,x_{t-1};x_T)||p(z_t)],\label{eq:elbo}
\end{align}

where $`KL[\cdot||\cdot]`$ denotes the Kullback-Liebler divergence. Combining Equations [eq:prob_form_likelihood], [eq:prob_form_prior], [eq:prob_form_posterior], and [eq:elbo], we minimize:

\begin{align}
&\mathcal{L}_{KL} + \frac{1}{\sigma_1} \mathcal{L}_{L1}(\delta_t, \hat{\delta}_t) \nonumber\\
&+\frac{1}{2\sigma_2^2} \mathcal{L}_{L2}(V(x_{t-1} + \delta_t), V(x_{t-1} + \hat{\delta}_t)),\label{eq:loss_pairwise}
\end{align}

where $`\mathcal{L}_{KL}= \frac{1}{2}\big(- \log\Sigma_\phi + \Sigma_\phi + \mu_\phi^2\big)`$, and the image similarity terms $`\mathcal{L}_{L1},\mathcal{L}_{L2}`$ represent L1 and L2 distance respectively.

We optimize Equation [eq:loss_pairwise] on single time steps, which we obtain by sampling all pairs of consecutive frames from the dataset. We also train the model to produce the first frame $`x_1`$ from videos that begin with a blank canvas, given a white input frame $`x_{blank}`$, and $`x_T`$. These starter sequences teach the model how to start a painting at inference time.

Sequential CVAE training. Our model is trained to reconstruct a real frame (outlined in green) while building upon its previous predictions for S time steps.

Sequence optimization

To synthesize an entire video, we run our model recurrently for multiple time steps, building upon its own predicted frames. It is common when making sequential predictions to observe compounding errors or artifacts over time . We use a novel training scheme to encourage outputs of the model to be accurate and realistic over multiple time steps. We alternate between two sequential training modes.

Sequential CVAE training encourages sequences of frames to be well-captured by the learned distribution, by reducing the compounding of errors. We train the model sequentially for several time steps, predicting each intermediate frame $`\hat{x}_t`$ using the model’s prediction from the previous time step: $`\hat{x}_t=\hat{x}_{t-1} + g_{\theta}(z_t,\hat{x}_{t-1},x_T)`$ for $`z_t \sim q_\phi(z_t|x_t-\hat{x}_{t-1},\hat{x}_{t-1}, x_T)`$. We compare each predicted frame to its corresponding real frame using the image similarity losses in Eq. [eq:loss_pairwise]. We illustrate this in Figure 3.
Sequential sampling training encourages random samples from our learned distribution to look like realistic partially-completed paintings. During inference (described below), we rely on sampling from the prior $`p(z_t)`$ at each time step to synthesize new videos. A limitation of the variational strategy is the limited coverage of the latent space $`z_t`$ during training , sometimes leading to predictions during inference $`\hat{x}_t=\hat{x}_{t-1} + g_{\theta}(z_t,\hat{x}_{t-1},x_T)`$, $`z_t\sim p(z_t)`$ that are unrealistic. To compensate for this, we introduce supervision on such samples by amending the image similarity term in Equation [eq:elbo] with a conditional critic loss term :
```
\begin{align}
\mathcal{L}_{critic}=&\mathbb{E}_{z_t\sim p(z_t)}\big[D_\psi\big(\hat{x}_t,\hat{x}_{t-1},x_T\big)\big]\nonumber\\
&- \mathbb{E}_{x_t}\big[D_\psi(x_t,x_{t-1},x_T)\big],
\end{align}
```
where $`D_\psi(\cdot)`$ is a critic function with parameters $`\psi`$. This critic encourages the distribution of sampled changes $`{\hat{\delta}_t=g_\theta(z_t,\hat{x}_{t-1},x_T),z_t \sim p(z_t)}`$ to match the distribution of training painting changes $`\delta_t`$. We use a critic architecture based on and optimize it using WGAN-GP .

In addition to the critic loss, we apply the image similarity losses (discussed above) after $`\tau`$ time steps, to encourage the model to eventually produce the completed painting. This training scheme is summarized in Figure 4.

Sequential sampling training. We use a conditional frame critic to encourage all frames sampled from our model to look realistic. The image similarity loss on the final frame encourages the model to complete the painting in τ time steps.

Inference: video synthesis

Given a completed painting $`x_T`$ and learned model parameters $`\theta,\phi`$, we synthesize videos by sampling from the model at each time step. Specifically, we synthesize each frame $`\hat{x}_t=\hat{x}_{t-1}+g_\theta(z_t,\hat{x}_{t-1},x_T)`$ using the synthesized previous frame $`\hat{x}_{t-1}`$ and a randomly sampled $`z_t \sim p(z_t)`$. We start each video using $`\hat{x}_{0}=x_{blank}`$, a blank frame.

Implementation

We implement our model using Keras and Tensorflow . We experimentally selected the hyperparameters controlling the reconstruction loss weights to be $`\sigma_1=0.01`$ and $`\sigma_2=0.1`$, using the validation set.

Introduction

Skilled artists can often look at a piece of artwork and determine how to recreate it. In this work, we explore whether we can use machine learning and computer vision to mimic this ability. We define a new video synthesis problem: given a painting, can we synthesize a time lapse video depicting how an artist might have painted it?

We present a probabilistic model for synthesizing time lapse videos of paintings. We demonstrate our model on Still Life with a Watermelon and Pomegranates by Paul Cezanne (top), and Wheat Field with Cypresses by Vincent van Gogh (bottom).

Artistic time lapses present many challenges for video synthesis methods. There is a great deal of variation in how people create art. Suppose two artists are asked to paint the same landscape. One artist might start with the sky, while the other might start with the mountains in the distance. One might finish each object before moving onto the next, while the other might work a little at a time on each object. During the painting process, there are often few visual cues indicating where the artist will apply the next stroke. The painting process is also long, often spanning hundreds of paint strokes and dozens of minutes.

In this work, we present a solution to the painting time lapse synthesis problem. We begin by defining the problem and describing its unique challenges. We then derive a principled, learning-based model to capture a distribution of steps that a human might use to create a given painting. We introduce a training scheme that encourages the method to produce realistic changes over many time steps. We demonstrate that our model can learn to solve this task, even when trained using a small, noisy dataset of painting time lapses collected from the web. We show that human evaluators almost always prefer our method to an existing video synthesis baseline, and often find our results indistinguishable from time lapses produced by real artists.

This work presents several technical contributions:

We use a probabilistic model to capture stochastic decisions made by artists, thereby capturing a distribution of plausible ways to create a painting.
Unlike work in future frame prediction or frame interpolation, we synthesize long-term videos spanning dozens of time steps and many real-time minutes.
We demonstrate a model that successfully learns from painting time lapses “from the wild.” This data is small and noisy, having been collected from uncontrolled environments with variable lighting, spatial resolution and video capture rates.

Conclusion

In this work, we introduce a new video synthesis problem: making time lapse videos that depict the creation of paintings. We proposed a recurrent probabilistic model that captures the stochastic decisions of human artists. We introduced an alternating sequential training scheme that encourages the model to make realistic predictions over many time steps. We demonstrated our model on digital and watercolor paintings, and used it to synthesize realistic and varied painting videos. Our results, including human evaluations, indicate that the proposed model is a powerful first tool for capturing stochastic changes from small video datasets.

Acknowledgments

We thank Zoya Bylinskii of Adobe Inc. for her insights around designing effective and accurate user studies. This work was funded by Wistron Corporation.

ELBO derivation

We provide the full derivation of our model and losses from Equation [eq:optimization]. We start with our goal of finding model parameters $`\theta`$ that maximize the following probability for all videos and all $`t`$:

\begin{align*}
&p_\theta(\delta_t,x_{t-1}; x_T)\nonumber\\\vspace{-5pt}
\propto &\hspace{1pt}p_\theta(\delta_t|x_{t-1}; x_T)\nonumber\\\vspace{-5pt}
= &\int_{z_t}\hspace{-2pt}p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)dz_t.
\end{align*}

We use variational inference and introduce an approximate posterior distribution $`q_\phi(z_t|\delta_t,x_{t-1};x_T)`$ .

\begin{align}
&\int_{z_t}p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)dz_t\nonumber\\
=&\int_{z_t}p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)\frac{q_\phi(z_t|\delta_t,x_{t-1};x_T)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}dz_t\nonumber\\
\propto&\log\int_{z_t}p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)\frac{q_\phi(z_t|\delta_t,x_{t-1};x_T)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}dz_t\nonumber\\
=&\log\int_{z_t}\frac{p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}q_\phi(z_t|\delta_t,x_{t-1};x_T)dz_t\nonumber\\
=&\log\mathbb{E}_{z\sim q_\phi(z_t|\delta_t,x_{t-1};x_T)}\bigg[\frac{p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}\bigg].
\end{align}

We use the shorthand $`z_t\sim q_\phi`$ for $`z\sim q_\phi(z_t|\delta_t,x_{t-1};x_T)`$, and apply Jensen’s inequality:

\begin{align}
&\log\mathbb{E}_{z_t\sim q_\phi}\bigg[\frac{p_\theta(\delta_t|z_t,x_{t-1};x_T)p(z_t)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}\bigg]\nonumber\\
\geq&\hspace{1pt}\mathbb{E}_{z_t\sim q_\phi}\big[\log p_\theta(\delta_t|z_t,x_{t-1};x_T)\big]\nonumber\\
&+ \mathbb{E}_{z\sim q_\phi}\big[\log\frac{p(z_t)}{q_\phi(z_t|\delta_t,x_{t-1};x_T)}\big]\nonumber\\
\geq&\hspace{1pt}\mathbb{E}_{z_t\sim q_\phi}\big[\log p_\theta(\delta_t|z_t,x_{t-1};x_T)\big]\nonumber\\
&- KL[q_\phi(z_t|\delta_t,x_{t-1};x_T)||p(z_t)],
\end{align}

where $`KL[\cdot||\cdot]`$ is the Kullback-Liebler divergence, arriving at the ELBO presented in Equation [eq:elbo] in the paper.

Combining the first term in Equation [eq:elbo] with our image likelihood defined in Equation [eq:prob_form_likelihood]:

\begin{align}
&\mathbb{E}_{z_t\sim q_\phi} \log p_\theta(\delta_t|z_t,x_{t-1};x_T)\nonumber\\
\propto&\hspace{1pt}\mathbb{E}_{z_t\sim q_\phi}\big[ \log e^{-\frac{1}{\sigma_1}|\delta_t - \hat{\delta}_t|}\nonumber\\
&+\log\mathcal{N}\big(V(x_{t-1} + \delta_t); V(x_{t-1} + \hat{\delta}_t), \sigma_2^2\mathbbm{I}\big)\big]\nonumber\\
=&\mathbb{E}_{z_t\sim q_\phi}\bigg[{-\frac{1}{\sigma_1}|\delta_t - \hat{\delta}_t|}\nonumber\\
&+\log\frac{1}{\sqrt{2\pi \sigma_2^2}}\exp\big(-\frac{(V(x_{t-1} + \delta_t)- V(x_{t-1} + \hat{\delta}_t))^2}{2\sigma_2^2}\big)\bigg]\nonumber\\
\propto&\hspace{1pt}\mathbb{E}_{z_t\sim q_\phi}\bigg[{-\frac{1}{\sigma_1}|\delta_t - \hat{\delta}_t|}\nonumber\\
&-\frac{1}{2\sigma_2^2}(V(x_{t-1} + \delta_t)- V(x_{t-1} + \hat{\delta}_t))^2\bigg],
\end{align}

giving us the image similarity losses in Equation [eq:loss_pairwise]. We derive $`\mathcal{L}_{KL}`$ in Equation [eq:loss_pairwise] by similarly taking the logarithm of the normal distributions defined in Equations [eq:prob_form_prior] and [eq:prob_form_posterior].

To the best of our knowledge, this is the first work that models and synthesizes distributions of videos of the past, given a single final frame. The most similar work to ours is a recent method called visual deprojection . Given a single input image depicting a temporal aggregation of frames, their model captures a distribution of videos that could have produced that image. We compare our method to theirs in our experiments. Here, we review additional related research in three main areas: video prediction, video interpolation, and art synthesis.

Video prediction

Video prediction, or future frame prediction, is the problem of predicting the next frame or few frames of a video given a sequence of past frames. Early work in this area focused on predicting motion trajectories or synthesizing motions in small frames . Recent methods train convolutional neural networks on large video datasets to synthesize videos of natural scenes and human actions . A recent work on time lapse synthesis focuses on outdoor scenes , simulating illumination changes over time while keeping the content of the scene constant. In contrast, creating painting time lapses requires adding content while keeping illumination constant. Another recent time lapse method outputs only a few frames depicting specific physical processes: melting, rotting, or flowers blooming .

Our problem differs from video prediction in several key ways. First, most video prediction methods focus on short time scales, synthesizing frames on the order of seconds into the future, and encompassing relatively small changes. In contrast, painting time lapses span minutes or even hours, and depict dramatic content changes over time. Second, most video predictors output a single most likely sequence, making them ill-suited for capturing a variety of different plausible painting trajectories. One study uses a conditional variational autoencoder to model a distribution of plausible future frames of moving humans. We build upon these ideas to model painting changes across multiple time steps. Finally, video prediction methods focus on natural videos, which depict of motions of people and objects or physical processes . The input frames often contain visual cues about how the motion, action or physical process will progress, limiting the space of possibilities that must be captured. In contrast, snapshots of paintings provide few visual cues, leading to many plausible future trajectories.

Video frame interpolation

Our problem can be thought of as a long-term frame interpolation task between a blank canvas and a completed work of art, with many possible painting trajectories between them. In video frame interpolation, the goal is to temporally interpolate between two frames in time. Classical approaches focus on natural videos, and estimate dense flow fields or phase to guide interpolation. More recent methods use convolutional neural networks to directly synthesize the interpolated frame , or combine flow fields with estimates of scene information . Most frame interpolation methods predict a single or a few intermediate frames, and are not easily extended to predicting long sequences, or predicting distributions of sequences.

Art synthesis

The graphics community has long been interested in simulating physically realistic paint strokes in digital media. Many existing methods focus on physics-based models of fluids or brush bristles . More recent learning-based methods leverage datasets of real paint strokes , often posing the artistic stroke synthesis problem as a texture transfer or style transfer problem . Several works focus on simulating watercolor-specific effects such as edge darkening . We focus on capturing large-scale, long-term painting processes, rather than fine-scale details of individual paint strokes.

In style transfer, images are transformed to simulate a specific style, such as a painting-like style or a cartoon-like style . More recently, neural networks have been used for generalized artistic style transfer . We leverage insights from these methods to synthesize a realistic progressions of paintings.

Several recent papers apply reinforcement learning or similar techniques to the process of painting. These approaches involve designing parameterized brush strokes, and then training an agent to apply strokes to produce a given painting . Some works focus on specific artistic tasks such as hatching or other repetitive strokes . These approaches require careful hand-engineering, and are not optimized to produce varied or realistic painting progressions. In contrast, we learn a broad set of effects from real painting time lapse data.

Problem overview

**Several real painting progressions of similar-looking scenes**. Each artist fills in the house, sky and field in a different order.

Given a completed painting, our goal is to synthesize different ways that an artist might have created it. We work with recordings of digital and watercolor painting time lapses collected from video websites. Compared to natural videos of scenes and human actions, videos of paintings present unique challenges.

High Variability

Painting trajectories: Even for the same scene, different artists will likely paint objects in different temporal orders (Figure 6).
Painting rates: Artists work at different speeds, and apply paint in different amounts.
Scales and shapes: Over the course of a painting, artists use strokes that vary in size and shape. Artists often use broad strokes early on, and add fine details later.
Data availability: Due to the limited number of available videos in the wild, it is challenging to gather a dataset that captures the aforementioned types of variability.

Medium-specific challenges

Non-paint effects: In digital art applications (e.g., ), there are many tools that apply local blurring, smudging, or specialized paint brush shapes. Artists can also apply global effects simulating varied lighting or tones.
Erasing effects: In digital art applications, artists can erase or undo past actions, as shown in Figure 7.
Physical effects in watercolor paintings: Watercolor painting videos exhibit distinctive effects resulting from the physical interaction of paint, water, and paper. These effects include specular lighting on wet paint, pigments fading as they dry, and water spreading from the point of contact with the brush (Figure 8).

In this work, we design a learning-based model to handle the challenges of high variability and painting medium-specific effects.

Example digital painting sequences. These sequences show a variety of ways to add paint, including fine strokes and filling (row 1), and broad strokes (row 3). We use red boxes to outline challenges, including erasing (row 2) and drastic changes in color and composition (row 3).

Example watercolor painting sequences. The outlined areas highlight some watercolor-specific challenges, including changes in lighting (row 1), diffusion and fading effects as paint dries (row 2), and specular effects on wet paint (row 3).