Reading time: 34 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.19817
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken.

๐Ÿ“„ Full Content

(a) Given a motion-blurred input image, our approach uses a large-scale video diffusion model to generate frames that reveal scene motion during the exposure and predict what may have occurred just before and after the image was captured. We illustrate scene motion predicted by our method with (b) an output video frame and (c) tracking from an off-the-shelf method [Karaev et al. 2024]. The resulting videos capture complex scene dynamics, enabling downstream applications including (d) bringing historical images to life: we show insets of three sharp generated video frames (red, green, and blue bars indicate each frame's exposure window) and visualize subtle motions revealed by the video as a 2D motion field computed by RAFT [Teed and Deng 2020].

We can also recover dynamic 3D structure and camera poses by applying a recent structure from motion technique to our output video [Li et al. 2025]. Video results are included in the supplemental webpage. Photos: (top) ยฉ Thales Antรดnio, iStock; (bottom) U.S. National Archives and Records, public domain.

We seek to answer the question: what can a motion-blurred image reveal about a scene’s past, present, and future? Although motion blur obscures 1 Introduction

“Only photography has been able to divide human life into a series of moments, each of them has the value of a complete existence. " -Eadweard Muybridge (attributed) A motion-blurred image is produced when the camera or scene moves during an exposure, causing scene content to smear across the image. Typically, motion blur is undesirable as it obscures image details and degrades visual quality, rendering an image unusable for downstream tasks. An alternative view, however, is that motion blur can be highly informative about a scene’s dynamics because it encodes spatiotemporal information over the time of capture. As such, blurred images can potentially be exploited to analyze the motions in a scene [Karaev et al. 2024;Wang et al. 2023], to recover 3D scene information [Li et al. 2025], and to draw inferences about what occurred just before or just after a given shot [Vondrick et al. 2016]. Inspired by recent advances in large video diffusion models [Wang et al. 2025]-which can generate plausible videos from limited input [Yang et al. 2025]-we consider the question: what can a single motion-blurred image reveal about a scene’s past, present, and future?

Prior work related to this question has formulated motion blur analysis as an image restoration problem-that is, recovering a single sharp image corresponding to a specific moment within the exposure (the scene’s “present”). This is a long-standing ill-posed inverse problem, initially tackled with classical optimization techniques [Perrone and Favaro 2016] and hand-crafted deblurring priors [Fergus et al. 2006;Levin et al. 2009]. Closer to our line of inquiry, deep networks [Nah et al. 2017] and generative models [Xiao et al. 2024b] have improved deblurring performance by learning a function that maps motion-blurred images to their restored counterpart. Intriguingly, by learning several such restoration functions, each tuned to a different moment within the exposure, it is now possible to map a motion-blurred photo to a short video clip [Jin et al. 2018].

Despite this progress in revealing a scene’s present with a video, existing methods often struggle with complex scene dynamics and rapid motions. These methods train on tens of thousands of inputoutput pairs of blurry and sharp images, implicitly treating them as a prior on scene motion and appearance during the exposure [Pham et al. 2023;Zhong et al. 2023]. However, motion blur in casuallycaptured photos is far too diverse to be accurately modeled with datasets of this size due to the sheer number of contributing factors, including object deformations, independent motions, occlusions and disocclusions, camera shake, and a wide range of shutter speeds.

Large video diffusion models, on the other hand, are trained on millions of video clips and billions of images. These models have demonstrated an ability to generate photorealistic, temporallyconsistent video sequences from as little information as a text prompt [Liu et al. 2024]. Most significantly, they can generate plausible reconstructions of a scene’s past or future appearance given a single uncorrupted input image [Brooks et al. 2024;Lu et al. 2024]. Recent work has also shown that these models are highly effective at solving inverse problems in imaging and sensing [Chihaoui and Favaro 2025;Chung et al. 2023;Kawar et al. 2022;Kwon and Ye 2024;Song et al. 2023;Xiao et al. 2024a], effectively acting as general-purpose priors over the space of natural images and videos.

Here, we introduce a method that repurposes such a large, pretrained video diffusion model [Yang et al. 2025] to synthesize video frames before, during, and after the exposure window of a blurry image-and use those frames for tracking and 3D reconstruction (see Figure 1). Our method is specifically designed to (1) leverage large-scale pre-training, (2) allow precise control over the exposure start time and duration of each frame, and (3) enable predictions of past and future as well. Our formulation essentially treats motion blur analysis as a conditional video generation problem, not one of image restoration.

Our method is robust and versatile, generalizing to challenging in-the-wild images that include scenes of dancers, concerts, sports events, deforming cloth, moving animals, cityscapes and nature scenes-and can even exploit motion blur in historical photos to bring them to life as short video clips. We show our approach achieves state-of-the-art performance when predicting the present and can capably extrapolate complex scene dynamics into the past and future. Finally, we demonstrate our output videos reveal complex camera trajectories, intricate motions, and dynamic phenomena from just one image, and can support downstream tasks such as tracking, pose estimation, and multi-view 4D reconstruction.

Blind deconvolution. Similar to our problem, blind deconvolution takes as input a single blurred observation, but seeks to explain it as the convolution of a sharp image with a spatially-invariant motion blur kernel [Cho and Lee 2009;Fergus et al. 2006;Krishnan et al. 2011;Kundur and Hatzinakos 1996;Levin et al. 2009;Shan et al. 2008]. Spatially-varying motion blur can be restored to some extent by constraining the blur kernel to a low-dimensional manifold [Hirsch et al. 2011], incorporating image self-similarity [Michaeli and Irani 2014], or using deep learning [Noroozi et al. 2017;Sun et al. 2015]. Still, there are issues with using such approaches to restore “in the wild” motion blur, where camera motion and scene-dependent effects such as parallax, deformations, and occlusions-disocclusions preclude the use of simple neural models or small datasets. Video from motion-blurred images. Jin et al. [2018] introduced the problem of restoring several video frames from a blurry image. They found that a key challenge for this task is that the restored frames can take on any order; for example, they could be played forwards, backwards, or in a shuffled order and still reproduce the blurry image when averaged together. To resolve this ambiguity, Jin et al. trained a network to first restore a frame corresponding to the middle of the exposure, and then sequentially restore adjacent frames. Building on this idea, subsequent works have explored a variety of solutions, incorporating carefully designed priors [Li et al. 2021;Pham et al. 2023;Zhang et al. 2021], loss functions [Niu et al. 2021;Purohit et al. 2019;Zhang et al. 2020], and network architectures [Zhong et al. 2023[Zhong et al. , 2022]]. These approaches have trouble handling the blur found in casually-captured photos because their curated datasets are far too limited to be representative of real-world blur (a few thousand video clips at most).

Single-image animation. Similar to our goal, single-frame animation [Holynski et al. 2021;Siarohin et al. 2019] aims to generate video sequences given a sharp image with no motion blur. This is accomplished by representing motion information through motion fields [Holynski et al. 2021], driving videos [Siarohin et al. 2019], or motion textures [Li et al. 2024a], and then warping and rendering the input image. Our work aims to derive motion information from the motion blur present within the input image itself while also deblurring to recover the original sharp frames.

Leveraging large image diffusion models. Internet-scale image datasets [Byeon et al. 2022;Schuhmann et al. 2022] have led to powerful diffusion models that can synthesize photorealistic images from a text prompt [Saharia et al. 2022]. With large diffusion models increasingly becoming available pre-trained and open-source, emerging techniques are repurposing these models as generic image priors for a variety of tasks [Sun et al. 2024;Taubner et al. 2025].

Closest in spirit to our work, Xiao et al. [2024a] and Chihaoui and Favaro [2025] use pre-trained image models to generate high-quality images from degraded inputs without requiring knowledge of the specific degradation process. While our approach also leverages a large-scale pre-trained model in order to remain agnostic about the specific causes of motion blur (i.e., camera motion, scene motion, etc.) our use of a video model is not agnostic to the degradation itself: motion blur is fundamentally due to time-varying appearance over the exposure, and video clips serve as a complete and physically-accurate explanation of this degradation.

Leveraging video diffusion models. Concurrent with our work, Pang et al. [2025] pre-train a video diffusion model on small synthetic and captured datasets in order to generate video from a motion-blurred image of a robot arm. Their method was evaluated only on images with simulated blur and-as in past research on recovering videos from motion-blurred images-does not leverage large-scale pre-trained models as we do.

What do large video diffusion models know about motion blur? Large-scale video models already encode a great deal of information about the relation between 3D geometry and time-varying appearance, despite not being explicitly supervised on such a relation [Brooks et al. 2024;Li et al. 2024b]. Most pertinent to our setting, the size and diversity of their video datasets implies that motion blur of various causes and degrees is already part of their training set. We thus posit that these models already have strong intrinsic priors over the input to our method-a motion-blurred image-and its connection to scene dynamics. Figure 2 shows a preliminary experiment that is highly suggestive of such priors: when given a text prompt and a motion-blurred image as conditioning input, a recent large-scale video diffusion model [Runway AI 2025] predicts future video frames that are consistent with the image’s motion blur. We aim to build on this capability by fine-tuning large-scale video diffusion models to predict frames occurring before, during, and after the moment of capture, with precise control over each frame’s exposure start time and duration.

We assume a general model for motion blur and use a large, pretrained video diffusion model to represent the space of natural videos. Under this model, an image ๐ผ captures the time-varying irradiance ๐ธ (๐‘ก) at the sensor plane as

where ๐‘” is the camera response function [Debevec and Malik 1997 [Ho et al. 2022;Xing et al. 2024]:

(2)

Sampling from this distribution is accomplished by initializing the output video with standard Gaussian noise and then using the video model to iteratively denoise it, through a reverse diffusion process [Ho et al. 2020]. To improve computational and memory efficiency, most video models represent their output in the compressed latent space of a pre-trained video encoder [Blattmann et al. 2023a,b]. This enables the model to predict a low-resolution latent video แนผ , where each latent frame encodes multiple high-resolution frames of the output video. Once the reverse diffusion process completes, a pre-trained video decoder network recovers the high-resolution output ๐‘‰ from the latent video แนผ .

4 Generating the Past, Present and Future

We modify a pre-trained video diffusion model to enable conditioning on both (1) a motion-blurred image and (2) a set of exposure intervals corresponding to the individual frames of the output video:

where ๐น is the number of generated frames and both the output video and the motion-blurred image are encoded in the latent space. This conditioning gives considerable flexibility in the timespan of videos generated by the model. For instance, we can output videos that predict the “present” by choosing T 1 . . . T ๐น to all lie within the input image’s actual exposure interval. Alternatively, by conditioning the model on intervals that lie outside the actual exposure, we can output videos that extend into the “past” and/or the “future. "

Our approach builds on the text-to-video model CogVideoX-2B [Yang et al. 2025]. This is a 2-billion parameter diffusion transformer [Peebles and Xie 2023] that we modify to support the conditioning of Equation 3. Figure 3 shows an overview of the model.

Latent space encoding. We encode the motion-blurred image into a single latent frame using the pre-trained variational auto-encoder (VAE) associated with the CogVideoX model. The VAE compresses the spatial resolution of the input blurry image by a factor of eight, and for videos it also compresses the temporal resolution by a factor of four. Hence, the model processes a latent video แนผ with dimensions ( F, D, H, W ), where F = ๐น /4 is the number of latent frames, D is the dimension of the latent space, and H and W are the height and width of the latent frames. Based on the temporal compression ratio, the ๐‘–-th frame of the latent video is associated with four exposure intervals, which we represent as a vector T๐‘– โˆˆ R 8 . For the first latent frame, for example, T1 =

Position encoding. Before passing the latent motion-blurred image ฤจ and noise-initialized latent video frames แนผ as input to the model, we concatenate them along the temporal dimension as

We then modulate them with a position encoding signal that indicates the spatial and temporal location of each latent pixel. We follow the standard sinusoidal position encoding [Su et al. 2024], adopted by CogVideoX-2B (see Supp. Section S2.1).

Exposure interval encoding. We also apply an exposure interval encoding to encode the start and end times of each latent frame. To compute this encoding, we first assign the normalized interval T = [-0.5, 0.5] to the exposure of the input motion-blurred image. Intervals T 1 , . . . T ๐น are then expressed relative to this normalized interval. We encode the four exposure intervals associated with a latent video frame by applying a sinusoidal positional encoding function ๐›พ to each coordinate of T๐‘– , concatenating the resulting vectors, and projecting via a linear layer:

Following Vaswani et al. [2017] and Mildenhall et al [2021], we define the sinusoidal positional encoding function to be

where ๐œˆ ๐‘– is an encoding frequency and ๐‘ is the number of such frequencies. Lastly, we encode the time interval of the latent input motion-blurred image in a similar fashion, after first replicating T four times to create a vector T = [T , T , T , T ] that matches the dimension of T๐‘– . See Supp. Section S2.1 for additional details.

Fine-tuning. To fine-tune the diffusion transformer, we sample a motion-blurred image and its corresponding sharp video sequence. We then encode these frames using the VAE; add noise ๐ to the latent video frames according to the diffusion process; apply position and exposure-interval encoding to the noisy latent frames; patchify them; and feed them to the video diffusion transformer. We optimize all parameters of the video diffusion transformer to minimize E แนผ ,๐ โˆฅ V -แนผ โˆฅ 2 , i.e., the expected L2 difference of the denoised latent video V and the clean latent video แนผ .

Optimization and inference. We fine-tune the entire model, including the diffusion transformer and the linear projection layer for the exposure interval encoding, using a batch size of 64 across 16 NVIDIA L40 GPUs for 10 days. Training is conducted for 20,000 iterations with an AdamW optimizer [Loshchilov and Hutter 2019] with a learning rate of 10 -4 . We also fine-tune the model for unconditional generation, by setting ฤจ = 0 with the same 20% dropout percentage as the original CogVideoX model. At inference time, we use 50 diffusion steps with a DDPM solver [Ho et al. 2020] and classifier-free guidance [Ho and Salimans 2021] with a guidance scale of 1.1, which takes โˆผ2 minutes on an NVIDIA L40 GPU.

Fine-tuning datasets. To evaluate our method against baselines and to test its performance on challenging in-the-wild scenes, we fine-tune three versions of the model. The first two are fine-tuned on specific datasets for direct comparison with baselines, as detailed in Section 5.1. To enhance generalization to complex real-world imagery, we also fine-tune a third version on a diverse compilation of high-FPS videos drawn from GoPRO [Nah et al. 2017] (240 FPS), Adobe240 [Su et al. 2017] (240 FPS), REDS [Nah et al. 2019] (120 FPS), iPhone240 [Shimizu et al. 2023] .

507 frames per clip. All frames were resized to the model’s native resolution of 1280 ร— 720.

Blur simulation. We simulate blur by averaging a sequence of consecutive frames from the video data according to Equation 1. This requires taking two considerations into account. First, since RGB colors in source videos may be stored in a gamma-corrected color space, which is non-linear, we perform all frame averaging in linear sRGB space. Second, naive averaging of consecutive frames from a source video may not give a good approximation to the blur integral if there is significant “dead time” between them. This occurs, for example, when the frame exposure interval is less than timespan between frames. To mitigate this issue, we use a frame interpolator [Zhong et al. 2024] to temporally upsample all videos to 1920 FPS prior to fine-tuning. See Supp. Section S2.2 for more details.

Video generation modes. We fine-tune the model so that it can operate in any one of two modes: present-only generation and past/present/future generation. At each fine-tuning iteration we sample a random number of frames for each mode, drawn from a batch of 64 randomly selected videos.

In the present-prediction mode, we sample two to sixteen consecutive 120 FPS or 240 FPS frames from these videos. We simulate motion blur as described above for the frames themselves (e.g., summing sixteen 1920 FPS frames to obtain one frame exposed for 1/120 seconds), and for their combined exposure interval (which can span up to 16/120 seconds). The model must then generate these 120 or 240 FPS video frames when given their motion-blurred sum as the conditioning signal. To improve robustness to even larger motion blurs, we sample 32 or 48 consecutive frames at 240 FPS and sum them to simulate motion blur due to an exposure interval of 32/240 or 48/240 seconds, respectively. In this case, the model must generate 16 frames that span this interval, each having an exposure of 1/240 seconds. This allows the model to generate videos whose frames have both short exposures and controllable dead time (please see Section 7.3 for an analysis of this capability).

In the past/present/future mode, we sample two to eight consecutive frames at 120 or 240 FPS, to simulate motion blur from exposures between 2/240 and 8/120 seconds. The model must then generate four to sixteen frames that span twice that exposure, thereby predicting dynamic appearance both before and after the exposure.

In Section 5.1 we compare our model to baselines on the task of generating frames during the present (during the exposure), and in Section 5.2 we assess generating the past, present and the future (from before to after the exposure).

Datasets. We first compare our model on the GoPro [Nah et al. 2017] dataset, which contains over 3000 blurred images along with paired 7-frame videos that were used to synthetically generate the blurred images. For this comparison, our model is only trained with 1280 ร— 720 sequences from the GoPro [Nah et al. 2017] dataset. To be compatible with all baselines, we downsample the model outputs to 640 ร— 360 resolution for evaluation. We also compare our model on the B-AIST++ dataset following Zhong et al. [2022]. The dataset contains videos of dancers and corresponding bounding boxes (192 ร— 160) used to crop and assess the model output. We utilize the original train and test splits for both datasets.

Baselines. We compare our model against the most relevant and recent baselines with publicly available code. On the GoPro dataset, we evaluate against MotionETR [Zhang et al. 2021], which reconstructs a video from a single motion-blurred image by first predicting a sharp image and an associated motion field, then warping the sharp image using the estimated motion. We also compare to the method of Jin et al. [2018], which employs an ordering-invariant loss to train networks that recover a sequence of sharp frames from a blurred input. For the B-AIST++ dataset, we compare to Animation from Blur [Zhong et al. 2022], which uses coarse optical flow estimation to guide video reconstruction via a generative adversarial network [Sohn et al. 2015].

Metrics. Quantitative evaluation requires care due to the inherent motion ambiguities in recovering videos from a single motionblurred image. For instance, both a video and its time-reversed counterpart may be equally valid predictions given the same input. To address this, prior work compares both the predicted video and its time-reversed version to the ground truth, reporting image quality metrics for the direction that yields better performance [Zhong et al. 2022]. While this approach accounts for global motion ambiguity at the frame level, it overlooks local motion variations across the image.

To better capture such spatially varying ambiguities, we additionally report metrics computed on forward and time-reversed patches of the predicted videos, allowing for more fine-grained evaluation.

More formally, let V (๐‘ ) โˆˆ R ๐น ร—๐ป๐‘ ร—๐‘Š๐‘ be the ๐‘th patch of a predicted video with ๐น frames, let V (๐‘ ) rev be the patch at the same spatial location in the temporally-reversed video, and let ๐‘‰ (๐‘ ) be the corresponding patch in the ground truth video. Then, we can define the bidirectional patch-based version ๐‘€ ๐‘ of a standard image quality metric ๐‘€ (โ€ข, โ€ข) as ๐‘ ) . ( 6)

We employ five evaluation metrics in total. The first three are standard image-based metrics: peak signal-to-noise ratio (PSNR), structured similarity index measure (SSIM) [Wang et al. 2004], and LPIPS [Zhang et al. 2018]. For the GoPro dataset, we use a patch size of 1 ร— 1 for PSNR and 40 ร— 40 for SSIM and LPIPS. For the B-AIST++ dataset, which requires evaluation on smaller crops of 192 ร— 160 given by labeled bounding boxes, we use patch sizes of 1 ร— 1 for PSNR and 32 ร— 32 for SSIM and LPIPS. When computing PSNR, we slightly modify Equation 6 to better capture frame-level signal-to-noise characteristics. Specifically, for each patch, we first select either the forward or reverse MSE based on Equation 6; then, we average the patch-wise MSE to obtain a single frame-level MSE. PSNR is computed from this value, and the final score is reported as the average PSNR across all frames.

For the remaining metrics, we use Frรฉchet Video Distance (FVD) [Unterthiner et al. 2019] to measure the distributional similarity between generated and ground-truth videos. FVD is computed at full resolution using videos played in the forward direction. Finally, we report end-point error (EPE) [Zhang et al. 2020] using the RAFT [Teed and Deng 2020] optical flow estimator. EPE measures the difference in optical flow between the predicted and groundtruth videos, from the first to the last frame. This metric is computed bidirectionally, selecting per pixel the temporal direction (forward or backward) that minimizes the flow error.

Results. Across all evaluation metrics, our model consistently outperforms the baselines (see Tables 1 and2), demonstrating both accurate motion prediction and high image quality. Qualitatively, our method performs well on the GoPro and B-AIST++ datasets, as shown in Figure 5. Additional video comparisons are available on the supplemental webpage. Notably, the model successfully handles challenging and complex effects, such as spatially varying blur, occlusions, and disocclusions. Compared to baselines like MotionETR and Animation from Blur, our results appear more natural and free of warping artifacts. We attribute this partly to the large-scale pre-training of the base video model and the resulting strong priors that it learns on natural videos [Shao et al. 2024]. Overall, our generated videos not only look more realistic but are closer to the distribution of ground truth videos in the datasets as measured by FVD.

Our method robustly reconstructs both the timespan of capture and subsequent motion, as demonstrated through an additional evaluation on the GoPro dataset. We omit baseline comparisons in this setting, as the baselines were not trained for this task. We evaluate our model on the GoPro dataset by synthesizing blurred images from 7 sharp 240 FPS frames. The model then predicts a 13-frame sequence: three frames before the moment of capture, seven frames during the exposure, and three frames after the capture. We plot the bidirectional patch PSNR in Figure 4. Interestingly, the quality remains consistent across the seven middle frames, suggesting that each frame within the motion blur is equally well constrained. Although accuracy degrades for predictions outside the moment of capture, the video frames still fall within 20-30 dB PSNR. The results suggest that the motion-blurred images provide strong cues for past and future prediction. We show qualitative examples of past and future generation in the supplemental webpage. et al. [2018], and Animation from Blur [Zhong et al. 2022] and find that our method recovers significantly clearer output video frames with motion tracks that are more consistent with the ground-truth video sequence (tracks estimated using Cotracker [Karaev et al. 2024]). (c-d) Additional baseline comparisons to in-the-wild data. We find that Jin et al. [2018] and MotionETR fail to recover sharp video frames on these challenging, in-the-wild sequences, likely due to the more limited scale of their training datasets and learned motion priors (compare to our results on these scenes in Figures 1 and6). Fig. 6. Applications of the proposed method (please see the video results in the supplemental webpage). (a) We demonstrate generating the past, present, and future from in-the-wild motion-blurred images, as indicated by the red, green, and blue labels. The method recovers sharp frames and scene dynamics from a mixer (top) and a busy city street (bottom). Note the complex motion trajectories recovered by applying an off-the-shelf tracker [Karaev et al. 2024] to our generated videos. (b) By exploiting motion blur in historical photos, we reveal scene dynamics, e.g., the movement of Mohammad Ali in a boxing match or astronaut John Glenn picking up a camera. (c) We bring images “to life” by predicting 3D scene dynamics and camera poses from our generated video frames with off-the-shelf structure from motion methods [Li et al. 2025]. (d) We even recover subtle motions in black and white photographs captured during World War II over 80 years ago. We reveal motions in the generated deblurred frames (insets) by applying an optical flow method [Teed and Deng 2020] to a past and future predicted frame. (e) Finally, we recover 3D facial dynamics from a motion-blurred image by applying a face tracker to our output video [Taubner et al. 2024]. Historical photos: (1) Coast Guard Lands the British Marines ( 1944

We test our model on a wide range of applications, including generating video from challenging in-the-wild motion-blurred photos (Section 6.1), bringing historical motion-blurred photos to life (Section 6.2), reconstructing 3D scene dynamics (Section 6.3), and recovering 3D human head pose from motion-blurred portrait photos (Section 6.4).

We evaluate our model on motion-blurred images from in-the-wild scenes and find that it consistently produces realistic videos across a wide range of scenarios. Figures 1 and6(a) and our supplemental webpage show qualitative examples. Our model handles diverse motion types-including running, biking, surfing, gymnastics, object manipulation, water splashes, and even complex circular motion such as children riding a merry-go-round. The scenes span a broad spectrum of subjects, from people and animals to kitchen implements, swaying tree branches, and falling confetti. We even recover crisp videos from multiple blurred objects with different motion trajectories, such as for bustling city intersections filled with cars and pedestrians. We show the recovered motion in Figures 1 and6(a) by applying an off-the-shelf tracking algorithm [Karaev et al. 2024], and we visualize the dense-tracked motion paths in the videos on the supplemental webpage.

Qualitative comparisons with baselines, including Jin et al. and MotionETR are provided in Figure 5 and the supplemental webpage. We use their publicly available models trained on the GoPro dataset. The baselines struggle to generalize to the varied content and motion types present in our test set and are unable to generate frames beyond the motion-blurred interval.

We find our model generalizes to historical photos-enabling recovery of video for dynamic scenes that were photographed, in some cases, over 80 years ago. For example, we leverage subtle motion blur cues to recover the direction and magnitude of the motions of American soldiers (Figure 1) and British marines (Figure 6(d)) piling from Coast Guard landing barges onto the French coast on June 6, 1944 during the Allied invasion of Normandy. For these results, we show 2D motion fields by applying off-the-shelf optical flow prediction [Teed and Deng 2020] to our past and future-predicted video frames. We can also watch Mohammad Ali land a blow on Jรผrgen Blin in our video, generated from a photo of the 1971 boxing match. Finally, another result from a 1998 photo captures the astronaut John Glenn carefully handling a 24 mm camera.

To reconstruct dynamic 3D (i.e., 4D) scenes from our generated videos, we apply MegaSaM [Li et al. 2025] and extract dense, dynamic 3D point clouds and corresponding camera poses. The reconstructions preserve spatial and temporal coherence, and we show visualizations of the reconstructed camera trajectories and scene geometries in Figures 1 and6(c) and in the supplemental webpage. In this fashion, our approach can be applied to in-the-wild or historical photos-bringing them to life through 4D visualization.

Our approach enables two distinct but complementary forms of 3D understanding from a single motion-blurred image. First, in scenes exhibiting significant rigid motion-such as turning heads or fast-moving limbs-our generated video frames reveal temporally coherent disparity cues that would otherwise be lost in a single blurred frame. This disparity information becomes accessible only because our model produces geometrically consistent video sequences rather than merely plausible frame-by-frame generations.

Second, in dynamic scenes with non-rigid structure, our synthesized videos can be lifted to coherent 4D representations that are physically plausible in space and time. Specifically, we show that our output videos can be lifted into a single 4D representation of dynamic scene geometry in a consistent world coordinate space. Such reconstructions are only possible when the input video maintains 3D consistency across time and when multi-view techniques are applied to exploit cross-frame correspondences.

We show that our model enables reconstruction of 3D human head pose using the disparity cues in a motion-blurred portrait photograph. Specifically, we apply our method to the photo shown in Figure 6(e) and recover sharp video frames showing the rigid motion of the head. We then apply a 3D human head tracker that predicts 2D facial landmarks for each frame and registers a parametric 3D human head model to the landmarks through a joint optimization procedure across all frames [Taubner et al. 2024]. We show the tracked 2D landmarks and the recovered animated 3D head model in Figure 6, and we show a video animation of these results in the supplemental webpage. This demonstrates that our predicted videos provide enough geometric consistency to be explained by a single 3D representation.

We conduct additional experiments to assess the performance of the model and the impact of architectural design choices on the generated videos. Specifically, we discuss our model’s ability to capture the multi-modality of generating videos from a single motion-blurred image (Section 7.1), maintain consistency (Section 7.2), and control the generated frames’ exposure intervals (Section 7.3).

Motion blur in an input image could potentially be explained by an infinite number of generated videos. We probe the distribution of videos learned by the model by sampling and comparing multiple output videos. Qualitatively, we find the motion in the output videos is consistent with motion behavior expected from a number of different object categories (see Figures 78and the supplemental webpage). For example, humans, cars, and animals are generally predicted to be moving in the forward-facing direction. However, in cases where motion direction is more ambiguous (e.g., a person shaking their head-see Figure 7), the output videos sample multiple plausible motion directions. Hence, the model does not exactly recover the motion that occurred during the moment of capture but rather generates samples of what might plausibly have occurred. Fig. 9. We assess 3D consistency of an image with motion blur due to camera movement. We visualize the epipolar lines, the residual movement of keypoints after applying a 2D homography that best aligns the two frames, and the absolute difference of the first frame with the homography-warped version of the last frame. We find the epipolar lines to be consistent with the forward (left to right) camera motion. Additionally, we observe parallax between the traffic sign and the background landscape; as a result, the homography cannot accurately model the traffic sign.

We provide further assessment of the model based on (1) the overall 3D consistency of generated videos and (2) the consistency of the output frames with the input motion-blurred image.

3D consistency. We evaluate the 3D consistency of the model output by analyzing generated videos from scenes with motion blur due to viewpoint movement. We apply SIFT [Lowe 2004] andRANSAC [Fischler andBolles 1981] to the first and last generated frames to detect feature correspondences and compute the fundamental matrix [Hartley and Zisserman 2003]. Additionally, we utilize RANSAC [Fischler and Bolles 1981] to find the 2D homography that best explains the keypoint correspondences between both images. We visualize the epipolar lines and the homography applied to both the keypoints and images in Figures 910. We find in the scene with the truck (Figure 9), the epipolar lines are consistent with the forward camera motion. Additionally, we observe that while the homography fits the background, it produces large errors on the traffic sign seen in the absolute difference map, highlighting parallax caused by the scene’s 3D structure. We observe a similar effect in Supp. Figure S1, where the foreground bushes closer to the camera exhibit noticeable parallax. Finally, in Figure 10, we observe that for a scene with a panning camera (i.e., no viewpoint change), the motion is accurately modeled via a 2D homography as expected. These visualizations appear subjectively consistent with the scene geometry, suggesting that our model produces geometrically-plausible outputs with a reasonable degree of 3D consistency, even in the absence of explicit 3D supervision. Fig. 10. We assess geometric consistency of an image from the GoPro dataset [Nah et al. 2017] with motion blur from a panning camera. We visualize the movement of keypoints after applying the 2D homography that best aligns them, and the absolute difference of the first frame with the homography-warped version of the last frame. In this case, apparent motion in the video can be accurately modeled with a homography as the viewpoint did not change. Note that in this case our approach reveals the presence of independent scene motion in the blurry photo (person’s outline in the difference image). Consistency with input motion-blurred image. We compare the motion-blurred image to the image created by averaging together the generated video frames. Table 3 reports quantitative results of this comparison using PSNR, and Figure 11 provides qualitative visualizations. We note that this metric is not informative on its own because the trivial solution-repeatedly outputting the input motion-blurred image-achieves a perfect score.

We note that other methods, such as diffusion posterior sampling (DPS) [Chung et al. 2023], can explicitly enforce consistency with the motion-blurred image in their objective function. In contrast, our method achieves consistency as a natural result of our finetuning process on motion-blurred images synthesized from videos, i.e. the model is fine-tuned to generate frames that reproduce the input motion-blurred image when averaged. We attempted applying PSLD [Rout et al. 2023], a DPS-based method designed for latent space models, to enforce consistency for our 1280 ร— 720 generated videos. However, we found it would require a prohibitive 380 GB of memory to backpropagate through the variational autoencoder (VAE) due to its high parameter count.

Finally, we evaluate how well our model controls the exposure interval of each generated video frame using the exposure interval embedding introduced in Section 4. where ( [๐‘“ ๐‘  . . . ๐‘“ ๐‘’ ]) is the average of frames ๐‘“ ๐‘  , . . . , ๐‘“ ๐‘’ . This setup evaluates how well our method adapts to different frame durations under the same blurred input.

For the second case, we introduce dead time between frames. Here, we average 32 consecutive 1920 FPS ground-truth frames [๐‘“ 1 . . . ๐‘“ 32 ] to form the input motion-blurred image and then generate 16 output frames, each corresponding to disjoint exposure intervals with a one-frame gap. Specifically, the outputs are

This setting tests whether our model performs well when exposure intervals are non-consecutive.

Our model maintains strong performance, demonstrating robustness to both longer exposure durations with stronger blur and to disjoint exposure intervals. Specifically, Table 4 and qualitative comparisons in our supplemental webpage demonstrate that our approach provides fine-grained control over exposure intervals and remains effective under various sampling conditions. As the number of output frames increases (i.e., shorter exposures), our model successfully reconstructs high-quality outputs, with only modest degradation in PSNR, SSIM, and perceptual metrics. When the exposure intervals are separated with dead time, performance decreases due to the increased ambiguity and blur size. Nevertheless, the model still generates plausible videos, showing it generalizes to non-consecutive start/end times for the generated frames. Alternative exposure control scheme. We also compare against an alternative exposure interval encoding scheme. Instead of explicitly providing the per-frame exposure intervals as input, we fine-tuned a new model whose temporal conditioning signal consists of (1) the start time of the first output frame, (2) the end time of the last output frame, (3) the (uniform) duration of individual frames, and (4) the number of frames. This scheme contains equivalent information but encodes the intervals implicitly. All other details regarding the model-sinusoidal encoding, linear layer, and addition to latent patches-are kept the same. We repeat the two evaluations discussed above for this model as well. Comparing Tables 4 and5, it is clear that our choice of per-frame exposure interval encoding is superior to this alternative encoding scheme. Please refer to the supplemental webpage for additional qualitative comparisons.

Sinuosidal embedding ablation. We ablate the sinusoidal projection used in our exposure interval encoding. Specifically, we train a model on the GoPro [Nah et al. 2017] dataset under the same settings as the main paper, but replace the sinusoidal projection with a simple linear layer applied directly to the exposure intervals. Comparing Tables 4 and6 suggests that the sinusoidal embedding plays a crucial role in the effectiveness of our method.

A surprising amount of information can be recovered from a single motion-blurred image. Our approach offers a deeper glimpse into the moment of capture-and gestures toward the past and future existence that surrounds it. Now, we highlight a few limitations of our approach and exciting directions for future work.

As shown in Figure 12 (row 1), the model struggles to recover videos from images that deviate from our assumed model of motion blur-such as photos created by compositing images together that were captured with separated exposure times. Similarly, scenes with extreme motion blur, including time-lapse images and images with a combination of fast camera panning and complex scene motion, often fail to generate plausible videos (Figure 12, rows 2-3). These failure cases could potentially be addressed by fine-tuning the model on more diverse training data, including examples with time-lapse effects or other types of motion blur.

Our work has only scratched the surface of how large, pre-trained video diffusion models might be used to recover scene information from image degradations. For instance, defocus blur and chromatic aberration may serve as additional cues for inferring scene geometry, and other optical effects could be similarly exploited. Moreover, our findings raise the intriguing possibility of end-to-end design [Sitzmann et al. 2018;Tseng et al. 2021] of optical degradations in tandem with large pre-trained models to purposefully encode information into an image for a downstream task (such as video generation). With ongoing advances in generative modeling and access to largescale video datasets, we anticipate many new opportunities at the intersection of image or video restoration and video generation.

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025.

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. 202:4 โ€ข SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, and David B. Lindell

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. 202:6 โ€ข SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, and David B. Lindell

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. Generating the Past, Present and Future from a Motion-Blurred Image โ€ข 202:9

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. 202:10 โ€ข SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, and David B. Lindell

sample 2, frame 1 sample 3, frame 1 sample 1, frame 8 sample 4, frame 8 sample 2, frame 8 sample 3, frame 8 sample 1, frame 16 sample 4, frame 16 sample 2, frame 16 sample 3, frame 16 sample 1, optical flow sample 4, optical flow sample 2, optical flow sample 3, optical flow optical flow color map

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. 202:12 โ€ข SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, and David B. Lindell

ACM Trans. Graph., Vol. 44, No. 6, Article 202. Publication date: December 2025. 202:14 โ€ข SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown, Kiriakos N. Kutulakos, and David B. Lindell

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut