단일 카메라 영상으로 보는 탁구공 3D 궤적 및 스핀 추정의 새로운 파이프라인
📝 Abstract
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
💡 Analysis
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
📄 Content
Table tennis is a dynamic sport demanding exceptional precision, speed, and strategic thinking. For athletes, coaches, and sports scientists, understanding the intricate 3D trajectory and spin of the ball is paramount for in-depth performance analysis and technique refinement. Such detailed insights, however, are notoriously difficult to obtain from conventional broadcast or monocular video footage, which only provides 2D observations. The rapid motion of the ball, coupled with occlusions, varying lighting conditions, and diverse camera angles, poses significant challenges for Figure 1. Qualitative example prediction of the full pipeline for a serve trajectory. The green dots represent the front-end detections for 2D ball positions and table keypoints. The magenta dots represent the predicted 3D ball trajectory from the back-end. accurate 3D reconstruction. Prior research has demonstrated the feasibility of training neural networks on synthetically generated data to reconstruct 3D trajectories and spin [8,22,31]. However, a significant gap remains in transitioning these models from clean, synthetic inputs to the noisy, sparse, and imperfect detections found in real-world videos. This discrepancy severely limits their practical deployment. This paper addresses these limitations by presenting a comprehensive pipeline designed to robustly infer the 3D trajectory and initial spin of a table tennis ball directly from monocular video footage. To the best of our knowledge, this work constitutes the first learning-based application of a complete pipeline for 3D table tennis analysis. Our solution is built on a novel two-stage framework that solves the fundamental problem of missing 3D ground truth in real-world broadcast videos. We achieve this by dividing the problem into a front-end perception stage and a backend uplifting stage, which allows us to train each component with different, readily available supervision. We also introduce a new dataset, essential for training our models. Our main contributions are: • We develop a state-of-the-art 2D ball detector leveraging the efficient Segformer++ architecture [20], specifically optimized for processing high-resolution images.
• We introduce a novel 2D table keypoint detector, enabling precise localization of the table boundaries within diverse video frames. These keypoints provide essential contextual information for the 2D-to-3D uplifting model. • We present a 2D-to-3D uplifting model that takes the detected 2D ball trajectory and table keypoints as input, outputting the 3D trajectory and initial spin of the ball. Even though the model is solely trained on physically-correct synthetic data, it achieves zero-shot generalization to realworld scenarios. We specifically adapt this model to robustly handle real-world detection noise, missing detections, and varying frame rates, making it compatible with our presented detectors. • We introduce the TTHQ dataset, a novel high-quality, high-resolution dataset featuring meticulously annotated 2D ball trajectories, annotated table keypoints, spin information, and comprehensive meta information, all sourced from publicly available YouTube videos. This dataset is instrumental for training and evaluating our 2D detectors and for benchmarking the integrated pipeline.
General object detectors [4,32,34] can be adapted for ball detection, but heatmap-based methods, common in 2D pose estimation [25,29,42,44], have become the de facto standard for ball detection. The TrackNet model family [6,19,38] and the state-of-the-art WASB [39], demonstrate strong performance. In addition to the ball position, we also want to detect specific table keypoints in the image, which is also sometimes performed for camera calibration in sports analytics [16,26,35]. However, existing methods in table tennis [10,15] often lack the precision and comprehensive set of points required for direct integration into our pipeline. We address these limitations by leveraging the Segformer++ architecture [20]. This modern, transformerbased approach is uniquely suited for our task due to its efficiency in processing high-resolution images, which is crucial for capturing tiny ball details and thin table edges. It is trained using a heatmap-based approach, which proves exceptionally effective for precise ball localization.
Reconstructing 3D trajectories from monocular video is challenging. While controlled multi-camera setups offer high accuracy via triangulation [27,30,33,43], they are impractical for broadcast footage. Monocular methods often rely on physics-based model fitting to observed 2D trajectories [5,10,15,18,24] or single-frame estimates using cues like observed ball size or height [2,3,21,41], but these approaches are susceptible to errors from inaccurate identification of key events, explicit camera calibration or insufficient video quality. The use of deep learning has shown great promise in overcoming these issues by directly predicting 3D trajectories
This content is AI-processed based on ArXiv data.