LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting

Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky…

Authors: Yicheng Rui, Xiao-Wei Duan, Licai Deng

LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting
LenghuSky-8: An 8-Y ear All-Sky Cloud Dataset with Star -A war e Masks and Alt-Az Calibration f or Segmentation and Nowcasting Y icheng Rui 1 , † Xiao-W ei Duan 1 Licai Deng 2 Fan Y ang 2 Zhengming Dang 1 Zhengjun Du 3 Junhao Peng 3 W enhao Chu 3 Umut Mahmut 1 K exin Li 1 Y iyun W u 1 Fabo Feng 1 1 State K ey Laboratory of Dark Matter Ph ysics, Tsung-Dao Lee Institute & School of Physics and Astronomy , Shanghai Jiao T ong Uni versity , Shanghai 201210, China 2 National Astronomical Observ atories, Chinese Academy of Sciences, Beijing 100101, China; 3 School of Computer T echnology and Application, Qinghai Uni v ersity , Xining 810016, China; ruiyicheng@sjtu.edu.cn Abstract Gr ound-based time-domain observatories r equir e minute-by-minute, site-scale awar eness of cloud cover , yet existing all-sk y datasets ar e short, daylight-biased, or lack astr ometric calibration. W e pr esent LenghuSky-8 , an eight-year (2018–2025) all-sky imaging dataset fr om a pr emier astr onomical site, comprising 429,620 512 × 512 frames with 81.2% night-time covera ge , star-awar e cloud masks, backgr ound masks, and per-pixel altitude–azimuth (alt–az) calibration. F or r ob ust cloud se gmentation acr oss day , night, and lunar phases, we train a linear pr obe on DINOv3 local featur es and obtain 93 . 3% ± 1 . 1% overall accuracy on a balanced, manually labeled set of 1,111 im- ages. Using stellar astrometry , we map each pixel to local alt–az coor dinates and measur e calibr ation uncertainties of ≈ 0 . 37 ◦ at zenith and ≈ 1 . 34 ◦ at 30 ◦ altitude, suf ficient for inte gr ation with telescope schedulers. Be yond se gmenta- tion, we intr oduce a short-horizon nowcasting benchmark over per-pixel thr ee-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow , Con vLSTM, and V ideoGPT . Con vLSTM per- forms best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. W e release the dataset, calibrations, and an open-sour ce toolkit for loading, evaluation, and scheduler -r eady alt–az maps to boost resear ch in segmentation, nowcasting, and autonomous observatory operations. 1. Introduction Ground-based time-domain surve ys such as the Zwicky T ransient Facility (ZTF) [ 2 ], the V era C. Rubin Observa- tory [ 17 ], and the T ianyu Project [ 11 ] are transforming our understanding of the dynamic Universe. T o maximize sci- entific return, these facilities rely on cloud-aware schedulers that continuously adapt pointing and exposure plans to lo- cal, rapidly evolving weather . Achieving this requires a re- alistic, high-resolution cloud model that ingests recent ob- servations and supports short-horizon predictions suitable for real-time decision making [ 25 ]. Building cloud models with such granularity demands long-duration, all-sky imag- ing datasets collected at fixed sites, with accurate geometric calibration and night-time cloud annotations. Howe ver , existing all-sky fisheye datasets exhibit at least one of the following limitations: (i) short temporal cover - age (months rather than years), prev enting seasonal model- ing; (ii) manual masking that does not scale; (iii) daytime bias, limiting astronomy use; (iv) missing astrometric cali- bration, making it difficult to map pixels to altitude–azimuth coordinates for effecti ve telescope scheduling; (v) selectiv e to easy-to-mark images that cannot depict the complicated scenario in the real world. In this work, we introduce LenghuSky-8, an eight-year, all-sky cloud dataset with star-a ware masks and alt–az cal- ibration, collected at Lenghu, Qinghai, China—a premier astronomical site [ 5 ]—spanning 2018–2025. The dataset contains 429,620 images at resolution of 512 × 512, cov- ering nights/days across 8 years, with 81.2% night-time frames. For segmentation, we use DINOv3 local features with a linear probe, enabling robust, label-efficient separa- tion of cloud and sky under div erse illumination, including moonlit conditions, at overall accuracy of 0 . 933 +0 . 011 − 0 . 011 on a small manually annotated dataset containing 1,111 images. Night-time star fields are used for astrometric calibration, yielding per-pixel altitude–azimuth coordinates with the un- certainty of 0 . 37 ◦ at zenith. Samples from the dataset are shown in Fig. 1 . Our contributions are tw ofold: 1. Dataset: an eight-year , all-sky , day–night imaging Augmented R aw ar gmax(L ogits) Backgr ound Mask Overlay Altitude Azimuth Figure 1. Samples from the dataset. First column represents the raw images that is augmented for cloud segmentation; Second column is the annotation of clear sky (blue), cloudy region (orange) and contamination re gion (pink); Third column is the mask for background; Forth column is the overlay of the first three columns; Last two columns are the astrometric calibration results for altitude and azimuth of the image. dataset with star-a ware cloud masks, background an- notation and alt–az calibration, suitable for both cloud nowcasting and domain-specific pretraining model; a manually labeled dataset containing 1,111 images that can be used to ev aluate cloud segmentation algorithms. 2. T ools: a DINOv3 linear-probe segmenter for all-sky camera, a all-sky camera calibrator based on star fields, and an open ev aluation toolkit with loaders, calibration maps, and scripts. W orkflow of this paper is shown in Fig. 2 . Code, data, and documentation are available in https : // github . com /ruiyicheng / LenghuSky- 8 . The remainder of the paper covers related work (Section 2 ), dataset establish- ment (Section 3 ), benchmark and baseline (Section 4 ), con- clusion and discussion (Section 5 ). 2. Related W ork 2.1. All-sky cloud datasets Despite growing adoption, publicly av ailable all-sky fisheye image datasets remain limited in both duration and scope. For example, Dev et al . introduced the SWIMSEG day- time and nighttime cloud se gmentation databases, compris- ing approximately 1,000 and 100 images respectiv ely , cap- tured in the tropical urban re gion of Singapore [ 7 , 8 ]. These small-scale collections lack sufficient temporal cov erage to represent seasonal or interannual cloud variability . Simi- larly , Li et al . [ 22 ] collected about 5,000 images with an all-sky camera during a site-testing campaign for the Thirty Meter T elescope (TMT) in Xinjiang. In 2019, Dev et al . released SWINySEG, a dataset of 6,768 daytime and night- time images annotated by human experts [ 9 ]. More recently , the Eye2Sky dataset [ 26 ] provided continuous all-sky im- agery from 11 stations in northwestern Germany , includ- ing one site with observations spanning from April 2022 to March 2023. Ne vertheless, ev en these larger efforts gener- ally cover only a few months to a year and often omit de- tailed per-pixel cloud masks or precise geometric calibra- tion. Consequently , the field still lacks large-scale, long- term, and well-annotated all-sky datasets necessary for ro- bust modeling and generalizable cloud characterization. 2.2. Cloud segmentation in all-sky camera images Ground-based all-sky cameras have been studied for cloud segmentation for over two decades. Early work relied on simple color heuristics that exploit the different scattering behavior of air molecules and cloud droplets. Fixed or adap- tiv e thresholds on red–blue ratios, their normalized variants, and saturation/dif ference cues were widely used to separate cloud from clear sky in daytime scenes [ 13 , 21 , 23 ]. These Figure 2. W orkflow of this paper . Solid arro ws denote dependencies among product data; dashed arro ws denote potential dependencies not considered in our experiments. methods are attractiv e for their robustness and real-time ef- ficiency but can be sensitiv e to camera calibration, aerosol load, circumsolar saturation, thin clouds near boundaries, and star field. Learning-based approaches reduced the need for hand- tuned thresholds by modeling sky/cloud appearance across color spaces. Dev et al . introduced a supervised frame w ork using partial least squares, which catalyzed reproducible ev aluation across cameras and conditions [ 7 ]. Deep con- volutional networks now dominate all-sky cloud segmen- tation. Encoder–decoder architectures (e.g., CloudSegNet) improv ed accuracy , especially in challenging regions near the Sun and horizon [ 9 ]. U-Net v ariants tailored to sky imagery (CloudU-Net and SegCloud) extended segmenta- tion across the full day and night using specialized atten- tion modules and training on mixed day/night corpora such as SWINySEG [ 27 , 32 ]. General-purpose image segmen- tation archiecture like SegMAN [ 12 ] are also feasible for cloud segmentation tasks. In recent years, large-scale self-supervised pre-training has become a dominant paradigm for visual representa- tion learning, offering powerful features that generalize across di verse visual domains. Methods such as MAE [ 16 ], MoCo v3 [ 4 ], iBO T [ 36 ], and DINOv2 [ 24 ] have demon- strated strong transferability to downstream tasks rang- ing from semantic segmentation to fine-grained recogni- tion. These models exploit masked image modeling, con- trastiv e learning, or teacher–student distillation to produce representations that capture both global semantics and lo- cal spatial structure. Building on these foundations, DI- NOv3 [ 29 ] introduces improv ed patch-lev el alignment and scalable V iT backbones, enabling state-of-the-art perfor- mance on dense prediction tasks. 2.3. Short-T erm Cloud For ecasting Ground-based all-sky cameras are widely used to nowcast cloud fields at site scale, typically within a 5–15 min hori- zon that is most relev ant for robotic observatories. Early approaches advected segmented cloud masks using optical flow to e xtrapolate motion, demonstrating useful skill up to about 5 min in tropical conv ection [ 6 ]. Hamill et al . [ 15 ] use the cross-correlation of optical-flo w based method to gener - ate no wcast result. Learning-based frame-to-frame w arping and sky-image prediction further improved short-horizon forecasts by jointly modeling motion and deformation [ 18 ]. Beyond optical-flow e xtrapolation, recurrent con v o- lutional architectures such as Con vLSTM [ 28 ] formu- late nowcasting as spatiotemporal sequence prediction by replacing fully connected gates with conv olutions. In parallel, discrete-latent generativ e models such as V ideoGPT—combining VQ-V AE encoders with Trans- former decoders—autoregress ov er video tokens and offer a flexible path to learn cloud ev olution priors directly from all-sky image sequences [ 33 ]. 3. Dataset 3.1. Raw data collection The all-sky camera is installed at the Lenghu site (longitude= 93 . 8961 ◦ , latitude= 38 . 6068 ◦ ), which is located on a local summit of the Saishiteng Mountain in Qinghai, China. The altitudes of the potential observing sites range from 4,200 m to 4,500 m. In contrast, the surrounding 100,000 km 2 area near Lenghu T own lies at a relativ ely lower elev ation (below 3,000 m). Both during the day and at night, the site experiences an extremely dry climate and predominantly clear skies. Such stable and arid atmospheric 2018 2019 2020 2021 2022 2023 2024 2025 2026 Observation T ime [year] 0 200 400 600 800 1000 Number of frame per Day Figure 3. Daily number of captured frames in the dataset. conditions lead to excellent seeing and lo w precipitable w a- ter vapor , making it an ideal site for astronomical observa- tions. The photographs were tak en using fishe ye-lens cameras. A fisheye lens is a specialized optical component designed to capture extremely wide fields of view , typically around 180 ◦ . Such lenses introduce strong visual distortion, pro- ducing wide panoramic or hemispherical images. In this work, we use a Sigma 4.5 mm f/2.8 fisheye lens, mounted on Canon 600D, 750D, and 800D all-sky camera bodies, producing raw images with resolution of 4000 × 6000 . The dataset can be divided into two distinct parts: data collected before 27th September 2023 18:09:48 (Part I) and data collected thereafter (Part II). Part I contains im- ages with fewer obstructions from surrounding structures or background objects, but the optical surfaces were poorly maintained, leading to a lar ge proportion of frames af fected by mud or de w on the lens. In contrast, P art II benefits from frequent manual cleaning, which significantly impro ves im- age clarity , yet the field of view is often partially blocked by nearby objects. An example representati ve of Part II is shown in the fourth ro w of Fig. 1 . The distribution of the number of frames captured per day is illustrated in Fig. 3 . The capture interv al is set to 5 minutes during nighttime and 20 minutes during daytime, determined by the solar elev ation angle. Consequently , the a verage number of frames per day is approximately 150 frames in summer and 200 frames in winter . The ex- posure time is dynamically adjusted according to ambient brightness, resulting in more images being captured dur- ing full-moon nights and fewer during new-moon periods. This introduces a noticeable monthly fluctuation in the total number of frames. In addition, the camera cadence is man- ually shortened during meteor shower ev ents or for system testing purposes. Continuous cov erage is particularly important for track- ing the ev olution of cloud structures, ev aluating diurnal variations, and calibrating moonlight scattering models. T o ensure hour-le v el persistence in observ ations, a threshold of 60 minutes is adopted to mark a break in observation se- quences. A statistical summary of the persistence character- istics is provided in T able 1 . Among all observation periods, we identify 16 instances that maintain hour -le vel continuity for more than one month. These long-duration sequences are particularly valuable for modeling phase-dependent lu- nar illumination effects and e v aluating temporal trends in sky conditions. The detailed list of these persistent samples is presented in T able 2 . 3.2. Segmentation using DINOv3 For cloud segmentation, we resize the center part of raw im- age into the resolution of 512 × 512 , normalize the image value using the range of [mean-std,mean+3 × std] for mak- ing the cloud more significant for annotation. Examples of the pre-processing results are shown in the first column of Fig. 1 . As clouds are amorphous objects without well-defined boundaries, constructing a reliable dataset for cloud detec- tion poses a significant challenge. T o ensure labeling ac- curacy , only regions with high confidence are annotated. In this work, we consider three categories: cloud , sky , and con- tamination . Because telescopes are highly sensitiv e to ev en thin clouds, all regions unsuitable for astronomical obser- vation are labeled as “cloud, ” while regions that are clearly transparent are labeled as “sky . ” Regions where classifica- tion is ambiguous—such as those co v ered by sno w , affected by dew , saturated by sunlight, moonlight, or artificial light sources, or obscured by scattered dust—are labeled as “con- tamination. ” An example of such cases is illustrated in the third row of Fig. 1 . W e annotate a total of 1,111 images using LabelMe [ 30 ]. The dataset is ev enly distributed across moon phases (new moon, first quarter , full moon, and last quarter), times of day (02:00, 06:00, 10:00, 14:00, 18:00, and 22:00 UTC+8, which is approximately two hours ahead of local solar time), seasons (spring/autumn, summer , and winter), and cloud lev els (ov ercast, partially cloudy , and clear). The dataset is also balanced between Part I and Part II of the observation. Examples of the annotated data are shown in Fig. 5 of the supplementary material. Giv en the inher- ently fuzzy and indistinct boundaries between sky , cloud, and contamination, we adopt a conservati v e labeling strat- egy , including only regions that unambiguously belong to a specific class. This approach minimizes human bias that may arise from inconsistently labeling thin cloud re- gions—classifying them as “sky” in cloudy skies or as “cloud” in clear skies. DINOv3 (V iT -L/16) is performed for image with super- T able 1. Summary of persistence statistics for continuous observations. Duration of persistence [days] T otal time (days) T otal number of frames Proportion of time Proportion of frames 5 2176.20 388586 0.7996 0.9045 10 1540.03 273491 0.5658 0.6366 30 1000.95 179256 0.3678 0.4172 60 625.85 117792 0.2300 0.2742 T able 2. Observations with hour -le vel persistence for more than 1 month Start End Duration (days) Num Frames 2018-12-10 13:32:44 2019-04-19 10:02:24 129.85 24747 2020-12-06 07:22:28 2021-03-04 01:50:38 87.77 18034 2020-01-26 19:52:42 2020-04-12 06:07:00 76.43 14002 2021-05-31 22:22:44 2021-08-14 16:32:27 74.76 12206 2023-01-20 06:53:42 2023-03-27 16:43:51 66.41 12267 2019-11-20 17:12:31 2020-01-25 17:26:05 66.01 13083 2019-08-20 21:29:09 2019-10-22 14:16:30 62.70 11180 2020-10-03 21:41:58 2020-12-04 19:52:38 61.92 12273 2018-05-01 00:02:44 2018-06-27 03:55:38 57.16 9152 2021-04-04 19:40:00 2021-05-30 15:43:07 55.84 9221 2022-07-25 22:04:45 2022-09-18 23:40:07 55.07 9200 2020-05-10 11:27:52 2020-06-27 05:17:47 47.74 7514 2018-08-03 05:21:52 2018-09-18 17:11:49 46.49 7788 2023-03-28 15:04:24 2023-05-08 09:45:02 40.78 6930 2022-04-17 16:52:51 2022-05-26 06:01:39 38.55 6312 2018-06-30 16:15:53 2018-08-03 03:35:23 33.47 5347 resolution to 1024 × 1024 , producing 64 × 64 local feature vectors and 5 global feature vectors, including 1 C L S and 4 register tokens. A linear probe using the local features are used for segmentation. Experiment results of using global features are shown in Section 4.1 . This annotator is per- formed on all 429,620 images. 3.3. Background annotation The background remains mostly stable throughout the dataset. T o improve segmentation accuracy , we manually identify frames in which the background changes. In most cases, the foreground varies while the camera remains fixed, owing to its stable mounting and the steady terrain. How- ev er , in Part II of the dataset, a nearby building with a mo v- able flat dome is constructed close to the all-sky camera. The frequent movement of the roof makes manual back- ground labeling infeasible. Fortunately , the roof has two distinct and stable configurations: raised (22:00, full-Moon sample in Fig. 5 of the supplementary material) and lowered (10:00, first-quarter sample in Fig. 5 of the supplementary material). A summary of the manual annotation results is provided in T able 3 . The background is manually obtained for 14 distinct periods, six of which correspond to changes in the camera setup that required ne w astrometric solutions. The procedure for deriving these astrometric solutions is de- scribed in Section 3.4 , and the background annotation re- sults are giv en in the supplementary material. T o automatically determine the roof configuration in each image, we train a linear classifier on the DINOv3 CLS token using 3,731 images, achieving 100% accuracy on a held-out test set of 373 images. W e then combine the pre- dictions of this classifier with the fixed manual mask to au- tomatically annotate the background in Part II of the dataset. 3.4. Astrometric calibration using stars A S T R O M E T RY . N E T [ 20 ] is a widely used tool for astro- metric calibration. Given star position on the image, A S - T RO M E T RY . N E T would return the world coordinate sys- tem (WCS), which is a reserv able map from pixel space to spherical coordinate space. It utilize geometric hash matching based on the assumption of conformal transfor- mation between detected stars and template catalogs. In general, astronomy photography ha ve small field-of-vie w (FoV), which ha ve limited geometric distortion. How- ev er , in the all-sky fisheye camera, the significant distortion would make the algorithm infeasible. A method that is sim- ilar to Jia et al . [ 34 ] is used mitigate this issue. Images are resized to resolution of 4096 × 4096 for astro- metric calibration, which corresponds to 64 × 64 DINOv3 patches with size of 64 × 64 . Raw value of the images is pre- T able 3. Change of background during the observation. Background proportion is the ratio of the spherical FoV and area of the mask of background. Start time (UTC+8) New astrometry Roof position Number of images Duration [day] Background proportion 2018-05-01 00:02:44 Y es – 22300 140.7 0.1917 2018-09-27 19:19:49 Y es – 34772 203.6 0.1357 2019-04-24 15:39:36 Y es – 9493 60.9 0.1702 2019-06-26 18:23:18 Y es – 1343 8.7 0.1875 2019-07-05 11:59:14 Y es – 70477 418.2 0.1902 2020-08-26 17:24:23 No – 3124 20.9 0.2093 2020-09-16 15:30:23 No – 1178 6.9 0.2376 2020-09-23 12:18:53 No – 4554 25.1 0.1795 2020-10-18 14:31:56 No – 38258 204.0 0.2160 2021-05-10 14:39:39 No – 51789 380.6 0.2255 2022-06-01 19:41:30 No – 47589 298.6 0.2355 2023-03-27 11:15:06 No – 28107 184.0 0.2324 2023-09-27 18:09:48 Y es Lowered 71976 543.9 0.3367 2023-09-27 20:47:06 No Raised 44660 202.0 0.3674 served for star recognition. The position of stars on the im- age ( u, v ) is extracted using an matched filtering algorithm, which is implemented by S E X T R A C T O R [ 3 ]. Assuming the center of FoV ( u 0 , v 0 ) is the origin, obtain the polar coordi- nate ( r, φ ) in the image plane of the resolv ed star . Direction of the incoming light θ can be modeled by the inv erse of a type of optical projection r = r ( θ ) in T able 8 . The stars are grouped by their ( θ , φ ) ’ s corresponding HEALPix [ 14 ] patch. For each group, con vert ( θ, φ ) into Cartesian coordi- nate  x = ( x, y , z ) T . Rotate a group of star to θ = 0 , which hav e minimal distortion, by the follo wing matrix R z =   cos φ c sin φ c 0 − sin φ c cos φ c 0 0 0 1   ; R y =   cos θ c 0 − sin θ c 0 1 0 sin θ c 0 cos θ c   ;  x ′ = R y R z  x, (1) where ( θ c , φ c ) represents the center coordinate of the HEALPix patch of this group;  x ′ is the Cartesian coordi- nate system of a group of star that is on the same HEALPix patch. Projecting them back to camera plane using r = r ( θ ) , we can obtain the undistorted star distribution. WCS of this HEALPix patch could be obtained in this way . When doing inference based on the results, we need first decide which HEALPix does the target land on, then do the same transformation as above. Applying the WCS on the transformed points, we could obtain the right ascension (R.A.; α ) and declination (Dec.; δ ) of a gi v en pixel. Because R.A. and Dec. would change with respect time due to the Earth motion, altitude (Alt.; h ) and azimuth (Az.; a ) is used to present the astrometric calibration results. The transfor- mation between ( α , δ ) and ( h, a ) is performed using A S - T RO P Y [ 1 ]. For each time slot which requires an indepen- dent astrometric solution, as listed in T able 3 , we apply this procedure for multiple images. The astrometric calibration for most HEALPix patches can be obtained in this way . When obtaining the position of DINOv3 patches, we first apply the same transformation to rotate them to the center of F oV , then using the corresponding WCS to obtain the re- sults. By this way , some HEALPix patches are still not re- solved by A S T RO M E T RY . N E T , which are shown in supple- mentary material. W e fit a radial symmetric model to inter- polate the rest of the DINOv3 patches. The model is r = 4 X i =0 w 2 i +1 z 2 i +1 a = φ + c, (2) where z = π 2 − h is the zenith distance; r and φ are centered on the zenith, which is obtained by enumeration to find the highest altitude. Uncertainty of the altitude σ h are ev aluated using error propagation σ 2 h = σ 2 z =  d z d r  2 σ 2 r . (3) The fitting residuals are summarized in T able 4 . For the camera used in this study , the assumption of an orthographic projection allows most HEALPix cells in the image to be individually resolved. The effecti ve pixel resolution at the zenith is d z d r   z =0 ≈ 3 arcmin/pixel, which can be used to ev aluate the altitude uncertainty via Eq. 3 . Under the ortho- graphic projection approximation, the average calibration uncertainties are 0 . 37 ◦ in the zenith region and 1 . 34 ◦ at an altitude corresponding to z = 60 ◦ . These v alues are smaller than the typical field of view (FoV) of modern wide-field optical survey telescopes such as V era C. Rubin Observa- tory , ZTF , and Tian yu. Detailed residual distributions from T able 4. Residual of astrometric calibration. σ a is the residual in the azimuth; σ r is the residual in radial, which is measured in pixel. Start time (UTC+8) σ a [deg] σ r [pixel] 2018-05-01 00:02:44 0.569 4.494 2018-09-27 19:19:49 1.382 9.016 2019-04-24 15:39:36 0.774 4.298 2019-06-26 18:23:18 0.814 5.239 2019-07-05 11:59:14 1.978 5.958 2023-09-27 18:09:48 2.256 15.086 A verage 1.295 7.349 the fitting results are provided in the supplementary mate- rial. 4. Benchmark and baseline 4.1. Ambiguity-A ware Sky/Cloud Segmentation Benchmark Because only confident regions are annotated manually , we only consider the metric that is on the the annotated re- gion. W e calculate the Accuracy , Recall, F1 score for the test set for the regions that we hav e annotated. Because a conservati ve strategy is applied for annotation, as described in Section 3.2 , the classical metric for object segmentation like mIoU is not feasible for this task. T rain-test-v alid set are taken at the ratio of 8:1:1, i.e. 889 samples for training set, 111 samples for validation set and 111 samples for test set. For each baseline, we bootstrap the train-valid-test set separation for 20 times to determine the uncertainty of the models’ performance. The following models are used as the baseline in this work. Linear probe of DINOv3 : As described in section 3.2 , a linear probe of DINOv3 (V iT -L/16) model is applied on patched of the output embeddings. Besides the local embed- ding, DINOv3 also hav e 1 CLS token and 4 register tokens to present the global features. W e considered 4 setup for the utilization of DINOv3 model: (1) local feature tok ens only; (2) local feature concatenate with CLS tokens; (3) local fea- ture concatenate with the mean of CLS + register tokens; (4) local features concatenate with CLS and all register tok ens. Encoder -decoder (CloudSegNet-like) : As mentioned in 2.2 , CloudSegNet is a network that is dedicated de- signed for cloud segmentation using encoder-decoder . In this work, an encoder-decoder is implemented as a baseline. W e adapt the network for the 512 × 512 image size. Mean- while, the final output layer is used to output 512 × 512 × 3 logits to enable three-class segmentation. Details of the im- plementation is shown in supplementary material. U-Net (CloudU-Net-like) : As a widely used image se g- mentation algorithm, U-Net is also implemented in this work as a baseline for cloud segmentation. W e also adapted the shape of output for this network. Details of the imple- mentation is shown in supplementary material. SegMAN : As mentioned in Section 2.2 , SegMAN is a state-of-the-art image segmentation architecture. W e adapt the input shape for the experiment. Tin y , small, base, and large models are implemented for comparison. A comparison of the results is sho wn in T able 5 . As shown in the table, a DINOv3 linear probe using only local features achiev es an overall test accuracy of 0 . 933 +0 . 011 − 0 . 011 % . In contrast, concatenating local features with the mean of the [CLS] and register tokens yields 0 . 934 +0 . 010 − 0 . 012 % ov erall test accuracy . The two methods show no statistically sig- nificant difference on the test set. Therefore, we adopt the simpler approach—linear probing of DINOv3 with local- feature embeddings—as our cloud segmentation method. Per-class segmentation metrics for these baselines are pro- vided in the supplementary material. 4.2. W eather Nowcast Benchmark The input to this task consists of two consecuti ve frames of logits generated by a DINOv3 linear probe applied to lo- cal patches. The objective is to predict the segmentation of the subsequent frame based on the preceding n frames. The model provides a three-class prediction: clear , cloud, and contamination. The ground truth is derived from an inference-based tri-label map. As outlined in Section 4.1 , only non-background pixels are considered in the scoring. Unlike direct image prediction, forecasting logits is a more meaningful approach for cloud modeling. The training dataset spans from May 1st, 2018 to January 1st, 2023, in- tersected with data from September 28th, 2023 to January 1st, 2025. The remaining data are used for testing. The following models are considered as baselines in this w ork. T rivial baseline : For comparison, we setup trivial base- line: An identical map from previous frame to next frame. This baseline is expected to hav e moderate performance be- cause the prediction would be correct if the whole sky is clear (sky), cloudy (cloud) or cov ered by sno w/ice (contam- ination) for a long time. Optical Flow extrapolation : W e implement the opti- cal flow based algorithm as introduced in Hamill et. al [ 15 ] as an baseline. This algorithm predicts future frames by extrapolating motion patterns from historical data. The core algorithm uses Farneback’ s dense optical flo w method [ 10 ] to compute motion vectors between the last two input frames. For prediction, the it assumes motion continuity by applying this computed flow field to warp the most recent frame forward, effecti v ely propagating each pixel along its estimated trajectory . Con vLSTM : ConvLSTM treats the past logit maps as a spatiotemporal tensor and uses con volutional gat- ing to retain localized motion/morphology cues, predict- ing the next tri-label (sky/cloud/contamination) logits end- T able 5. Overall metrics for cloud segmentation(median with 16th-83rd percentiles) Baseline Accuracy Macro Precision Macro Recall Macro F1-score DINOv3 local 0 . 933 +0 . 011 − 0 . 011 0 . 894 +0 . 020 − 0 . 028 0 . 867 +0 . 026 − 0 . 037 0 . 873 +0 . 028 − 0 . 019 DINOv3 local + CLS 0 . 930 +0 . 013 − 0 . 011 0 . 890 +0 . 013 − 0 . 021 0 . 853 +0 . 045 − 0 . 029 0 . 868 +0 . 022 − 0 . 021 DINOv3 local + CLS + register 0 . 924 +0 . 020 − 0 . 010 0 . 897 +0 . 008 − 0 . 030 0 . 841 +0 . 046 − 0 . 031 0 . 860 +0 . 033 − 0 . 023 DINOv3 local + mean( CLS + register) 0 . 934 +0 . 010 − 0 . 012 0 . 892 +0 . 017 − 0 . 020 0 . 866 +0 . 036 − 0 . 036 0 . 872 +0 . 020 − 0 . 014 Encoder-decoder 0 . 804 +0 . 014 − 0 . 027 0 . 771 +0 . 037 − 0 . 025 0 . 655 +0 . 033 − 0 . 016 0 . 686 +0 . 022 − 0 . 033 U-Net 0 . 851 +0 . 031 − 0 . 020 0 . 839 +0 . 030 − 0 . 080 0 . 699 +0 . 022 − 0 . 034 0 . 728 +0 . 015 − 0 . 048 SegMAN T iny 0 . 885 +0 . 015 − 0 . 023 0 . 801 +0 . 049 − 0 . 031 0 . 760 +0 . 038 − 0 . 026 0 . 772 +0 . 038 − 0 . 018 SegMAN Small 0 . 881 +0 . 013 − 0 . 027 0 . 812 +0 . 037 − 0 . 036 0 . 748 +0 . 031 − 0 . 023 0 . 767 +0 . 015 − 0 . 020 SegMAN Base 0 . 882 +0 . 008 − 0 . 019 0 . 817 +0 . 025 − 0 . 017 0 . 735 +0 . 033 − 0 . 028 0 . 761 +0 . 024 − 0 . 032 SegMAN Lar ge 0 . 872 +0 . 014 − 0 . 021 0 . 804 +0 . 038 − 0 . 033 0 . 725 +0 . 023 − 0 . 017 0 . 748 +0 . 016 − 0 . 022 to-end; training and scoring follow the background mask- ing used throughout the benchmark (only judgeable, non- background pixels contrib ute to the loss/metrics). V ideoGPT : V ideoGPT -style generativ e baseline tok- enizes each 3-channel logit map with a VQ-V AE, then ap- plies a causal transformer to autoregress over space–time to- ken sequences to synthesize the next frame’ s tokens, which are decoded back to logits. Results of these baselines are sho wn in T able 6 . The per - class segmentation metric for these baselines are shown in supplementary material. It is interesting to see that Con- vLSTM ha ve highest prediction accuracy and V ideoGPT model hav e worst performance. Meanwhile number of used past frame would not significantly affect the prediction ac- curacy . Details of the model implementation are shown in supplementary material. T able 6. Overall metrics for weather nowcast. V ideoGPT - n refers to using V ideoGPT to predict the next frame according to previous n frames; Accuracy , precision, recall, and F1-score are measure by the macro av erage among sky , cloud and contamination class. Baseline Accuracy Precision Recall F1-score T ri vial 0.888 0.859 0.858 0.858 Optical Flow 0.888 0.858 0.860 0.859 Con vLSTM 0.890 0.871 0.849 0.859 V ideoGPT -1 0.871 0.841 0.823 0.831 V ideoGPT -2 0.872 0.840 0.825 0.831 V ideoGPT -7 0.870 0.841 0.818 0.827 5. Conclusion and Discussion This work presents LenghuSky-8, an 8-year all-sky cloud dataset with star-a ware masks and Alt-Az calibration for segmentation and nowcasting. This dataset hav e long tem- poral coverage, full automatic cloud annotation at 93 . 3% median accurac y , cov erage of both day time and night time, astrometric calibration to map pixels to local altitude and azimuth, and complete sample that include various moon phase and conditions. This dataset is important for model- ing local cloud en vironment, which is useful for dev eloping the scheduler of automatic ground-based surv ey telescopes. Meanwhile, it is a v aluable dataset for en vironmental study . Besides, different configurations of DINOv3, encoder- decoder , U-Net, and v arious size of SegMAN are tested for cloud segmentation. Linear probe on local features of DINOv3 hav e the highest accuracy , which show the poten- tial to adopt pre-training model to various fields. W e need to mention that metrics exclude ambiguous areas are used for this task, which may inflate performance and obscure boundary mistakes. Furthermore, we test trivial baseline, optical flow extrapolation, Con vLSTM, and V ideoGPT on the weather nowcast task. Interestingly , V ideoGPT achie ve the worst performance in these model. In future work, we would collect more annotated images at di verse observation sites and instruments to enhance the generalization of the trained model. In the cloud nowcasting task, optical flow extrapolation and Con vLSTM algorithm do not hav e significant advan- tage over trivial identical map from the previous frame to the next one, which is a common phenomenon for time series prediction tasks [ 31 , 35 ]. Future research are re- quired to propose a more accurate way for cloud nowcast- ing. As shown in Fig. 2 , incorporating astrometric calibra- tion to register frames into a common sky coordinate system may reduce spurious motion and stabilize training. Like- wise, injecting physically moti v ated structure—e.g., adv ec- tion consistency , non-negati vity and boundedness of radio- metric quantities, or weak mass-continuity priors—can reg- ularize solutions and encourage physically plausible ev olu- tion. Acknowledgement This work is supported by the National K ey R&D Program of China, Nos. 2024YF A1611801 and 2024YFC2207700, the National Natural Science Foundation of China (NSFC) under grant No.12473066, No.12233009 and No. 62562052, the Shanghai Jiao T ong University 2030 Ini- tiativ e, the Basic Resources In v estigation Program of the Ministry of Science and T echnology of China (Grant No. 2023FY101100) and Y uanqi Observatory . W e also sin- cerely thank Y ansong W ang, Jinfang Zhang, Zixuan Y ang, Baolong Ma, Qin Ma, and Xiangyu Ma for their valuable work in data labeling. References [1] Astropy Collaboration, Adrian M. Price-Whelan, Pey Lian Lim, Nicholas Earl, Nathaniel Starkman, Larry Bradley, David L. Shupe, Aarya A. Patil, Lia Corrales, C. E. Brasseur, Maximilian N ¨ othe, Axel Donath, Erik T ollerud, Brett M. Morris, Adam Ginsburg, Eero V aher, Benjamin A. W eav er, James T ocknell, W illiam Jamieson, Marten H. van Kerkwijk, Thomas P . Robitaille, Bruce Merry, Mat- teo Bachetti, H. Moritz G ¨ unther, Thomas L. Aldcroft, Jaime A. Alvarado-Montes, Anne M. Archibald, Attila B ´ odi, Shreyas Bapat, Geert Barentsen, Juanjo Baz ´ an, Manish Biswas, M ´ ed ´ eric Boquien, D. J. Burke, Daria Cara, Mihai Cara, Kyle E. Conroy , Simon Conseil, Matthew W . Craig, Robert M. Cross, K elle L. Cruz, Francesco D’Eugenio, Nadia Dencheva, Hadrien A. R. Devillepoix, J ¨ org P . Di- etrich, Arthur Davis Eigenbrot, Thomas Erben, Leonardo Ferreira, Daniel Foreman-Mackey, Ryan Fox, Nabil Freij, Suyog Garg, Robel Geda, Lauren Glattly , Y ash Gondhalekar , Karl D. Gordon, David Grant, Perry Greenfield, Austen M. Groener, Stev e Guest, Sebastian Gurovich, Rasmus Hand- berg, Akeem Hart, Zac Hatfield-Dodds, Derek Home- ier , Griffin Hosseinzadeh, T im Jenness, Craig K. Jones, Prajwel Joseph, J. Bryce Kalmbach, Emir Karamehme- toglu, Mikołaj Kałuszy ´ nski, Michael S. P . Kelley , Nicholas Kern, W olfgang E. Kerzendorf, Eric W . Koch, Shankar Kulumani, Antony Lee, Chun L y, Zhiyuan Ma, Conor MacBride, Jakob M. Maljaars, Demitri Muna, N. A. Mur- phy , Henrik Norman, Richard O’Steen, Kyle A. Oman, Camilla Pacifici, Sergio Pascual, J. Pascual-Granado, Ro- hit R. Patil, Gabriel I. Perren, T imothy E. Pickering, T anuj Rastogi, Benjamin R. Roulston, Daniel F . Ryan, Eli S. Rykof f, Jose Sabater, Parikshit Sakurikar, Jes ´ us Salgado, Aniket Sanghi, Nicholas Saunders, V olodymyr Savchenko, Ludwig Schwardt, Michael Seifert-Eckert, Albert Y . Shih, Anany Shrey Jain, Gyanendra Shukla, Jonathan Sick, Chris Simpson, Sudheesh Singanamalla, Leo P . Singer, Jaladh Singhal, Manodeep Sinha, Brigitta M. Sip ˝ ocz, Lee R. Spitler, David Stansby, Ole Streicher , Jani ˇ Sumak, John D. Swin- bank, Dan S. T aranu, Nikita T ewary, Grant R. T remblay , Miguel de V al-Borro, Samuel J. V an K ooten, Zlatan V asovi ´ c, Shresth V erma, Jos ´ e V in ´ ıcius de Miranda Cardoso, Peter K. G. W illiams, T om J. Wilson, Benjamin Wink el, W . M. W ood-V asey, Rui Xue, Peter Y oachim, Chen Zhang, An- drea Zonca, and Astropy Project Contributors. The Astropy Project: Sustaining and Growing a Community-oriented Open-source Project and the Latest Major Release (v5.0) of the Core Package. The Astrophysical Journal , 935(2):167, 2022. 6 [2] Eric C. Bellm, Shriniv as R. Kulkarni, Matthe w J. Gra- ham, Richard Dekany, Roger M. Smith, Reed Riddle, Frank J. Masci, George Helou, Thomas A. Prince, Scott M. Adams, C. Barbarino, T om Barlow , James Bauer, Ron Beck, Justin Belicki, Rahul Biswas, Nadejda Blagorodnov a, Den- nis Bode wits, Bryce Bolin, V alery Brinnel, Tim Brooke, Brian Bue, Mattia Bulla, Rick Burruss, S. Bradley Cenko, Chan-Kao Chang, Andrew Connolly , Michael Coughlin, John Cromer, V ir ginia Cunningham, Kishalay De, Alex Delacroix, V andana Desai, Dmitry A. Duev, Gwendolyn Eadie, T ony L. Farnham, Michael Feeney, Ulrich Feindt, David Flynn, Anna Francko wiak, S. Frederick, C. Fremling, A vishay Gal-Y am, Suvi Gezari, Matteo Giomi, Daniel A. Goldstein, V . Zach Golkhou, Ariel Goobar, Steven Groom, Eugean Hacopians, David Hale, John Henning, Anna Y . Q. Ho, David Hover, Justin Howell, T iara Hung, Daniela Hup- penkothen, David Imel, W ing-Huen Ip, ˇ Zeljko Ivezi ´ c, Ed- ward Jackson, L ynne Jones, Mario Juric, Mansi M. Kasli- wal, S. Kaspi, Stephen Kaye, Michael S. P . Kelle y , Marek K o walski, Emily Kramer, Thomas Kupfer, W alter Landry, Russ R. Laher , Chien-De Lee, Hsing W en Lin, Zhong-Y i Lin, Ragnhild Lunnan, Matteo Giomi, Ashish Mahabal, Pe- ter Mao, Adam A. Miller, Ser ge Monkewitz, Patrick Murphy , Chow-Choong Ngeow, Jakob Nordin, Peter Nugent, Eran Ofek, Maria T . Patterson, Bryan Penprase, Michael Porter, Ludwig Rauch, Umaa Rebbapragada, Dan Reiley , Mickael Rigault, Hector Rodriguez, Jan van Roestel, Ben Rusholme, Jakob van Santen, S. Schulze, Da vid L. Shupe, Leo P . Singer, Maayane T . Soumagnac, Robert Stein, Jason Surace, Jesper Sollerman, Paula Szkody, F . T addia, Scott T erek, Angela V an Sistine, Sjoert v an V elzen, W . Thomas V estrand, Richard W alters, Charlotte W ard, Quan-Zhi Y e, Po-Chieh Y u, Lin Y an, and Jeffry Zolkower. The Zwicky Transient Facility: System Overview , Performance, and First Results. Publica- tions of the Astronomical Society of the P acific , 131(995): 018002, 2019. 1 [3] E. Bertin and S. Arnouts. SExtractor: Software for source extraction. Astr onomy and Astr ophysics Supplement Series , 117:393–404, 1996. 6 [4] Xinlei Chen, Saining Xie, and Kaiming He. An Empiri- cal Study of Training Self-Supervised V ision Transformers. arXiv e-prints , art. arXiv:2104.02057, 2021. 3 [5] Licai Deng, Fan Y ang, Xiaodian Chen, Fei He, Qili Liu, Bo Zhang, Chunguang Zhang, Kun W ang, Nian Liu, An- bing Ren, Zhiquan Luo, Zhengzhou Y an, Jianfeng Tian, and Jun Pan. Lenghu on the Tibetan Plateau as an astronomical observing site. Nature , 596(7872):353–356, 2021. 1 [6] Soumyabrata Dev , Florian Savoy , and Y ee Hui Lee. Short- term prediction of localized cloud motion using ground- based sky imagers. pages 2563–2566, 2016. 3 [7] Soumyabrata Dev, Y ee Hui Lee, and Stefan W inkler . Color- Based Segmentation of Sky/Cloud Images From Ground- Based Cameras. IEEE Journal of Selected T opics in Ap- plied Earth Observations and Remote Sensing , 10(1):231– 242, 2017. 2 , 3 [8] Soumyabrata Dev, Florian M. Savoy, Y ee Hui Lee, and Ste- fan Winkler . Nighttime sky/cloud image segmentation. arXiv e-prints , art. arXiv:1705.10583, 2017. 2 [9] Soumyabrata Dev, Atul Nautiyal, Y ee Hui Lee, and Stefan W inkler. CloudSegNet: A Deep Netw ork for Nychthemeron Cloud Image Segmentation. IEEE Geoscience and Remote Sensing Letters , 16(12):1814–1818, 2019. 2 , 3 [10] Gunnar Farneb ¨ ack. T wo-frame motion estimation based on polynomial expansion. In Image Analysis , pages 363–370, Berlin, Heidelberg, 2003. Springer Berlin Heidelber g. 7 [11] F . B. Feng, Y . C. Rui, Z. M. Du, Q. Lin, C. C. Zhang, D. Zhou, K. M. Cui, M. Ogihara, M. Y ang, J. Lin, Y . Z. Cai, T . Z. Y ang, X. Y . Pang, M. J. Jian, W . X. Li, H. X. Guo, X. Shi, J. C. Shi, J. Y . Li, K. R. Guo, S. Y ao, A. M. Chen, P . Jia, X. Y . T an, S. J. Jenkins, H. X. Jiang, M. Y . Zhang, K. X. Li, G. Y . Xiao, S. Y . Zheng, Y . F . Xuan, J. Zheng, M. He, R. A. H. Jones, and C. Y . Song. Tian yu-Search for the Second Solar System and Explore the Dynamic Univ ers. Acta Astr onomica Sinica , 65(4):34, 2024. 1 [12] Y unxiang Fu, Meng Lou, and Y izhou Y u. SegMAN: Omni- scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation. arXiv e-prints , art. arXiv:2412.11890, 2024. 3 , 4 [13] M. S. Ghonima, B. Urquhart, C. W . Chow , J. E. Shields, A. Cazorla, and J. Kleissl. A method for cloud detection and opacity classification based on ground based sky im- agery . Atmospheric Measurement T echniques , 5(11):2881– 2892, 2012. 2 [14] K. M. G ´ orski, E. Hi von, A. J. Banday , B. D. W andelt, F . K. Hansen, M. Reinecke, and M. Bartelmann. HEALPix: A Framework for High-Resolution Discretization and Fast Analysis of Data Distributed on the Sphere. The Astr ophysi- cal Journal , 622(2):759–771, 2005. 6 [15] Thomas Hamill and T . Nehrkorn. A short-term cloud fore- cast scheme using cross correlations. W eather and F or ecast- ing - WEA THER FORECAST , 8:401–411, 1993. 3 , 7 [16] Kaiming He, Xinlei Chen, Saining Xie, Y anghao Li, Piotr Doll ´ ar, and Ross Girshick. Masked Autoencoders Are Scal- able V ision Learners. arXiv e-prints , art. 2021. 3 [17] ˇ Zeljko Ivezi ´ c, Stev en M. Kahn, J. Anthony T yson, Bob Abel, Emily Acosta, Robyn Allsman, David Alonso, Y usra Al- Sayyad, Scott F . Anderson, John Andrew , James Roger P . Angel, George Z. Angeli, Reza Ansari, Pierre Antilogus, Constanza Araujo, Robert Armstrong, Kirk T . Arndt, Pierre Astier, ´ Eric Aubourg, Nicole Auza, Tim S. Axelrod, Deb- orah J. Bard, Jeff D. Barr, Aurelian Barrau, James G. Bartlett, Amanda E. Bauer, Brian J. Bauman, Sylvain Bau- mont, Ellen Bechtol, Keith Bechtol, Andrew C. Becker, Jacek Becla, Cristina Beldica, Stev e Bellavia, Federica B. Bianco, Rahul Biswas, Guillaume Blanc, Jonathan Blazek, Roger D. Blandford, Josh S. Bloom, Joanne Bogart, Tim W . Bond, Michael T . Booth, Anders W . Borgland, Kirk Borne, James F . Bosch, Dominique Boutigny, Craig A. Brack- ett, Andrew Bradshaw , William Nielsen Brandt, Michael E. Brown, James S. Bullock, Patricia Burchat, David L. Burke, Gianpietro Cagnoli, Daniel Calabrese, Shawn Callahan, Al- ice L. Callen, Jeffrey L. Carlin, Erin L. Carlson, Sriniv asan Chandrasekharan, Glenaver Charles-Emerson, Stev e Ches- ley , Elliott C. Cheu, Hsin-F ang Chiang, James Chiang, Carol Chirino, Derek Chow, Da vid R. Ciardi, Charles F . Claver, Jo- hann Cohen-T anugi, Joseph J. Cockrum, Rebecca Coles, An- drew J. Connolly , Kem H. Cook, Asantha Cooray, Ke vin R. Cov ey, Chris Cribbs, W ei Cui, Roc Cutri, Philip N. Daly, Scott F . Daniel, Felipe Daruich, Guillaume Daubard, Greg Daues, W illiam Dawson, Francisco Delgado, Alfred Del- lapenna, Robert de Peyster, Miguel de V al-Borro, Seth W . Digel, Peter Doherty , Richard Dubois, Gregory P . Dubois- Felsmann, Josef Durech, Frossie Economou, Tim Eifler, Michael Eracleous, Benjamin L. Emmons, Angelo Fausti Neto, Henry Ferguson, Enrique Figueroa, Merlin Fisher- Levine, W arren F ocke, Michael D. F oss, James Frank, Michael D. Freemon, Emmanuel Gangler , Eric Gawiser , John C. Geary, Perry Gee, Marla Geha, Charles J. B. Gess- ner , Robert R. Gibson, D. Kirk Gilmore, Thomas Glanz- man, William Glick, T atiana Goldina, Daniel A. Goldstein, Iain Goodenow , Melissa L. Graham, W illiam J. Gressler, Philippe Gris, Leanne P . Guy, Augustin Guyonnet, Gun- ther Haller, Ron Harris, Patrick A. Hascall, Justine Haupt, Fabio Hernandez, Sv en Herrmann, Edward Hileman, Joshua Hoblitt, John A. Hodgson, Craig Hogan, James D. Howard, Dajun Huang, Michael E. Huffer , Patrick Ingraham, W al- ter R. Innes, Suzanne H. Jacoby , Bhuvnesh Jain, Fabrice Jammes, M. James Jee, Tim Jenness, Garrett Jernigan, Darko Jevremovi ´ c, Kenneth Johns, Anthony S. Johnson, Margaret W . G. Johnson, R. L ynne Jones, Claire Juramy- Gilles, Mario Juri ´ c, Jason S. Kalirai, Nitya J. Kallivayalil, Bryce Kalmbach, Jeffre y P . Kantor , Pierre Karst, Mansi M. Kasliwal, Heather Kelly, Richard Kessler, V eronica Kin- nison, Da vid Kirkby, Llo yd Knox, Iv an V . Koto v , V ic- tor L. Krabbendam, K. Simon Krughoff, Petr Kub ´ anek, John Kuczewski, Shri Kulkarni, John Ku, Nadine R. Ku- rita, Craig S. Lage, Ron Lambert, Tra vis Lange, J. Brian Langton, Laurent Le Guillou, Deborah Levine, Ming Liang, Kian-T at Lim, Chris J. Lintott, K evin E. Long, Margaux Lopez, Paul J. Lotz, Robert H. Lupton, Nate B. Lust, Lau- ren A. MacArthur, Ashish Mahabal, Rachel Mandelbaum, Thomas W . Markiewicz, Darren S. Marsh, Philip J. Marshall, Stuart Marshall, Morgan May, Robert McKercher , Michelle McQueen, Joshua Meyers, Myriam Migliore, Michelle Miller, and David J. Mills. LSST: From Science Driv ers to Reference Design and Anticipated Data Products. The As- tr ophysical J ournal , 873(2):111, 2019. 1 [18] Leron Julian and Aswin C. Sankaranarayanan. Precise Fore- casting of Sky Images Using Spatial W arping. arXiv e-prints , art. arXiv:2409.12162, 2024. 3 [19] J. Kannala and S.S. Brandt. A generic camera model and cal- ibration method for conv entional, wide-angle, and fish-eye lenses. IEEE T r ansactions on P attern Analysis and Machine Intelligence , 28(8):1335–1340, 2006. 1 [20] Dustin Lang, David W . Hogg, Keir Mierle, Michael Blanton, and Sam Roweis. Astrometry .net: Blind Astrometric Cali- bration of Arbitrary Astronomical Images. The Astr onomical Journal , 139(5):1782–1800, 2010. 5 [21] Qingyong Li, W eitao Lu, and Jun Y ang. A Hybrid Thresh- olding Algorithm for Cloud Detection on Ground-Based Color Images. Journal of Atmospheric and Oceanic T ech- nology , 28(10):1286–1296, 2011. 2 [22] X. Li, B. W ang, B. Qiu, and C. W u. An all-sky camera im- age classification method using cloud cover features. Atmo- spheric Measur ement T echniques , 15(11):3629–3639, 2022. 2 [23] C. N. Long, J. M. Sabburg, J. Calb ´ o, and D. Pag ` es. Re- trieving Cloud Characteristics from Ground-Based Daytime Color All-Sky Images. Journal of Atmospheric and Oceanic T echnology , 23(5):633, 2006. 2 [24] Maxime Oquab, Timoth ´ ee Darcet, Th ´ eo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov , Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, W ojciech Galuba, Russell Howes, Po-Y ao Huang, Shang-W en Li, Ishan Misra, Michael Rabbat, V asu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´ e Je- gou, Julien Mairal, Patrick Labatut, Armand Joulin, and Pi- otr Bojanowski. DINOv2: Learning Robust V isual Features without Supervision. arXiv e-prints , art. 2023. 3 [25] Y icheng Rui, Y if an Xuan, Shuyue Zheng, Ke xin Li, Kaiming Cui, Kai Xiao, Jie Zheng, Jun Kai Ng, Hongxuan Jiang, Fabo Feng, and Qinghui Sun. Architecture of the Tian yu Software: Relativ e Photometry as a Case Study. Publications of the Astr onomical Society of the P acific , 137(6):064501, 2025. 1 [26] Thomas Schmidt, Jonas St ¨ uhrenberg, Niklas Blum, Jorge Lezaca, Annette Hammer, Stefan Wilbert, Bijan Nouri, Marion Schroedter -Homscheidt, Detle v Heinemann, and Thomas V ogt. Eye2sky - a network of all-sky imager and me- teorological measurement stations for high resolution now- casting of solar irradiance. Meteor ologisc he Zeitschrift , 34 (1):35–55, 2025. 2 [27] Chaojun Shi, Y atong Zhou, and Bo Qiu. Cloudu-netv2: A cloud segmentation method for ground-based cloud images based on deep learning. Neural Pr ocessing Letter s , 53:1–14, 2021. 3 [28] Xingjian Shi, Zhourong Chen, Hao W ang, Dit-Y an Y eung, W ai-kin W ong, and W ang-chun W oo. Con v olutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. arXiv e-prints , art. arXiv:1506.04214, 2015. 3 [29] Oriane Sim ´ eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, V asil Khalidov , Marc Szafraniec, Seungeun Y i, Micha ¨ el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca W ehrstedt, Jianyuan W ang, Timoth ´ ee Darcet, Th ´ eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea V edaldi, Jamie T olan, John Brandt, Camille Couprie, Julien Mairal, Herv ´ e J ´ egou, Patrick La- batut, and Piotr Bojanowski. DINOv3. arXiv e-prints , art. arXiv:2508.10104, 2025. 3 [30] Kentaro W ada, mpitid, Martijn Buijs, Zhang Ch. N., Narumi, Bc. Martin Kubov ˇ c ´ ık, Alex Myczko, latentix, Lingjie Zhu, Naoya Y amaguchi, Shohei Fujii, iamgd67, IlyaOvodov , Ak- shar Patel, Christian Clauss, Eisoku Kuroiwa, Roger Iyen- gar , Sergei Shilin, T anya Malygina, Kento Kawaharazuka, Jonne Engelberts, Aleksi J, AlexMa, Changwoo Song, Char- lie, Daniel Rose, Douglas Li vingstone, Doug, Erik, and Hen- rik T oft. wkentaro/labelme: v4.6.0, 2021. 4 [31] Y uxuan W ang, Haixu W u, Jiaxiang Dong, Y ong Liu, Chen W ang, Mingsheng Long, and Jianmin W ang. Deep T ime Series Models: A Comprehensive Surve y and Benchmark. arXiv e-prints , art. arXiv:2407.13278, 2024. 8 [32] W anyi Xie, Dong Liu, Ming Y ang, Shaoqing Chen, Benge W ang, Zhenzhu W ang, Y ingwei Xia, Y ong Liu, Y iren W ang, and Chaofang Zhang. SegCloud: a nov el cloud image seg- mentation model using a deep con volutional neural network for ground-based all-sky-view camera observation. Atmo- spheric Measurement T echniques , 13(4):1953–1961, 2020. 3 [33] W ilson Y an, Y unzhi Zhang, Pieter Abbeel, and Aravind Sriniv as. V ideoGPT : V ideo Generation using VQ-V AE and T ransformers. arXiv e-prints , art. arXiv:2104.10157, 2021. 3 [34] Jia Y in, Y ongqiang Y ao, Xuan Qian, Liyong Liu, Xu Chen, and Liuming Zhai. Calibration and applications of the all- sky camera at the Ali Observ atory in T ibet. Monthly Notices of the Royal Astr onomical Society , 537(1):617–627, 2025. 5 [35] Y uanzhao Zhang and William Gilpin. Context parroting: A simple but tough-to-beat baseline for foundation mod- els in scientific machine learning. arXiv e-prints , art. arXiv:2505.11349, 2025. 8 [36] Jinghao Zhou, Chen W ei, Huiyu W ang, W ei Shen, Cihang Xie, Alan Y uille, and T ao Kong. iBOT : Image BER T Pre-T raining with Online T okenizer . arXiv e-prints , art. arXiv:2111.07832, 2021. 3 LenghuSky-8: An 8-Y ear All-Sky Cloud Dataset with Star -A ware Masks and Alt-Az Calibration f or Segmentation and Nowcasting Supplementary Material T able 7. Statistics of failure case of all-sky camera. The events that occurs within 12 hours are merged together as one e v ent. Reason # occur # frame Do wn time [days] cov er 181 30111 169.79 stronglight 388 5275 23.54 cameradown 56 2167 20.75 object 40 735 4.66 T otal 665 38288 218.75 T able 8. Optional projection types of full-sky camera, which is provided by [ 19 ]. Name Formula Gnomonic projection r = f tan θ Stereographic projection r = 2 f tan θ 2 Equidistant projection r = 2 f θ Equisolid angle projection r = 2 f sin θ 2 Orthographic projection r = f sin θ A. Failur e cases of the all-sky camera Failure of all-sky camera refers to being unable to recog- nize a large proportion of the sky condition manually . T yp- ical cases of f ailure is sho wn in Fig. 4 . W e manually inspect the whole dataset and mark out the failure cases. Case 1 ∼ 4 in Fig. 4 are marked as cover; Case 5 are marked as object; Case 6 and 7 are mark ed as strong light; Case 8 is marked as Camera malfunction. 665 failures are recognized by human in the dataset. The start and the end of these ev ents are also recorded. In segmentation task, the regions that cannot de- cide whether there is cloud or not are annotated as “contam- ination” class. An example of “cover” is the third column of Fig. 1 . A statistics of the failure time is shown in T a- ble 7 . Mud and dew are the most frequent causes of f ailure, particularly from November through May . The complete table is av ailable in the online repository . More detailed weather monitoring and camera f ailure information is a v ail- able in https : / / huggingface . co / datasets / ruiyicheng / LenghuSky - 8 / tree / main / data . The region that would affect the determination of the lo- cal weather are annotated as contamination, e.g. the region that is cov ered by de w . B. Projection type of all-sky camera In this work, se v eral projection types are used for distortion correction. These types are listed in table 8 . C. Manual annotation samples and annotation process details f or cloud segmentation task Manual annotation samples for cloud segmentation task is shown in Fig. 5 . The 1,111-image reference set was anno- tated by 9 trained students/engineers with astronomy back- ground under an astronomer lead. W e stratified sampling by time-of-day , season, moon phase, and cloud coverage to ensure balanced coverage; rare cases largely determine the final size. Each image is labeled in LabelMe using writ- ten guidelines with conserv ati ve partial labeling: only high- confidence regions are labeled; ambiguous pixels are left unlabeled and ignored in loss/metrics. Every image then un- dergoes a second-pass expert revie w/edit to enforce cross- annotator consistency; disagreements are resolved by adju- dication in this pass. Some bright structures in examples Fig. 5 arise from scattering/stray light (dust/dew/scratches/saturation) and can resemble thin clouds in a single frame. Our guideline treats these as contamination when they obstruct sky/cloud attribution; otherwise they are annotated as sky/cloud, or left unlabeled if ambiguous. Annotators also consult adja- cent frames to distinguish ev olving clouds from static arti- facts. D. Fitting and residual of astrometric calibra- tion The Altitude-Azimuth fitting map for each time slot is shown in Fig. 6 . Residual of fitting results are shown in Fig. 7 . E. Annotation of background Annotation of the background as listed in T able 3 is sho wn in Fig. 8 . F. Experimental details of baseline models F .1. Encoder -Decoder for cloud segmentation W e use polygon annotations produced in LabelMe JSON format. For each sample, the RGB image is reco vered from the embedded base64-encoded payload and conv erted to a three-channel array . Semantic masks are rasterized by fill- ing the annotated polygons per class on an empty can v as that shares the image resolution. Pixels not covered by any polygon are assigned an “ignore” label. The class map- ping is sky → 0, cloud → 1, contamination → 2, and Figure 4. Failure cases of the all-sk y camera. T op row (left to right): (1) Cov ered by dust or sand; (2) Covered by dew or ice; (3) Scattered light caused by mud coverage; (4) Covered by snow . Bottom row (left to right): (5) Obstruction by an external object; (6) Strong nearby light source; (7) Strong distant light source; (8) Camera malfunction. New Moon 2:00 6:00 10:00 14:00 18:00 22:00 F irst Quarter F ull Moon Last Quarter sk y cloud contamination Figure 5. Manual annotation for cloud segmentation. Blue represents cloud regions; orange represents sky regions; Pink represents contamination regions; Columns represents images taken around given time in UTC+8; Rows represents images taken in different moon phase condition. Nearby frames are used by human to determine whether some regions are scatter light or cloud. 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (a) Altitude fitting result for 2018-05- 01 00:02:44 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (b) Azimuth fitting result for 2018- 05-01 00:02:44 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (c) Altitude fitting result for 2018-09- 27 19:19:49 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (d) Azimuth fitting result for 2018- 09-27 19:19:49 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (e) Altitude fitting result for 2019-04- 24 15:39:36 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (f) Azimuth fitting result for 2019-04- 24 15:39:36 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (g) Altitude fitting result for 2019-06- 26 18:23:18 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (h) Azimuth fitting result for 2019- 06-26 18:23:18 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (i) Altitude fitting result for 2019-07- 05 11:59:14 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (j) Azimuth fitting result for 2019-07- 05 11:59:14 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 10 20 30 40 50 60 70 80 90 alt (degr ees) (k) Altitude fitting result for 2023-09- 27 18:09:48 0 500 1000 1500 2000 2500 3000 3500 4000 X (pix els) 0 500 1000 1500 2000 2500 3000 3500 4000 Y (pix els) 0 50 100 150 200 250 300 350 az (degr ees) (l) Azimuth fitting result for 2023-09- 27 18:09:48 Figure 6. Altitude–Azimuth fitting results for different time slots. Red boxes means that the WCS of the corresponding HEALPix cell is not resolved by A S T R O M E T RY . N E T for any image in the ensemble, and the corresponding altitude and azimuth is obtained by fitting results. ignore → 3. Polygons are rounded to integer coordinates and clipped to image bounds before rasterization to av oid off-by-one artif acts. CloudSegNet is a compact encoder–decoder CNN with two downsampling stages and two symmetric upsampling stages. A layer-wise specification of this model is sho wn in T able 9 . W e train with pixel-wise cross-entropy using ignore index , and uniform class weights. The optimizer is Adam with learning rate 1e − 4 . W e train for up to 500 epochs and apply early stopping on validation loss with a patience of 5 epochs, retaining the checkpoint with the best validation loss. F .2. U-Net (CloudU-Net) for cloud segmentation The pre-processing steps and training parameters of U-Net is similar to encoder-decoder as above. The layer-wise specification of the applied model in this paper is shown in T able 10 . 0 20 40 60 80 z (deg) 20 15 10 5 0 5 10 15 20 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 4.494 px) (a) Radial fitting residual for 2018- 05-01 00:02:44 0 50 100 150 200 250 300 350 Measur ed az (deg) 2 1 0 1 2 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 0.569°) (b) Azimuth fitting residual for 2018- 05-01 00:02:44 0 20 40 60 80 z (deg) 40 20 0 20 40 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 9.016 px) (c) Radial fitting residual for 2018- 09-27 19:19:49 0 50 100 150 200 250 300 350 Measur ed az (deg) 6 4 2 0 2 4 6 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 1.382°) (d) Azimuth fitting residual for 2018- 09-27 19:19:49 0 10 20 30 40 50 60 70 z (deg) 20 15 10 5 0 5 10 15 20 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 4.298 px) (e) Radial fitting residual for 2019- 04-24 15:39:36 0 50 100 150 200 250 300 350 Measur ed az (deg) 3 2 1 0 1 2 3 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 0.774°) (f) Azimuth fitting residual for 2019- 04-24 15:39:36 0 20 40 60 80 z (deg) 20 10 0 10 20 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 5.239 px) (g) Radial fitting residual for 2019- 06-26 18:23:18 0 50 100 150 200 250 300 350 Measur ed az (deg) 4 3 2 1 0 1 2 3 4 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 0.814°) (h) Azimuth fitting residual for 2019- 06-26 18:23:18 0 20 40 60 80 z (deg) 20 10 0 10 20 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 5.958 px) (i) Radial fitting residual for 2019-07- 05 11:59:14 0 50 100 150 200 250 300 350 Measur ed az (deg) 7.5 5.0 2.5 0.0 2.5 5.0 7.5 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 1.978°) (j) Azimuth fitting residual for 2019- 07-05 11:59:14 0 20 40 60 80 z (deg) 60 40 20 0 20 40 60 R esidual (meas - pr ed) [pix els] r(z) r esiduals (RMSE 15.086 px) (k) Radial fitting residual for 2023- 09-27 18:09:48 0 50 100 150 200 250 300 350 Measur ed az (deg) 10 5 0 5 10 R esidual (pr ed - meas) [deg] Azimuth r esiduals (cir cular RMSE 2.256°) (l) Azimuth fitting residual for 2023- 09-27 18:09:48 Figure 7. Astrometric calibration fitting residual for different time slots in radial and azimuthal direction. T able 9. Layer-wise specification of Encoder -Decoder (CloudSegNet) for a 3 × 512 × 512 input. “k/s/p” denotes kernel/stride/padding. Stage Layer (in → out) Operation k/s/p Channel out Size (H × W) Enc-1 3 → 64 Con v2d + ReLU + BN 3/1/1 64 512 × 512 Enc-1 64 → 64 Con v2d + ReLU + BN 3/1/1 64 512 × 512 Enc-1 ↓ – MaxPool2d 2/2/0 64 256 × 256 Enc-2 64 → 128 Con v2d + ReLU + BN 3/1/1 128 256 × 256 Enc-2 128 → 128 Con v2d + ReLU + BN 3/1/1 128 256 × 256 Enc-2 ↓ – MaxPool2d 2/2/0 128 128 × 128 Dec-1 ↑ 128 → 64 ConvT ranspose2d + ReLU 2/2/0 64 256 × 256 Dec-1 64 → 64 Con v2d + ReLU + BN 3/1/1 64 256 × 256 Dec-2 ↑ 64 → 32 Con vT ranspose2d + ReLU 2/2/0 32 512 × 512 Head 32 → 3 Con v2d (logit) 1/1/0 3 512 × 512 F .3. SegMAN f or cloud segmentation The pre-processing steps and training parameters of U-Net is similar to encoder-decoder as abov e. SegMAN is an encoder–decoder segmentation network with scalable vari- ants (Tin y/Small/Base/Lar ge). The parameters used in this work for these scale is exactly the same with the original paper [ 12 ]. All variants share the same data interface and segmentation head: the model outputs dense logits that are bilinearly resized to the mask resolution when needed. V ari- 2018-05-01 00:02:44 2018-09-27 19:19:49 2019-04-24 15:39:36 2019-06-26 18:23:18 2019-07-05 11:59:14 2020-08-26 17:24:23 2020-09-16 15:30:23 2020-09-23 12:18:53 2020-10-18 14:31:56 2021-05-10 14:39:39 2022-06-01 19:41:30 2023-03-27 11:15:06 2023-09-27 18:09:48 2023-09-27 20:47:06 Figure 8. Annotation of all background ants dif fer only in encoder capacity and decoder width; we keep training/e v aluation identical across v ariants. F .4. Con vLSTM for cloud nowcast W e implement a next-frame Con vLSTM baseline on se- quences of logits with per-frame masks. Each training sam- ple comes from CloudLogitsDataset , which yields a video tensor x ∈ R C × T × H × W together with a binary mask stack m ∈ 0 , 1 T × H × W ; frames are built from timestamped files and normalised per sequence by ( x − µ ) / ( σ · 6) be- fore batching. For learning, we enforce T = n input +1 , feed the first n input frames to the Con vLSTM, and regress the last frame with a masked MSE on the target mask m T ; op- timisation uses Adam. The configuration used in our ex- periments sets n input =2 , hidden dimensions [64 , 64] , kernel size 3 , and dropout 0 , with data sampled at 60-min interv als from user-specified time ranges. Layer-wise specification of used Con vLSTM in the paper is sho wn in T able 11 . F .5. V ideoGPT for cloud nowcast Our V ideoGPT nowcaster follows a two-stage discrete la- tent modelling pipeline. First, a VQ-V AE encodes videos by time-slicing, then vector -quantises features with a code- book of size K ; training minimises masked reconstruction MSE plus standard commitment/codebook losses and tracks codebook perplexity . After training VQ-V AE, we freeze the best checkpoint and train a causal Transformer (GPT) as an autoregressi v e language model ov er the flattened VQ indices, using cross-entropy on next-token prediction. Key hyperparameters include K =512 codes for VQ-V AE and a GPT with d model =512 , n head =8 , and 6 layers. Architecture of the V ideoGPT used in this work is shown in T able 12 . G. Per -class experimental results The per-class nowcasting results are sho wn in T able 13 ; The per-class cloud se gmentation results are sho wn in T able 14 . T able 10. Layer-wise specification of the U-Net used in our e xperiments (bilinear upsampling v ariant) for a 3 × 512 × 512 input. Stage Layer (in → out) Operation k/s/p Ch. out Size (H × W) Enc-0 3 → 64 Conv2d + ReLU + BN 3/1/1 64 512 × 512 Enc-0 64 → 64 Con v2d + ReLU + BN 3/1/1 64 512 × 512 Down-1 ↓ – MaxPool2d 2/2/0 64 256 × 256 Down-1 64 → 128 Conv2d + ReLU + BN 3/1/1 128 256 × 256 Down-1 128 → 128 Conv2d + ReLU + BN 3/1/1 128 256 × 256 Down-2 ↓ – MaxPool2d 2/2/0 128 128 × 128 Down-2 128 → 256 Conv2d + ReLU + BN 3/1/1 256 128 × 128 Down-2 256 → 256 Conv2d + ReLU + BN 3/1/1 256 128 × 128 Down-3 ↓ – MaxPool2d 2/2/0 256 64 × 64 Down-3 256 → 512 Conv2d + ReLU + BN 3/1/1 512 64 × 64 Down-3 512 → 512 Conv2d + ReLU + BN 3/1/1 512 64 × 64 Bottleneck ↓ – MaxPool2d 2/2/0 512 32 × 32 Bottleneck 512 → 512 Con v2d + ReLU + BN 3/1/1 512 32 × 32 Bottleneck 512 → 512 Con v2d + ReLU + BN 3/1/1 512 32 × 32 Up-1 ↑ 512 → 512 Upsample (bilinear) 2/–/– 512 64 × 64 Up-1 512 +512 Concat (skip from Down-3) – 1024 64 × 64 Up-1 1024 → 256 Con v2d + ReLU + BN 3/1/1 256 64 × 64 Up-1 256 → 256 Conv2d + ReLU + BN 3/1/1 256 64 × 64 Up-2 ↑ 256 → 256 Upsample (bilinear) 2/–/– 256 128 × 128 Up-2 256 +256 Concat (skip from Down-2) – 512 128 × 128 Up-2 512 → 128 Conv2d + ReLU + BN 3/1/1 128 128 × 128 Up-2 128 → 128 Conv2d + ReLU + BN 3/1/1 128 128 × 128 Up-3 ↑ 128 → 128 Upsample (bilinear) 2/–/– 128 256 × 256 Up-3 128 +128 Concat (skip from Down-1) – 256 256 × 256 Up-3 256 → 64 Conv2d + ReLU + BN 3/1/1 64 256 × 256 Up-3 64 → 64 Con v2d + ReLU + BN 3/1/1 64 256 × 256 Up-4 ↑ 64 → 64 Upsample (bilinear) 2/–/– 64 512 × 512 Up-4 64 +64 Concat (skip from Enc-0) – 128 512 × 512 Up-4 128 → 64 Conv2d + ReLU + BN 3/1/1 64 512 × 512 Up-4 64 → 64 Con v2d + ReLU + BN 3/1/1 64 512 × 512 Head 64 → 3 Conv2d (logits) 1/1/0 3 512 × 512 T able 11. Layer -wise specification of the Con vLSTM next-frame baseline. Input is a [ B , C =3 , T in =2 , H, W ] clip; output is a single frame [ B , 3 , H , W ] . “k/s/p” denotes kernel/stride/padding. Stage Layer (in → out) Operation k/s/p Ch. out Size (H × W) LSTM-1 3 → 64 ConvLSTM2D + Dropout 3/1/1 64 H × W LSTM-2 64 → 64 Con vLSTM2D + Dropout 3/1/1 64 H × W Head 64 → 3 Conv2d (readout) 1/1/0 3 H × W T able 12. Architecture of the V ideoGPT nowcaster (VQ-V AE + GPT). For a [ B , 3 , T , 64 , 64] input, the encoder downsamples to [ B , 256 , T , 8 , 8] , which is vector-quantised (codebook size K ). T okens are modeled autoregressi vely by a T ransformer and decoded back to frames. VQ-V AE Encoder Enc-1 3 → 64 Conv2d + BN + ReLU + ResidualBlock 4/2/1 64 32 × 32 Enc-2 64 → 128 Con v2d + BN + ReLU + ResidualBlock 4/2/1 128 16 × 16 Enc-3 128 → 256 Con v2d + BN + ReLU + ResidualBlock 4/2/1 256 8 × 8 Bottleneck 256 → 256 Con v2d + BN + ReLU 3/1/1 256 8 × 8 V ector Quantiser VQ 256 → K V ectorQuantizer ( K =512 , D =256 ) – K T × 8 × 8 VQ-V AE Decoder Dec-3 ↑ 256 → 128 Con vTranspose2d + BN + ReLU + ResidualBlock 4/2/1 128 16 × 16 Dec-2 ↑ 128 → 64 ConvT ranspose2d + BN + ReLU + ResidualBlock 4/2/1 64 32 × 32 Dec-1 ↑ 64 → 64 ConvT ranspose2d + BN + ReLU + ResidualBlock 4/2/1 64 64 × 64 Head 64 → 3 Conv2d + T anh 3/1/1 3 64 × 64 GPT (autoregressi ve over tokens) T okEmb K → d model Embedding – 512 – PosEmb – Learned positional embedding (length 20,000) – 512 – T ransf 512 → 512 #Layers = 6 , #Heads = 8 (T ransformerEncoderLayer) – 512 – LN 512 → 512 LayerNorm – 512 – Head 512 → K Linear (projection to v ocab) – K – T able 13. Per-class metrics for no wcasting Class Baseline Precision Recall F1-score Sky T ri vial 0.914 0.913 0.914 Optica lFlow 0.913 0.913 0.913 Con vLSTM 0.922 0.905 0.913 V ideoGPT -1 0.920 0.882 0.900 V ideoGPT -2 0.915 0.887 0.901 V ideoGPT -7 0.919 0.883 0.900 Cloud T ri vial 0.876 0.878 0.877 Optical Flow 0.878 0.878 0.878 Con vLSTM 0.860 0.907 0.883 V ideoGPT -1 0.833 0.904 0.867 V ideoGPT -2 0.841 0.897 0.868 V ideoGPT -7 0.830 0.903 0.865 Contamination T ri vial 0.787 0.782 0.785 Optical Flow 0.784 0.788 0.786 Con vLSTM 0.831 0.734 0.779 V ideoGPT -1 0.768 0.685 0.724 V ideoGPT -2 0.763 0.690 0.724 V ideoGPT -7 0.774 0.668 0.717 T able 14. Per-class metrics for cloud se gmentation(median with 16th-83rd percentiles) Class Baseline Precision Recall F1-score Sky DINOv3 local 0 . 920 +0 . 028 − 0 . 022 0 . 945 +0 . 018 − 0 . 009 0 . 936 +0 . 006 − 0 . 015 DINOv3 local + CLS 0 . 917 +0 . 037 − 0 . 025 0 . 941 +0 . 019 − 0 . 014 0 . 932 +0 . 012 − 0 . 013 DINOv3 local + CLS + register 0 . 913 +0 . 038 − 0 . 031 0 . 942 +0 . 018 − 0 . 025 0 . 926 +0 . 015 − 0 . 012 DINOv3 local + mean(CLS + register) 0 . 921 +0 . 031 − 0 . 019 0 . 944 +0 . 018 − 0 . 019 0 . 936 +0 . 009 − 0 . 016 Encoder-Decoder 0 . 764 +0 . 061 − 0 . 027 0 . 868 +0 . 056 − 0 . 034 0 . 816 +0 . 021 − 0 . 009 U-Net 0 . 869 +0 . 023 − 0 . 039 0 . 912 +0 . 025 − 0 . 047 0 . 882 +0 . 017 − 0 . 016 SegMAN T iny 0 . 881 +0 . 027 − 0 . 019 0 . 926 +0 . 022 − 0 . 027 0 . 906 +0 . 008 − 0 . 017 SegMAN Small 0 . 871 +0 . 026 − 0 . 031 0 . 929 +0 . 031 − 0 . 019 0 . 904 +0 . 013 − 0 . 010 SegMAN Base 0 . 874 +0 . 017 − 0 . 027 0 . 927 +0 . 025 − 0 . 017 0 . 901 +0 . 010 − 0 . 012 SegMAN Lar ge 0 . 854 +0 . 025 − 0 . 014 0 . 931 +0 . 021 − 0 . 026 0 . 899 +0 . 005 − 0 . 021 Cloud DINOv3 local 0 . 961 +0 . 013 − 0 . 018 0 . 966 +0 . 011 − 0 . 011 0 . 961 +0 . 013 − 0 . 010 DINOv3 local + CLS 0 . 965 +0 . 006 − 0 . 024 0 . 969 +0 . 007 − 0 . 016 0 . 960 +0 . 012 − 0 . 012 DINOv3 local + CLS + register 0 . 954 +0 . 019 − 0 . 016 0 . 965 +0 . 013 − 0 . 011 0 . 960 +0 . 009 − 0 . 013 DINOv3 local + mean(CLS + register) 0 . 966 +0 . 010 − 0 . 021 0 . 970 +0 . 007 − 0 . 017 0 . 962 +0 . 013 − 0 . 010 Encoder-Decoder 0 . 823 +0 . 052 − 0 . 034 0 . 823 +0 . 050 − 0 . 047 0 . 822 +0 . 028 − 0 . 035 U-Net 0 . 863 +0 . 035 − 0 . 062 0 . 921 +0 . 017 − 0 . 031 0 . 880 +0 . 020 − 0 . 036 SegMAN T iny 0 . 902 +0 . 031 − 0 . 027 0 . 925 +0 . 018 − 0 . 042 0 . 910 +0 . 017 − 0 . 019 SegMAN Small 0 . 905 +0 . 031 − 0 . 029 0 . 912 +0 . 036 − 0 . 047 0 . 904 +0 . 020 − 0 . 015 SegMAN Base 0 . 900 +0 . 028 − 0 . 029 0 . 921 +0 . 021 − 0 . 038 0 . 913 +0 . 008 − 0 . 024 SegMAN Lar ge 0 . 899 +0 . 025 − 0 . 034 0 . 900 +0 . 032 − 0 . 024 0 . 899 +0 . 018 − 0 . 024 Contamination DINOv3 local 0 . 808 +0 . 056 − 0 . 092 0 . 686 +0 . 080 − 0 . 121 0 . 726 +0 . 059 − 0 . 055 DINOv3 local + CLS 0 . 783 +0 . 068 − 0 . 057 0 . 644 +0 . 135 − 0 . 068 0 . 715 +0 . 053 − 0 . 050 DINOv3 local + CLS + register 0 . 806 +0 . 060 − 0 . 084 0 . 630 +0 . 121 − 0 . 113 0 . 707 +0 . 067 − 0 . 059 DINOv3 local + mean(CLS + register) 0 . 793 +0 . 073 − 0 . 087 0 . 675 +0 . 126 − 0 . 114 0 . 715 +0 . 047 − 0 . 042 Encoder-Decoder 0 . 730 +0 . 108 − 0 . 124 0 . 285 +0 . 077 − 0 . 071 0 . 396 +0 . 077 − 0 . 056 U-Net 0 . 839 +0 . 055 − 0 . 278 0 . 280 +0 . 076 − 0 . 089 0 . 415 +0 . 072 − 0 . 130 SegMAN T iny 0 . 637 +0 . 113 − 0 . 126 0 . 437 +0 . 112 − 0 . 092 0 . 496 +0 . 100 − 0 . 052 SegMAN Small 0 . 678 +0 . 117 − 0 . 153 0 . 383 +0 . 121 − 0 . 077 0 . 489 +0 . 043 − 0 . 057 SegMAN Base 0 . 683 +0 . 069 − 0 . 065 0 . 376 +0 . 089 − 0 . 105 0 . 483 +0 . 057 − 0 . 095 SegMAN Lar ge 0 . 654 +0 . 145 − 0 . 112 0 . 359 +0 . 066 − 0 . 087 0 . 464 +0 . 045 − 0 . 065

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment