VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation
Short-video recommendation presents unique challenges, such as modeling rapid user interest shifts from implicit feedback, but progress is constrained by a lack of large-scale open datasets that reflect real-world platform dynamics. To bridge this gap, we introduce the VK Large Short-Video Dataset (VK-LSVD), the largest publicly available industrial dataset of its kind. VK-LSVD offers an unprecedented scale of over 40 billion interactions from 10 million users and almost 20 million videos over six months, alongside rich features including content embeddings, diverse feedback signals, and contextual metadata. Our analysis supports the dataset’s quality and diversity. The dataset’s immediate impact is confirmed by its central role in the live VK RecSys Challenge 2025. VK-LSVD provides a vital, open dataset to use in building realistic benchmarks to accelerate research in sequential recommendation, cold-start scenarios, and next-generation recommender systems.
💡 Research Summary
The paper presents VK‑LSVD, a publicly released industrial dataset designed to fill a critical gap in short‑video recommendation research. Existing public datasets such as Netflix Prize or MovieLens lack the scale, temporal depth, and multimodal feedback that characterize modern short‑video platforms, while the few large‑scale industrial datasets (e.g., KuaiRand, KuaiRec, KuaiSAR) are limited to tens of thousands of users and short observation windows. VK‑LSVD addresses these limitations by providing six months of interaction logs from VK’s short‑video service, encompassing 40.8 billion interaction events generated by 10 million users on 19.6 million videos.
Key characteristics of the dataset include:
-
Rich feedback signals – The data captures both implicit and explicit user actions: watch time (capped at 255 seconds but cumulative across re‑views), likes, dislikes, shares, bookmarks, author clicks, and comment opens. This variety enables researchers to explore nuanced user engagement beyond binary clicks.
-
Extensive contextual metadata – For each exposure, the dataset records the consumption context (feed vs. search), platform type (Android, Web, iOS), client agent, and geographic region (80 distinct zones). Such context is essential for building situation‑aware recommendation models.
-
Static user and item attributes – User demographics (age 18‑70, gender, most frequent location) and item attributes (author ID, video duration) are provided, supporting fairness, bias, and cold‑start studies.
-
Content embeddings – A 64‑dimensional float16 embedding vector is supplied for every video. These embeddings were generated by a proprietary multimodal model and then compressed using truncated SVD, preserving the most informative dimensions while allowing flexible dimensionality reduction.
-
Global Temporal Split (GTS) – Interaction records are stored as weekly Parquet files, with the first 25 weeks designated for training, the 26th week for validation, and the 27th week for testing. This strict chronological split mirrors real‑world deployment scenarios and enables robust evaluation of sequential and session‑based recommendation algorithms.
-
Data quality validation – The authors demonstrate that user activity and item popularity follow classic power‑law distributions, confirming realism. They further compute cosine similarity of iALS latent factors across demographic groups (age, gender, location) and content attributes (author, duration, embedding clusters). Results show strong intra‑group similarity, indicating that the dataset captures meaningful user and item relationships.
-
Baseline experiments – Four simple models are evaluated: Random, Global Popularity, a Conversion‑based click‑through rate predictor, and Implicit Alternating Least Squares (iALS) using watch time > 10 seconds as a positive signal. iALS achieves the highest NDCG@20 (0.0655 on the random split), illustrating the benefit of leveraging implicit feedback at this scale.
-
Real‑world impact via VK RecSys Challenge 2025 – The dataset serves as the core of a major competition where participants must rank the top 100 most relevant users for each newly uploaded video (a cold‑start, user‑to‑item matching task). Evaluation uses NDCG@100 per clip, encouraging the development of models that can handle new‑item recommendation under strict per‑user recommendation caps.
From an ethical standpoint, all identifiers (user_id, item_id, author_id, and categorical context fields) are irreversibly hashed, and no raw video, audio, or textual content is released. The provided embeddings are designed such that original media cannot be reconstructed, thereby respecting privacy and copyright constraints. The dataset is released under the Apache License 2.0 on Hugging Face Hub, allowing unrestricted academic and commercial use.
The authors summarize their contributions as follows: (1) introducing the largest publicly available short‑video recommendation dataset, (2) providing a thorough statistical and technical validation of its quality and diversity, and (3) demonstrating its practical relevance through a high‑profile competition. By offering a combination of massive scale, multimodal feedback, rich contextual metadata, and content embeddings, VK‑LSVD establishes a new benchmark for research in sequential recommendation, cold‑start handling, long‑tail exploration, and fairness‑aware recommender systems. Researchers can start with provided subsets (e.g., 1 % random users or popular items) on modest hardware and scale up to the full dataset for large‑scale distributed training.
In conclusion, VK‑LSVD fills a crucial void in the ecosystem of recommendation datasets. Its comprehensive design enables realistic simulation of industrial short‑video environments, fostering advances that are more likely to transfer to production systems. The dataset is expected to become a standard reference for future work on rapid user interest shifts, multimodal content integration, and responsible AI in the short‑video domain.
Comments & Academic Discussion
Loading comments...
Leave a Comment