From Street View to Visibility Network: Mapping Urban Visual Relationships with Vision-Language Models
Visibility analysis is one of the fundamental analytics methods in urban planning and landscape research, traditionally conducted through computational simulations based on the Line-of-Sight (LoS) principle. However, when assessing the visibility of named urban objects such as landmarks, geometric intersection alone fails to capture the contextual and perceptual dimensions of visibility as experienced in the real world. The study challenges the traditional LoS-based approaches by introducing a new, image-based visibility analysis method. Specifically, a Vision Language Model (VLM) is applied to detect the target object within a direction-zoomed Street View Image (SVI). Successful detection represents the object’s visibility at the corresponding SVI location. Further, a heterogeneous visibility graph is constructed to address the complex interaction between observers and target objects. In the first case study, the method proves its reliability in detecting the visibility of six tall landmark constructions in global cities, with an overall accuracy of 87%. Furthermore, it reveals broader contextual differences when the landmarks are perceived and experienced. In the second case, the proposed visibility graph uncovers the form and strength of connections for multiple landmarks along the River Thames in London, as well as the places where these connections occur. Notably, bridges on the River Thames account for approximately 30% of total connections. Our method complements and enhances traditional LoS-based visibility analysis, and showcases the possibility of revealing the prevalent connection of any visual objects in the urban environment. It opens up new research perspectives for urban planning, heritage conservation, and computational social science.
💡 Research Summary
The paper revisits the foundations of urban visibility analysis, which has traditionally relied on line‑of‑sight (LoS) simulations that require detailed 3D data and treat visibility as a purely geometric relationship. Recognizing the scarcity of high‑resolution DSM/DEM data and the inability of LoS to capture contextual cues such as lighting, vegetation, and the co‑occurrence of multiple objects, the authors propose an image‑based alternative that leverages the ubiquity of street‑view imagery (SVI) and recent advances in vision‑language models (VLMs). The workflow consists of three main steps: (1) a target landmark is described with a textual prompt; (2) direction‑ and zoom‑controlled SVI images are fed to a VLM (e.g., CLIP, OWL‑ViT, Grounding DINO) which determines whether the landmark is detectable in each image; a successful detection is interpreted as a binary visibility signal for that viewpoint; (3) all observer‑landmark pairs are encoded in a heterogeneous visibility graph where nodes represent viewpoints and landmarks, and edges carry weights derived from detection confidence, distance, and viewing angle. This graph enables the quantification of connection strength, identification of shared viewpoints, and analysis of how urban infrastructure mediates visual relationships.
Two case studies validate the approach. The first examines six iconic tall structures across global cities (e.g., The Shard, Burj Khalifa). Using a manually curated ground‑truth set, the VLM‑driven detection achieves an overall accuracy of 87 %, and the resulting visibility maps reveal substantial contextual variation—different streetscapes, surrounding building densities, and lighting conditions lead to markedly different perceived visibility despite identical geometric LoS outcomes. The second case focuses on multiple landmarks along London’s River Thames. The constructed visibility graph uncovers a dense network of inter‑landmark connections, with bridges acting as critical visual conduits; roughly 30 % of all edges involve a bridge, highlighting the role of physical infrastructure in shaping visual experience.
The authors discuss strengths—scalability to any city with SVI, direct incorporation of real‑world visual context, and the ability to study multi‑object visual relationships—alongside limitations such as VLM sensitivity to image quality, weather, and distance, and the fact that SVI samples only road‑aligned viewpoints, leaving pedestrian‑only zones under‑represented. Nonetheless, the study demonstrates that image‑based visibility analysis can complement, and in data‑poor settings even replace, traditional LoS methods, opening new avenues for urban planning, heritage conservation, and computational social science research.
Comments & Academic Discussion
Loading comments...
Leave a Comment