Neural Turtle Graphics for Modeling City Road Layouts

Reading time: 26 minute
...

📝 Original Paper Info

- Title: Neural Turtle Graphics for Modeling City Road Layouts
- ArXiv ID: 1910.02055
- Date: 2019-10-07
- Authors: Hang Chu, Daiqing Li, David Acuna, Amlan Kar, Maria Shugrina, Xinkai Wei, Ming-Yu Liu, Antonio Torralba, Sanja Fidler

📝 Abstract

We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represent road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show that it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch parts of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.

💡 Summary & Analysis

This paper introduces Neural Turtle Graphics (NTG), a novel generative model for spatial graphs used to model city road layouts. The key idea is to represent the road layout as a graph where nodes are control points and edges are road segments. NTG, parameterized by a neural network, iteratively generates new nodes and edges connected to existing ones based on the current state of the graph. Starting with an initial root node and its connections, it extends this structure by adding new nodes and edges in each step.

The problem addressed here is the generation of plausible city road layouts for applications such as urban planning simulations, video game environments, and autonomous vehicle testing scenarios. Traditional methods relied heavily on procedural modeling techniques that required manual tuning of attributes to achieve desired results, which was inflexible and labor-intensive.

NTG solves this by learning from existing map data using a graph-based approach. The encoder-decoder architecture allows it to capture local statistics and generate new road segments iteratively. This method can be conditioned on specific attributes or partial sketches provided by users, offering flexibility in the generation process.

The results show that NTG outperforms other methods across various performance metrics. It not only generates plausible city layouts but also supports interactive synthesis where users can control styles and sketch parts of the layout. Additionally, it performs well in aerial road parsing tasks, demonstrating its versatility as both a generative model and an analytical tool.

The significance lies in NTG’s ability to provide a flexible solution for generating realistic city road structures without extensive manual intervention. This has potential applications in urban planning simulations, game development, and autonomous vehicle testing environments. Moreover, it serves as an effective prior for aerial map parsing tasks, enhancing the accuracy of road detection from satellite imagery.

📄 Full Paper Content (ArXiv Source)

# Neural Turtle Graphics

We formulate the city road layout generation problem as a planar graph generation problem. We first introduce the notation in Sec. 8.1 and then describe our NTG model in Sec. 8.2. Aerial parsing, implementation details, training and inference are given in Sec. 8.3-Sec. 8.5, respectively.

image image
(a) (b)
Illustration of the Neural Turtle Graphics (NTG) model. (a) depicts acyclic incoming paths {$\mb{s}^{in}$} of an active node $\mb{v}_i$, each of which is encoded using an RNN encoder. NTG decoder then predicts a set of outgoing nodes {$\mb{v}^{out}$}. (b) shows the NTG’s neural network architecture. First, the encoder GRU consumes the motion trajectory $\Delta\mb{x}^{in}$ of each incoming path. We produce an order-invariant representation by summing up the last-state hidden vectors across all paths. Next, the decoder produces “commands” to advance the turtle and produces new nodes. An optional attribute vector can be further added to the decoder depending on the task.

Notation

Road Layout. We represent a city road layout using an undirected graph $`{G}=\{{V},{E}\}`$, with nodes $`V`$ and edges $`{E}`$. A node $`\mb{v}_i\in {V}`$ encodes its spatial location $`[x_i,y_i]^T`$, while an edge $`e_{\mb{v}_{i},\mb{v}_{j}}\in\{0,1\}`$ denotes whether a road segment connecting nodes $`\mb{v}_i`$ and $`\mb{v}_j`$ exists. City road graphs are planar since all intersections are present in $`V`$. We assume it is connected, there is a path in $`G`$ between any two nodes in $`V`$. The coordinates $`x_i`$ and $`y_i`$ are measured in meters, relative to the city’s world location.

Incoming Paths. For a node $`\mb{v}_i`$, we define an Acyclic Incoming Path as an ordered sequence of unique, connected nodes which terminates at $`\mb{v}_i`$: $`\mb{s}^{in}`$= {$`\mb{v}_{i,1},\mb{v}_{i,2},...,\mb{v}_{i,L}, \mb{v}_{i}`$} where $`e_{\mb{v}_{i,t},\mb{v}_{i,t+1}}=1`$ for each $`1\leq t < L`$, and $`e_{\mb{v}_{i,L},\mb{v}_{i}}=1`$, with $`L`$ representing the length of the path. Since multiple different acyclic paths can terminate at $`\mb{v}_i`$, the set of these paths is denoted as $`S_i^{in}:= \{\mb{s}_k^{in}\}`$.

Outgoing Nodes. We define $`{V}^{out}_{i} := \{\mb{v}_j: \mb{v}_j \in V \wedge e_{\mb{v}_{i},\mb{v}_{j}}=1\}`$, as the set of nodes with an edge to $`\mb{v}_i`$.

Graph Generation

We learn to generate graphs in an iterative manner. The graph is initialized with a root node and a few nodes connected to it, which are used to initialize a queue $`Q`$ of unvisited nodes. In every iteration, an unvisited node from $`Q`$ is picked to be expanded (called active node). Based on its current local topology, an encoder model generates a latent representation, which is used to generate a set of neighboring nodes using a decoder model. These generated nodes are pushed to $`Q`$. The node to be expanded next is picked by popping from $`Q`$, until it is empty.

By construction, an active node $`\mb{v}_i`$ has at least one neighbor node in the graph. NTG extracts a representation of its local topology by encoding incoming paths $`S_{i}^{in}`$ (of maximum length L) and uses the representation to generate a set of outgoing nodes $`V_{i}^{out}`$ (if any) with edges to $`\mb{v}_i`$. These paths are encoded in an order-invariant manner and the resulting latent representation is used to generate a set of outgoing nodes $`V_i^{out}`$. NTG performs the encoding and decoding to generate the graph as described above with an encoder-decoder neural network. Fig. 1(a) visualizes the process, while Fig. 1(b) illustrates the encoder-decoder neural architecture, which is described in detail in the following sections.

NTG Encoder. We encode a single incoming path $`\mb{s}^{in}`$ into node $`\mb{v}_i`$ with a zero-initialized, bidirectional GRU . The input to the GRU while processing the $`t^{th}`$ node in the path is the motion vector between the nodes $`\mb{v}_{i,t}^{in}\in\mb{s}^{in}`$ and $`\mb{v}_{i,t+1}^{in}\in\mb{s}^{in}`$ in the path; , $`[\Delta x_{i,t}^{in},\Delta y_{i,t}^{in}]^T=[x_{i,t+1}^{in},y_{i,t+1}^{in}]^T-[x_{i,t}^{in},y_{i,t}^{in}]^T`$. This offset could be encoded as a discrete or continuous value, as discussed in Sec. 8.4. The final latent representation $`\mb{h}_\mathrm{enc}`$ for all paths is computed by summing the last hidden states of each path. Optionally, we append an attribute vector $`\mb{h}_\mathrm{attr}`$ to the latent representation. For example, the attribute could be an embedding of a one-hot vector, encoding the city identity. This enables NTG to learn an embedding of city, enabling conditional generation. The final encoding is their concatenation $`[\mb{h}_\mathrm{enc}, \mb{h}_\mathrm{attr}]`$.

Sampling Incoming Paths. During training, for an active node $`\mb{v}_i`$ we use a subset of $`{S}_{i}^{in}`$ by sampling $`K`$ random walks (without repetition) starting from $`\mb{v}_i`$, such that each random walk visits at most $`L`$ different nodes. We find this random sampling to lead to a more robust model as it learns to generate from incomplete and diverse input representations. Optionally, we can also feed disconnected adjacent nodes as additional input. We found this to perform similarly in the task of road modeling due to high connectivity in data.

Decoder. We decode the outgoing nodes $`{V}^{out}_i`$ with a decoder GRU. The recurrent structure of the decoder enables capturing local dependencies between roads such as orthogonality at road intersections. We independently predict $`\Delta x_{t'}^{out}`$ and $`\Delta y_{t'}^{out}`$ for an outgoing node $`\mb{v}_{t'}^{out}`$, indicating a new node’s relative location w.r.t. $`\mb{v}_i`$. Additionally, we predict a binary variable which indicates whether another node should be generated. At generation time we check overlap between the new node and existing graph with a 5$`m`$ threshold to produce loops. Optionally, we predict the edge type between $`(\mb{v}_i, \mb{v}_{t'}^{out})`$, minor or major road, using a categorical variable. The hidden state $`\mb{h}_{t'}`$ of the decoder is updated as:

MATH
\begin{equation}
    \mb{h}_{t'+1}=\mathrm{GRU}(\mb{h}_{t'}\mid \mb{h}_\mathrm{enc}, \mb{h}_\mathrm{attr}, \Delta\mb{x}_{t'}^{out})
\end{equation}
Click to expand and view more

Aerial Road Parsing

Parsing with Map Prior. The dominant approaches to parse roads from aerial imagery have trained CNNs to produce a probability (likelihood) map, followed by thresholding and thinning. This approach typically results in maps with holes or false positive road segments, and heuristics are applied to postprocess the topology. We view a NTG model trained to generate city graphs (not trained for road parsing) as a prior, and use it to postprocess the likelihood coming from the CNN. Starting from the most confident intersection node as the root node, we push all its neighbors into the queue of unvisited nodes, and then use NTG to expand the graph. At each decoding step, we multiply the likelihood from CNN with the prior produced by NTG, and sample output nodes from this joint distribution. The end of sequence is simply determined by checking whether the maximum probability of a new node falls below a threshold (0.05 in our paper).

Image-based NTG. We also explore explicitly training NTG for aerial parsing. We condition on image features by including predictions from a CNN trained to parse aerial images in the attribute vector $`\mb{h}_{attr}`$. In practice, we initialize the graph obtained by thresholding and thinning the ouputs of the CNN, and use the trained image-based NTG on top.

Implementation Details

We exploit the same NTG model in both tasks of city generation and road detection. Depending on the task, we empirically find that the best parameterization strategy varies. For city generation, we use discrete $`\Delta x`$, $`\Delta y`$ with resolution of 1m for both encoder and decoder, where $`x`$ points to east and $`y`$ points to north. Here, $`\Delta x`$ and $`\Delta y`$ are limited to $`[-100:100]`$, indicating that the largest offset in either direction is 100m. The discrete $`\Delta x`$ and $`\Delta y`$ values are given associated embeddings (resembling words in language models), which are concatenated to generate the input to the encoder at every step. For road detection, we use continuous polar coordinates in the encoder, where the axis is rotated to align with the edge from the previous to the current node. This forms rotation invariant motion trajectories that help detecting roads with arbitrary orientation. The decoder always uses a discrete representation. We encode and decode the coordinates $`x`$ and $`y`$ independently. We find that this yields similar results compared to joint prediction, while significantly saving training memory and model capacity. 500 hidden units are used in encoder and decoder GRUs.

Learning & Inference

Inference. At each inference step, we pop a node from the queue $`Q`$, encode its existing incoming paths, and generate a set of new nodes. For each new node, we check if it is in the close proximity of an existing node in the graph. If the distance to an existing node is below a threshold $`\epsilon`$ (5m in our paper), we do not add the new node to the queue. Instead, an edge is included to connect the current node to the existing node. This enables the generation of cycles in the graph. We also find the maximum node degree, maximum node density, and minimum angle between two edges in the training set, and ensure our generation does not exceed these limits. We refer to supplemental material for their effects.

Learning. At training time, $`K`$ incoming paths for each $`\mb{v}_i`$ are sampled, and we learn to predict all of its neighboring nodes. We enforce an order in decoding the nodes, where we sort nodes counter-clockwise to form a sequence. The ordering saves having to solve an assignment problem to compute the loss function. Our model is trained using ground truth map data with teacher-forcing , using a cross entropy loss for each of the output nodes. The networks are optimized using Adam  with a learning rate of 1e-3, weight decay of 1e-4, and gradient clipping of 1.0.

Experiments

  Country City Node Edge Area Length
RoadNet 13 17 233.6k 262.1k 170.0km$`^2`$ 7410.7km
SpaceNet 4 4 115.8k 106.9k 122.3km$`^2`$ 2058.4km

Dataset statistics of RoadNet and SpaceNet .

l||L1.0cmL1.0cmL1.0cmL1.0cm|L0.8cm||L1.0cmL1.0cmL1.0cmL1.0cm|L0.8cm||l & & &
& mp1`` & ``mp2`` & ``pa`` & ``fc`` & & ``densi.`` & ``conne.`` & ``reach`` & ``conve.`` & &
& $`10^{-1}`$&$`10^{0}`$&$`10^{-1}`$&$`10^{1}`$& &$`10^{1}`$&$`10^{-2}`$&$`10^{5}`$&$`10^{-3}`$ & &
GraphRNN-2D `` & 7.12 & 6.35 & 8.45 & 16.15 & 25.0 & 51.58 & 4.61 & 45.11 & 6.72 & 43.7 & 44.26
PGGAN `` & 1.98 & 2.15 & 5.34 & 10.51 & 63.2 & 45.77 & 19.48 & 4.33 & 2.94 & 58.9 & 5.95
CityEngine-5k `` & 2.74 & 2.71 & 8.34 & 14.78 & 47.1 & 13.59 & 21.66 & 7.61 & 16.66 & 51.7 & 45.86
CityEngine-10k `` & 2.55 & 2.56 & 8.23 & 14.17 & 48.9 & 12.43 & 21.79 & 7.05 & 16.82 & 52.1 & 46.00
NTG-vanilla & 2.63 & 2.33 & 4.05 & 9.17 & 66.0 & 8.69 & 1.87 & 8.99 & 3.06 & 86.5 & 41.27
NTG-enhance & 1.52&1.34&2.83&6.76&77.3&3.76& 1.97 &4.13&1.86&92.4 & 42.09

  GraphRNN-2D  PGGAN  CityEngine  NTG GT
image image image image image
image image image image image
image image image image image
Qualitative examples of city road layout generation. GraphRNN-2D generates unnatural structures and fails to capture city style. PGGAN is unable to create new cities by either severely overfitting, or producing artifacts. CityEngine produces less style richness due to its fixed rule-based synthesis algorithm. NTG is able to both capture the city style and creating new cities.

We demonstrate NTG on three tasks: city road layout generation, satellite road parsing, and environment simulation.

City Road Layout Generation

RoadNet Dataset. We collected a real-world road dataset from OpenStreetMap (OSM) to facilitate this task. In particular, we selected 17 unique cities across continents and gathered all road markers. OSM, being crowd-sourced, often has incomplete markers in underpopulated areas. To alleviate this, we manually select the most densely annotated 10km$`^2`$ region within each city. These constitute our final RoadNet dataset. Table 1 shows the statistics.

Metrics

The goals of road layout generation are to create road networks that are: a) Perceptually plausible and preserve a meaningful city style, and b) Diverse. We use three broad categories of automatic metrics to evaluate city generation:

Perceptual: For every node, we render the graph in a 300m neighborhood centered around it on a canvas. With their perceptual features extracted from an InceptionV3 network , we compute the Fréchet Inception Distance (FID)  between the synthesized roads and ground truth maps for each city. To ensure a meaningful FID, we adapt InceptionV3, which has originally been trained on natural images, to road drawings, by finetuning it to predict city ids on our dataset. This yields a 90.27% accuracy, indicating effective capture of style across cities in the network.

Urban Planning : We measure four common urban planning features reflecting city style: 1. Node density within a neighborhood of 100m, 200m, and 300m. 2. Connectivity as reflected by the degrees of nodes. 3. Reach as the total length of accessible roads within a distance of 100m, 200m, and 300m. 4.) Transportation convenience as the ratio of the euclidean distance over the Dijkstra shortest path for node pairs that are more than 500m away. We also compute the Fréchet distance between normal distributions computed from the concatenation of the Urban Planning features of real and generated maps.

Diversity metric: We measure the ability to create novel cities by computing the overlap between a real and generated city as the percentage of road in one graph falling outside the 10m vicinity of the road in the other graph, and vice versa. We compare this Chamfer-like distance against all ground truth maps and report the average lowest value.

Results

We compare the following methods:

$`\bullet`$ **GraphRNN-2D **: We enhance the GraphRNN model by introducing extra branches to encode/decode node coordinates and city id. We add a CNN that takes into account local rendering of existing graph as in  and add checks to avoid invalid edge crossing during inference.

$`\bullet`$ **PGGAN **: We train to generate images of road layouts at a resolution of 256$`\times`$256. We use our trained InceptionV3 network to classify samples into cities to compute city-wise metrics. For computing the graph-related metrics we convert images to graphs by thresholding and thinning.

$`\bullet`$ **CityEngine **: CityEngine is a state-of-the-art software for synthesizing cities based on an optimized, complex rule-based L-system . By default, it only offers limited templates and is incapable of generating new cities. To enhance CityEngine, we use its provided control interface and exhaustively search over its attribute space by enumerating combinations of important control parameters such as angles, bending specifications, and crossing ratios. We then predict city probabilities using the InceptionV3 network, and select the highest ranking 10km$`^{2}`$ as the result for each city.

City-wise FID (fc) of different methods.

$`\bullet`$ NTG: NTG begins with a root node with its edges. We evaluate NTG with a random root (NTG-vanilla), as well as with a pre-stored high connectivity root (NTG-enhance).

Tab. [tab:cityresult] and Fig. 2 show quantitative and qualitative results, respectively. Quantitatively, NTG outperforms baselines across all metrics. GraphRNN-2D fails to capture various city styles and frequently produces unnatural structures. This is due to its sequential generative nature that depends on Breadth First Search. The RNN that encodes the full history fails to capture coherent structures since consecutive nodes may not be spatially close due to BFS. PGGAN  produces sharp images with occasional artifacts that are difficult to convert into meaningful graphs. Samples from PGGAN are severely overfit as reflected by the Diversity metric, indicating its inability to create new cities. Moreover, PGGAN also suffers from mode-collapse and memorizes a portion of data. This imbalance of style distribution leads to worse perceptual FIDs. With our enhancement (exhaustive search), CityEngine  is able to capture certain cities’ style elements: especially the node density. However, it has less topological variance and style richness due to its fixed set of rules. Expanding its search from 5000km$`^{2}`$ (CityEngine-5k) to 10000km$`^{2}`$ (CityEngine-10k) of generated maps does not lead to significant improvements, while requiring double the amount of computation. NTG is able to create new cities, while better capturing style in most cities as shown in Fig. 3.

Fig. 4 digs into NTG’s generated maps, showing that NTG learns to remember local road patterns. As the graph grows, different local topologies are organically intertwined to produce new cities. Fig. 5 shows the effect of two important hyper-parameters: maximum number of paths $`K`$ and maximum incoming path length $`L`$. Results show that reconstruction quality is determined by $`L`$, while $`K`$ and $`L`$ both affect inference time. Training with longer and more paths does not necessarily improve perceptual quality, since generation starts from a single root node without long paths.

We further demonstrate our approach by having NTG predict two types of roads, major and minor roads. Results are shown in Fig. 7, showing that NTG easily generalizes to a more complex modeling problem.

GT NTG
image image
NTG creates new city road layouts in a combinatorial manner. Local patterns as shown by orange boxes are remembered, and then intertwined to create novel structures.
image image image
Effect of sampled paths K and maximum path length L on reconstruction quality in meters (red), inference time in seconds per km2 (green), and FID-fc (blue).

Interactive Synthesis. We showcase an application for interactive road layout generation where a user chooses from a palette of cities and provides local topology priors by sketching. We match the user’s input with pre-stored node templates to form the root node. To allow generating multiple components on the same canvas, we simply modify the NTG inference procedure to iterate through multiple queues in parallel. Fig. 6 shows examples of the generation process. Beyond Road Layouts. In Appendix, we show results on using NTG’s multipath paradigm for learning effective representation of complex shapes, such as multi-stroke hand drawings. This shows potential as a general purpose spatial graph generative model beyond the city road modeling.

user input step 0 step 20 step 40 step 60 step 80 final
image image image image image image image
image image image image image image image
Examples of interactive city road layout generation via user sketching and local style selection.
Brussels Toronto
image image
NTG can be easily extended to generate road type.

Introduction

City road layout modeling is an important problem with applications in various fields. In urban planning, extensive simulation of city layouts are required for ensuring that the final construction leads to effective traffic flow and connectivity. Further demand comes from the gaming industry where on-the-fly generation of new environments enhances user interest and engagement. Road layout generation also plays an important role for self-driving cars, where diverse virtual city blocks are created for testing autonomous agents.

Although the data-driven end-to-end learning paradigm has revolutionized various computer vision fields, the leading approaches  (e.g., the foundation piece in the commercially available CityEngine software) for city layout generation are still largely based on procedural modeling with hand-designed features. While these methods guarantee valid road topologies with user specified attribute inputs, the attributes are all hand-engineered and inflexible to use. For example, if one wishes to generate a synthetic city that resembles e.g. London, tedious manual tuning of the attributes is required in order to get plausible results. Moreover, these methods cannot trivially be used in aerial road parsing.

In this paper, we propose a novel generative model for city road layouts that learns from available map data. Our model, referred to as Neural Turtle Graphics (NTG) is inspired by the classical turtle graphics methodology 1 that progressively grows road graphs based on local statistics. We model the city road layout using a graph. A node in the graph represents a spatial control point of the road layout, while the edge represents a road segment. The proposed NTG is realized as an encoder-decoder architecture where the encoder is an RNN that encodes local incoming paths into a node and the decoder is another RNN that generates outgoing nodes and edges connecting an existing node to the newly generated nodes. Generation is done iteratively, by pushing newly predicted nodes onto a queue, and finished once all nodes are visited. Our NTG can generate road layouts by additionally conditioning on a set of attributes, thus giving control to the user in generating the content. It can also take a user specified partial sketch of the roads as input for generating a complete city road layout. Experiments with a comparison to strong baselines show that our method achieves better road layout generation performance in a diverse set of performance metrics. We further show that the proposed NTG can be used as an effective prior for aerial map parsing, particularly in cases when the imagery varies in appearance from that used in training. Fine-tuning the model jointly with CNN image feature extraction further improves results, outperforming all existing work on the Spacenet benchmark.

Related Work

Classical Work. A large body of literature exists on procedural modeling of streets. The seminal early work of  proposed an L-system which iteratively generates the map while adjusting parameters to conform to user guidance. This method became the foundation of the commercially available state-of-the-art CityEngine  software. Several approaches followed this line of work, exploiting user-created tensor fields , domain splitting , constraints stemming from the terrain , and blending of retrieved examplars . Methods that evolve a road network using constraints driven by crowd behaviour simulation have also been extensively studied .

Generative Models of Graphs. Graph generation with neural networks has only recently gained attention .  uses an RNN to generate a graph as a sequence of nodes sorted by breadth-first order, and predict edges to previous nodes as the new node is added.  uses a variational autoencoder to predict the adjacency and attribute matrices of small graphs. trains recurrent neural network that passes messages between nodes of a graph, and generates new nodes and edges using the propagated node representation. Most of these approaches only predict graph topology, while in our work we address generation of spatial graphs. Producing valid geometry and topology makes our problem particularly challenging. Our encoder shares similarities with node2vec  which learns node embeddings by encoding local connectivities using random walks. Our work focuses on spatial graph generation and particularly on road layouts, thus different in scope and application.

Graph-based Aerial Parsing. Several work formulated road parsing as a graph prediction problem. The typical approach relies on CNN road segmentation followed by thinning . To deal with errors in parsing, proposes to reason about plausible topologies on an augmented graph as a shortest path problem. In , the authors treat local city patches as a simply connected maze which allows them to define the road as a closed polygon. Road detection then follows Polygon-RNN  which uses an RNN to predict vertices of a polygon.  performs lane detection by predicting polylines in a top-down LIDAR view using a hierarchical RNN. Here, one RNN decides on adding new lanes, while the second RNN predicts the vertices along the lane. In our work, we predict the graph directly. Since our approach is local, it is able to grow large graphs which is typically harder to handle with a single RNN. Related to our work,  annotate building footprints with a graph generating neural network. However, these works are only able to handle single cycle polygons.

Most related to our work is RoadTracer , which iteratively grows a graph based on image evidence and local geometry of the already predicted graph. At each step, RoadTracer predicts a neighboring node to the current active node. Local graph topology is encoded using a CNN that takes as input a rendering of the existing graph to avoid falling back. Our method differs in the encoder which in our case operates directly on the graph, and the decoder which outputs several outgoing nodes using an RNN which may better capture more complex road intersection topologies. Furthermore, while  relied on carefully designed dynamic label creation during training to mimic their test time graph prediction, our training regime is simple and robust to test time inference.

We also note that with some effort many of these work could be turned into generative models, however, ours is the first that showcases generative and interactive modeling of roads. Importantly, we show that NTG trained only on map data serves as an efficient prior for aerial road parsing. This cannot easily be done with existing work  which all train a joint image and geometry representation.

Conclusion

In this paper, we proposed Neural Turtle Graphics for generating large spatial graphs. NTG takes the form of an encoder-decoder neural network which operates on graphs locally. We showcased NTG on generating plausible new versions of cities, interactive generation of city road layouts, as well as aerial road parsing. Furthermore, we combined the two tasks of aerial parsing and generation, and highlighted NTG to automatically simulate new cities for which it has not seen any part of the map during training time. In future work, we aim to tackle generation of other city elements such as buildings and vegetation.

Satellite Road Parsing

SpaceNet Dataset. While several datasets have been presented for road detection , we use SpaceNet  for its large scale, image quality, and open license. To facilitate consistent future benchmarking, we reformat the raw data into an easy-to-use version with consistent tile size in metric space. Tab. 1 shows its statistics. We split tiles of each city into train-validation-test with a 4-1-1 ratio.

  DRM  RoadExtractor  RoadTracer  FCN  DLA+STEAL  NTG GT
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
Qualitative examples of SpaceNet road parsing.
  satellite detection simulation 1 simulation 2 simulation 3 simulation 4
image image image image image image
image image image image image image
Sat2Sim: converting satellite image into a series of simulated environments. Buildings and vegetation added via .

Metrics

Average Path Length Similarity (APLS) has been shown to be the best metric to reflect routing properties . Between two graphs, APLS is defined as

MATH
\begin{equation*}
\vspace{-3mm}
\mathrm{APLS}=1-\frac{1}{N_{p}}\sum_{p_{\mb{v}_1\mb{v}_2}<\infty}min\left\{1,\frac{|p_{\mb{v}_1\mb{v}_2}-p_{\mb{v}_1^{\prime}\mb{v}_2^{\prime}}|}{p_{\mb{v}_1\mb{v}_2}}\right\}
\end{equation*}
Click to expand and view more

where $`\mb{v}`$ denotes a source graph node, $`\mb{v}^{\prime}`$ as its closest on-road point in the target graph if such a point exists within a buffer range (5m), $`N_p`$ number of paths. Here, $`p_{\mb{v}_1\mb{v}_2}`$ denotes the Dijkstra shortest path length between two nodes, and has infinite value if no path exists. We also exchange source and target graphs to establish metric symmetry. To ensure even node distribution, graphs are RDP-simplified  and uniformly subdivided with 30m maximum edge length. While we use APLS as our main metric, we also report conventional pixel-wise IOU and F1 score as references, even though they are less desirable as revealed in .

Results

 We compare three categories of methods:
$`\bullet`$ Prior art: We evaluate DeepRoadMapper , RoadExtractor , and RoadTracer . RoadTracer requires additional starting points as input. We use the most likely pixel predicted by their CNN, as well as 30 points randomly selected from ground truth (RoadTracer-30).

$`\bullet`$ Stronger CNNs: We explore more powerful CNN architectures. We train an FCN with a ResNet backbone , as well as a CNN using DLA  with STEAL . To obtain the graph we use standard thinning and geodesic sorting.

$`\bullet`$ NTG: We evaluate both the parsing NTG (NTG-P) that is only trained on RoadNet and acts as a topological prior and image-based NTG (NTG-I) that is trained on SpaceNet.

  IOU F1 APLS
DeepRoadMapper  45.02 62.08 51.49
RoadExtractor  52.91 69.20 57.38
RoadTracer  10.23 18.56 48.55
RoadTracer-30  48.29 65.13 42.94
FCN  51.09 67.63 56.56
DLA+STEAL  58.96 74.18 71.04
NTG-P (’s CNN) 50.58 67.18 55.87
NTG-P (’s CNN) 51.62 68.09 58.79
NTG-P (DLA+STEAL) 59.29 74.44 70.99
NTG-I (DLA+STEAL) 63.15 77.42 74.97

SpaceNet evaluation on unseen city by holding one city out in training. Without finetuning, the RoadNet pretrained NTG-P is able to improve over DLA+STEAL.

  IOU F1 APLS
RoadExtractor  20.51 34.03 43.06
DLA+STEAL  33.94 50.68 56.15
NTG-P (DLA+STEAL) 35.16 52.02 57.89

SpaceNet evaluation on unseen city by holding one city out in training. Without finetuning, the RoadNet pretrained NTG-P is able to improve over DLA+STEAL.

Table [tab:spaceresult1] and Figure 8 present SpaceNet results. It can be seen that our method outperforms baselines in all metrics. The DLA+STEAL CNN produces cleaner predictions that focus on road. NTG-P trained only with RoadNet is able to successfully parse graph structure. Using NTG-I that further takes CNN output as input achieves the best result. We also experiment the RoadNet trained NTG-P with CNNs from prior art . It can be seen that the city topology knowledge of NTG makes it a better graph extractor compared to hand-crafted postprocessings in , especially in terms of APLS. For NTG-P with DLA+STEAL we notice it has similar performance as standard graph extraction. This is because DLA+STEAL prediction has high confidence as it is trained and tested with same cities that have similar visual appearance. We therefore further experiment with one city held-out to simulate road parsing in unseen cities. Results are presented in Table 3. It can be seen that NTG-P is able to further improve the result, demonstrating the effectiveness of generative road layout knowledge learnt from RoadNet. We conduct 4-fold evaluation holding out each city per fold, and report the average result.

Environment Simulation

We further showcase a novel application that combines our two tasks in Figure 9. We propose to directly convert a satellite image into simulation-ready environments, which may be important for testing autonomous vehicles in the future. First, we detect roads in the satellite image with NTG, giving us an initial graph. Then, we exploit our generative model to propose plausible variations. This is done by pushing all single-connection nodes in the parsed graph into our generative queue, and running the NTG generative process to expand the graph. We directly make use of the NTG model trained for city generation and choose a random city id for each run. This has two main advantages. First, it is fully automatic and only requires a low-cost satellite image as input. Second, it provides a set of plausible variations of the environment (city) instead of a static one, which could eventually enable training more robust agents. For visualization, we additionally add buildings and tree via , showing plausible and diverse simulation-ready cities.


📊 논문 시각자료 (Figures)

Figure 1



Figure 2



Figure 3



Figure 4



Figure 5



Figure 6



Figure 7



Figure 8



Figure 9



Figure 10



Figure 11



Figure 12



Figure 13



Figure 14



Figure 15



Figure 16



Figure 17



Figure 18



Figure 19



Figure 20



Figure 21



Figure 22



Figure 23



Figure 24



Figure 25



Figure 26



Figure 27



Figure 28



Figure 29



Figure 30



Figure 31



Figure 32



Figure 33



Figure 34



Figure 35



Figure 36



Figure 37



Figure 38



Figure 39



Figure 40



Figure 41



Figure 42



Figure 43



Figure 44



Figure 45



Figure 46



Figure 47



Figure 48



Figure 49



Figure 50



Figure 51



Figure 52



Figure 53



Figure 54



Figure 55



Figure 56



Figure 57



Figure 58



Figure 59



Figure 60



Figure 61



Figure 62



Figure 63



Figure 64



Figure 65



Figure 66



Figure 67



Figure 68



Figure 69



Figure 70



Figure 71



Figure 72



Figure 73



Figure 74



Figure 75



Figure 76



Figure 77



Figure 78



Figure 79



Figure 80



Figure 81



Figure 82



Figure 83



Figure 84



Figure 85



Figure 86



Figure 87



A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

  1. Turtle graphics is a technique for vector drawing, where a relative cursor (turtle) receives motion commands and leave traces on the canvas. ↩︎

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut