-
Notifications
You must be signed in to change notification settings - Fork 1
/
paper.qmd
446 lines (302 loc) · 22.2 KB
/
paper.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
---
title: "ANIME"
subtitle: "Approximate Network Matching and Integration Enrichment"
author:
- name: Josiah Parry
affiliation: Environmental Systems Research Institute, Redlands, CA, USA
orcid: 0000-0001-9910-865X
- name: Robin Lovelace
affiliation: Leeds Institute for Transport Studies, University of Leeds, UK
orcid: 0000-0001-5679-6536
format: html
filters:
- pseudocode
bibliography: refs.bib
---
```{r, include = FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Abstract
Reconciling topologically different linestrings is a fundamental challenge in spatial data science, particularly when attempting to integrate network datasets from disparate sources. Existing methods often struggle with the absence of join keys and the need for wholesale joining of attributes, hindering effective data integration and analysis. In this paper, we propose a novel algorithm for matching two sets of linestrings, implemented in an open-source Rust library with bindings to R and Python. Our algorithm addresses the challenge by identifying topologically similar linestrings and estimating the shared length between each pair of matched linestrings. By leveraging R* spatial indices and angle-based matching criteria, our approach effectively reconciles linestrings with varying topologies. We demonstrate the utility of our algorithm through applications in spatial data analysis, including joins, weighted aggregations, and network subset identification based on shared characteristics. The proposed algorithm offers a robust solution for reconciling linestrings in spatial datasets, with implications for various domains in spatial data science.
<!--
- A common issue in spatial data science is the reconciliation of two sets of linestrings.
- linestrings may represent the same phenomenon but be topologically different
- joining data between these road networks is problematic for many reasons
- often there may be no join key present
- if there is a join key, there remains an issue of wholesale joining of attributes
- attributes of a linestring are intended be associated with that linestring, not another
- we need to do a join and provide a weight for future calculations
- in this paper we introduce a new algorithm to match two sets of linestrings
- it is implemented in an open source rust library with bindings to R and Python
-->
## Problem Statement
**Overarching Idea**:
- use the approximate network matching to create a list of _strong_ candidate matches with a known shared overlap amount.
- The result of the approximate network match can be taken and utilized for 'vertical conflation' using many different approaches
- in this paper we introduce a method of spatial interpolation of linear features
> "Due to the complexity and limitations of existing methods, planners and analysts often have to employ a heavily manual conflation process, which is time‐consuming and often prohibitively expensive." Lei, Ting; Lei, Zhen (2019).
- datasets are derived from remotely sensed imagery (samal et al., 2004)
- this is happening on a much larger scale for example google building footprints (https://sites.research.google/open-buildings/)
- features may not always be complete or entirely accurate but can be used to supplement missing data or address changes in a dataset
- matching generalized to more detailed geometry is tough and an unsolved problem , see "LoD" mentions (Ick-Hoi Kim, Chen-Chieh Feng & Yi-Chen Wang), (Mustiere & Devogele, 2008), (Zhang 2009)
- In a road database, several objects may represent different parts of the same real-world road, e.g., each lane in a highway could be represented by a different object,(Zhang 2009)
- some features can be have no corresponding features or many
- missing join keys between disparate datasets or missing semantic information requires a geometric only approach (Ick-Hoi Kim, Chen-Chieh Feng & Yi-Chen Wang, )
> "a simple overlay of the sources would not automatically reveal correspondence." (Samal et al, 2004)
## Definitions
The purpose of this section is to provide clear and concise definitions of terminologies used in the description of this algorithm.
We will use the simple feature access standards definitions of geometric primitives for the sake of consistency. In our algorithm we make extensive use of **Line** and **LineString** geometric primitives. A Line is defined by two **Point**s that are (x, y) coordinate pairs. A LineString is composed of 2 or more Points and "each consecutive pair of Points defines a Line segment." (@sfa FIXME cite). Each LineString is referred to as a **feature**. An array of features is referred to as a **FeatureCollection** (@geojson_spec).
The objective of this algorithm is to identify matches between two LineString FeatureCollections and measure the length of the match. A **match** is a correspondence between features of separate FeatureCollections (@lei_lei_19).
The terminology to refer to the two FeatureCollections is inconsistent in the literature. For example, Zhang (@zhang_2009) refers to the two FeatureCollections as the reference and target whereas the Java Conflation Suite (JCS) (@jcs) refers to these as the reference and subject.
Instead, we use define the two FeatureCollections as the **source** and the **target** as utilized by Comber and Zeng's work on areal interpolation (@comber_zeng_2019). The source and target terminology is also much more commonly used in relational database management systems (RDBMS) and in the wider data science ecosystem. The use of these terms is an intentional recognition in the shift towards a more general field of spatial data science. The objective of the matching algorithm is to match features _from_ the source _to_ the target. Often the target is a more detailed and known collection of features.
In the process matching features, **candidate**s are found. A candidate is a source feature that _may_ correspond to a target feature. If a candidate passes tests, it is then deemed a match. After matches have been found, attributes from the source feature are often transferred to the target feature. This process of attribute transfer is referred to as **integration** (FIXME CITE).
## Existing algorithms
- there is a vast literature dedicated to matching linestring features between datasets. These can be thought of a category of binary classification algorithms that identify if a feature in the source FeatureCollection is a match to the target.
- the classification is often based on geometric or semantic criteria, or both. The true challenge lies in the absence of semantic information. We therefore focus on matching strictly based on geometric criteria.
- most commonly these criteria are location (proximity), orientation (angle), length, and shape similarity (often measured by some type of distance e.g. freshet, hausdorff or avg. euclidean from linestring vertices)
Some choice algorithms. By far not exhaustive. Demonstrate some common unique strategies that are employed
Goodchild and Hunter 1997 introduce a buffer based algorithm which calculates the proportion of length covered by a target and source feature. They state that
> we suggest that a similar approach based on the percentage of the reference source length rather than the tested source might be more related to generalization...
- Our algorithm, due to its reporting of shared overlap permits us to adapt the Goodchild algorithm and use % source length overlap as a cut off.
- The issue here is that there may be complete overlap from the source but the source may continue well past. we must address this. This is why we will lean on areal weighted interpolation
- Java Conflation Suite (JCS, 2003)
- delimited strokes algorithm 2009 zhang
- Kim et al., 2017 creates an algorithm to conflate historic road network with new ones
- uses a number of spatial similarity measures with c4.5 decision tree to classify
- "linear directional mean" of each line segment in a LineString
- challenging for longer linestrings with many segments
- shorter line median Hausdorff distance (SMHD)
- absolute value of cosine similarity (aCS)
- note we do something very similar
- (Chehreghan & Abbaspour, 2018) use a genetic algorithm. These are computational expensive and do not sacle well
- Esri conflation toolset: buffer analysis and similarity
- overline approach (morgan and lovelace)
### Spatial Criteria
our algorithm can be thought of as an extension of buffer and orientation analysis
- location / proximity measured as a distance buffer (of sorts)
- orientation (angle)
- where we differ is in the use of length
- most existing algorithms perform matching between the source and target based on a measure of length of each LineString. Common are average euclidean distance, hausdorff, and frechet distance.
- each of these evaluates the linestrings in their entirety
- the algorithms then use the spatial criteria to perform a binary classification of either a match or not
- we do
## Algorithm overview
- "a good matching algorithm should be able to create high quality results at a high speed." (Zhang, 2009)
- many algorithms are focused on 1:1 correspondence. ours supports `m:n` correspondence. Each feature in the target FeatureCollection can be matched as few as 0 times or to every feature in the source FeatureCollection.
- Our approach considers all relationships "including one-to-null, null-to-one, one-to-one, one-to-many, many-to-one, and many-to many." (Chehreghan & Abbaspour, 2017, 10.1080/15481603.2017.1338390)
- ^ related to above, we need to document the cardinality of our approach one-to-many there is no 1:1 matching of linestrings (Lei, Ting; Lei, Zhen (2019).)
- > Beyond the one‐to‐one case, however, matching becomes more complicated and less well‐defined.
- consider our algorithm as a way of identifying potential matches that can be pruned if desired
- we do not provide a binary classification but a way to identify correspondence between a target and source and measure the amount of correspondence.
The proposed algorithm aims to match elements of two sets of LineStrings that are topologically similar and estimate the amount of shared length between each pair of matched line strings.
Each LineString is composed of one or more Lines which is comprised of a single start or end point. The approximate network matching algorithm constructs two R* spatial indices over the component lines in $A$ and $B$. Intersection candidates between the two trees are used to limit the search space. For each candidate pair, the angle of the slopes are compared to determine if they are approximately parallel (parallelish). If the slopes are approximately parallel and the lines are within a minimum separable distance of each other, they are considered to match. The overlapping region between the matched lines is used to compute the shared length.
The result of the matching algorithm is a B-tree which can be used to generate a row-compressed sparse matrix.
### Identify match candidates
To identify matches between $A$ and $B$ we do not look at the LineStrings in their totality, but rather, by their individual components. $A$ and $B$ are comprised of one or more LineStrings index by $i$ and $j$ respectively. Each linestring is composed of one or more lines indexed as $k$. Matches are found between elements of $A_{ik}$ and $B_{jk}$ using two R-trees.
We create an empty R-tree, $Tree_A$. For each line $A_{ik}$ we compute the slope of the line and insert the geometry, slope, and index into the tree.
Next we create another empty R-tree, $Tree_B$, in which we will store each line in $B_{jk}$. However, instead of using the axis-aligned bounding box of $B_{jk}$, we create a newer, larger one, based on a distance tolerance, $DT$. The distance tolerance is used to expand the search for matches. We compute the AABB of $B_{jk}$, then expand the AABB by $DT$ in both the x and y directions. After doing so, we insert the geometry, slope, and index into $Tree_B$
```{r}
#| include: false
devtools::load_all("r")
```
```{r message = FALSE, echo = FALSE}
#| layout-ncol: 2
suppressMessages(library(sf))
library(rsgeo)
library(ggplot2)
library(patchwork)
conflicted::conflict_prefer("ggplot2", "rsgeo")
# box to crop geometry to
crop_box <- st_bbox(c("xmin" = 427200, xmax = 427500, ymin = 433550, ymax = 433700))
rnet_x <- "https://raw.githubusercontent.com/nptscot/networkmerge/main/data/rnet_armley.geojson" |>
read_sf() |>
st_geometry() |>
st_transform(27700) |>
st_crop(crop_box)
rnet_y <- "https://raw.githubusercontent.com/nptscot/networkmerge/main/data/rnet_armley_line.geojson" |>
read_sf() |>
st_crop(crop_box)
x <- as_rsgeo(sf::st_transform(rnet_x, 27700))
y <- as_rsgeo(st_transform(rnet_y, 27700))
# axis-aligned-bounding-box for x
xbb <- bounding_rect(explode_lines(x))
# creating bounding rects for y
# need to expand them
ybb <- bounding_rect(explode_lines(y))
# define function to expand the AABB
expand_aabb <- function(x, DT) {
crds <- coords(x)
# xmin, max, max, min, min
# ymin min max max min
crds[,1] <- crds[,1] + (c(-1, 1, 1, -1, -1) * DT)
crds[,2] <- crds[,2] + (c(-1, -1, 1, 1, -1) * DT)
rsgeo::geom_polygon(crds$x, crds$y, crds$polygon_id)
}
DT <- 2.5
xbb_sf <- st_as_sfc(xbb) |> st_set_crs(27700)
ybb_sf <- st_as_sfc(expand_aabb(ybb, DT)) |> st_set_crs(27700)
p1 <- ggplot() +
geom_sf(
data = xbb_sf,
fill = "#76b5c5", alpha = 0.25, lwd = 0.1
) +
geom_sf(data = rnet_x, lwd = 0.2) +
labs(title = "Axis-aligned-bounding-boxes of A") +
theme_void()
p2 <- ggplot() +
geom_sf(
data = ybb_sf,
fill = "#e28743", alpha = 0.25,lwd = 0.1
) +
geom_sf(data = rnet_y, alpha = 0.5) +
labs(
title = "Axis-aligned-bounding-boxes of B",
subtitle = paste0(DT, "meter distance tolerance")
) +
theme_void()
p3 <- ggplot() +
geom_sf(
data = xbb_sf,
fill = "#76b5c5", alpha = 0.25, lwd = 0.1
) +
geom_sf(
data = ybb_sf,
fill = "#e28743", alpha = 0.25, lwd = 0.1
) +
geom_sf(data = rnet_x, alpha = 0.5, lwd = 0.15) +
geom_sf(data = rnet_y, alpha = 0.5) +
labs(title = "AABB and Networks Overlay") +
theme_void()
p1
p2
```
If AABBs between $Tree_A$ and $Tree_B$ are intersecting, it means that that the lines $A_{ik}$ and $B_{jk}$ might be within $DT$ of each other and should be checked to see if they are considered matches.
```{r}
p3
```
### Matching Criteria
Candidate matches as determined by intersecting AABBs must then be further evaluated. Lines $A_{ik}$ and $B_{jk}$ must be approximately parallel (parallelish) to be considered a match. To this end, an angle tolerance $AT$ is defined. We take the inverse tangent of the slopes of lines $A_{ik}$ and $B_{jk}$ to find their angle. If the difference between these two angles are less than or equal to $AT$, we deem them tolerant or, parallelish.
```{r}
#| fig-cap: "Matched lines with 15° angle tolerance and 2.5 meter distance tolerance."
mtx <- rnetmatch::rnet_match(rnet_x, rnet_y, 2.5, 15)
i <- 4
ydx <- which(mtx$i == i)
xx <- rnet_x[i]
yy <- rnet_y$geometry[mtx$j[ydx]]
# plot the lines that we will measure
plot(xx, lty = 3)
plot(yy, add = TRUE)
```
Being confident that the $A_{ik}$ and $B_{jk}$ are parallelish, we next need to determine if they are within the distance tolerance determined by $DT$. This is done by measuring the minimum separable distance between $A_{ik}$ and $B_{jk}$. If both conditions are satisfied, then the lines are matched. Following, the shared segment length must be calculated.
### Caclulating segment overlap
Once two lines $A_{ik}$ and $B_{jk}$ have been determined to be matches, we need to evaluate how much overlap exists between the two lines. This overlap is defined by the segment length of $A_{ik}$ contained in the overlap in the x or y dimension between $A_{ik}$ or $B_{jk}$.
Based on the angle of the line $A_{ik}$, $\theta_{A_{ik}}$, we either calculate the overlap in the line segments in either the x or y dimension.
![](assets/line-seg-overlap-top.png)
If $\theta_{A_{ik}} \le 45^{\circ}$, we calculate the overlap between the range of x values of $A_{ik}$ and $B_{jk}$, $(x_{min}, x_{max})$. Using the slope of $A_{ik}$, solve for the values of y in the equation of the line. Using the calculated values of y, calculate the length of the line segment. If $\theta_{A_{ik}} \gt 45^{\circ}$, we instead calculate the overlap in the range of y values and subsequently solve for x, then calculate the length of the line segment. Note that if there is no overlap in the x or y dimension and even if both matching criteria were met, there will be no shared length, and the resultant value with be 0.
## Algorithm Implementation
```pseudocode
#| label: alg-approx-net-matching
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Approximate Network Matching}
\begin{algorithmic}
\State // Initialize R-trees for LineString components in sets A and B
\Procedure{ApproxNetworkMatch}{$A, B, DT, AT$}
\State $Tree_A \gets$ InitializeEmptyRTree()
\For{each $A_{ik} \in A$}
\State $slope_{A_{ik}} \gets$ ComputeSlope($A_{ik}$)
\State InsertIntoRTree($Tree_A, i, A_{ik}, slope_{A_{ik}}$)
\EndFor
\State $Tree_B \gets$ InitializeEmptyRTree()
\For{each $B_{jk} \in B$}
\State $expandedAABB_{B_{jk}} \gets$ ExpandAABB($B_{jk}, DT$)
\State InsertIntoRTree($Tree_B, j, B_{jk}, expandedAABB_{B_{jk}}$)
\EndFor
\State // Identify potential match candidates
\For{each pair $(A_{ik}, B_{jk})$ with intersecting AABBs}
\If{IsParallelish($slope_{A_{ik}}, slope_{B_{jk}}, AT$) and IsWithinDistance($A_{ik}, B_{jk}, DT$)}
\State // Calculate shared segment length
\State $overlapLength \gets$ CalculateOverlapLength($A_{ik}, B_{jk}$)
\State // Store matched pair and overlap length
\State StoreOrUpdateMatchedPair($A_{ik}, B_{jk}, overlapLength$)
\EndIf
\EndFor
\State \Return MatchedPairs
\EndProcedure
\State // Helper functions
\Function{IsParallelish}{$slope_{A}, slope_{B}, AT$}
\State $angle_A \gets \arctan(slope_{A})$
\State $angle_B \gets \arctan(slope_{B})$
\State \Return $(|angle_A - angle_B| \le AT)$
\EndFunction
\Function{IsWithinDistance}{$A_{ik}, B_{jk}, DT$}
\State $minDistance \gets$ ComputeMinSeparableDistance($A_{ik}, B_{jk}$)
\State \Return $(minDistance \le DT)$
\EndFunction
\Function{CalculateOverlapLength}{$A_{ik}, B_{jk}$}
\State $\theta_{A_{ik}} \gets$ ComputeAngle($A_{ik}$)
\If{$\theta_{A_{ik}} \le 45^\circ$}
\State $overlapLength \gets$ CalculateXOverlap($A_{ik}, B_{jk}$)
\Else
\State $overlapLength \gets$ CalculateYOverlap($A_{ik}, B_{jk}$)
\EndIf
\State \Return $overlapLength$
\EndFunction
\end{algorithmic}
\end{algorithm}
```
## Integration of Numeric Attributes
- we integrate principles from areal weighted interpolation and geometric conflation to provide an algorithm that matches geometries and performs attribute interpolation for complete and partial matches
- by de-emphasizing the importance of exacty 1-1 matches and relying on principles of spatial interpolation we can integrate datasets that are not representational of identical phenomena but rather similar ones. We remove the emphasis on measures of fit essentially
- in the cases where a complete match is identified this ammounts to a 1:1 attribute transfer
- we rely on intensive and extensive attribute transfer.
- TODO CITE r-spatial book and Tobler pysal library
- the shared length between i and j may exceed the total length of i or j. This occurs when multiple matches are made between components most often in parallel ways
### Extensive Numeric Attribute Integration
sum of shared length / length of y *
let the shared length between target i and source j be the variable $SL_{ij}$
$$
\hat{Y}_i = \sum_{j} \frac{SL_{ij}}{length(j)} \times Y_j
$$
### Intensive Numeric Attribute Integration
$$
\hat{Y}_i = \text{mean}\left(\frac{SL_{ij}}{length(i)} \times Y_j\right)
$$
## Integration of Categorical Attributes
Similar to the the approach taken for numeric attributes, we can also apply these to categorical variables. Say we have a categorical variable $Y$ with $k$ unique values. For each $j$ we calculate a frequency table for each $Y_{jk}$. We create new variable for each unique $Y_k$. $Y_{ik}$ becomes a numeric variable of the frequency. $Y_{jk}$ is a dummy variable. We now apply the same approach for extensive and intensive attribute integration for each new dummy variable.
### Extensive Categorical Attribute Integration
$\hat{Y_{ik}}$ is equal to the shared length divided by the length of j times $Y_{jk}$ summed for all $ij$ pairs. Note that $Y_{jk}$ can take on only the values of 1 or 0.
$$
\hat{Y_{ik}} = \sum_{j}{\frac{SL_{ij}}{length(j)}} \times Y_{jk}
$$
### Intensive Categorical Attribute Integration
$$
\hat{Y_{ik}} = \text{mean}\left(\frac{SL_{ij}}{length(i)} \times Y_{jk}\right)
$$
## Adaptability of Approximate Network Matching
- our algorithm is incredible flexible
- for example, you can match features that have no overlap if they are within distance tolerance and match the orientation critera to find abutting neighbors
- use results to efficiently narrow down matching candidates for manual verification
- can be used as a way to very quickly narrow down matches. Shared overlap can be used to provide a binary classification
## Discussion
- This algorithm is limited in that it is designed in $\Bbb{R}^2$ space. As such, there is no support for 3-dimensional or spherical coordinates. It is, however, conceivable to apply the same principles to these scenarios. The challenge, then, is to determine overlap regions, lengths, and lines-segments.
- 3D could be important though not many network datasets keep track of the height of points. Otherwise, measuring distance in 3 dimension for candidates could be useful.
- limited in that the geometries must be in the same coordinate system and must be proximal to eachother. Does not work for arbitrarily scaled geometries
- does not handle geometry shift (Lei, Ting; Lei, Zhen (2019))
- we make no affine transformations to handle shifts in geometries
------
### References
[@morgan_travel_2021]
[@rawlingson_overlaying_2015]
[@chehreghan_new_2017]
[@goodchild_simple_1997]
[@zhang_methods_nodate]
[@lei_optimal_2019]
::: {#refs}
:::