The goal of postmastr
is to provide consistent, tidy parsing of street
address data. The package is currently oriented towards American street
addresses, e.g. “123 East Main Street”. It contains functions for both
standardizing address elements (e.g. converting street names like
“Second” to “2nd” or converting “AV” to “Ave”) and for parsing out
input strings into separate variables for each input element.
We’ve at a point where all major functionality except for the ability to
work with unit types and numbers is ready for testing. If you work with
American street addresses regularly and have the time to take the
package for a spin, we’d love feedback before we submit to CRAN. We want
to make sure the workflow works, and can handle whatever addresses we
throw at it. Also, postmastr
is only set-up for American street
addresses right now but the functions have been built for expansion. If
you work with international street addresses and want to contribute,
please open a feature request
issue and
introduce yourself!
As of March 27, 2019, There is now a workflow for parsing intersections
build into pm_parse
. There are two breaking changes to be aware of:
pm_identify()
is no longer part ofpm_parse()
, and therefore must be run ahead of time by the end userpm_prep()
now has a requiredtype
parameter that should either be"street"
or"intersection"
- there have been changes to the arguments for
pm_replace()
,pm_rebuild()
, andpm_parse()
as well
The intersection workflow is very similar to the street address workflow
except that intersections must be prepared with pm_intersect_longer()
,
than parsed, then put back together with pm_intersect_wider()
before
replacing and rebuilding. The intersection workflow supports both short
(i.e. Main St at First Ave
) and long (i.e. Main St at First Ave, St. Louis MO 63110
) forms.
Street addresses can be notoriously difficult to work with. In the
United States, the U.S. Postal Service has standards for their
composition. There is so
much variety, however, that anticipating all of the possible
permutations of addresses is a significant task. When the inaccuracy of
human data entry is added, the challenge of parsing addresses becomes
monumental. The goal of postmastr
is to provide a uniform workflow for
parsing street address data that allows for sufficient flexibility.
This flexibility is provided in two ways. First, we utilize
“dictionaries” for a number of the key functions that allow users to
provide vectors of data to base parsing on. This enables postmastr
to
parse potential misspellings and colloquial terms that are hard (or
impossible) to predict. Second, not all aspects of the workflow are
mandatory - if street address data do not contain postal codes, states,
or cities, for example, those functions can be skipped.
postmastr
is not available from CRAN yet. In the meantime, you can
install the development version of postmastr
from Github with
remotes
:
# install.packages("remotes")
remotes::install_github("slu-openGIS/postmastr")
To illustrate the core components of the postmastr
workflow, we’ll use
some data included in the package on sushi restaurants in the St. Louis,
Missouri region. These are “long” data - some resturants appear multiple
times. Here is a quick preview of the data:
> sushi1
# A tibble: 30 x 3
name address visit
<chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18
4 Drunken Fish - Ballpark Village 601 Clark Ave #104, St. Louis, MO 63102-1719 4/28/18
5 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 5/10/18
6 Drunken Fish - Ballpark Village 601 Clark Ave Suite 104, St. Louis, MO 63102-1719 8/7/18
7 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18
8 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18
9 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18
10 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18
# … with 20 more rows
For the sushi1
data, the required dictionaries
are:
> mo <- pm_dictionary(type = "state", filter = "MO", case = c("title", "upper"), locale = "us")
> cities <- pm_append(type = "city",
+ input = c("Brentwood", "Clayton", "CLAYTON", "Maplewood",
+ "St. Louis", "SAINT LOUIS", "Webster Groves"),
+ output = c(NA, NA, "Clayton", NA, NA, "St. Louis", NA))
The sushi1
data are small, and the dictionaries could be developed
simply by looking through the data. However, a typical data set will
require significant exploration to develop these dictionaries.
postmastr
provides a full-feature
workflow
for working manually through the parsing process to both develop
dictionaries and troubleshoot issues with your data.
Once dictionaries have been developed, the pm_parse()
function can be
used to fully prep, parse, and reconstruct address strings:
> postmastr::sushi1 %>%
+ dplyr::filter(name != "Drunken Fish - Ballpark Village") %>%
+ pm_parse(input = "full",
+ address = address,
+ output = "short",
+ keep_parsed = "no",
+ city_dict = cities,
+ state_dict = mo)
# A tibble: 27 x 4
name address visit pm.address
<chr> <chr> <chr> <chr>
1 BaiKu Sushi Lounge 3407 Olive St, St. Louis, Missouri 63103 3/20/18 3407 Olive St
2 Blue Ocean Restaurant 6335 Delmar Blvd, St. Louis, MO 63112 10/26/18 6335 Delmar Blvd
3 Cafe Mochi 3221 S Grand Boulevard, St. Louis, MO 63118 10/10/18 3221 S Grand Blvd
4 Drunken Fish - Central West End 1 Maryland Plaza, St. Louis, MO 63108 12/2/18 1 Maryland Plz
5 I Love Mr Sushi 9443 Olive Blvd, St. Louis, Missouri 63132 1/1/18 9443 Olive Blvd
6 Kampai Sushi Bar 4949 W Pine Blvd, St. Louis, MO 63108 2/13/18 4949 W Pine Blvd
7 Midtown Sushi & Ramen 3674 Forest Park Ave, St. Louis, MO 63108 3/4/18 3674 Forest Park Ave
8 Mizu Sushi Bar 1013 Washington Avenue, St. Louis, MO 63101 9/12/18 1013 Washington Ave
9 Robata Maplewood 7260 Manchester Road, Maplewood, MO 63143 11/1/18 7260 Manchester Rd
10 SanSai Japanese Grill Maplewood 1803 Maplewood Commons Dr, St. Louis, MO 63143 2/14/18 1803 Maplewood Commons Dr
# … with 17 more rows
The postmastr
functions all contain a locale
argument that is only
enabled for American (i.e. locale = "us"
) addresses. Assistance with
expanding postmastr
functionality to other countries would be most
welcome. If you work with street address data in another country and
would like to contribute to postmastr
by extending its functionality,
please open a feature request
issue and
introduce yourself!
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.