Parse unusual addresses #81

debseidner · 2015-04-01T21:16:50Z

As a geocoding system, I need to be able to parses unusual addresses in the geocoder so I can handle finding the (x,y) for addresses like 32 1/2 Main Street, 18-D Main Street, etc.

hkeeler · 2015-05-18T16:42:50Z

Starting to do bit of deeper testing on the results of the address parser. The following addresses parse as follows. usaddress has two parsing methods, "parse" (default) and "tag" (generally better results).:

32 1/2 Main Street (method: "parse" and "tag")

"parts": {
    "AddressNumber": "32",
    "AddressNumberSuffix": "1/2",
    "StreetName": "Main",
    "StreetNamePostType": "Street"
}

Notes: These seem like results as we'd expect.

18-D Main Street (method: "parse" and "tag")
```
"parts": {
    "AddressNumber": "18-D",
    "StreetName": "Main",
    "StreetNamePostType": "Street"
}
```
Notes: Not as good. It would be more consistent if "AddressNumber": "18" and "AddressNumberSuffix": "D". Not sure if this poses a problem for us if "18-D" is our address number.
18 D Main Street (method: "parse")
```
"parts": {
    "AddressNumber": "18",
    "StreetName": "Main", 
    "StreetNamePostType": "Street"
}
```
Notes: D is dropped all together, not making it into any of the address parts. This is a bit of a surprise.
18 D Main Street (method: "tag")
```
"parts": {
    "AddressNumber": "18", 
    "StreetName": "D Main", 
    "StreetNamePostType": "Street"
}
```
Notes: D Main as StreetName? That seems like a real problem. This is especially surprising considering "tag" generally provides better results that "parse".

Curious to get everyone's take on this. Is it time to start learning more about how to "train" the parser?

hkeeler · 2015-05-18T18:29:12Z

Rural routes were brought up at today's standup. Below are a few address given as examples Postal Services's Pub28 on Rural Route Addresses:

Original	Type	USPSBoxGroupType	USPSBoxGroupID	USPSBoxType	USPSBoxID
RR+2+BOX+152	PO Box	RR	2	BOX	152
RR+9+BOX+23A	PO Box	RR	9	BOX	23A
RR03+BOX+98D	PO Box			RR03 BOX	98D
RR+4+BOX+19-1A	PO Box	RR	4	BOX	19-1A

Overall, looks "pretty good". The parser clearly doesn't split up USPSBoxID if there are sub-parts, but maybe that's fine. The Leading Zero address, which doesn't parse correctly, is considered "Acceptable", but not "Preferred".

There are other "Incorrect" formats such as Designations RFD and RD and Additional Designations. These formats are not parsed well, but I suppose that is to be expected considering the USPS doesn't consider them to be valid in the first place.

hkeeler · 2015-05-18T18:52:54Z

Also, if anyone else would like to test the parser's default behavior, usaddress has a site for single and bulk address parsing.

debseidner · 2015-05-19T18:08:39Z

@hkeeler In addition to RR, I also found that there are highway contract routes or star routes (HC 68 BOX 23A or HC 68 BOX 19-2B). We should also take into account suites for business addresses (12 E MAIN AVE STE 209) or addresses like in DC with the directional after the street (1275 First Street NE).

hkeeler · 2015-05-19T20:56:55Z

Similar to RR, the HC addresses parse as follows:

Original	Type	USPSBoxGroupType	USPSBoxGroupID	USPSBoxType	USPSBoxID
HC+68+BOX+23A	PO Box	HC	68	BOX	23A
HC+68+BOX+19-2B	PO Box	HC	68	BOX	19-2B

Suites and DC address in their standard form also parse as expected:

Original	Type	AddressNumber	StreetNamePreDirectional	StreetName	StreetNamePostType	OccupancyType	OccupancyIdentifier
12+E+MAIN+AVE+STE+209	Street Address	12	E	MAIN	AVE	STE	209

Original	Type	AddressNumber	StreetName	StreetNamePostType	StreetNamePostDirectional
1275+First+Street+NE	Street Address	1275	First	Street	NE

Its interesting that the E in the Suite address is considered a StreetNamePreDirectional, while the DC address's NE is a StreetNamePostDirectional.

hkeeler · 2015-05-20T16:31:28Z

I've been playing with training the parser a bit. It's pretty easy. The instructions are laid out pretty well here:

https://github.com/datamade/usaddress#for-the-nerds

I've been able to add to the training data (defaults to training/labeled.xml), and see the parse results change. Its also interesting that they have other training files available in the training directory. Its not yet clear if the packaged version of the library uses just labeled.xml, all of the files, or some combination. I suspect is not all since one of the files fails to load when I try to train with it.

Another interesting discovery is that it doesn't seem possible to train the parser to parse addresses with concatenated parts, such as:

RR03 BOX 98D

This is due to the parsers hard-coded set of tokens used to split the address parts. To get around this, we (or they) would need to change the tokenize function to include other tokens. I'm not quite ready to go down this road just yet.

And finally, I did a bit of testing for Salt Lake City with their unique addressing scheme.

>>> import usaddress
>>> address = '5268 S 2200 E # 12'
>>> usaddress.tag(address)
(OrderedDict([
 ('AddressNumber', u'5268'), 
 ('StreetNamePreDirectional', u'S'), 
 ('StreetName', u'2200'), 
 ('StreetNamePostDirectional', u'E'), 
 ('OccupancyIdentifier', u'# 12')]),
 'Street Address')

According to the SLC's example from their Standardization page, this address should parse to:

5268 S: Frontage Number
S: Directional
2200: Street Number
E: Street Directional
12: Sub-structure Suffix

These don't exactly line up, and I'm not exactly sure how these date parts could me made to align with usaddress's address parts. Perhaps this is good enough? Thoughts?

hkeeler · 2015-05-20T22:26:05Z

Since the question came up at today's standup, usaddress's address parts are based on the United States Thoroughfare, Landmark, and Postal Address Data Standard. This standard does reference USPS Pub 28 throughout, so the two are not mutually exclusive. One of their objectives is:

Build on USPS Publication 28, the Census Bureau TIGER files, the FGDC
Content Standard for Digital Geospatial Metadata, the FGDC's National Spatial
Data Infrastructure (NSDI) Framework Data Content Standard, and previous
FGDC address standard efforts.

hkeeler · 2015-06-02T21:58:25Z

There were recently questions about how the parser handles Puerto Rico and Overseas Military addresses. The usaddress behaves as follows by default:

Military

Below are the sample addresses taken from Pub 28 Military Addresses. The results are...not so good.

Original	Type	OccupancyType	OccupancyIdentifier	USPSBoxType	USPSBoxID	PlaceName	StateName	ZipCode	Unable to parse	LandmarkName
unit+2050+box+4190+apo+ap+96278-2050	PO Box	unit	2050	box	4190	apo	ap	96278-2050
psc+802+box+74+apo+ae+09499-2050	Unparsed								psc 802 box 74 apo ae 09499-2050
uscgc+hamilton+fpo+ap+96667-3931	Ambiguous					fpo	ap	96667-3931		uscgc hamilton

Puerto Rico

Below are sample addresses taken from Pub 28 Puerto Rico Addresses. They clearly parse much better than the military addresses. The one at the end that does not parse is considered an "exception".

Certain condominiums are not located on a named street or have an assigned number to the building. The name of the condominium is substituted for the street name.

Original	Type	AddressNumber	StreetName	OccupancyType	OccupancyIdentifier	PlaceName	StateName	ZipCode	Recipient	StreetNamePostType	Unable to parse
1234+ave+ashford+apt+1a+san+juan+pr+00907-1021	Street Address	1234	ave ashford	apt	1a	san juan	pr	00907-1021
1230+calle+amapolas+apt+103+carolina+pr+00979-1126	Street Address	1230	calle amapolas	apt	103	carolina	pr	00979-1126
1234+urb+los+olmos+ponce+pr+00731-1235	Street Address	1234	urb			los olmos ponce	pr	00731-1235
urb+las+gladiolas+150+calle+a san+juan+pr+00926-0221	Street Address	150	calle a			san juan	pr	00926-0221	urb las gladiolas
1234+calle+aurora+mayagues+pr+00680-1234	Street Address	1234	calle			mayagues	pr	00680-1234		aurora
res+las+margaritas+edif+1+apt+104+caguas+pr+00725-1103	Unparsed										res las margaritas edif 1 apt 104 caguas pr 00725-1103

...and the good news is that these types of addresses seem very "trainable", so if we do feel like we need to improve these results, we have options.

jmarin mentioned this issue Apr 1, 2015

Data model specification #5

Closed

9 tasks

jmarin added this to the M2 milestone May 4, 2015

jmarin modified the milestones: M3, M2 Jun 2, 2015

jmarin modified the milestones: M4, M3 Jul 30, 2015

jmarin modified the milestones: M5 - TIGER Geocoder Improvements, M4 Aug 17, 2015

hkeeler removed this from the M5 - TIGER Geocoder Improvements milestone Aug 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse unusual addresses #81

Parse unusual addresses #81

debseidner commented Apr 1, 2015

hkeeler commented May 18, 2015

hkeeler commented May 18, 2015

hkeeler commented May 18, 2015

debseidner commented May 19, 2015

hkeeler commented May 19, 2015

hkeeler commented May 20, 2015

12: Sub-structure Suffix

hkeeler commented May 20, 2015

hkeeler commented Jun 2, 2015

Parse unusual addresses #81

Parse unusual addresses #81

Comments

debseidner commented Apr 1, 2015

hkeeler commented May 18, 2015

hkeeler commented May 18, 2015

hkeeler commented May 18, 2015

debseidner commented May 19, 2015

hkeeler commented May 19, 2015

hkeeler commented May 20, 2015

12: Sub-structure Suffix

hkeeler commented May 20, 2015

hkeeler commented Jun 2, 2015

Military

Puerto Rico