-
Notifications
You must be signed in to change notification settings - Fork 14
Parse unusual addresses #81
Comments
Starting to do bit of deeper testing on the results of the address parser. The following addresses parse as follows. usaddress has two parsing methods, "parse" (default) and "tag" (generally better results).:
Curious to get everyone's take on this. Is it time to start learning more about how to "train" the parser? |
Rural routes were brought up at today's standup. Below are a few address given as examples Postal Services's Pub28 on Rural Route Addresses:
Overall, looks "pretty good". The parser clearly doesn't split up There are other "Incorrect" formats such as Designations RFD and RD and Additional Designations. These formats are not parsed well, but I suppose that is to be expected considering the USPS doesn't consider them to be valid in the first place. |
@hkeeler In addition to RR, I also found that there are highway contract routes or star routes (HC 68 BOX 23A or HC 68 BOX 19-2B). We should also take into account suites for business addresses (12 E MAIN AVE STE 209) or addresses like in DC with the directional after the street (1275 First Street NE). |
Similar to RR, the HC addresses parse as follows:
Suites and DC address in their standard form also parse as expected:
Its interesting that the |
I've been playing with training the parser a bit. It's pretty easy. The instructions are laid out pretty well here: I've been able to add to the training data (defaults to Another interesting discovery is that it doesn't seem possible to train the parser to parse addresses with concatenated parts, such as:
This is due to the parsers hard-coded set of tokens used to split the address parts. To get around this, we (or they) would need to change the And finally, I did a bit of testing for Salt Lake City with their unique addressing scheme. >>> import usaddress
>>> address = '5268 S 2200 E # 12'
>>> usaddress.tag(address)
(OrderedDict([
('AddressNumber', u'5268'),
('StreetNamePreDirectional', u'S'),
('StreetName', u'2200'),
('StreetNamePostDirectional', u'E'),
('OccupancyIdentifier', u'# 12')]),
'Street Address') According to the SLC's example from their Standardization page, this address should parse to:
These don't exactly line up, and I'm not exactly sure how these date parts could me made to align with usaddress's address parts. Perhaps this is good enough? Thoughts? |
Since the question came up at today's standup, usaddress's address parts are based on the United States Thoroughfare, Landmark, and Postal Address Data Standard. This standard does reference USPS Pub 28 throughout, so the two are not mutually exclusive. One of their objectives is:
|
There were recently questions about how the parser handles Puerto Rico and Overseas Military addresses. The usaddress behaves as follows by default: MilitaryBelow are the sample addresses taken from Pub 28 Military Addresses. The results are...not so good.
Puerto RicoBelow are sample addresses taken from Pub 28 Puerto Rico Addresses. They clearly parse much better than the military addresses. The one at the end that does not parse is considered an "exception".
...and the good news is that these types of addresses seem very "trainable", so if we do feel like we need to improve these results, we have options. |
As a geocoding system, I need to be able to parses unusual addresses in the geocoder so I can handle finding the (x,y) for addresses like 32 1/2 Main Street, 18-D Main Street, etc.
The text was updated successfully, but these errors were encountered: