Code to create a Pairwise Zip code Distance Matrix and use an index lookup. This is a tiny utility I built, for use in my work, but I figured it might be helpful to anyone else dealing with zip/postal code distances. This can be a huge and efficient time saver if you're working with large data and need to keep hitting some sort of a zip/latlong service API multiple times.
The idea is to form a (41483,41483) matrix where the cell values are the haversine distances
between two zips that are indexed. I save the index look up and the matrix file as .npz, .json or .pkl to reload and use in different projects. I hope this saves your effort when you're working with data streams or pandas dataframes and want to avoid iterating, using pypi package or a custom apply function to calculate distance between 2 zipcodes multiple times.
Use this matrix and vectorize your operations !
Install the folllowing python packages
pip install pandas
pip install numpy
pip install mpu
Since the the matrix is too huge to upload on git, you can try running the .py script in the code folder or go through the Jupter Notebook. The source file has been added in the Data folder. Running the Zip_distance.py file will generate and save the matrix on your machine and you can load it using
zip_dist = np.load(os.path.join(walk_up_folder(os.getcwd(), 2), "zip_dist.npz"))['arr_0']
# Load the saved objects
f = open(os.path.join(walk_up_folder(os.getcwd(), 3), "Data/zips_indexer.json"))
zips_indexer = json.load(f)
zip_dist = np.load(os.path.join(walk_up_folder(os.getcwd(), 3), "Data/zip_dist.npz"))['arr_0']
a='95035' # Zipcode as String
b='94085'
# Extract that cell value from the distance matrix
distance = zip_dist[zips_indexer[a],zips_indexer[b]]
To run the code fresh or make any changes, you can follow the steps in the Notebook
What optimizations did I make in my code?
- Converted the matrix from 'float64' to 'float32' to reduce disk space
- Tried using Numba.jit to speed up the iterrows and numpy loops
mpu.haversine
gives result in kms. Convert to miles 1km = 0.621371 miles- Did not normalize the matrix before saving since i do need the raw distance values and since it is pairwise the max distance will be too high
- numba -
- Per the deprecation recommendations, it's very reasonable that code which doesn't compile with @jit(nopython=True) could be faster without the decorator.
- The available libraries that can be used with numba jit in nopython is fairly limited (pretty much only to numpy arrays and certain python builtin libraries).
- Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.
- Using list(zip(a,b)) is much faster if yopu want top create a new pandas column that is a tuple of 2 other columns
- To make a symmetrix pairwise matrix, initialze using
np.zeros
so that you won't need to worrty about the diagonal elements. Extrack the upper triangular indices. Add the transpose to the matrix to fill lower as well and set diagonal to 0 again
Please use the following bibtex, when you refer
@software{Samala_Pairwise_Zip-code_Haversine_2021,
author = {Samala, Nitin Kishore Sai},
month = {11},
title = {{Pairwise Zip-code Haversine distance lookup matrix}},
url = {https://github.com/snknitin/us-zipcode-distance},
year = {2021}
}
- Haversine Distance Calculation
- Pairwise matrix from 2 arrays
- Array Storage Benchmark
- Compressed save/load NPZ
Here are some related projects
No. You can change the source file of the zip details from here and run the code for any country
Answer 2
If you have any feedback, please reach out to us at [email protected]