-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017
base: master
Are you sure you want to change the base?
Conversation
def haversine(lon1, lat1, lon2, lat2): | ||
earth_radius = 6371000 # meters | ||
lat1, lat2 = np.radians(lat1), np.radians(lat2) | ||
lon1, lon2 = np.radians(lon1), np.radians(lon2) | ||
|
||
dlat = lat2 - lat1 | ||
dlon = lon2 - lon1 | ||
a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2 | ||
return 2 * earth_radius * np.arcsin(np.sqrt(a)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGreenlee Why don't we modify the existing:
def calDistance(point1, point2, coordinates=False):
"""haversine distance
:param point1: a coordinate in degrees WGS84
:param point2: another coordinate in degrees WGS84
:param coordinates: if false, expect a list of coordinates, defaults to False
:return: distance approximately in meters
"""
earthRadius = 6371000 # meters
# SHANKARI: Why do we have two calDistance() functions?
# Need to combine into one
# points are now in geojson format (lng,lat)
if coordinates:
dLat = math.radians(point1.lat-point2.lat)
dLon = math.radians(point1.lon-point2.lon)
lat1 = math.radians(point1.lat)
lat2 = math.radians(point2.lat)
else:
dLat = math.radians(point1[1]-point2[1])
dLon = math.radians(point1[0]-point2[0])
lat1 = math.radians(point1[1])
lat2 = math.radians(point2[1])
a = (math.sin(dLat/2) ** 2) + ((math.sin(dLon/2) ** 2) * math.cos(lat1) * math.cos(lat2))
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = earthRadius * c
return d
in emission core?
If we want points vs lat long explicit why dont we do something like:
def haversine(*args, coordinates=False):
"""
Calculate the Haversine distance between two points.
There are two usage modes:
1. Points mode: Supply two points.
- If `coordinates=True`, the points are expected to have attributes `lat` and `lon`.
Example: haversine(point1, point2, coordinates=True)
- Otherwise, the points are expected to be sequences (e.g. lists or tuples)
Example: haversine([lon, lat], [lon, lat])
2. LatLong mode: Supply four numeric values:
lon1, lat1, lon2, lat2
Example: haversine(lon1, lat1, lon2, lat2)
:returns: Distance in meters
"""
earth_radius = 6371000 # meters
if len(args) == 2:
# Points mode
p1, p2 = args
if coordinates:
lat1, lon1 = p1.latitude, p1.lon
lat2, lon2 = p2.lat, p2.lon
else:
lon1, lat1 = p1[0], p1[1]
lon2, lat2 = p2[0], p2[1]
elif len(args) == 4:
# LatLong mode
lon1, lat1, lon2, lat2 = args
else:
raise ValueError("Invalid arguments. Provide either two points or four coordinate values (lon1, lat1, lon2, lat2).")
# Convert from degrees to radians
lat1, lat2 = np.radians(lat1), np.radians(lat2)
lon1, lon2 = np.radians(lon1), np.radians(lon2)
# Haversine formula
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
return earth_radius * c
Rather than restating it in time filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just put it there for now because we might switch to using other libraries (shapely and/or geopandas) for the distance calculations, depending on whether they are faster
I spent much of today addressing edge cases to get this optimized implementation to match I am assigning @TeachMeTW to create a similarly-optimized |
Only a few tests are failing now (compared to dozens before), and as far as I can tell the only remaining difference in behavior is that Seeking clarification on when and why the zero-duration places occur. Does it have something to do with untracked time? |
Test are passing after a trivial off-by-one fix! |
- Replace row-by-row iterative processing with vectorized computation of time, distance, and speed. - Compute candidate segmentation points in bulk using thresholds (time, distance, speed) and flags for tracking restarts and motion activity. - Retain a targeted row-by-row loop to handle huge invalid timestamp offsets on candidate points. - Partition filtered points into trip segments based on computed candidate indices, and force trip termination based on transition events. - Note: Tests are failing; will debug issues in subsequent commits.
- Replace row-by-row iterative processing with vectorized computation of time, distance, and speed. - Compute candidate segmentation points in bulk using thresholds (time, distance, speed) and flags for tracking restarts and motion activity. - Retain a targeted row-by-row loop to handle huge invalid timestamp offsets on candidate points. - Partition filtered points into trip segments based on computed candidate indices, and force trip termination based on transition events. - Note: Tests are failing; will debug issues in subsequent commits.
68df978
to
0e8d0d7
Compare
@TeachMeTW I updated your commit and have gotten the tests to pass, but I am not sure that all the of comments/documentation is still accurate. Can you clean it up? |
We attempted to unify the existing implementation of We can either:
Since we are trying to get this out the door to see our improvements on prod next week I am proposing (2). |
957a9bb
to
53e18c9
Compare
Keeping separate implementations of the haversine formula for now, since the expected values in all the tests rely on python math calculations and the numpy calculations are slightly off in floating point precision (< 0.000001 difference) Added timing for /segment_into_trips_dist/loop Co-authored-by: TeachMeTW <[email protected]>
Instead of DwellSegmentationTimeFilter and DwellSegmentationDistFilter accepting the timeseries and timequery as parameters, and each retrieving transition_df and motion_df that way, we can retrieve those dataframes upstream (in trip_segmentation.py) and pass them through. This will work exactly the same but is a bit simpler in structure
53e18c9
to
911c0f9
Compare
@TeachMeTW