🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

JGreenlee · 2025-02-07T19:11:54Z

TeachMeTW · 2025-02-07T20:47:27Z

...ion/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py

+def haversine(lon1, lat1, lon2, lat2):
+    earth_radius = 6371000  # meters
+    lat1, lat2 = np.radians(lat1), np.radians(lat2)
+    lon1, lon2 = np.radians(lon1), np.radians(lon2)
+
+    dlat = lat2 - lat1
+    dlon = lon2 - lon1
+    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
+    return 2 * earth_radius * np.arcsin(np.sqrt(a))


@JGreenlee Why don't we modify the existing:

def calDistance(point1, point2, coordinates=False): """haversine distance :param point1: a coordinate in degrees WGS84 :param point2: another coordinate in degrees WGS84 :param coordinates: if false, expect a list of coordinates, defaults to False :return: distance approximately in meters """ earthRadius = 6371000 # meters # SHANKARI: Why do we have two calDistance() functions? # Need to combine into one # points are now in geojson format (lng,lat) if coordinates: dLat = math.radians(point1.lat-point2.lat) dLon = math.radians(point1.lon-point2.lon) lat1 = math.radians(point1.lat) lat2 = math.radians(point2.lat) else: dLat = math.radians(point1[1]-point2[1]) dLon = math.radians(point1[0]-point2[0]) lat1 = math.radians(point1[1]) lat2 = math.radians(point2[1]) a = (math.sin(dLat/2) ** 2) + ((math.sin(dLon/2) ** 2) * math.cos(lat1) * math.cos(lat2)) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a)) d = earthRadius * c return d

in emission core?

If we want points vs lat long explicit why dont we do something like:

def haversine(*args, coordinates=False): """ Calculate the Haversine distance between two points. There are two usage modes: 1. Points mode: Supply two points. - If `coordinates=True`, the points are expected to have attributes `lat` and `lon`. Example: haversine(point1, point2, coordinates=True) - Otherwise, the points are expected to be sequences (e.g. lists or tuples) Example: haversine([lon, lat], [lon, lat]) 2. LatLong mode: Supply four numeric values: lon1, lat1, lon2, lat2 Example: haversine(lon1, lat1, lon2, lat2) :returns: Distance in meters """ earth_radius = 6371000 # meters if len(args) == 2: # Points mode p1, p2 = args if coordinates: lat1, lon1 = p1.latitude, p1.lon lat2, lon2 = p2.lat, p2.lon else: lon1, lat1 = p1[0], p1[1] lon2, lat2 = p2[0], p2[1] elif len(args) == 4: # LatLong mode lon1, lat1, lon2, lat2 = args else: raise ValueError("Invalid arguments. Provide either two points or four coordinate values (lon1, lat1, lon2, lat2).") # Convert from degrees to radians lat1, lat2 = np.radians(lat1), np.radians(lat2) lon1, lon2 = np.radians(lon1), np.radians(lon2) # Haversine formula dlat = lat2 - lat1 dlon = lon2 - lon1 a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2 c = 2 * np.arcsin(np.sqrt(a)) return earth_radius * c

Rather than restating it in time filter?

I just put it there for now because we might switch to using other libraries (shapely and/or geopandas) for the distance calculations, depending on whether they are faster

JGreenlee · 2025-02-07T21:32:50Z

I spent much of today addressing edge cases to get this optimized implementation to match master. Left it here for @TeachMeTW to review and will clean it up later

I am assigning @TeachMeTW to create a similarly-optimized DwellSegmentationDistFilter by using this one as an example

JGreenlee · 2025-02-07T21:32:55Z

Only a few tests are failing now (compared to dozens before), and as far as I can tell the only remaining difference in behavior is that master sometimes generates zero-duration places, which are expected in a few of the tests in TestPipelineRealData (testZeroDurationPlaceInterpolationSingleSync, testZeroDurationPlaceInterpolationMultiSync, testJun21)
TestMetricsConfirmedTrips also fails because it also uses shankari_2016-06-21

Seeking clarification on when and why the zero-duration places occur. Does it have something to do with untracked time?

JGreenlee · 2025-02-07T21:55:43Z

I was worried that by adding the complex logic to handle the many edge cases, the performance benefit would start to disappear.

But it still looks pretty good. Trip segmentation consistently runs in about 1/3 the time, regardless of which day I use

And this time, both implementations actually generate the same trips:

	start_ts	end_ts
0	1440688739.67	1440689408.30
1	1440689540.70	1440690076.78
2	1440690676.91	1440694424.89
3	1440694936.47	1440699266.67
4	1440699298.74	1440700070.13
5	1440716367.38	1440719699.47
6	1440719879.47	1440723334.90
7	1440728457.49	1440729142.71
8	1469618232.77	1469621795.00
9	1469624459.54	1469624904.51
10	1469625084.51	1469647206.80
11	1469647386.80	1469647720.02
12	1469649282.12	1469654463.95
13	1469654643.95	1469655890.25
14	1469657170.00	1469660118.71
15	1469660263.68	1469660991.59
16	1469666731.00	1469666936.39
17	1469667346.75	1469668756.00
18	1469669000.04	1469670920.00
19	1469670932.17	1469671579.88
20	1469672049.20	1469672977.54
21	1469673888.75	1469674377.54
22	1469674452.53	1469674797.00
23	1469674977.00	1469675656.00
24	1469675860.40	1469677096.61
25	1469677276.61	1469677679.00
26	1469678006.73	1469678966.00
27	1469678994.68	1469679381.00
28	1469679519.26	1469680163.00
29	1469682469.19	1469684686.18
30	1469686277.51	1469686987.01
31	1470341031.23	1470342912.00
32	1470343292.14	1470351348.72
33	1470352238.74	1470352844.00
34	1470354036.96	1470354172.71
35	1470354386.56	1470355132.00
36	1470355612.59	1470355924.25
37	1470356315.84	1470357252.57
38	1470357578.29	1470363534.71
39	1470364485.74	1470364718.35

I think there are still a number of further optimizations that can be considered:

Using shapely / geopandas instead of a numpy-based haversine implementation
Using one recent_dists instead of having two separate last_10_dists and last_5min_dists
- ~~not possible without changing the current behavior (I tried)~~
Computing the max distances in batches (for example, by only looking the first 100 non-segemented points) so we waste less time on irrelevant points
polars? (now that we are computing more things column-wise)

But we may choose to save those for later if these changes, as well as the last batch of changes, already prove effective at reducing costs

JGreenlee · 2025-02-08T03:18:27Z

Test are passing after a trivial off-by-one fix!

JGreenlee · 2025-02-10T16:44:45Z

Using one recent_dists instead of having two separate last_10_dists and last_5min_dists

I was able to accomplish this, which cut execution time by ~1/2 again.

Charts below show the cumulative improvement since we began optimizing trip segmentation.
Given that trip segmentation does not appear to be a bottleneck anymore and much of the remaining execution time is just from DB calls, I do not think we need to consider any further optimizations here.
Let's just get DwellSegmentationDistFilter to the same level as DwellSegmentationTimeFilter and then push this out

- Replace row-by-row iterative processing with vectorized computation of time, distance, and speed. - Compute candidate segmentation points in bulk using thresholds (time, distance, speed) and flags for tracking restarts and motion activity. - Retain a targeted row-by-row loop to handle huge invalid timestamp offsets on candidate points. - Partition filtered points into trip segments based on computed candidate indices, and force trip termination based on transition events. - Note: Tests are failing; will debug issues in subsequent commits.

JGreenlee · 2025-02-13T02:31:44Z

@TeachMeTW I updated your commit and have gotten the tests to pass, but I am not sure that all the of comments/documentation is still accurate. Can you clean it up?

JGreenlee · 2025-02-14T17:39:40Z

We attempted to unify the existing implementation of calDistance with the NumPy haversine implementation used by this PR.
However, this causes several tests to fail due to floating-point rounding issues (the differences are < 0.000001)

We can either:

update the expected values
keep 2 separate implementations for now

Since we are trying to get this out the door to see our improvements on prod next week I am proposing (2).
Updating the last commit accordingly

Keeping separate implementations of the haversine formula for now, since the expected values in all the tests rely on python math calculations and the numpy calculations are slightly off in floating point precision (< 0.000001 difference) Added timing for /segment_into_trips_dist/loop Co-authored-by: TeachMeTW <[email protected]>

Instead of DwellSegmentationTimeFilter and DwellSegmentationDistFilter accepting the timeseries and timequery as parameters, and each retrieving transition_df and motion_df that way, we can retrieve those dataframes upstream (in trip_segmentation.py) and pass them through. This will work exactly the same but is a bit simpler in structure

JGreenlee added 2 commits February 7, 2025 14:11

draft of new DwellSegmentationTimeFilter

ae1b00c

fix bugs in draft of new DwellSegmentationTimeFilter

939d576

TeachMeTW reviewed Feb 7, 2025

View reviewed changes

JGreenlee mentioned this pull request Feb 7, 2025

Pipeline Optimization Strategies e-mission/e-mission-docs#1105

Open

JGreenlee added 2 commits February 7, 2025 22:02

fix to allow zero-duration places

36a974e

rename variables for clarity

fe6035a

JGreenlee added 2 commits February 10, 2025 00:43

unify 'recent_points_diffs' into one column

adef057

update log statements

266a5d1

JGreenlee changed the title ~~draft of new DwellSegmentationTimeFilter~~ 🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation Feb 11, 2025

TeachMeTW added 2 commits February 12, 2025 10:42

JGreenlee force-pushed the vectorized_segmentation branch from 68df978 to 0e8d0d7 Compare February 13, 2025 02:17

Modified comments and removed unused variable -- tests passing

d8392fe

shankari mentioned this pull request Feb 14, 2025

"End Trip + Sync" not working on iOS e-mission/e-mission-docs#1112

Open

JGreenlee force-pushed the vectorized_segmentation branch from 957a9bb to 53e18c9 Compare February 14, 2025 18:01

JGreenlee and others added 2 commits February 14, 2025 13:24

JGreenlee force-pushed the vectorized_segmentation branch from 53e18c9 to 911c0f9 Compare February 14, 2025 18:31

JGreenlee marked this pull request as ready for review February 14, 2025 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

JGreenlee commented Feb 7, 2025

TeachMeTW Feb 7, 2025

JGreenlee Feb 7, 2025

JGreenlee commented Feb 7, 2025

JGreenlee commented Feb 7, 2025 •

edited

Loading

JGreenlee commented Feb 7, 2025 •

edited

Loading

JGreenlee commented Feb 8, 2025

JGreenlee commented Feb 10, 2025 •

edited

Loading

JGreenlee commented Feb 13, 2025

JGreenlee commented Feb 14, 2025

🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

Are you sure you want to change the base?

🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

Conversation

JGreenlee commented Feb 7, 2025

TeachMeTW Feb 7, 2025

Choose a reason for hiding this comment

JGreenlee Feb 7, 2025

Choose a reason for hiding this comment

JGreenlee commented Feb 7, 2025

JGreenlee commented Feb 7, 2025 • edited Loading

JGreenlee commented Feb 7, 2025 • edited Loading

JGreenlee commented Feb 8, 2025

JGreenlee commented Feb 10, 2025 • edited Loading

JGreenlee commented Feb 13, 2025

JGreenlee commented Feb 14, 2025

JGreenlee commented Feb 7, 2025 •

edited

Loading

JGreenlee commented Feb 7, 2025 •

edited

Loading

JGreenlee commented Feb 10, 2025 •

edited

Loading