Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation #1017

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

JGreenlee
Copy link
Contributor

Comment on lines 32 to 40
def haversine(lon1, lat1, lon2, lat2):
earth_radius = 6371000 # meters
lat1, lat2 = np.radians(lat1), np.radians(lat2)
lon1, lon2 = np.radians(lon1), np.radians(lon2)

dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
return 2 * earth_radius * np.arcsin(np.sqrt(a))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JGreenlee Why don't we modify the existing:

def calDistance(point1, point2, coordinates=False):
    """haversine distance

    :param point1: a coordinate in degrees WGS84
    :param point2: another coordinate in degrees WGS84
    :param coordinates: if false, expect a list of coordinates, defaults to False
    :return: distance approximately in meters
    """
    earthRadius = 6371000  # meters
    # SHANKARI: Why do we have two calDistance() functions?
    # Need to combine into one
    # points are now in geojson format (lng,lat)
    if coordinates:
        dLat = math.radians(point1.lat-point2.lat)
        dLon = math.radians(point1.lon-point2.lon)
        lat1 = math.radians(point1.lat)
        lat2 = math.radians(point2.lat)
    else:
        dLat = math.radians(point1[1]-point2[1])
        dLon = math.radians(point1[0]-point2[0])
        lat1 = math.radians(point1[1])
        lat2 = math.radians(point2[1])


    a = (math.sin(dLat/2) ** 2) + ((math.sin(dLon/2) ** 2) * math.cos(lat1) * math.cos(lat2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = earthRadius * c

    return d

in emission core?

If we want points vs lat long explicit why dont we do something like:

def haversine(*args, coordinates=False):
    """
    Calculate the Haversine distance between two points.

    There are two usage modes:

    1. Points mode: Supply two points.
       - If `coordinates=True`, the points are expected to have attributes `lat` and `lon`.
         Example: haversine(point1, point2, coordinates=True)
       - Otherwise, the points are expected to be sequences (e.g. lists or tuples)
         Example: haversine([lon, lat], [lon, lat])

    2. LatLong mode: Supply four numeric values:
         lon1, lat1, lon2, lat2
       Example: haversine(lon1, lat1, lon2, lat2)

    :returns: Distance in meters
    """
    earth_radius = 6371000  # meters

    if len(args) == 2:
        # Points mode
        p1, p2 = args
        if coordinates:
            lat1, lon1 = p1.latitude, p1.lon
            lat2, lon2 = p2.lat, p2.lon
        else:
            lon1, lat1 = p1[0], p1[1]
            lon2, lat2 = p2[0], p2[1]
    elif len(args) == 4:
        # LatLong mode
        lon1, lat1, lon2, lat2 = args
    else:
        raise ValueError("Invalid arguments. Provide either two points or four coordinate values (lon1, lat1, lon2, lat2).")

    # Convert from degrees to radians
    lat1, lat2 = np.radians(lat1), np.radians(lat2)
    lon1, lon2 = np.radians(lon1), np.radians(lon2)

    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return earth_radius * c

Rather than restating it in time filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just put it there for now because we might switch to using other libraries (shapely and/or geopandas) for the distance calculations, depending on whether they are faster

@JGreenlee
Copy link
Contributor Author

I spent much of today addressing edge cases to get this optimized implementation to match master. Left it here for @TeachMeTW to review and will clean it up later

I am assigning @TeachMeTW to create a similarly-optimized DwellSegmentationDistFilter by using this one as an example

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 7, 2025

Only a few tests are failing now (compared to dozens before), and as far as I can tell the only remaining difference in behavior is that master sometimes generates zero-duration places, which are expected in a few of the tests in TestPipelineRealData (testZeroDurationPlaceInterpolationSingleSync, testZeroDurationPlaceInterpolationMultiSync, testJun21)
TestMetricsConfirmedTrips also fails because it also uses shankari_2016-06-21

Seeking clarification on when and why the zero-duration places occur. Does it have something to do with untracked time?

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 7, 2025

I was worried that by adding the complex logic to handle the many edge cases, the performance benefit would start to disappear.

But it still looks pretty good. Trip segmentation consistently runs in about 1/3 the time, regardless of which day I use

image

And this time, both implementations actually generate the same trips:

	start_ts	end_ts
0	1440688739.67	1440689408.30
1	1440689540.70	1440690076.78
2	1440690676.91	1440694424.89
3	1440694936.47	1440699266.67
4	1440699298.74	1440700070.13
5	1440716367.38	1440719699.47
6	1440719879.47	1440723334.90
7	1440728457.49	1440729142.71
8	1469618232.77	1469621795.00
9	1469624459.54	1469624904.51
10	1469625084.51	1469647206.80
11	1469647386.80	1469647720.02
12	1469649282.12	1469654463.95
13	1469654643.95	1469655890.25
14	1469657170.00	1469660118.71
15	1469660263.68	1469660991.59
16	1469666731.00	1469666936.39
17	1469667346.75	1469668756.00
18	1469669000.04	1469670920.00
19	1469670932.17	1469671579.88
20	1469672049.20	1469672977.54
21	1469673888.75	1469674377.54
22	1469674452.53	1469674797.00
23	1469674977.00	1469675656.00
24	1469675860.40	1469677096.61
25	1469677276.61	1469677679.00
26	1469678006.73	1469678966.00
27	1469678994.68	1469679381.00
28	1469679519.26	1469680163.00
29	1469682469.19	1469684686.18
30	1469686277.51	1469686987.01
31	1470341031.23	1470342912.00
32	1470343292.14	1470351348.72
33	1470352238.74	1470352844.00
34	1470354036.96	1470354172.71
35	1470354386.56	1470355132.00
36	1470355612.59	1470355924.25
37	1470356315.84	1470357252.57
38	1470357578.29	1470363534.71
39	1470364485.74	1470364718.35

I think there are still a number of further optimizations that can be considered:

  • Using shapely / geopandas instead of a numpy-based haversine implementation
  • Using one recent_dists instead of having two separate last_10_dists and last_5min_dists
    • not possible without changing the current behavior (I tried)
  • Computing the max distances in batches (for example, by only looking the first 100 non-segemented points) so we waste less time on irrelevant points
  • polars? (now that we are computing more things column-wise)

But we may choose to save those for later if these changes, as well as the last batch of changes, already prove effective at reducing costs

@JGreenlee
Copy link
Contributor Author

Test are passing after a trivial off-by-one fix!

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 10, 2025

Using one recent_dists instead of having two separate last_10_dists and last_5min_dists

I was able to accomplish this, which cut execution time by ~1/2 again.


Charts below show the cumulative improvement since we began optimizing trip segmentation.
Given that trip segmentation does not appear to be a bottleneck anymore and much of the remaining execution time is just from DB calls, I do not think we need to consider any further optimizations here.
Let's just get DwellSegmentationDistFilter to the same level as DwellSegmentationTimeFilter and then push this out

Image Image Image Image

@JGreenlee JGreenlee changed the title draft of new DwellSegmentationTimeFilter 🚀 Vectorized Operations and Efficient Iteration During Trip Segmentation Feb 11, 2025
- Replace row-by-row iterative processing with vectorized computation of time, distance, and speed.
- Compute candidate segmentation points in bulk using thresholds (time, distance, speed) and flags for tracking restarts and motion activity.
- Retain a targeted row-by-row loop to handle huge invalid timestamp offsets on candidate points.
- Partition filtered points into trip segments based on computed candidate indices, and force trip termination based on transition events.
- Note: Tests are failing; will debug issues in subsequent commits.
- Replace row-by-row iterative processing with vectorized computation of time, distance, and speed.
- Compute candidate segmentation points in bulk using thresholds (time, distance, speed) and flags for tracking restarts and motion activity.
- Retain a targeted row-by-row loop to handle huge invalid timestamp offsets on candidate points.
- Partition filtered points into trip segments based on computed candidate indices, and force trip termination based on transition events.
- Note: Tests are failing; will debug issues in subsequent commits.
@JGreenlee JGreenlee force-pushed the vectorized_segmentation branch from 68df978 to 0e8d0d7 Compare February 13, 2025 02:17
@JGreenlee
Copy link
Contributor Author

@TeachMeTW I updated your commit and have gotten the tests to pass, but I am not sure that all the of comments/documentation is still accurate. Can you clean it up?

@JGreenlee
Copy link
Contributor Author

We attempted to unify the existing implementation of calDistance with the NumPy haversine implementation used by this PR.
However, this causes several tests to fail due to floating-point rounding issues (the differences are < 0.000001)

We can either:

  1. update the expected values
  2. keep 2 separate implementations for now

Since we are trying to get this out the door to see our improvements on prod next week I am proposing (2).
Updating the last commit accordingly

@JGreenlee JGreenlee force-pushed the vectorized_segmentation branch from 957a9bb to 53e18c9 Compare February 14, 2025 18:01
JGreenlee and others added 2 commits February 14, 2025 13:24
Keeping separate implementations of the haversine formula for now, since the expected values in all the tests rely on python math calculations and the numpy calculations are slightly off in floating point precision (< 0.000001 difference)

Added timing for /segment_into_trips_dist/loop

Co-authored-by: TeachMeTW <[email protected]>
Instead of DwellSegmentationTimeFilter and DwellSegmentationDistFilter accepting the timeseries and timequery as parameters, and each retrieving transition_df and motion_df that way, we can retrieve those dataframes upstream (in trip_segmentation.py) and pass them through.
This will work exactly the same but is a bit simpler in structure
@JGreenlee JGreenlee force-pushed the vectorized_segmentation branch from 53e18c9 to 911c0f9 Compare February 14, 2025 18:31
@JGreenlee JGreenlee marked this pull request as ready for review February 14, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for review by Shankari
Development

Successfully merging this pull request may close these issues.

2 participants