Add options for custom aggregation frequencies #21

dcjohnson24 · 2022-09-20T00:33:14Z

Description

This pull request adds the option of resampling by various frequencies. This will enable different ways of aggregating the data for the study of trends over time. It might also help in understanding why some routes have completed to scheduled trip ratios greater than 1 (#19).

Resolves #12

Type of change

Bug fix
New functionality
Documentation

How has this been tested?

Informal test

Daily aggregation using the string 'D' in Pandas resampling equals the original groupby method

In [58]: orig = rt.groupby(by=['date', 'route_id'])['trip_count'].sum().reset_index()

In [59]: new = (rt.set_index(['date', 'route_id'])
    ...:             .groupby(
    ...:                 [pd.Grouper(level='date', freq='D'),
    ...:                  pd.Grouper(level='route_id')])['trip_count']
    ...:             .sum().reset_index()
    ...:         )

In [60]: orig.equals(new)
Out[60]: True

Findings

Some findings from the reaggregated data

schedule_feeds = [
        {
            "schedule_version": "20220507",
            "feed_start_date": "2022-05-20",
            "feed_end_date": "2022-06-02",
        },
        {
            "schedule_version": "20220603",
            "feed_start_date": "2022-06-04",
            "feed_end_date": "2022-06-07",
        },
        {
            "schedule_version": "20220608",
            "feed_start_date": "2022-06-09",
            "feed_end_date": "2022-07-08",
        },
        {
            "schedule_version": "20220709",
            "feed_start_date": "2022-07-10",
            "feed_end_date": "2022-07-17",
        },
        {
            "schedule_version": "20220718",
            "feed_start_date": "2022-07-19",
            "feed_end_date": "2022-07-20",
        },
    ]

In [15]: daily_info = AggInfo(freq='D')

In [16]: weekly_info = AggInfo(freq='W-MON')

In [17]: monthly_info = AggInfo(freq='M')

In [18]: output_list = []

In [19]: for info in [daily_info, weekly_info, monthly_info]:
    ...:     output_list.append(combine_real_time_rt_comparison(schedule_feeds, agg_info=info, save=Fa
    ...: lse))
    ...: 
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00,  3.39it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:04<00:00,  3.04it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.13it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:01<00:00,  3.08it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:08<00:00,  3.35it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:08<00:00,  2.40it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.81it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:02<00:00,  3.90it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.20it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.13it/s]
Processing 2022-07-20 at2022-09-19 15:23:48: 100%|██████████████████████| 5/5 [00:22<00:00,  4.43s/it]
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:03<00:00,  4.05it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:03<00:00,  3.87it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.36it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:00<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:08<00:00,  3.64it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:08<00:00,  3.58it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.87it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:02<00:00,  4.13it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.30it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.28it/s]
Processing 2022-07-20 at2022-09-19 15:24:08: 100%|██████████████████████| 5/5 [00:19<00:00,  3.98s/it]
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:03<00:00,  4.46it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:03<00:00,  4.26it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.46it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:00<00:00,  4.31it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:06<00:00,  4.31it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:06<00:00,  3.87it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.10it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:01<00:00,  4.52it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.71it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.64it/s]
Processing 2022-07-20 at2022-09-19 15:24:25: 100%|██████████████████████| 5/5 [00:17<00:00,  3.54s/it]

Here are the distributions of ratio in the combined DataFrame of the various schedule versions.

# Daily
In [22]: for df in output_list:
    ...:     print(df.ratio.describe())
    ...: 
count    1530.000000
mean        0.806123
std         0.142894
min         0.210526
25%         0.705431
50%         0.809052
75%         0.916225
max         1.225806
Name: ratio, dtype: float64

# Weekly
count    738.000000
mean       0.493180
std        0.228549
min        0.096552
25%        0.278797
50%        0.477403
75%        0.662779
max        1.051724
Name: ratio, dtype: float64

# Monthly
count    861.000000
mean       0.242437
std        0.205531
min        0.042088
25%        0.101590
50%        0.188032
75%        0.261278
max        0.920404
Name: ratio, dtype: float64

And the summary across all schedule versions

In [23]: summary_output = []

In [24]: for o in output_list:
    ...:     summary_output.append(build_summary(o, save=False))
    ...: 

In [25]: for df in summary_output:
    ...:     print(df.ratio.describe())
    ...: 
# Daily
count    423.000000
mean       0.799969
std        0.134382
min        0.210526
25%        0.711802
50%        0.798822
75%        0.901876
max        1.134615
Name: ratio, dtype: float64

# Weekly
count    246.000000
mean       0.655958
std        0.196121
min        0.289908
25%        0.486671
50%        0.593121
75%        0.824934
max        1.051724
Name: ratio, dtype: float64

# Monthly
count    369.000000
mean       0.249274
std        0.058417
min        0.118730
25%        0.204863
50%        0.247737
75%        0.292998
max        0.406403
Name: ratio, dtype: float64

Observations

For the monthly aggregation, the max ratio in the summary across schedule versions is notably lower than the max ratio of the combined individual schedules e.g. 0.41 vs 0.92.

The median ratio in the summary across schedule versions drops from 0.8 in the daily aggregation to 0.59 in the weekly aggregation and 0.25 in the monthly aggregation.

ratio > 1 remains in the aggregations except for the monthly aggregation.

The routes with ratio > 1 are mostly on weekends and holidays.

In [30]: for s in summary_output:
    ...:     print(s.day_type.loc[s.ratio > 1].value_counts(normalize=True))
    ...: 
# Daily
hol    0.380952
sun    0.285714
wk     0.190476
sat    0.142857
Name: day_type, dtype: float64

# Weekly
hol    1.0
Name: day_type, dtype: float64

# Monthly
Series([], Name: day_type, dtype: float64)

The weekday percentage of ratio > 1 is slightly higher for the individual schedule versions but still in the minority.

In [31]: for o in output_list:
    ...:     print(o.day_type.loc[o.ratio > 1].value_counts(normalize=True))
    ...: 
# Daily
wk     0.320388
sun    0.291262
sat    0.223301
hol    0.165049
Name: day_type, dtype: float64

# Weekly
hol    1.0
Name: day_type, dtype: float64

# Monthly
Series([], Name: day_type, dtype: float64)

In each of the aggregations in the summary across schedule versions, the proportion of rows with ratio > 1 is less than 5 percent.

In [32]: for s in summary_output:
    ...:     print((s.ratio > 1).sum() / len(s))
    ...: 
# Daily
0.04964539007092199

# Weekly
0.028455284552845527

# Monthly
0.0

In the aggregations of individual schedule versions, the proportion is less than 7 percent.

In [33]: for o in output_list:
    ...:     print((o.ratio > 1).sum() / len(o))
    ...: 
# Daily
0.0673202614379085

# Weekly
0.009485094850948509

# Monthly
0.0

dcjohnson24 · 2022-10-04T20:46:18Z

Using bus_full_day_data instead of bus_hourly_summary makes the problem of trip ratios greater than one disappear.

my_summary = compare_scheduled_and_rt.main(freq='D')
In [15]: my_summary.ratio.describe()
Out[15]: 
count    423.000000
mean       0.476454
std        0.133097
min        0.210526
25%        0.379220
50%        0.447073
75%        0.557457
max        0.833333
Name: ratio, dtype: float64

Evidence of double counting of trips during the hourly aggregation.

bucket = 's3://chn-ghost-buses-public'

full_day = 'bus_full_day_data_v2/2022-05-20.csv'

df = pd.read_csv(f"{bucket}/{full_day}")
df = pd.read_csv(f"{bucket}/{full_day}")
daily_summary = df.groupby(['data_date', 'rt', 'des']).agg(
        {'tatripid': set, 'tablockid': set, 'vid': set}).reset_index()
hourly_summary = (
                    df.groupby(["data_date", "data_hour", "rt", "des"])
                    .agg({"vid": set, "tatripid": set, "tablockid": set})
                    .reset_index()
                )
daily_summary = create_cols(daily_summary)
hourly_summary = create_cols(hourly_summary)
# Check whether hourly summaries have larger totals
daily_from_hourly = hourly_summary.groupby(['data_date', 'rt', 'des']).sum().reset_index()
compare_cols = ['rt', 'des', 'vh_count', 'trip_count', 'block_count']
num_rts_less = (daily_summary[compare_cols] < daily_from_hourly[compare_cols])['trip_count'].sum()
print(num_rts_less / daily_summary.shape[0])
0.9193954659949622

dcjohnson24 · 2022-10-07T02:26:29Z

data_analysis/static_gtfs_analysis.py

-
-        if save:
-            route_dir_daily_summary.to_csv(
-                f"s3://{BUCKET}/schedule_summaries/route_dir_level/"


Is it worth it to start saving the daily equivalent of these files to s3? Something like f"s3://{BUCKET}/schedule_summaries/route_dir_level/schedule_route_dir_daily_summary_v{VERSION_ID}.csv"?

Yes, I think we should, but I can tackle that (or I can look into giving you write access to the S3 bucket). Either way I'd be inclined to merge this PR as-is and then handle.

…ncies Add options for custom aggregation frequencies

dcjohnson24 added 2 commits September 19, 2022 19:13

Add options for custom aggregation frequencies

737751f

Daily aggregation to remove ratio > 1

d5e9ef3

Remove old hourly aggregation

f282c61

dcjohnson24 commented Oct 7, 2022

View reviewed changes

lauriemerrell approved these changes Oct 10, 2022

View reviewed changes

dcjohnson24 merged commit 165d830 into chihacknight:main Oct 11, 2022

dcjohnson24 deleted the custom-agg-frequencies branch October 11, 2022 20:19

haileyplusplus pushed a commit to haileyplusplus/chn-ghost-buses that referenced this pull request Apr 1, 2024

Merge pull request chihacknight#21 from dcjohnson24/custom-agg-freque…

1e292c3

…ncies Add options for custom aggregation frequencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add options for custom aggregation frequencies #21

Add options for custom aggregation frequencies #21

dcjohnson24 commented Sep 20, 2022

dcjohnson24 commented Oct 4, 2022

dcjohnson24 Oct 7, 2022 •

edited

Loading

lauriemerrell Oct 10, 2022

Add options for custom aggregation frequencies #21

Add options for custom aggregation frequencies #21

Conversation

dcjohnson24 commented Sep 20, 2022

Description

Type of change

How has this been tested?

Informal test

Findings

Observations

dcjohnson24 commented Oct 4, 2022

dcjohnson24 Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

lauriemerrell Oct 10, 2022

Choose a reason for hiding this comment

dcjohnson24 Oct 7, 2022 •

edited

Loading