Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options for custom aggregation frequencies #21

Merged
merged 3 commits into from
Oct 11, 2022

Conversation

dcjohnson24
Copy link
Collaborator

Description

This pull request adds the option of resampling by various frequencies. This will enable different ways of aggregating the data for the study of trends over time. It might also help in understanding why some routes have completed to scheduled trip ratios greater than 1 (#19).

Resolves #12

Type of change

  • Bug fix
  • New functionality
  • Documentation

How has this been tested?

Informal test

Daily aggregation using the string 'D' in Pandas resampling equals the original groupby method

In [58]: orig = rt.groupby(by=['date', 'route_id'])['trip_count'].sum().reset_index()

In [59]: new = (rt.set_index(['date', 'route_id'])
    ...:             .groupby(
    ...:                 [pd.Grouper(level='date', freq='D'),
    ...:                  pd.Grouper(level='route_id')])['trip_count']
    ...:             .sum().reset_index()
    ...:         )

In [60]: orig.equals(new)
Out[60]: True

Findings

Some findings from the reaggregated data

schedule_feeds = [
        {
            "schedule_version": "20220507",
            "feed_start_date": "2022-05-20",
            "feed_end_date": "2022-06-02",
        },
        {
            "schedule_version": "20220603",
            "feed_start_date": "2022-06-04",
            "feed_end_date": "2022-06-07",
        },
        {
            "schedule_version": "20220608",
            "feed_start_date": "2022-06-09",
            "feed_end_date": "2022-07-08",
        },
        {
            "schedule_version": "20220709",
            "feed_start_date": "2022-07-10",
            "feed_end_date": "2022-07-17",
        },
        {
            "schedule_version": "20220718",
            "feed_start_date": "2022-07-19",
            "feed_end_date": "2022-07-20",
        },
    ]
In [15]: daily_info = AggInfo(freq='D')

In [16]: weekly_info = AggInfo(freq='W-MON')

In [17]: monthly_info = AggInfo(freq='M')

In [18]: output_list = []

In [19]: for info in [daily_info, weekly_info, monthly_info]:
    ...:     output_list.append(combine_real_time_rt_comparison(schedule_feeds, agg_info=info, save=Fa
    ...: lse))
    ...: 
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00,  3.39it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:04<00:00,  3.04it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.13it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:01<00:00,  3.08it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:08<00:00,  3.35it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:08<00:00,  2.40it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.81it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:02<00:00,  3.90it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.20it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.13it/s]
Processing 2022-07-20 at2022-09-19 15:23:48: 100%|██████████████████████| 5/5 [00:22<00:00,  4.43s/it]
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:03<00:00,  4.05it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:03<00:00,  3.87it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.36it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:00<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:08<00:00,  3.64it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:08<00:00,  3.58it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.87it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:02<00:00,  4.13it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.30it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.28it/s]
Processing 2022-07-20 at2022-09-19 15:24:08: 100%|██████████████████████| 5/5 [00:19<00:00,  3.98s/it]
100%|█████████████████████████████████████████████████████████████████| 14/14 [00:03<00:00,  4.46it/s]
INFO:root:Processing 20220507█████████████████████████████████████████| 14/14 [00:03<00:00,  4.26it/s]
100%|███████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.46it/s]
INFO:root:Processing 20220603███████████████████████████████████████████| 4/4 [00:00<00:00,  4.31it/s]
100%|█████████████████████████████████████████████████████████████████| 30/30 [00:06<00:00,  4.31it/s]
INFO:root:Processing 20220608█████████████████████████████████████████| 30/30 [00:06<00:00,  3.87it/s]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  4.10it/s]
INFO:root:Processing 20220709███████████████████████████████████████████| 8/8 [00:01<00:00,  4.52it/s]
100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.71it/s]
INFO:root:Processing 20220718███████████████████████████████████████████| 2/2 [00:00<00:00,  3.64it/s]
Processing 2022-07-20 at2022-09-19 15:24:25: 100%|██████████████████████| 5/5 [00:17<00:00,  3.54s/it]

Here are the distributions of ratio in the combined DataFrame of the various schedule versions.

# Daily
In [22]: for df in output_list:
    ...:     print(df.ratio.describe())
    ...: 
count    1530.000000
mean        0.806123
std         0.142894
min         0.210526
25%         0.705431
50%         0.809052
75%         0.916225
max         1.225806
Name: ratio, dtype: float64

# Weekly
count    738.000000
mean       0.493180
std        0.228549
min        0.096552
25%        0.278797
50%        0.477403
75%        0.662779
max        1.051724
Name: ratio, dtype: float64

# Monthly
count    861.000000
mean       0.242437
std        0.205531
min        0.042088
25%        0.101590
50%        0.188032
75%        0.261278
max        0.920404
Name: ratio, dtype: float64

And the summary across all schedule versions

In [23]: summary_output = []

In [24]: for o in output_list:
    ...:     summary_output.append(build_summary(o, save=False))
    ...: 

In [25]: for df in summary_output:
    ...:     print(df.ratio.describe())
    ...: 
# Daily
count    423.000000
mean       0.799969
std        0.134382
min        0.210526
25%        0.711802
50%        0.798822
75%        0.901876
max        1.134615
Name: ratio, dtype: float64

# Weekly
count    246.000000
mean       0.655958
std        0.196121
min        0.289908
25%        0.486671
50%        0.593121
75%        0.824934
max        1.051724
Name: ratio, dtype: float64

# Monthly
count    369.000000
mean       0.249274
std        0.058417
min        0.118730
25%        0.204863
50%        0.247737
75%        0.292998
max        0.406403
Name: ratio, dtype: float64

Observations

For the monthly aggregation, the max ratio in the summary across schedule versions is notably lower than the max ratio of the combined individual schedules e.g. 0.41 vs 0.92.

The median ratio in the summary across schedule versions drops from 0.8 in the daily aggregation to 0.59 in the weekly aggregation and 0.25 in the monthly aggregation.

ratio > 1 remains in the aggregations except for the monthly aggregation.

The routes with ratio > 1 are mostly on weekends and holidays.

In [30]: for s in summary_output:
    ...:     print(s.day_type.loc[s.ratio > 1].value_counts(normalize=True))
    ...: 
# Daily
hol    0.380952
sun    0.285714
wk     0.190476
sat    0.142857
Name: day_type, dtype: float64

# Weekly
hol    1.0
Name: day_type, dtype: float64

# Monthly
Series([], Name: day_type, dtype: float64)

The weekday percentage of ratio > 1 is slightly higher for the individual schedule versions but still in the minority.

In [31]: for o in output_list:
    ...:     print(o.day_type.loc[o.ratio > 1].value_counts(normalize=True))
    ...: 
# Daily
wk     0.320388
sun    0.291262
sat    0.223301
hol    0.165049
Name: day_type, dtype: float64

# Weekly
hol    1.0
Name: day_type, dtype: float64

# Monthly
Series([], Name: day_type, dtype: float64)

In each of the aggregations in the summary across schedule versions, the proportion of rows with ratio > 1 is less than 5 percent.

In [32]: for s in summary_output:
    ...:     print((s.ratio > 1).sum() / len(s))
    ...: 
# Daily
0.04964539007092199

# Weekly
0.028455284552845527

# Monthly
0.0

In the aggregations of individual schedule versions, the proportion is less than 7 percent.

In [33]: for o in output_list:
    ...:     print((o.ratio > 1).sum() / len(o))
    ...: 
# Daily
0.0673202614379085

# Weekly
0.009485094850948509

# Monthly
0.0

@dcjohnson24
Copy link
Collaborator Author

Using bus_full_day_data instead of bus_hourly_summary makes the problem of trip ratios greater than one disappear.

my_summary = compare_scheduled_and_rt.main(freq='D')
In [15]: my_summary.ratio.describe()
Out[15]: 
count    423.000000
mean       0.476454
std        0.133097
min        0.210526
25%        0.379220
50%        0.447073
75%        0.557457
max        0.833333
Name: ratio, dtype: float64

Evidence of double counting of trips during the hourly aggregation.

bucket = 's3://chn-ghost-buses-public'

full_day = 'bus_full_day_data_v2/2022-05-20.csv'

df = pd.read_csv(f"{bucket}/{full_day}")
df = pd.read_csv(f"{bucket}/{full_day}")
daily_summary = df.groupby(['data_date', 'rt', 'des']).agg(
        {'tatripid': set, 'tablockid': set, 'vid': set}).reset_index()
hourly_summary = (
                    df.groupby(["data_date", "data_hour", "rt", "des"])
                    .agg({"vid": set, "tatripid": set, "tablockid": set})
                    .reset_index()
                )
daily_summary = create_cols(daily_summary)
hourly_summary = create_cols(hourly_summary)
# Check whether hourly summaries have larger totals
daily_from_hourly = hourly_summary.groupby(['data_date', 'rt', 'des']).sum().reset_index()
compare_cols = ['rt', 'des', 'vh_count', 'trip_count', 'block_count']
num_rts_less = (daily_summary[compare_cols] < daily_from_hourly[compare_cols])['trip_count'].sum()
print(num_rts_less / daily_summary.shape[0])
0.9193954659949622


if save:
route_dir_daily_summary.to_csv(
f"s3://{BUCKET}/schedule_summaries/route_dir_level/"
Copy link
Collaborator Author

@dcjohnson24 dcjohnson24 Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth it to start saving the daily equivalent of these files to s3? Something like f"s3://{BUCKET}/schedule_summaries/route_dir_level/schedule_route_dir_daily_summary_v{VERSION_ID}.csv"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should, but I can tackle that (or I can look into giving you write access to the S3 bucket). Either way I'd be inclined to merge this PR as-is and then handle.

@dcjohnson24 dcjohnson24 merged commit 165d830 into chihacknight:main Oct 11, 2022
@dcjohnson24 dcjohnson24 deleted the custom-agg-frequencies branch October 11, 2022 20:19
haileyplusplus pushed a commit to haileyplusplus/chn-ghost-buses that referenced this pull request Apr 1, 2024
…ncies

Add options for custom aggregation frequencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants