-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add options for custom aggregation frequencies #21
Add options for custom aggregation frequencies #21
Conversation
Using my_summary = compare_scheduled_and_rt.main(freq='D')
In [15]: my_summary.ratio.describe()
Out[15]:
count 423.000000
mean 0.476454
std 0.133097
min 0.210526
25% 0.379220
50% 0.447073
75% 0.557457
max 0.833333
Name: ratio, dtype: float64 Evidence of double counting of trips during the hourly aggregation. bucket = 's3://chn-ghost-buses-public'
full_day = 'bus_full_day_data_v2/2022-05-20.csv'
df = pd.read_csv(f"{bucket}/{full_day}")
df = pd.read_csv(f"{bucket}/{full_day}")
daily_summary = df.groupby(['data_date', 'rt', 'des']).agg(
{'tatripid': set, 'tablockid': set, 'vid': set}).reset_index()
hourly_summary = (
df.groupby(["data_date", "data_hour", "rt", "des"])
.agg({"vid": set, "tatripid": set, "tablockid": set})
.reset_index()
)
daily_summary = create_cols(daily_summary)
hourly_summary = create_cols(hourly_summary)
# Check whether hourly summaries have larger totals
daily_from_hourly = hourly_summary.groupby(['data_date', 'rt', 'des']).sum().reset_index()
compare_cols = ['rt', 'des', 'vh_count', 'trip_count', 'block_count']
num_rts_less = (daily_summary[compare_cols] < daily_from_hourly[compare_cols])['trip_count'].sum()
print(num_rts_less / daily_summary.shape[0])
0.9193954659949622 |
|
||
if save: | ||
route_dir_daily_summary.to_csv( | ||
f"s3://{BUCKET}/schedule_summaries/route_dir_level/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth it to start saving the daily equivalent of these files to s3
? Something like f"s3://{BUCKET}/schedule_summaries/route_dir_level/schedule_route_dir_daily_summary_v{VERSION_ID}.csv"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should, but I can tackle that (or I can look into giving you write access to the S3 bucket). Either way I'd be inclined to merge this PR as-is and then handle.
…ncies Add options for custom aggregation frequencies
Description
This pull request adds the option of resampling by various frequencies. This will enable different ways of aggregating the data for the study of trends over time. It might also help in understanding why some routes have completed to scheduled trip ratios greater than 1 (#19).
Resolves #12
Type of change
How has this been tested?
Informal test
Daily aggregation using the string 'D' in Pandas resampling equals the original
groupby
methodFindings
Some findings from the reaggregated data
Here are the distributions of
ratio
in the combined DataFrame of the various schedule versions.And the summary across all schedule versions
Observations
For the monthly aggregation, the max
ratio
in the summary across schedule versions is notably lower than the maxratio
of the combined individual schedules e.g. 0.41 vs 0.92.The median
ratio
in the summary across schedule versions drops from 0.8 in the daily aggregation to 0.59 in the weekly aggregation and 0.25 in the monthly aggregation.ratio
> 1 remains in the aggregations except for the monthly aggregation.The routes with
ratio
> 1 are mostly on weekends and holidays.The weekday percentage of
ratio
> 1 is slightly higher for the individual schedule versions but still in the minority.In each of the aggregations in the summary across schedule versions, the proportion of rows with
ratio
> 1 is less than 5 percent.In the aggregations of individual schedule versions, the proportion is less than 7 percent.