Skip to content

Latest commit

 

History

History
66 lines (58 loc) · 4.04 KB

bin_and_mean.md

File metadata and controls

66 lines (58 loc) · 4.04 KB

plot_utils.bin_and_mean

plot_utils.bin_and_mean(xdata, ydata, bins=10, distribution='normal', show_fig=True, fig=None, ax=None, figsize=(4,4), dpi=100, show_bins=True, raw_data_label='raw', mean_data_label='average', xlabel='x', ylabel='y', logx=False, logy=False, grid_on=True, show_legend=True):

Calculates bin-and-mean results and shows the bin-and-mean plot (optional).

A "bin-and-mean" plot is a more salient way to show the dependency of ydata on xdata. The data points (xdata, ydata) are divided into different groups according to the value of x (via the "bins" argument), and within each group, the mean values of x and y are calculated, and considered as the representative x and y values.

"bin-and-mean" works better when data points are highly skewed (e.g., a lot of data points for when x is small, but very few for large x). The data points when x is large are usually not noises, and could be even more valuable (think of the case where x is earthquake magnitude and y is the related economic loss). If we want to study the relationship between economic loss and earthquake magnitude, we need to bin-and-mean raw data and draw conclusions from the mean data points.

The theory that enables this method is the assumption that the data points with similar x values follow the same distribution. Naively, we assume the data points are normally distributed, then y_mean is the arithmetic mean of the data points within a bin. We also often assume the data points follow log-normal distribution (if we want to assert that y values are all positive), then y_mean is the expected value of the log-normal distribution, while x_mean for any bins are still just the arithmetic mean.

Note: For log-normal, the expective value of y is:
                E(Y) = exp(mu + (1/2)*sigma^2)
      where mu and sigma are the two parameters of the distribution.

[Parameters]

xdata, ydata : <array_like>
    Raw x and y data points (with the same length). Can be pandas Series or
    numpy arrays.
bins : <int> or <array_like>
    Number of bins (an integer), or an array representing the actual bin
    edges. Note that the binning is done according x values.
distribution : <str>
    Specifies which distribution the y values within a bin follow. Use
    'lognormal' if you want to assert all positive y values. Only supports
    normal and log-normal distributions at this time.
show_fig : <bool>
    Whether or not to show a bin-and-mean plot
fig, ax : <mpl.figure.Figure>, <mpl.axes._subplots.AxesSubplot>
    Figure and axes objects.
    If provided, the graph is plotted on the provided figure and
    axes. If not, a new figure and new axes are created.
figsize : tuple of two scalars
    Size (width, height) of figure in inches. (fig object passed via "fig"
    will over override this parameter)
dpi : scalar
    Screen resolution. (fig object passed via "fig" will over override
    this parameter)
show_bins : <bool>
    Whether or not to show the bin edges as vertical lines on the plots
raw_data_label, mean_data_label : <str>
    Two strings that specify the names of the raw data and the averaged
    data, respectively, such as "raw data" and "averaged data". Useless
    if show_legend is False.
xlabel, ylabel : <str>
    Labels for x and y axes of the plot
logx, logy : <bool>
    Whether or not to adjust the scales of x and/or y axes to log
grid_on : <bool>
    Whether or not to show the grids
error_bars_on : <bool>
    Whether or not to show error bars (of y values) of each bin
error_shades_on : <bool>
    Whether or not to show error shades (of y values) of each bin; this
    argument overrides error_bars_on
legend_on : <bool>
    Whether or not to show the legend

[Returns]

fig, ax :
    Figure and axes objects
x_mean, y_mean : <np.ndarray>
    Mean values of x and y for each data group (i.e., "bin")
y_std : <np.ndarray>
    Standard deviation of y for each data group (i.e., "bin")