Description
First of all, I think Tensorboard is really great and important UI for real-time feedback on model training, but I run into a few issues with accuracy of the displayed information.
There are two related "sub-issues", the first is much more important:
-
It seems there are cases of distribtutions for which the resulting histogram bins don't show any discriminating information any more. Example:
True distribution, computed offline:
Tensorboad histogram:
The visualization is exactly flat/uniform which would indicate a serious issue with training the model. However, the actual distribution is more or less Gaussian around 0.97, so in that sense the tensorboard histogram is deceptive and not showing any useful information. I could imagine that this issue is caused by the fact that the variance is small compared to the mean - but a reasonable scaling/binning method should be robust against that, it's not like the absolute values in this example are completely ill-posed. -
This issue concerns the "style" of the visualization. If it's supposed to be a histogram, why is it shown as piece-wise linear connected lines and not as a bar plot?
If a continuous representation is preferred for aesthetic or other reasons, the right thing would be a kernel density estimation (which for scalar data should not be super expensive). But a piece-wise linear visualization is confusing and introducing strange artifacts. One of this artifacts can be seen when using the "overlay" mode for showing the time evolution of histograms:
The alternation of constant/horizontal sections and oblique interpolation lines creates a lot of visual clutter that makes it hard to see the actual time evolution of from one histogram to the next. A smooth KDE visualization would presumably be much easier to visually parse, but a plain bar plot might also be easier to understand due to it's simplicity.
A side note, unrelated to histograms:
The moving average smoothing function for scalar data should convert the user-provided raw smoothingWeight to per-time-series parameter that takes into account the sampling rate of each time-series displayed. The training loss sampling rate is often orders of magnitude larger than the sampling rate of evaluator jobs. It is absolutely counter-intuitive to use the "point-to-point" moving average with the same smoothingWeight for those different time-series.
The result of doint that is that when the smoothingWeight is set to a value that sufficiently suppresses noise/variance in the training loss, then the evaluation loss is lagging behind so much that the end value of the smoothed evaluation loss is way off it's true value due to the lag:
The smoothed final value of the evaluation loss at 20M training steps corresponds to the "instantaneous" value of the training evaluation loss at ~6M steps. I think this is not the information that one wants to see at 20M steps.
Context:
Tensorboard version: 1.13.0a0
TensorFlow version: build from CL/231841759