-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove epochs and only use batches #689
Comments
Would it be possible to still support the concept of epochs somehow? |
eventhough "epoch" is used throughout libraries, I don't think it is really important for training a model.
this changes quite fundamentally the concept of batch: it would be now a fixed-size random extract of a dataset. I'll use sample from now on as I find it clearer. const sampleCount = epochCount * dataset.size / sampleSize this way, we can also avoid having both implementation in discojs and only have to computation outside of discojs. |
if you want, it can be possible to offer the best of both worlds. it would
only affect the UI, not functionality: the user can specify their round
duration either in epochs (but that should allow fractional values such as
0.2) or in batches=steps.
in the code we'd always use batches afterwards
this relies on the assumption that the dataset overall size is known (or
also specified in the UI)
…On Mon, Jul 1, 2024 at 2:31 PM Valérian Rousset ***@***.***> wrote:
Would it be possible to still support the concept of epochs somehow? If
I'm going to train a model on a dataset I will reflect in terms of epochs
rather than batches (or round) for sure, so I would find it confusing and
limiting
eventhough "epoch" is used throughout libraries, I don't think it is
really important for training a model.
from a network perspective, we only need the clients to train for a
certain amount of time on their data, not a specific amount of epoch (nor
batches but that's for another time).
I've the feeling that I'm missing some deeper ML knowledge here, why do
you find it limiting? do the model need to know that it has now see "all
the dataset" (which is the meaning of epoch for me)?
not be able to know how the nb of batches I have to choose translates into
epochs. What about allowing to specify either batches or epochs? (annoying
from an implementation standpoint but could be nice as UX?)
this changes quite fundamentally the concept of batch: it would be now a
fixed-size random extract of a dataset. I'll use sample from now on as I
find it clearer.
there is not really a translation of samples to epoches, as it is random
now. to have a probable (>=50%) epoch of the dataset, one could use
const sampleCount = epochCount * dataset.size / sampleSize
this way, we can also avoid having both implementation in discojs and only
have to computation outside of discojs.
—
Reply to this email directly, view it on GitHub
<#689 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDIMRZST7BBTAXIDSVM4T3ZKFD3PAVCNFSM6AAAAABKBSHUNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGAZDEMRQG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes! That's exactly what I meant
As a user, how I would choose what is a "certain amount of time" would depend on the concept of epochs. Ideally I have a sizeable and manageable amount of data and I will want to train for exactly one epoch: I took advantage of all the data available and the model only saw each data point once, so less overfitting. If I can only choose a number of batches (= samples), I will not know if the number of batches I choose represents more or less than one pass over the dataset. In practice there's usually not enough data and I will want to do multiple passes, or I have may too much data them I would like to do a fraction of epoch (in which case specifying a number of samples would be useful) Essentially, when I think about how much data I want the model to see, I reason in terms of number of passes over the dataset (=epochs) and not in terms of samples (=batches =samples). That may be very personal and that's why I think being able to choose would be nice |
okay, so we need support for both partial dataset (sampled based) and full dataset (one epoch). so when someone ask to train for
that does requires that we change discojs itself, as we will in fact have two types is that what you had in mind? |
Yes! I expect that most cases would either be a fraction less than one or an integer number of epochs though |
just a comment on random sampling: either it should be done in both cases (full epochs and fractional ones), or not at all. in the latter case this means that we'd assume the dataset is shuffled already. (if that's an assumption would be good to state in the readmes and code.). btw if it's shuffled, you don't need sampling but can just go with the first 20% of that ordered dataset. so maybe it's easiest to do dataset shuffling in the preprocessing, or then not do any sampling/shuffling ever? in terms of terminology, i'd say batch size is more clear than sample/sample size (more robust in meaning in all scenarios) |
in my understanding, sampling can potentially return previously seen element in the same iteration (might even return twice the same element in a single batch, which very low probability). so that's incompatible with full epoch (all lines once).
that means that the model will always train on the same part of the dataset, is that an issue? FWIW, having whole suffling is a bit costly memory wise, as we have to keep track of the remaining elements.
yep, I agree, batch makes more sense now. |
after discussion, look like epochs are not really needed, we can directly use batches. so going from "round -> epoch -> batch" to have "round -> batch". that would give more direct control on
TrainingInformation.epochs
&EpochLogs
Task
, userounds
as the top level count of run, thenbatchesPerRound
(renamed fromroundDuration
){Trainer,Model}.fit
The text was updated successfully, but these errors were encountered: