-
Notifications
You must be signed in to change notification settings - Fork 12.1k
finetune.cpp command-line arg #13873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3376,5 +3376,56 @@ common_params_context common_params_parser_init(common_params & params, llama_ex | |
} | ||
).set_examples({LLAMA_EXAMPLE_SERVER})); | ||
|
||
add_opt(common_arg({ "-save", "--opt-save-model-to" }, "ALPHA", | ||
string_format( | ||
"adamw or sgd optimizer alpha (default: %s); note: sgd alpha recommended ~10x (no momentum)", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Forgot to update this string? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes ty :) |
||
params.opt_save_model_to.c_str()), | ||
[](common_params & params, const std::string & value) { params.opt_save_model_to = value; }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt( | ||
common_arg({ "-lr", "--learning-rate" }, "ALPHA", | ||
string_format( | ||
"adamw or sgd optimizer alpha (default: %.2g); note: sgd alpha recommended ~10x (no momentum)", | ||
(double) params.lr.lr), | ||
[](common_params & params, const std::string & value) { params.lr.lr = std::stof(value); }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt(common_arg( | ||
{ "-lr-half", "--learning-rate-halflife-epochs" }, "N", | ||
string_format("reduce lr in half every N epochs (default: %.3g)", (double) params.lr.halflife_epochs), | ||
[](common_params & params, const std::string & value) { params.lr.halflife_epochs = std::stof(value); }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt(common_arg({ "-lr-halvings", "--learning-rate-halvings" }, "N", | ||
string_format("max N lr halvings (default: %.3g)", (double) params.lr.halvings), | ||
[](common_params & params, const std::string & value) { params.lr.halvings = std::stof(value); }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
Comment on lines
+3393
to
+3400
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me the more intuitive parameterization of a decaying learning rate would be to set a minimum value for the learning rate rather than a maximum number of times the learning rate is halved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but then you need to distinguish one being explicitly specified vs not. i.e. mine can be left at default while tweaking just -lr |
||
add_opt(common_arg( | ||
{ "-wd", "--weight-decay" }, "WD", | ||
string_format( | ||
"adamw or sgd optimizer weight decay (0 is off; recommend very small e.g. 1e-9) (default: %.2g).", | ||
(double) params.lr.wd), | ||
[](common_params & params, const std::string & value) { params.lr.wd = std::stof(value); }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt(common_arg({ "-val", "--val-split" }, "FRACTION", | ||
string_format("portion of data to use as validation when optimizing (default: %.2g).", | ||
(double) params.val_split), | ||
[](common_params & params, const std::string & value) { params.val_split = std::stof(value); }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt(common_arg({ "-epochs", "--epochs" }, "N", | ||
string_format("optimizer max # of epochs (default: %d)", params.lr.epochs), | ||
[](common_params & params, int epochs) { params.lr.epochs = epochs; }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
add_opt(common_arg({ "-period", "--opt-period" }, "N", | ||
string_format("make logical batch this multiple of physical batch - needs more memory for accumulation if >1 (default: %d)", params.opt_period), | ||
[](common_params & params, int opt_period) { params.opt_period = opt_period; }) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
Comment on lines
+3417
to
+3420
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be parametrized by setting the logical and physical batch sizes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it was, but (correct me if I'm wrong) other opt init code in master adjusts the batch sizes to the minimum of logical, physical. I didn't have the confidence to mess with that. We can drop it and leave it for someone else to investigate if you prefer. |
||
add_opt(common_arg({ "-opt", "--optimizer" }, "sgd|adamw", "adamw or sgd", | ||
[](common_params & params, const std::string & name) { | ||
params.optimizer = ggml_opt_get_optimizer(name.c_str()); | ||
if (params.optimizer == GGML_OPT_OPTIMIZER_TYPE_COUNT) { | ||
throw std::invalid_argument("invalid --optimizer, valid options: adamw, sgd"); | ||
} | ||
}) | ||
.set_examples({ LLAMA_EXAMPLE_FINETUNE })); | ||
|
||
return ctx_arg; | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,13 +2,15 @@ | |
|
||
#pragma once | ||
|
||
#include "llama-cpp.h" | ||
|
||
#include <set> | ||
#include <sstream> | ||
#include <string> | ||
#include <string_view> | ||
#include <vector> | ||
#include <sstream> | ||
#include <cmath> | ||
|
||
#include "ggml-opt.h" | ||
#include "llama-cpp.h" | ||
|
||
#ifdef _WIN32 | ||
#define DIRECTORY_SEPARATOR '\\' | ||
|
@@ -80,6 +82,7 @@ enum llama_example { | |
LLAMA_EXAMPLE_LOOKUP, | ||
LLAMA_EXAMPLE_PARALLEL, | ||
LLAMA_EXAMPLE_TTS, | ||
LLAMA_EXAMPLE_FINETUNE, | ||
|
||
LLAMA_EXAMPLE_COUNT, | ||
}; | ||
|
@@ -219,6 +222,25 @@ enum common_reasoning_format { | |
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas. | ||
}; | ||
|
||
struct lr_decay { | ||
float lr = 1e-5; | ||
float halflife_epochs = 100; | ||
float halvings = 10; | ||
|
||
float decayed(float epoch) const { | ||
float maxepoch = halvings * halflife_epochs; | ||
return lr * std::pow(.5, (epoch > maxepoch ? maxepoch : epoch) / halflife_epochs); | ||
} | ||
}; | ||
|
||
struct lr_opt : lr_decay { | ||
float epoch = 0; | ||
float wd = 0; | ||
unsigned epochs = 2; | ||
}; | ||
|
||
struct ggml_opt_optimizer_params common_lr_opt_pars(void * userdata); | ||
Comment on lines
+225
to
+242
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a decaying learning rate is a common enough use case in machine learning that it would make sense to implement in The way you've implemented it the learning rate will be scaled down by discrete factors of 2 rather than a smooth decay. Is this intentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was intentional to have a step only at the end of a full epoch; I realize this is not how everyone does it. |
||
|
||
struct common_params { | ||
int32_t n_predict = -1; // new tokens to predict | ||
int32_t n_ctx = 4096; // context size | ||
|
@@ -350,6 +372,13 @@ struct common_params { | |
bool no_mmproj = false; // explicitly disable multimodal model | ||
std::vector<std::string> image; // path to image file(s) | ||
|
||
// finetune | ||
struct lr_opt lr; | ||
enum ggml_opt_optimizer_type optimizer = GGML_OPT_OPTIMIZER_TYPE_ADAMW; | ||
float val_split = 0.05f; // fraction of data used for validation when optimizing | ||
int32_t opt_period = 1; | ||
std::string opt_save_model_to = "finetuned-model.gguf"; | ||
|
||
// embedding | ||
bool embedding = false; // get only sentence embedding | ||
int32_t embd_normalize = 2; // normalisation for embeddings (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm) | ||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -74,16 +74,30 @@ extern "C" { | |||||||||
GGML_OPT_BUILD_TYPE_OPT = 30, | ||||||||||
}; | ||||||||||
|
||||||||||
enum ggml_opt_optimizer_type { | ||||||||||
GGML_OPT_OPTIMIZER_TYPE_ADAMW, | ||||||||||
GGML_OPT_OPTIMIZER_TYPE_SGD, | ||||||||||
|
||||||||||
GGML_OPT_OPTIMIZER_TYPE_COUNT | ||||||||||
}; | ||||||||||
|
||||||||||
// "adamw" or "sgd" (case insensitive) | ||||||||||
GGML_API const char * ggml_opt_optimizer_name(enum ggml_opt_optimizer_type); | ||||||||||
GGML_API enum ggml_opt_optimizer_type ggml_opt_get_optimizer(const char *); | ||||||||||
|
||||||||||
// parameters that control which optimizer is used and how said optimizer tries to find the minimal loss | ||||||||||
struct ggml_opt_optimizer_params { | ||||||||||
// AdamW optimizer parameters | ||||||||||
struct { | ||||||||||
float alpha; // learning rate | ||||||||||
float beta1; | ||||||||||
float beta2; | ||||||||||
float eps; // epsilon for numerical stability | ||||||||||
float wd; // weight decay for AdamW, use 0.0f to disable | ||||||||||
float alpha; // learning rate | ||||||||||
float beta1; // adamw | ||||||||||
float beta2; // adamw | ||||||||||
Comment on lines
+92
to
+93
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
float eps; // epsilon for numerical stability | ||||||||||
float wd; // weight decay - 0.0f to disable | ||||||||||
} adamw; | ||||||||||
struct { | ||||||||||
float alpha; // learning rate | ||||||||||
float wd; // weight decay | ||||||||||
} sgd; | ||||||||||
}; | ||||||||||
|
||||||||||
// callback to calculate optimizer parameters prior to a backward pass | ||||||||||
|
@@ -113,7 +127,10 @@ extern "C" { | |||||||||
int32_t opt_period; // after how many gradient accumulation steps an optimizer step should be done | ||||||||||
|
||||||||||
ggml_opt_get_optimizer_params get_opt_pars; // callback for calculating optimizer parameters | ||||||||||
void * get_opt_pars_ud; // userdata for calculating optimizer parameters | ||||||||||
void * get_opt_pars_ud; // userdata for calculating optimizer parameters | ||||||||||
|
||||||||||
// only GGML_OPT_OPTIMIZER_TYPE_ADAMW allocates m, v per parameter | ||||||||||
enum ggml_opt_optimizer_type optimizer; | ||||||||||
}; | ||||||||||
|
||||||||||
// get parameters for an optimization context with defaults set where possible | ||||||||||
|
@@ -186,7 +203,7 @@ extern "C" { | |||||||||
// The second context should contain all other tensors and will be (re)allocated automatically. | ||||||||||
// Due to this automated allocation the data of the second context is not defined when accessed in user code. | ||||||||||
// Note that the second dimension of the inputs/outputs are interpreted as the number of datapoints in those tensors. | ||||||||||
// 4. Call ggml_opt_fit. If you need more control you can use ggml_opt_epoch instead. | ||||||||||
// 4. Call ggml_opt_fit. If you need more control (e.g. optimizer sgd) you can use ggml_opt_epoch instead. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There is SGD support in |
||||||||||
|
||||||||||
// signature for a callback while evaluating opt_ctx on dataset, called after an evaluation | ||||||||||
typedef void (*ggml_opt_epoch_callback)( | ||||||||||
|
@@ -226,12 +243,14 @@ extern "C" { | |||||||||
struct ggml_tensor * outputs, // output tensor, must have shape [ne_label, ndata_batch] if labels are used | ||||||||||
ggml_opt_dataset_t dataset, // dataset with data and optionally also labels | ||||||||||
enum ggml_opt_loss_type loss_type, // loss to minimize | ||||||||||
enum ggml_opt_optimizer_type optimizer, // sgd or adamw | ||||||||||
ggml_opt_get_optimizer_params get_opt_pars, // callback to get optimizer params, userdata is pointer to epoch (of type int64_t) | ||||||||||
int64_t nepoch, // how many times the dataset should be iterated over | ||||||||||
int64_t nbatch_logical, // datapoints optimizer step, must be a multiple of ndata_batch in inputs/outputs | ||||||||||
float val_split, // fraction of the dataset to use for validation, must be in [0.0f, 1.0f) | ||||||||||
bool silent); // whether or not info prints to stderr should be suppressed | ||||||||||
|
||||||||||
GGML_API enum ggml_opt_optimizer_type ggml_opt_context_optimizer_type(ggml_opt_context_t); | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this declaration upwards so that it's in the same place as the other getters for |
||||||||||
#ifdef __cplusplus | ||||||||||
} | ||||||||||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not autoformat files in the same PR where you make functional changes. It creates a lot of unnecessary work for maintainers. As I said, please fix your environment to avoid doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy re: resolve.
as i said, the intention is to autoformat only the new code i add. if i accidentally changed other lines and they were affected, i'm happy to revert