-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding API to fetch tokenizer config for model #1052
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments. Also, could you please add some unit tests as well?
ads/aqua/app.py
Outdated
@@ -328,6 +328,55 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict: | |||
|
|||
return config | |||
|
|||
def get_chat_template(self, model_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do similar to deployment and fine-tuning?
AQUA_MODEL_TOKENIZER_CONFIG = "tokenizer_config.json"
def get_hf_tokenizer_config(self, model_id: str) -> Dict:
config = self.get_config(model_id, AQUA_MODEL_TOKENIZER_CONFIG)
if not config:
logger.debug(
f"Tokenizer config for model: {model_id} is not available. Use defaults."
)
return config
Relevant to this, for verified models, we want to load the config from the service model bucket instead of user bucket. PR- #1053
If we use get_config directly, the above PR should take care of it.
ads/aqua/app.py
Outdated
@@ -328,6 +328,55 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict: | |||
|
|||
return config | |||
|
|||
def get_chat_template(self, model_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @telemetry decorator.
ads/aqua/extension/model_handler.py
Outdated
@@ -316,8 +317,24 @@ def post(self, *args, **kwargs): # noqa: ARG002 | |||
) | |||
|
|||
|
|||
class AquaModelChatTemplateHandler(AquaAPIhandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be good to return the entire config instead of only the chat template field since there might be use cases down the line where other values might be needed from the tokenizer. UI can parse the json and get the chat_template.
We can name this AquaModelTokenizerConfigHandler accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look good. requesting one update to the endpoint name to match what we're returning.
ads/aqua/extension/model_handler.py
Outdated
__handlers__ = [ | ||
("model/?([^/]*)", AquaModelHandler), | ||
("model/?([^/]*)/license", AquaModelLicenseHandler), | ||
("model/?([^/]*)/chat_template", AquaModelTokenizerConfigHandler), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change the endpoint name to /tokenizer instead of /chat_template since we'll be returning the entire json file contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…m/oracle/accelerated-data-science into feature/default-chat-template-api
ads/aqua/app.py
Outdated
@@ -306,20 +314,22 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict: | |||
) | |||
base_model = self.ds_client.get_model(base_model_ocid).data | |||
artifact_path = get_artifact_path(base_model.custom_metadata_list) | |||
if config_folder == "artifact": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to move strings to constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
ads/aqua/app.py
Outdated
self, | ||
model_id: str, | ||
config_file_name: str, | ||
config_folder: Optional[str] = "config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if to pass None
as a config_folder
? I guess in the code we should add something like
config_folder = config_folder or DEFAULT_MODEL_CONFIG_FOLDER
Let's also move the strings to the constants.
DEFAULT_MODEL_CONFIG_FOLDER = "config"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
ads/aqua/app.py
Outdated
@@ -277,6 +282,9 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict: | |||
The OCID of the Aqua model. | |||
config_file_name: str | |||
name of the config file | |||
config_folder: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a default value to the docstring as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
ads/aqua/app.py
Outdated
@@ -306,20 +314,22 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict: | |||
) | |||
base_model = self.ds_client.get_model(base_model_ocid).data | |||
artifact_path = get_artifact_path(base_model.custom_metadata_list) | |||
if config_folder == "artifact": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to convert the config_folder
to the lowercase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config_folder values comes from constant enums which is always lowercase. so probably no need to convert explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially true, however we allow to pass a custom config_folder
to the function?
ads/aqua/app.py
Outdated
if not artifact_path: | ||
logger.debug( | ||
f"Failed to get artifact path from custom metadata for the model: {model_id}" | ||
) | ||
return config | ||
|
||
config_path = f"{os.path.dirname(artifact_path)}/config/" | ||
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the config_folder
will have /
at the end? It might be better to use os.path.join
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config_folder is parameter that we pass to get_config function.
This determines whether we need to search for a particular file in artifact folder (like tokenizer_config.json) or config folder (like ft_config.json).
The value of the config folder will always be either "config" (default) or "artifact".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)
?
paths = url_parse.path.strip("/") | ||
path_list = paths.split("/") | ||
if ( | ||
len(path_list) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 4
and 3
in this code? :) Let's move them either to constants or add some meaningful comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request path here is: /aqua/models/ocid1.iad.ahdxxx/tokenizer
path_list=['aqua','models','ocid1.iad.ahdxxx','tokenizer']
Added comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kumar-shivam-ranjan, could you also change the title and description for the PR, since the scope of the PR has been adjusted.
Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, rest looks good.
ads/aqua/app.py
Outdated
if not artifact_path: | ||
logger.debug( | ||
f"Failed to get artifact path from custom metadata for the model: {model_id}" | ||
) | ||
return config | ||
|
||
config_path = f"{os.path.dirname(artifact_path)}/config/" | ||
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
ads/aqua/extension/model_handler.py
Outdated
def get(self, model_id): | ||
url_parse = urlparse(self.request.path) | ||
paths = url_parse.path.strip("/") | ||
path_list = paths.split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Looks like we only use path_list
further in this method, then probably can do
path_list = urlparse(self.request.path).path.strip("/").split("/")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -316,8 +317,26 @@ def post(self, *args, **kwargs): # noqa: ARG002 | |||
) | |||
|
|||
|
|||
class AquaModelTokenizerConfigHandler(AquaAPIhandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the pydocs to the class?
Handles requests for retrieving the Hugging Face tokenizer configuration
of a specified model.
Expected request format:
GET /aqua/models/<model-ocid>/tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
ads/aqua/extension/model_handler.py
Outdated
and path_list[3] == "tokenizer" | ||
): | ||
return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: looks like else is not required here. I think it would be more clear to do something like this:
is_valid_path = (
len(path_list) == 4
and path_list[3] == "tokenizer"
and is_valid_ocid(path_list[2])
)
if not is_valid_path:
raise HTTPError(400, f"The request {self.request.path} is invalid.")
return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
c66bdd6
This PR intends to add additional model API to fetch tokenizer config of the model. It reads the tokenizer_config.json file from the model artifact location and returns the whole json as response body. The tokenizer config of the model contains chat template of the model which is useful when we deploy the model with chat completions endpoint.
Chat templates specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.
For models which don't have chat_templates, it will return empty json.
Unit tests