Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding API to fetch tokenizer config for model #1052

Merged
merged 21 commits into from
Feb 6, 2025

Conversation

kumar-shivam-ranjan
Copy link
Member

@kumar-shivam-ranjan kumar-shivam-ranjan commented Feb 3, 2025

This PR intends to add additional model API to fetch tokenizer config of the model. It reads the tokenizer_config.json file from the model artifact location and returns the whole json as response body. The tokenizer config of the model contains chat template of the model which is useful when we deploy the model with chat completions endpoint.

Chat templates specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

For models which don't have chat_templates, it will return empty json.

Request

curl --location 'http://localhost:8888/aqua/model/ocid1.datasciencemodel.oc1.iad.xxxx/tokenizer'

Response

{
    "bos_token": "<s>",
    "clean_up_tokenization_spaces": true,
    "cls_token": "<s>",
    "eos_token": "</s>",
    "mask_token": {
        "__type": "AddedToken",
        "content": "<mask>",
        "lstrip": true,
        "normalized": true,
        "rstrip": false,
        "single_word": false
    },
    "model_max_length": 8192,
    "pad_token": "<pad>",
    "sep_token": "</s>",
    "sp_model_kwargs": {},
    "tokenizer_class": "XLMRobertaTokenizer",
    "unk_token": "<unk>"
}

Unit tests


Screenshot 2025-02-05 at 12 47 47 AM

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 3, 2025
Copy link

github-actions bot commented Feb 3, 2025

📌 Cov diff with main:

Coverage-11%

📌 Overall coverage:

Coverage-56.71%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments. Also, could you please add some unit tests as well?

ads/aqua/app.py Outdated
@@ -328,6 +328,55 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict:

return config

def get_chat_template(self, model_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do similar to deployment and fine-tuning?

AQUA_MODEL_TOKENIZER_CONFIG = "tokenizer_config.json"

def get_hf_tokenizer_config(self, model_id: str) -> Dict:
        config = self.get_config(model_id, AQUA_MODEL_TOKENIZER_CONFIG)
        if not config:
            logger.debug(
                f"Tokenizer config for model: {model_id} is not available. Use defaults."
            )
        return config

Relevant to this, for verified models, we want to load the config from the service model bucket instead of user bucket. PR- #1053

If we use get_config directly, the above PR should take care of it.

ads/aqua/app.py Outdated
@@ -328,6 +328,55 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict:

return config

def get_chat_template(self, model_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @telemetry decorator.

@@ -316,8 +317,24 @@ def post(self, *args, **kwargs): # noqa: ARG002
)


class AquaModelChatTemplateHandler(AquaAPIhandler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good to return the entire config instead of only the chat template field since there might be use cases down the line where other values might be needed from the tokenizer. UI can parse the json and get the chat_template.
We can name this AquaModelTokenizerConfigHandler accordingly.

Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-76%

📌 Overall coverage:

Coverage-56.75%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good. requesting one update to the endpoint name to match what we're returning.

__handlers__ = [
("model/?([^/]*)", AquaModelHandler),
("model/?([^/]*)/license", AquaModelLicenseHandler),
("model/?([^/]*)/chat_template", AquaModelTokenizerConfigHandler),
Copy link
Member

@VipulMascarenhas VipulMascarenhas Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the endpoint name to /tokenizer instead of /chat_template since we'll be returning the entire json file contents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-77%

📌 Overall coverage:

Coverage-56.76%

Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-77%

📌 Overall coverage:

Coverage-56.76%

Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-76%

📌 Overall coverage:

Coverage-56.77%

ads/aqua/app.py Outdated
@@ -306,20 +314,22 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict:
)
base_model = self.ds_client.get_model(base_model_ocid).data
artifact_path = get_artifact_path(base_model.custom_metadata_list)
if config_folder == "artifact":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move strings to constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ads/aqua/app.py Outdated
self,
model_id: str,
config_file_name: str,
config_folder: Optional[str] = "config",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if to pass None as a config_folder? I guess in the code we should add something like

config_folder = config_folder or DEFAULT_MODEL_CONFIG_FOLDER

Let's also move the strings to the constants.

DEFAULT_MODEL_CONFIG_FOLDER = "config"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ads/aqua/app.py Outdated
@@ -277,6 +282,9 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict:
The OCID of the Aqua model.
config_file_name: str
name of the config file
config_folder: Optional[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a default value to the docstring as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

ads/aqua/app.py Outdated
@@ -306,20 +314,22 @@ def get_config(self, model_id: str, config_file_name: str) -> Dict:
)
base_model = self.ds_client.get_model(base_model_ocid).data
artifact_path = get_artifact_path(base_model.custom_metadata_list)
if config_folder == "artifact":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to convert the config_folder to the lowercase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_folder values comes from constant enums which is always lowercase. so probably no need to convert explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially true, however we allow to pass a custom config_folder to the function?

ads/aqua/app.py Outdated
if not artifact_path:
logger.debug(
f"Failed to get artifact path from custom metadata for the model: {model_id}"
)
return config

config_path = f"{os.path.dirname(artifact_path)}/config/"
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the config_folder will have / at the end? It might be better to use os.path.join?

Copy link
Member Author

@kumar-shivam-ranjan kumar-shivam-ranjan Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_folder is parameter that we pass to get_config function.
This determines whether we need to search for a particular file in artifact folder (like tokenizer_config.json) or config folder (like ft_config.json).
The value of the config folder will always be either "config" (default) or "artifact".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)?

paths = url_parse.path.strip("/")
path_list = paths.split("/")
if (
len(path_list) == 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 4 and 3 in this code? :) Let's move them either to constants or add some meaningful comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request path here is: /aqua/models/ocid1.iad.ahdxxx/tokenizer
path_list=['aqua','models','ocid1.iad.ahdxxx','tokenizer']

Added comments

Copy link
Member

@mrDzurb mrDzurb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kumar-shivam-ranjan, could you also change the title and description for the PR, since the scope of the PR has been adjusted.

@kumar-shivam-ranjan kumar-shivam-ranjan changed the title Adding API to fetch default chat template for models Adding API to fetch tokenizer config for model Feb 5, 2025
@kumar-shivam-ranjan
Copy link
Member Author

Hi @kumar-shivam-ranjan, could you also change the title and description for the PR, since the scope of the PR has been adjusted.

Updated

Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-19.50%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, rest looks good.

ads/aqua/app.py Outdated
if not artifact_path:
logger.debug(
f"Failed to get artifact path from custom metadata for the model: {model_id}"
)
return config

config_path = f"{os.path.dirname(artifact_path)}/config/"
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)?

Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-19.50%

Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-82%

📌 Overall coverage:

Coverage-56.67%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@kumar-shivam-ranjan kumar-shivam-ranjan self-assigned this Feb 5, 2025
mrDzurb
mrDzurb previously approved these changes Feb 5, 2025
def get(self, model_id):
url_parse = urlparse(self.request.path)
paths = url_parse.path.strip("/")
path_list = paths.split("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Looks like we only use path_list further in this method, then probably can do

path_list = urlparse(self.request.path).path.strip("/").split("/")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -316,8 +317,26 @@ def post(self, *args, **kwargs): # noqa: ARG002
)


class AquaModelTokenizerConfigHandler(AquaAPIhandler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the pydocs to the class?

Handles requests for retrieving the Hugging Face tokenizer configuration 
    of a specified model.
    
    Expected request format:
        GET /aqua/models/<model-ocid>/tokenizer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

and path_list[3] == "tokenizer"
):
return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id))
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: looks like else is not required here. I think it would be more clear to do something like this:

 is_valid_path = (
    len(path_list) == 4
    and path_list[3] == "tokenizer"
    and is_valid_ocid(path_list[2])
)

if not is_valid_path:
       raise HTTPError(400, f"The request {self.request.path} is invalid.")

return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link

github-actions bot commented Feb 6, 2025

📌 Cov diff with main:

Coverage-80%

📌 Overall coverage:

Coverage-56.67%

Copy link

github-actions bot commented Feb 6, 2025

📌 Cov diff with main:

Coverage-82%

📌 Overall coverage:

Coverage-56.68%

@kumar-shivam-ranjan kumar-shivam-ranjan merged commit 0db36f2 into main Feb 6, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants