Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaObject changes to support multiple vector configs #1499

Merged
merged 20 commits into from
Oct 7, 2024

Conversation

maheshrajamani
Copy link
Contributor

@maheshrajamani maheshrajamani commented Oct 5, 2024

What this PR does:
Made the SchemaObject store and return List of VectorConfig. This is needed because tables need to support multiple vector config.
Changed the vectorize serialization to be based on VectorizeConfig object used in SchemaObject.

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CLA Signed: DataStax CLA

…needed because tables need multiple vector config.
@maheshrajamani maheshrajamani requested a review from a team as a code owner October 5, 2024 03:14
@@ -198,15 +198,15 @@ public Uni<RestResponse<CommandResult>> postCommand(
// TODO: refactor this code to be cleaner so it assigns on one line
EmbeddingProvider embeddingProvider = null;
final VectorConfig.VectorizeConfig vectorizeConfig =
schemaObject.vectorConfig().vectorizeConfig();
schemaObject.vectorConfigs().get(0).vectorizeConfig();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changed to get(0) because collections have one config. Will need to change this when we work on vectorize support for tables.

@@ -39,7 +40,7 @@ public Operation resolveKeyspaceCommand(
.collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().getApiDataType()));
List<String> partitionKeys = Arrays.stream(command.definition().primaryKey().keys()).toList();

Map<String, VectorizeConfig> vectorizeConfigMap =
Map<String, VectorConfig.VectorizeConfig> vectorizeConfigMap =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the serialization to use the object used in schema cache

Copy link
Contributor

@tatu-at-datastax tatu-at-datastax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so, I understand that we will need to allow multiple vector definitions for API Tables (and possibly eventually for Collections). But I am bit worried about couple of aspects:

  1. Use of "vector-disabled" value for List<VectorConfig> instead of main-level enabled/disabled -- along with special VectorConfig value
  2. Ordering of VectorConfig within List as it's not clear how ordering would be defined/preserved for multi-vector use case (which we don't yet have). Should we use JSON Object instead of JSON Array, keyed by column name?

I think there are good reasons to do above but would be good to outline reasoning.

}

// convert a vector jsonNode from table comment to vectorConfig
// convert a vector jsonNode from table comment to vectorConfig, used for collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... not used for tables? So only collection? (if so, comment should state that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it means "used when reading config for Collections (not Tables)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table there denotes cql table. fwiw I will just say comment to avoid confusion

@@ -51,17 +81,8 @@ public static VectorConfig fromJson(JsonNode jsonNode, ObjectMapper objectMapper
vectorizeServiceParameterNode == null
? null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use Map.of() unless we want to use null as marker of some kind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't understand this recommendation.

@@ -49,6 +51,9 @@ public DataVectorizer(
this.nodeFactory = nodeFactory;
this.embeddingCredentials = embeddingCredentials;
this.schemaObject = schemaObject;
// This is getting element at 0 since only one vector is stored in the schema.
// This logic needs to be changed to handle multiple vectors columns for tables,
vectorConfig = schemaObject.vectorConfigs().get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should class Javadoc probably mention if it is only being used for Collections?
I assume it is because only single vector is defined as per comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has all field names hard coded for collections, this needs to be adapted for api tables.

int vectorSize,
SimilarityFunction similarityFunction,
VectorizeConfig vectorizeConfig) {

// TODO: this is an immutable record, this can be singleton
// TODO: Remove the use of NULL for the objects like vectorizeConfig
public static VectorConfig notEnabledVectorConfig() {
return new VectorConfig(false, -1, null, null);
return new VectorConfig(false, null, -1, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics of null vectorField? I assume we need a placeholder of some kind... ?
Put another way, how are we using this placeholder (how is it checked and so on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave null if the there is no vector field set for the table or collection

@maheshrajamani
Copy link
Contributor Author

  1. Use of "vector-disabled" value for List<VectorConfig> instead of main-level enabled/disabled -- along with special VectorConfig value

Refactored this to keep the flag outside vector definitions.

2. Ordering of VectorConfig within List as it's not clear how ordering would be defined/preserved for multi-vector use case (which we don't yet have). Should we use JSON Object instead of JSON Array, keyed by column name?

We don't need to maintain ordering for the vectorize config, since they have the column name stored with it, in case of collection field name will be $vectorize

entry.getValue(), VectorConfig.ColumnVectorDefinition.VectorizeConfig.class);
vectorizeConfigMap.put(entry.getKey(), vectorizeConfig);
}
} catch (JsonProcessingException | IllegalArgumentException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one last request: was hoping for separate catch blocks -- so for readTree() generic "Bad JSON" (because that'd be causing it) -- and then for treeToValue() can indicate specific vector field that was bad.

SimilarityFunction similarityFunction =
SimilarityFunction.fromString(jsonNode.get("metric").asText());
/**
* Configuration for a column, In case of collection this will be of size one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "size one"? Does it mean there will just be one such definition or ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently collection has only one vector field and its hard coded to $vectorize in the application. These code need to be refactored when we do table vectorize support. Added the comment since I did get(0)

Copy link
Contributor

@tatu-at-datastax tatu-at-datastax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks better -- added some optional suggestions, approved.

@maheshrajamani maheshrajamani merged commit 647012c into main Oct 7, 2024
3 checks passed
@maheshrajamani maheshrajamani deleted the schema-outject-multiple-vectorize branch October 7, 2024 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants