-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SchemaObject changes to support multiple vector configs #1499
Conversation
…needed because tables need multiple vector config.
…TOR_EMBEDDING_TEXT_FIELD`
…TOR_EMBEDDING_TEXT_FIELD`
…TOR_EMBEDDING_TEXT_FIELD`
@@ -198,15 +198,15 @@ public Uni<RestResponse<CommandResult>> postCommand( | |||
// TODO: refactor this code to be cleaner so it assigns on one line | |||
EmbeddingProvider embeddingProvider = null; | |||
final VectorConfig.VectorizeConfig vectorizeConfig = | |||
schemaObject.vectorConfig().vectorizeConfig(); | |||
schemaObject.vectorConfigs().get(0).vectorizeConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changed to get(0) because collections have one config. Will need to change this when we work on vectorize support for tables.
@@ -39,7 +40,7 @@ public Operation resolveKeyspaceCommand( | |||
.collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().getApiDataType())); | |||
List<String> partitionKeys = Arrays.stream(command.definition().primaryKey().keys()).toList(); | |||
|
|||
Map<String, VectorizeConfig> vectorizeConfigMap = | |||
Map<String, VectorConfig.VectorizeConfig> vectorizeConfigMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the serialization to use the object used in schema cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so, I understand that we will need to allow multiple vector definitions for API Tables (and possibly eventually for Collections). But I am bit worried about couple of aspects:
- Use of "vector-disabled" value for
List<VectorConfig>
instead of main-level enabled/disabled -- along with specialVectorConfig
value - Ordering of
VectorConfig
withinList
as it's not clear how ordering would be defined/preserved for multi-vector use case (which we don't yet have). Should we use JSON Object instead of JSON Array, keyed by column name?
I think there are good reasons to do above but would be good to outline reasoning.
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Outdated
Show resolved
Hide resolved
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Outdated
Show resolved
Hide resolved
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Outdated
Show resolved
Hide resolved
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// convert a vector jsonNode from table comment to vectorConfig | ||
// convert a vector jsonNode from table comment to vectorConfig, used for collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... not used for tables? So only collection? (if so, comment should state that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it means "used when reading config for Collections (not Tables)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table there denotes cql table. fwiw I will just say comment to avoid confusion
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/VectorConfig.java
Outdated
Show resolved
Hide resolved
@@ -51,17 +81,8 @@ public static VectorConfig fromJson(JsonNode jsonNode, ObjectMapper objectMapper | |||
vectorizeServiceParameterNode == null | |||
? null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use Map.of()
unless we want to use null
as marker of some kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't understand this recommendation.
@@ -49,6 +51,9 @@ public DataVectorizer( | |||
this.nodeFactory = nodeFactory; | |||
this.embeddingCredentials = embeddingCredentials; | |||
this.schemaObject = schemaObject; | |||
// This is getting element at 0 since only one vector is stored in the schema. | |||
// This logic needs to be changed to handle multiple vectors columns for tables, | |||
vectorConfig = schemaObject.vectorConfigs().get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should class Javadoc probably mention if it is only being used for Collections?
I assume it is because only single vector is defined as per comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has all field names hard coded for collections, this needs to be adapted for api tables.
int vectorSize, | ||
SimilarityFunction similarityFunction, | ||
VectorizeConfig vectorizeConfig) { | ||
|
||
// TODO: this is an immutable record, this can be singleton | ||
// TODO: Remove the use of NULL for the objects like vectorizeConfig | ||
public static VectorConfig notEnabledVectorConfig() { | ||
return new VectorConfig(false, -1, null, null); | ||
return new VectorConfig(false, null, -1, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the semantics of null
vectorField? I assume we need a placeholder of some kind... ?
Put another way, how are we using this placeholder (how is it checked and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave null if the there is no vector field set for the table or collection
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/VectorConfig.java
Outdated
Show resolved
Hide resolved
Refactored this to keep the flag outside vector definitions.
We don't need to maintain ordering for the vectorize config, since they have the column name stored with it, in case of collection field name will be |
…ct-multiple-vectorize
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/VectorConfig.java
Show resolved
Hide resolved
entry.getValue(), VectorConfig.ColumnVectorDefinition.VectorizeConfig.class); | ||
vectorizeConfigMap.put(entry.getKey(), vectorizeConfig); | ||
} | ||
} catch (JsonProcessingException | IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, one last request: was hoping for separate catch blocks -- so for readTree()
generic "Bad JSON" (because that'd be causing it) -- and then for treeToValue()
can indicate specific vector field that was bad.
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Show resolved
Hide resolved
SimilarityFunction similarityFunction = | ||
SimilarityFunction.fromString(jsonNode.get("metric").asText()); | ||
/** | ||
* Configuration for a column, In case of collection this will be of size one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "size one"? Does it mean there will just be one such definition or ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently collection has only one vector field and its hard coded to $vectorize
in the application. These code need to be refactored when we do table vectorize support. Added the comment since I did get(0)
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/VectorConfig.java
Outdated
Show resolved
Hide resolved
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/NamespaceCache.java
Outdated
Show resolved
Hide resolved
src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looks better -- added some optional suggestions, approved.
…te/data-api into schema-outject-multiple-vectorize
What this PR does:
Made the SchemaObject store and return List of VectorConfig. This is needed because tables need to support multiple vector config.
Changed the vectorize serialization to be based on VectorizeConfig object used in SchemaObject.
Checklist