-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatible with the official pinecone-text library #26
Comments
Looks like the murmurhash function used in pinecone-text has a flag for signed / unsigned. I had thought that a potential solution would be to make that a configurable flag, but it appears that pinecone itself really is expecting an unsigned integer:
So looks like the solution has to be a fix on the spark-pinecone side here to change from |
@rohanshah18 Curious if you have any thoughts here. We're blocked on hybrid search until we can get this working on the Spark connector. |
Hey @mdagost, thanks for flagging me. I'll take a look on Monday and try to get back to you with an exact timeline to fix this issue. I'll make sure to prioritize it during this sprint. |
@mdagost I have started an internal discussion and trying to prioritize. Will reply back with the timeline on the solution once its clear. |
Thanks! Really appreciate that! For large document sets, it becomes pretty intractable to fit and encode with the pinecone-text BM25 encoder unless you parallelize on spark. I've got some code for doing that that I'd be happy to contribute somewhere. It's all running, we just need to be able to write the sparse vectors to pinecone with this connector :) |
@mdagost I'm actively working on this and trying to have a consensus internally. But I want to understand is how you're seeing this error since Spark-connector accepts IntegerType and
Thanks! |
So that error
I did that test to show that, by using an The actual error I see in my real code is that pinecone-text uses an unsigned integer for the sparse index value. Under the hood it is calling
in which case the values from pinecone-text overflow the integer and you get an error. Or you define it as LongType so that it's big enough for the unsigned integers from pinecone-text:
But then the spark-pinecone connector complains on the upsert becuase that doesn't match the schema it's expecting. Does that make sense? |
Yes, that makes so much sense. So the spark connector is designed on java sdk and java's integer type is the corresponding uint32 of protobuf (source: https://protobuf.dev/programming-guides/proto3/#scalar). But I hear you and will be pushing for uniform data types across different tools at Pinecone so you wont have to worry about what sdk you're using with Spark-connector (i.e. java or python sdk). Thanks for the information. |
Hello @rohanshah18, we are experiencing the same issue in our work. Could you please provide an estimated time of resolution for this issue? |
Thanks! I'll take a look! |
See pinecone-io/pinecone-text#71.
From here, the indices array in pinecone-text is a 32-bit unsigned integer. However, the sparse vectors in this official pinecone Spark connector are expected to be Spark IntegerType. Spark's integers are 32-bit signed. That means that pinecone-text produces indices which overflow Spark's integer type and therefore are incompatible with the pinecone spark connector. I've verified this.
Since these are both official pinecone repos I'd expect them to be compatible.
Any ideas on what to do here? Seems like you might want to change that schema from
IntegerType
toLongType
?The text was updated successfully, but these errors were encountered: