Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible with the official pinecone-text library #26

Closed
mdagost opened this issue Jan 28, 2024 · 11 comments
Closed

Incompatible with the official pinecone-text library #26

mdagost opened this issue Jan 28, 2024 · 11 comments
Assignees

Comments

@mdagost
Copy link

mdagost commented Jan 28, 2024

See pinecone-io/pinecone-text#71.

From here, the indices array in pinecone-text is a 32-bit unsigned integer. However, the sparse vectors in this official pinecone Spark connector are expected to be Spark IntegerType. Spark's integers are 32-bit signed. That means that pinecone-text produces indices which overflow Spark's integer type and therefore are incompatible with the pinecone spark connector. I've verified this.

Since these are both official pinecone repos I'd expect them to be compatible.

Any ideas on what to do here? Seems like you might want to change that schema from IntegerType to LongType?

@mdagost
Copy link
Author

mdagost commented Jan 30, 2024

Looks like the murmurhash function used in pinecone-text has a flag for signed / unsigned. I had thought that a potential solution would be to make that a configurable flag, but it appears that pinecone itself really is expecting an unsigned integer:

HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32

So looks like the solution has to be a fix on the spark-pinecone side here to change from IntegerType to LongType.

@mdagost
Copy link
Author

mdagost commented Feb 3, 2024

@rohanshah18 Curious if you have any thoughts here. We're blocked on hybrid search until we can get this working on the Spark connector.

@rohanshah18
Copy link
Contributor

Hey @mdagost, thanks for flagging me. I'll take a look on Monday and try to get back to you with an exact timeline to fix this issue. I'll make sure to prioritize it during this sprint.

@rohanshah18 rohanshah18 self-assigned this Feb 4, 2024
@rohanshah18
Copy link
Contributor

@mdagost I have started an internal discussion and trying to prioritize. Will reply back with the timeline on the solution once its clear.

@mdagost
Copy link
Author

mdagost commented Feb 7, 2024

Thanks! Really appreciate that! For large document sets, it becomes pretty intractable to fit and encode with the pinecone-text BM25 encoder unless you parallelize on spark. I've got some code for doing that that I'd be happy to contribute somewhere. It's all running, we just need to be able to write the sparse vectors to pinecone with this connector :)

@rohanshah18
Copy link
Contributor

rohanshah18 commented Feb 8, 2024

@mdagost I'm actively working on this and trying to have a consensus internally. But I want to understand is how you're seeing this error since Spark-connector accepts IntegerType and -74040069 is a number within the range of IntegerType? So if I pass that number as an input to sparse indices, the code shouldn't throw an error and infact the protobuf will adjust accordingly since java doesn't offer uint32 but the java equivalent of uint32 is int (source: https://protobuf.dev/programming-guides/proto3/#scalar).

HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32

Thanks!

@mdagost
Copy link
Author

mdagost commented Feb 8, 2024

So that error HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32 comes from pinecone itself when you try to pass in a negative integer as a sparse index value, just with the regular python client. For example:

import uuid

import mmh3
import pinecone


pinecone.init(api_key="xxxx", environment="us-east-1-aws")
index = pinecone.Index("some-hybrid-index")

sentence = "This is the way the world ends."

indices = [mmh3.hash(t, signed=True) for t in sentence.split()]
values = [0.5 for _ in range(len(indices))]
sparse = {"indices": indices, "values": values}
dense = [0.1]*384

to_upsert = [{
    "id": str(uuid.uuid4()),
    "sparse_values": sparse,
    "values": values
    }]

index.upsert(to_upsert)

I did that test to show that, by using an IntegerType in spark, which is signed, we're only getting half of the full range because pinecone itself won't allow the negative half of the IntegerType. If it had, then I could have put in a PR to pinecone-text to make it an option to use mmh3.hash(t, signed=True).

The actual error I see in my real code is that pinecone-text uses an unsigned integer for the sparse index value. Under the hood it is calling mmh3.hash(t, signed=False). So one of two things happens. Either you use the schema that spark-pinecone expects:

json_schema = StructType(
    [
        StructField("indices", ArrayType(IntegerType(), False), False),
        StructField("values", ArrayType(FloatType(), False), False),
    ]
)

in which case the values from pinecone-text overflow the integer and you get an error. Or you define it as LongType so that it's big enough for the unsigned integers from pinecone-text:

json_schema = StructType(
    [
        StructField("indices", ArrayType(LongType(), False), False),
        StructField("values", ArrayType(FloatType(), False), False),
    ]
)

But then the spark-pinecone connector complains on the upsert becuase that doesn't match the schema it's expecting.

Does that make sense?

@rohanshah18
Copy link
Contributor

rohanshah18 commented Feb 8, 2024

Yes, that makes so much sense. So the spark connector is designed on java sdk and java's integer type is the corresponding uint32 of protobuf (source: https://protobuf.dev/programming-guides/proto3/#scalar). But I hear you and will be pushing for uniform data types across different tools at Pinecone so you wont have to worry about what sdk you're using with Spark-connector (i.e. java or python sdk). Thanks for the information.

@edwinpgm
Copy link

Hello @rohanshah18, we are experiencing the same issue in our work. Could you please provide an estimated time of resolution for this issue?

@rohanshah18
Copy link
Contributor

@mdagost @edwinpgm we had to update the underlying Java SDK as well so we have just released the spark connector v1 which now accepts LongType as the datatype for SparseIndices.

@mdagost
Copy link
Author

mdagost commented May 22, 2024

Thanks! I'll take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants