Incompatible with the official pinecone-text library #26

mdagost · 2024-01-28T01:12:34Z

From here, the indices array in pinecone-text is a 32-bit unsigned integer. However, the sparse vectors in this official pinecone Spark connector are expected to be Spark IntegerType. Spark's integers are 32-bit signed. That means that pinecone-text produces indices which overflow Spark's integer type and therefore are incompatible with the pinecone spark connector. I've verified this.

Since these are both official pinecone repos I'd expect them to be compatible.

Any ideas on what to do here? Seems like you might want to change that schema from IntegerType to LongType?

The text was updated successfully, but these errors were encountered:

mdagost · 2024-01-30T14:35:09Z

Looks like the murmurhash function used in pinecone-text has a flag for signed / unsigned. I had thought that a potential solution would be to make that a configurable flag, but it appears that pinecone itself really is expecting an unsigned integer:

HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32

So looks like the solution has to be a fix on the spark-pinecone side here to change from IntegerType to LongType.

mdagost · 2024-02-03T21:33:50Z

@rohanshah18 Curious if you have any thoughts here. We're blocked on hybrid search until we can get this working on the Spark connector.

rohanshah18 · 2024-02-04T14:22:15Z

Hey @mdagost, thanks for flagging me. I'll take a look on Monday and try to get back to you with an exact timeline to fix this issue. I'll make sure to prioritize it during this sprint.

rohanshah18 · 2024-02-06T16:15:38Z

@mdagost I have started an internal discussion and trying to prioritize. Will reply back with the timeline on the solution once its clear.

mdagost · 2024-02-07T16:00:34Z

Thanks! Really appreciate that! For large document sets, it becomes pretty intractable to fit and encode with the pinecone-text BM25 encoder unless you parallelize on spark. I've got some code for doing that that I'd be happy to contribute somewhere. It's all running, we just need to be able to write the sparse vectors to pinecone with this connector :)

rohanshah18 · 2024-02-08T00:07:05Z

@mdagost I'm actively working on this and trying to have a consensus internally. But I want to understand is how you're seeing this error since Spark-connector accepts IntegerType and -74040069 is a number within the range of IntegerType? So if I pass that number as an input to sparse indices, the code shouldn't throw an error and infact the protobuf will adjust accordingly since java doesn't offer uint32 but the java equivalent of uint32 is int (source: https://protobuf.dev/programming-guides/proto3/#scalar).

HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32

Thanks!

mdagost · 2024-02-08T00:55:19Z

So that error HTTP response body: vectors[0].sparse_values.indices[5]: invalid value -74040069 for type TYPE_UINT32 comes from pinecone itself when you try to pass in a negative integer as a sparse index value, just with the regular python client. For example:

import uuid

import mmh3
import pinecone


pinecone.init(api_key="xxxx", environment="us-east-1-aws")
index = pinecone.Index("some-hybrid-index")

sentence = "This is the way the world ends."

indices = [mmh3.hash(t, signed=True) for t in sentence.split()]
values = [0.5 for _ in range(len(indices))]
sparse = {"indices": indices, "values": values}
dense = [0.1]*384

to_upsert = [{
    "id": str(uuid.uuid4()),
    "sparse_values": sparse,
    "values": values
    }]

index.upsert(to_upsert)

I did that test to show that, by using an IntegerType in spark, which is signed, we're only getting half of the full range because pinecone itself won't allow the negative half of the IntegerType. If it had, then I could have put in a PR to pinecone-text to make it an option to use mmh3.hash(t, signed=True).

The actual error I see in my real code is that pinecone-text uses an unsigned integer for the sparse index value. Under the hood it is calling mmh3.hash(t, signed=False). So one of two things happens. Either you use the schema that spark-pinecone expects:

json_schema = StructType(
    [
        StructField("indices", ArrayType(IntegerType(), False), False),
        StructField("values", ArrayType(FloatType(), False), False),
    ]
)

in which case the values from pinecone-text overflow the integer and you get an error. Or you define it as LongType so that it's big enough for the unsigned integers from pinecone-text:

json_schema = StructType(
    [
        StructField("indices", ArrayType(LongType(), False), False),
        StructField("values", ArrayType(FloatType(), False), False),
    ]
)

But then the spark-pinecone connector complains on the upsert becuase that doesn't match the schema it's expecting.

Does that make sense?

rohanshah18 · 2024-02-08T01:54:37Z

Yes, that makes so much sense. So the spark connector is designed on java sdk and java's integer type is the corresponding uint32 of protobuf (source: https://protobuf.dev/programming-guides/proto3/#scalar). But I hear you and will be pushing for uniform data types across different tools at Pinecone so you wont have to worry about what sdk you're using with Spark-connector (i.e. java or python sdk). Thanks for the information.

edwinpgm · 2024-02-22T18:06:44Z

Hello @rohanshah18, we are experiencing the same issue in our work. Could you please provide an estimated time of resolution for this issue?

rohanshah18 · 2024-05-15T23:34:57Z

@mdagost @edwinpgm we had to update the underlying Java SDK as well so we have just released the spark connector v1 which now accepts LongType as the datatype for SparseIndices.

mdagost · 2024-05-22T20:24:52Z

Thanks! I'll take a look!

rohanshah18 self-assigned this Feb 4, 2024

mdagost mentioned this issue Feb 9, 2024

Change the sparse_values indices to LongType to work with pinecone-text. #27

Closed

7 tasks

rohanshah18 closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatible with the official pinecone-text library #26

Incompatible with the official pinecone-text library #26

mdagost commented Jan 28, 2024 •

edited

Loading

mdagost commented Jan 30, 2024

mdagost commented Feb 3, 2024

rohanshah18 commented Feb 4, 2024

rohanshah18 commented Feb 6, 2024

mdagost commented Feb 7, 2024

rohanshah18 commented Feb 8, 2024 •

edited

Loading

mdagost commented Feb 8, 2024 •

edited

Loading

rohanshah18 commented Feb 8, 2024 •

edited

Loading

edwinpgm commented Feb 22, 2024

rohanshah18 commented May 15, 2024

mdagost commented May 22, 2024

Incompatible with the official pinecone-text library #26

Incompatible with the official pinecone-text library #26

Comments

mdagost commented Jan 28, 2024 • edited Loading

mdagost commented Jan 30, 2024

mdagost commented Feb 3, 2024

rohanshah18 commented Feb 4, 2024

rohanshah18 commented Feb 6, 2024

mdagost commented Feb 7, 2024

rohanshah18 commented Feb 8, 2024 • edited Loading

mdagost commented Feb 8, 2024 • edited Loading

rohanshah18 commented Feb 8, 2024 • edited Loading

edwinpgm commented Feb 22, 2024

rohanshah18 commented May 15, 2024

mdagost commented May 22, 2024

mdagost commented Jan 28, 2024 •

edited

Loading

rohanshah18 commented Feb 8, 2024 •

edited

Loading

mdagost commented Feb 8, 2024 •

edited

Loading

rohanshah18 commented Feb 8, 2024 •

edited

Loading