Skip to content

Add string coercion when decoding json #7453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cht42 opened this issue Apr 29, 2025 · 7 comments
Open

Add string coercion when decoding json #7453

cht42 opened this issue Apr 29, 2025 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@cht42
Copy link

cht42 commented Apr 29, 2025

When Spark decodes JSON data where values don't match the expected schema type, it silently coerces incompatible values to strings instead of failing or preserving the original structure.

When decoding the following JSON with a schema of map<string,string>: {"hello": [1,2,3]}, Spark automatically converts the array value to its string representation, resulting in {"hello": "[1,2,3]"}

@cht42 cht42 added the enhancement Any new improvement worthy of a entry in the changelog label Apr 29, 2025
@tustvold
Copy link
Contributor

tustvold commented Apr 29, 2025

Does it preserve the JSON data as is? If so this is incompatible with the destructive nature of possessing to a tape.

Whilst one could serialize the tape back to JSON the code to do this doesn't exist, would be complicated, and wouldn't reproduce the actual input...

I wonder if true spark compatibility might just require a custom JSON reader...

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

I wonder if true spark compatibility might just require a custom JSON reader...

I agree -- if the goal is to have a json reader compatible with spark, and we can't do that with straightforward API extensions, making a custom JSON reader makes the most sense to me

@cht42
Copy link
Author

cht42 commented Apr 29, 2025

FYI, I'm using the following code to achieve this

// Returns the raw string value of the JSON value. Used when incorrect types are
// encountered in the JSON input. This is used to replicate Spark's behavior
// when decoding JSON.
fn decode_any(s: &mut String, tape: &Tape<'_>, pos: u32) -> Result<(), ArrowError> {
    match tape.get(pos) {
        TapeElement::StartObject(end) => {
            s.push('{');
            let mut cur_idx = pos + 1;
            let mut key = true;
            while cur_idx < end {
                decode_any(s, tape, cur_idx)?;
                cur_idx = tape.next(cur_idx, "json")?;
                if cur_idx < end {
                    if key {
                        s.push(':');
                    } else {
                        s.push(',');
                    }
                    key = !key;
                }
            }

            s.push('}');
        }
        TapeElement::StartList(end) => {
            s.push('[');

            let mut cur_idx = pos + 1;
            while cur_idx < end {
                decode_any(s, tape, cur_idx)?;
                cur_idx = tape.next(cur_idx, "json")?;
                if cur_idx < end {
                    s.push(',');
                }
            }

            s.push(']');
        }
        TapeElement::String(idx) => {
            s.push('"');
            s.push_str(tape.get_string(idx));
            s.push('"');
        }
        TapeElement::Number(idx) => s.push_str(tape.get_string(idx)),
        TapeElement::I64(high) => match tape.get(pos + 1) {
            TapeElement::I32(low) => {
                let val = ((high as i64) << 32) | (low as u32) as i64;
                s.push_str(&val.to_string());
            }
            _ => unreachable!(),
        },
        TapeElement::I32(n) => s.push_str(&n.to_string()),
        TapeElement::F32(n) => s.push_str(&n.to_string()),
        TapeElement::F64(high) => match tape.get(pos + 1) {
            TapeElement::F32(low) => {
                let val = f64::from_bits(((high as u64) << 32) | low as u64);
                s.push_str(&val.to_string());
            }
            _ => unreachable!(),
        },
        TapeElement::True => s.push_str("true"),
        TapeElement::False => s.push_str("false"),
        TapeElement::Null => s.push_str("null"),
        el => unreachable!("unexpected {:?}", el),
    }

    Ok(())
}

@tustvold
Copy link
Contributor

That doesn't handle escaping correctly, correctly serializing the tape back to JSON is of similar complexity to parsing it in the first place.

@scovich
Copy link
Contributor

scovich commented May 1, 2025

I'm not sure the correct solution, but we're hitting this problem in delta-io/delta-kernel-rs#501. Since the bad entry is one field of one row of a potentially very large json parse, it's pretty painful. And implementing a full-blown custom json parser for everything -- just to deal with one field that might be bad sometimes -- is a super unpleasant workaround.

So far I'm aware of three alternatives to handle bad values without just walking away from arrow-json parsing:

@scovich
Copy link
Contributor

scovich commented May 1, 2025

Forgot a fourth option:

Variant is probably the best (most flexible with least knobs) solution, but could be a long ways off.

@alamb
Copy link
Contributor

alamb commented May 5, 2025

Variant is probably the best (most flexible with least knobs) solution, but could be a long ways off.

You must be helping us fish for help 🎉

I am pretty stoked that we just landed some example variant data

So I think we can now proceed testing a Rust variant implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants