-
Notifications
You must be signed in to change notification settings - Fork 921
Add string coercion when decoding json #7453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it preserve the JSON data as is? If so this is incompatible with the destructive nature of possessing to a tape. Whilst one could serialize the tape back to JSON the code to do this doesn't exist, would be complicated, and wouldn't reproduce the actual input... I wonder if true spark compatibility might just require a custom JSON reader... |
I agree -- if the goal is to have a json reader compatible with spark, and we can't do that with straightforward API extensions, making a custom JSON reader makes the most sense to me |
FYI, I'm using the following code to achieve this // Returns the raw string value of the JSON value. Used when incorrect types are
// encountered in the JSON input. This is used to replicate Spark's behavior
// when decoding JSON.
fn decode_any(s: &mut String, tape: &Tape<'_>, pos: u32) -> Result<(), ArrowError> {
match tape.get(pos) {
TapeElement::StartObject(end) => {
s.push('{');
let mut cur_idx = pos + 1;
let mut key = true;
while cur_idx < end {
decode_any(s, tape, cur_idx)?;
cur_idx = tape.next(cur_idx, "json")?;
if cur_idx < end {
if key {
s.push(':');
} else {
s.push(',');
}
key = !key;
}
}
s.push('}');
}
TapeElement::StartList(end) => {
s.push('[');
let mut cur_idx = pos + 1;
while cur_idx < end {
decode_any(s, tape, cur_idx)?;
cur_idx = tape.next(cur_idx, "json")?;
if cur_idx < end {
s.push(',');
}
}
s.push(']');
}
TapeElement::String(idx) => {
s.push('"');
s.push_str(tape.get_string(idx));
s.push('"');
}
TapeElement::Number(idx) => s.push_str(tape.get_string(idx)),
TapeElement::I64(high) => match tape.get(pos + 1) {
TapeElement::I32(low) => {
let val = ((high as i64) << 32) | (low as u32) as i64;
s.push_str(&val.to_string());
}
_ => unreachable!(),
},
TapeElement::I32(n) => s.push_str(&n.to_string()),
TapeElement::F32(n) => s.push_str(&n.to_string()),
TapeElement::F64(high) => match tape.get(pos + 1) {
TapeElement::F32(low) => {
let val = f64::from_bits(((high as u64) << 32) | low as u64);
s.push_str(&val.to_string());
}
_ => unreachable!(),
},
TapeElement::True => s.push_str("true"),
TapeElement::False => s.push_str("false"),
TapeElement::Null => s.push_str("null"),
el => unreachable!("unexpected {:?}", el),
}
Ok(())
} |
That doesn't handle escaping correctly, correctly serializing the tape back to JSON is of similar complexity to parsing it in the first place. |
I'm not sure the correct solution, but we're hitting this problem in delta-io/delta-kernel-rs#501. Since the bad entry is one field of one row of a potentially very large json parse, it's pretty painful. And implementing a full-blown custom json parser for everything -- just to deal with one field that might be bad sometimes -- is a super unpleasant workaround. So far I'm aware of three alternatives to handle bad values without just walking away from arrow-json parsing:
|
Forgot a fourth option:
Variant is probably the best (most flexible with least knobs) solution, but could be a long ways off. |
You must be helping us fish for help 🎉 I am pretty stoked that we just landed some example variant data So I think we can now proceed testing a Rust variant implementation |
When Spark decodes JSON data where values don't match the expected schema type, it silently coerces incompatible values to strings instead of failing or preserving the original structure.
When decoding the following JSON with a schema of
map<string,string>
:{"hello": [1,2,3]}
, Spark automatically converts the array value to its string representation, resulting in{"hello": "[1,2,3]"}
The text was updated successfully, but these errors were encountered: