DRILL-8450: Add Data Type Inference to XML Format Plugin #2819

cgivre · 2023-08-08T14:15:31Z

DRILL-8450: Add Data Type Inference to XML Format Plugin

Description

This PR adds data type inference to the XML format plugin. In similar fashion to other plugins, it adds a new configuration parameter: allTextMode, which when set to true, reads all data as strings. The default is true.
Note that the inference is limited to doubles, date, timestamps, boolean and strings.

Documentation

Updated README

Testing

Added unit test.

mbeckerle

+1 One comment to fix.

This was simpler than I expected. You already had the typify() method which does the real work.

I learned how to add a new config property to this thing. Very useful.

mbeckerle · 2023-08-08T14:58:16Z

contrib/format-xml/README.md

-All fields are read as strings.  Nested fields are read as maps.  Future functionality could include support for lists.
+The XML reader has an `allTextMode` which, when set to `true` reads all data fields as strings.
+When set to `false`, Drill will attempt to infer data types.
+Nested fields are read as maps.  Future functionality could include support for lists.


Not really part of this change set, but I don't know what you are suggesting by "future functionality could include support for lists." I'd like to understand that plan/idea just as part of grokking all of this XML mapping.

mbeckerle · 2023-08-08T15:02:21Z

common/src/main/java/org/apache/drill/common/Typifier.java

+    Entry<Class, String> result = Typifier.typify(data);
+    String dataType = result.getKey().getSimpleName();
+
+    // If the string is empty, return UNKNOWN


The next line of code contradicts this comment by returning VARCHAR.
(Unless VARCHAR == UNKNOWN, which is news to me.)

@mbeckerle Drill doesn't really have an UNKNOWN data type. The way the typifier works is that if it can't determine the datatype, it falls back to string which can basically accept anything.

Regarding the lists... The issue is that to create a list, you have to set the data mode to REPEATED. The problem with XML is that there's no real way to know if a field is repeated or not. Consider this:

<book> <author>a</author> </book> <book> <author>a1</author> <author>a2</author> </book>

Since Drill uses the streaming reader, when it first encounters the author field, it would add an entry for a VARCHAR field. However, when it gets to the next author record, it should be list, but there's no way to really know that w/o a schema.

With JSON we don't have this problem because it uses [ to denote lists.

Does that make sense?

Makes perfect sense.

For XML you need XSD to know what's potentially repeating.

Sometimes that is easy because of minOccurs/maxOccurs.

But there's also these "implied arrays".

<element name="a" type="xs:int"/> <element name="b" type="xs:int"/> <element name="a" type="xs:int"/>

That's allowed in both XSD and DFDL schemas (though I want to change Daffodil to issue a warning if you do this, because it is such a bad idea when representing structured data.)

The element 'a' looks like an array, in that you can index it.

I think for drill there are just 2 columns: 'a', 'b', but as there is more than one declaration for 'a', it is an implied array.

Even just detecting this (and disallowing it for now) requires a more sophisticated metadata builder which is what I'm working on now.

cgivre · 2023-08-10T15:12:12Z

Converting to draft. There's a unit test failing in the HTTP plugin.

cgivre · 2023-08-14T02:11:10Z

@mbeckerle Unit tests fixed. I also added the data type inference for APIs that generate XML.
@jnturton, The CI is still failing with that Kerberos issue.

cgivre · 2023-08-18T14:34:09Z

@mbeckerle Could you please take another look. I had to fix a few things for a unit test. Thx!

mbeckerle

+1 last UT fix change looks fine, just a question.

mbeckerle · 2023-08-18T19:46:57Z

contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpXmlOptions.java

@@ -111,7 +111,7 @@ public String toString() {
  public static class HttpXmlOptionsBuilder {

    private int dataLevel;
-    private boolean allTextMode;
+    private Boolean allTextMode;


I thought there were 3 modes: allTextMode, allNumbersAreDouble mode, and infer-types mode.

So why is this a boolean vs am enum?

@mbeckerle
In the JSON reader there are two parameters: allTextMode and readAllNumbersAsDouble. Both are boolean. For the XML reader, I chose not to implement the readAllNumbersAsDouble parameter because in practice, it requires very clean data. From using Drill with clients, I can tell you from a lot of personal experience that this was one of the biggest data challenges. For instance, you'd get data where there was an DOUBLE field and then there would be a row with zero denoted as 0. This would then cause schema change exceptions.

We have actually made significant improvements in Drill's implicit casting rules which do prevent a lot of schema change exceptions and as a result, IMHO, it makes distinguishing between INTs and DOUBLES a lot less important. So.. out of laziness I decided it wasn't worth it. I can be convinced otherwise.

cgivre · 2023-08-21T14:58:44Z

@mbeckerle @jnturton Are we ok to merge this? I'll add support for arrays in a separate PR.

jnturton · 2023-08-21T15:36:17Z

LGTM

cgivre self-assigned this Aug 8, 2023

cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation labels Aug 8, 2023

mbeckerle approved these changes Aug 8, 2023

View reviewed changes

cgivre marked this pull request as draft August 10, 2023 15:11

cgivre force-pushed the xml_data_types branch from 35a3e36 to 5b12d30 Compare August 13, 2023 05:21

cgivre marked this pull request as ready for review August 14, 2023 02:09

cgivre added 3 commits August 17, 2023 08:41

DRILL-8450: Add Data Type Inference to XML Format Plugin

c3e7c69

Fix checkstyle

fc1eeeb

Fixed XML Reader for HTTP Plugin

f0dc478

cgivre force-pushed the xml_data_types branch from ac636e9 to f0dc478 Compare August 17, 2023 12:41

Fixed Unit Tests

bf9bbfd

Fixed HTTP UT

8bc72e9

mbeckerle approved these changes Aug 18, 2023

View reviewed changes

Fixed UT, for real this time. No really, it should work.

149f5af

jnturton approved these changes Aug 21, 2023

View reviewed changes

cgivre merged commit ee1cfeb into apache:master Aug 22, 2023
8 checks passed

cgivre deleted the xml_data_types branch August 22, 2023 00:48

cgivre added a commit to cgivre/drill that referenced this pull request Nov 2, 2023

DRILL-8450: Add Data Type Inference to XML Format Plugin (apache#2819)

5d4b9cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-8450: Add Data Type Inference to XML Format Plugin #2819

DRILL-8450: Add Data Type Inference to XML Format Plugin #2819

cgivre commented Aug 8, 2023

mbeckerle left a comment

mbeckerle Aug 8, 2023

mbeckerle Aug 8, 2023

cgivre Aug 8, 2023

mbeckerle Aug 8, 2023

cgivre commented Aug 10, 2023

cgivre commented Aug 14, 2023

cgivre commented Aug 18, 2023

mbeckerle left a comment

mbeckerle Aug 18, 2023

cgivre Aug 20, 2023

cgivre commented Aug 21, 2023

jnturton commented Aug 21, 2023

DRILL-8450: Add Data Type Inference to XML Format Plugin #2819

DRILL-8450: Add Data Type Inference to XML Format Plugin #2819

Conversation

cgivre commented Aug 8, 2023

DRILL-8450: Add Data Type Inference to XML Format Plugin

Description

Documentation

Testing

mbeckerle left a comment

Choose a reason for hiding this comment

mbeckerle Aug 8, 2023

Choose a reason for hiding this comment

mbeckerle Aug 8, 2023

Choose a reason for hiding this comment

cgivre Aug 8, 2023

Choose a reason for hiding this comment

mbeckerle Aug 8, 2023

Choose a reason for hiding this comment

cgivre commented Aug 10, 2023

cgivre commented Aug 14, 2023

cgivre commented Aug 18, 2023

mbeckerle left a comment

Choose a reason for hiding this comment

mbeckerle Aug 18, 2023

Choose a reason for hiding this comment

cgivre Aug 20, 2023

Choose a reason for hiding this comment

cgivre commented Aug 21, 2023

jnturton commented Aug 21, 2023