Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring to expose clean metadata and infoset walkers #1112

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented Nov 9, 2023

(Supercedes PR #1092, which is still open because I linked the comments there from comments in this PR.)

Version 3.7.0-SNAPSHOT

This was designed to support integration of Daffodil directly (without intermediate XML or JSON or even EXI) to other data handling tools, specifically, Apache Drill.

InfosetElement and related InfosetNode traits are now in runtime1.api package.

InfosetOutputter now has methods which use InfosetElement and InfosetArray traits as the objects passed to the handler methods. This improves the API over making DIArray, DISimple, and DIComplex visible.

Added Metadata, ElementMetadata, etc. (also runtime1.api package) which provide limited exposure to the RuntimeData and CompilerInfo information.

Added MetadataHandler - which is walked by MetadataWalker which can be called from DataProcessor.

Walking the runtime1 metadata is easier than walking the DSOM tree. And these data fabrics like Apache Drill are interfacing to the runtime, not the compiler, so it's natural for the runtime1 metadata structures and data structures to be the ones driving the interfacing.

The InfosetNode types were always supposed to be the API, the DINodes the implementation. This solves the issue of what classes should show through to SAPI and JAPI about the infoset nodes. It should be the InfosetNode types, not the DINode types.

Furthermore, the InfosetNode types can have methods to access the needed runtime metadata information needed by walkers, InfosetOutputters, etc.

These hide our infoset implementation and runtime metdata (RuntimeData classes) implementions.

Added unit test for metadata and data walking to core module

Note: Nothing has changed with InfosetInputters, as those are not needed for Apache Drill integration - which is parse-only.

Refactored InfosetOutputter to enable sharing InfosetOutputterImpl.

BlobMethodImpl is now a class so must be extended first, InfosetOutputter is a trait so is implemented (second) in Java declarations.

Removed use of DIComplex, DISimple from SAPI and JAPI tests

NodeInfo.PrimType is n.g. for Java access.
Using PrimTypeNode which is a class.

TODO: maybe just use type name strings for this.

Added features to SchemaUtils to avoid "tns" prefix definition (which is now officially frowned upon)

Added isHidden to SequenceRuntimeData. Needed to avoid walking hidden elements that appear in metadata structures, but aren't relevant as they do not appear in InfosetOutputter events.

TODO: ensure these are not shared between hidden/non-hidden groups. (PropEnv stuff)
That is: isHidden on SequenceRuntimeData may be broken in the case where a sequence definition is shared both hidden and non-hidden.

Added child access to InfosetComplexElement. This is used by tests.

It is questionable whether this should be allowed, because we don't have fast O(1) hash-based access to all children, only to children referenced by expressions. This allows linear search for now.

Added QName.toPrefix_localName method. Why allocate these on the fly if they're static compile time constants.

DAFFODIL-2832

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice changes! Main questions are surrounding if it would be possible to make the API even smaller.

@mbeckerle
Copy link
Contributor Author

@stevedlawrence This is ready for a next review pass. I've addressed all the prior comments I believe, and cleaned up several other things.

Once merged, I want to run a large schema regression test with as many schemas as possible to verify this hasn't broken things.

@mbeckerle mbeckerle requested a review from tuxji November 12, 2023 05:17
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. A number of comments with fresh eyes.

I think my main question is what this all looks like from a Javadoc and Java API implementation perspective, so that weird scala stuff doesn't leak into it and that all the members are visible in the doc. I think all the new MetaData and Infoset traits aren't defined anywhere where are java doc is generated, so we need to figure out how that will work. I dont' really want more proxy classes.

Maybe we can plan to drop the japi/sapi stuff, and merge that and the new stuff into a the api package and finally get rid of the proxy stuff. 4.0.0 is nearing, so maybe it's time to discuss fixing our API.

@mbeckerle
Copy link
Contributor Author

@stevedlawrence @tuxji This should be very close to ready now.
Please give this a review again.

@tuxji
Copy link
Contributor

tuxji commented Nov 14, 2023

@stevedlawrence @tuxji This should be very close to ready now. Please give this a review again.

I'll start reviewing this PR. Quick question, though. Is this new walker API intended to be used by the C code generator too, instead of the C generator walking the compiled schema by hand?

@mbeckerle
Copy link
Contributor Author

mbeckerle commented Nov 14, 2023 via email

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Looks good overall.

build.sbt Outdated
@@ -219,7 +219,7 @@ val minSupportedJavaVersion: String =

lazy val commonSettings = Seq(
organization := "org.apache.daffodil",
version := "3.6.0",
version := "3.7.0-SNAPSHOT",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't belong in this PR. PR #1113 which hasn't been merged yet will make this change for us.


assertEquals(4, a.dataValue.getAnyRef)
assertEquals(infoset, a.parent)
assertEquals(4, a.getUnsignedLong.longValue().toInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for calling .longValue() when it seems possible to call a.getUnsignedLong.toInt directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These return java types. java.math.BigInteger didn't seem to have a direct conversion to int. Hence this.

@@ -1209,7 +1209,6 @@ public void testJavaAPI25() throws IOException, ClassNotFoundException, External
assertEquals(expectedData, bos.toString());
}

@Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional and why is the test being disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a mistake. I'll add this back.


/**
* This is the supportable API for access to the RuntimeData structures
* which provide access to static information about a given .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About a given compiled schema object, or is there a better description than that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I looked at the javadoc and it doesn't include any of the new stuff since it's not in the japi project. That's expected, and doesn't have to be fixed for this PR, but we should make sure that is addressed at some point, maybe once the drill integration is complete and we are confident that the new API is sufficient.

@mbeckerle
Copy link
Contributor Author

+1

I looked at the javadoc and it doesn't include any of the new stuff since it's not in the japi project. That's expected, and doesn't have to be fixed for this PR, but we should make sure that is addressed at some point, maybe once the drill integration is complete and we are confident that the new API is sufficient.

Yeah, we have to look into how to make the doc "hang together" despite some of the code being defined in java and some (most really) in scala, and some being in japi/sapi packages and some in runtime1/api package.

We may have to put in our own relative hyperlinks between the two bodies of documentation.

We also really need a way to document and exclude some methods/members from the API without having to have a sharp separation between things in the sapi/japi packages which are all API, and things not. We need to avoid the overhead of this proxy japi/sapi structure. It's too hard to maintain. I'd be fine with just adding to the java/scaladoc a standard comment and naming convention indicating that a method is internal not to be considered part of the API. These methods do need to be public. Even just naming them with prefix "_" and always putting them last in the trait/class/object would be ok.

This was designed to support integration of Daffodil directly
(without intermediate XML or JSON or even EXI) to other
data handling tools, specifically, Apache Drill.

InfosetElement and related InfosetNode traits are now in
runtime1.api package.

InfosetOutputter now has methods which use InfosetElement and
InfosetArray traits as the objects passed to the handler methods. This
improves the API over making DIArray, DISimple, and DIComplex visible.
(Though it is backward incompatible.) Also removed use of DIComplex,
DISimple from most SAPI and JAPI tests. The SAX InfosetOutputter still
downcasts somewhat to the DINode classes.

Added Metadata, ElementMetadata, etc. (also runtime1.api package) which
provide limited exposure to the RuntimeData and CompilerInfo information.

Added MetadataHandler - which is walked by MetadataWalker which can be
called from DataProcessor. Added unit test for metadata and data
walking to core module

Walking the runtime1 metadata is easier than walking the DSOM tree.
And these data fabrics like Apache Drill are interfacing to the
runtime1 specifically, not the schema compiler. It's natural for the
runtime1 metadata structures and data structures to be the ones
driving the interfacing.

The InfosetNode types were always supposed to be the API, the DINodes
the implementation. This solves the issue of what classes should show
through to SAPI and JAPI about the infoset nodes. It should be the
InfosetNode types, not the DINode types.

Furthermore, the InfosetNode types can have methods to access the
needed runtime metadata information needed by walkers,
InfosetOutputters, etc.

These hide our infoset implementation and runtime metdata (RuntimeData
classes) implementions.

Note: Nothing has changed with InfosetInputters, as those are not
needed for Apache Drill integration - which is parse-only.

BlobMethodMixin factored out of InfosetOutputter as a shared
implementation trait for the basic blob implementation.

Added features to SchemaUtils to avoid "tns" prefix
definition (which is now officially frowned upon)

Added isHidden to SequenceRuntimeData. Needed to avoid walking
hidden elements that appear in metadata structures, but aren't
relevant as they do not appear in InfosetOutputter events.

Improved Infoset API access to simple type methods

They now use the DFDL type names, not the underlying implementation
type names.

So a decimal is accessed via getDecimal not getBigDecimal. Improved
java doc accordingly. They throw a predictable exception on conversion
issues.

DEPRECATION/COMPATIBILITY

The InfosetOutputter trait methods have changed signatures.

The types of the arguments have been replaced:

DIArray -> InfosetArray
DISimple -> InfosetSimpleElement
DIComplex -> InfosetComplexElement

This was done to hide the "DIxxx" types as they are
internal and subject to change.

Methods of DISimple that were named for implementation types (like
getBigInt, getBigDecimal, etc.) have been replaced by methods named
for the DFDL types.

These methods are replaced:

getBigDecimal -> getDecimal
getBigInt -> getInteger

These methods also changed names:

dataValueAsString -> getText

This method is new:

getNonNegativeInteger

Some methods have been removed (getStatus).

DAFFODIL-2832
@mbeckerle mbeckerle changed the title Draft: Refactoring to expose clean metadata and infoset walkers Refactoring to expose clean metadata and infoset walkers Nov 14, 2023
@mbeckerle mbeckerle merged commit d26a582 into apache:main Nov 14, 2023
8 of 10 checks passed
@mbeckerle mbeckerle deleted the drill-exp3 branch November 14, 2023 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants