Prepare for 2.0 release #43

anthonyvdotbe · 2020-08-17T05:07:09Z

Closes #26, closes #25, closes #14.

To build, a JDK 11+ is required. To build the Javadoc, a JDK 15+ is required (otherwise it fails on the xom module. There's already an RC available at http://jdk.java.net/15/, and the final release is due next month).

The commits should be self-explanatory.

W.r.t. the cleanup: which of these commits must be reverted, and what can effectively be removed (also considering that it's trivial to resurrect anything from a Git repo)?

Edit: to clarify: before starting out with this PR, I deleted everything that isn't required to build the final JARs, purely for my own convenience. Obviously, I'm aware that some of it needs to stay. However, given the reorganization of the project's structure, I believe it would be useful to go over these commits, and decide what to do with each of them:

retain them, given that anything that's deleted can trivially be resurrected (e.g. gwt-src seems to have been added at the time in order to enable HTML parsing in env.js. However, env.js is effectively dead, and GWT itself is mostly irrelevant as well nowadays)
revert them, either as is, or by additionally moving the affected files. Due to the separation in multiple Maven modules, there are now 3 top-level directories containing sources (htmlparser, saxtree, xom). So to avoid clutter in the top-level directory, I believe it would be best to move any other remaining top-level directories elsewhere, typically into the new htmlparser module's directory (e.g. translator-src would move to htmlparser/src/translator).

So for the record: I'm not putting any of this up for debate: if you say "revert all commits as is", then I'll do so without further ado.

W.r.t. "Make jchardet & ICU4J heuristics a no-op": jchardet is problematic. When relying on its automatic module name, Maven gives the following warning:

[WARNING] ****************************************************************************************************************************************
[WARNING] * Required filename-based automodules detected: [jchardet-1.0.jar]. Please don't publish this project to a public artifact repository! *
[WARNING] ****************************************************************************************************************************************

So I really believe we should eliminate the dependency altogether. The easiest way to do so, is by simply making the related heuristic a no-op. In fact, I believe the same should be done with the ICU4J dependency. The usage of optional Maven dependencies and requires static clauses are indications that the current situation isn't quite right.

The current commit is the minimal change required to eliminate the dependencies.

Now there are 3 options:

remove Heuristics altogether, given that
- it's only used if: it's explicitly enabled, and the given InputSource doesn't wrap a Reader, and both of the basic sniffers (Bom & Meta) fail
- it's trivial for users to do the encoding detection themselves, using any library of their choosing
- none of the current sniffers are state of the art
use an SPI: htmlparser should introduce an interface EncodingSniffer and its module-info.java should say uses EncodingSniffer;. Then the current sniffers should be moved into their own Maven module(s), implement EncodingSniffer, and be made available as services to htmlparser
merge as is: this is no more than a behavioral regression (except for the removal of the sniffers, but I don't see how direct usage of these classes could possibly be justified), so we could just wait for someone to report it. Then once (if ever) it happens, one of the former 2 options can be implemented

What's your opinion?

(Edit: @carlosame I edited this post to clarify the "Clean up"-commits. And contrary to what you're saying, normalization checking no longer requires ICU4J. So please remove any comments that are not/no longer relevant, or at least bundle all your comments into a single one.)
(Edit2: @carlosame ICU4J is over 30x the size of htmlparser, and both its encoding detection (which "isn't very good") and normalization checking are disabled by default. Anyway, I get your point. And again: 3 of your comments are irrelevant at this point, and the other 2 could easily be merged into 1. So please clean up your existing comments & stop adding new comments (just to be clear: add as much feedback as you want, but do so by editing an existing comment, not adding new ones all the time).)

2f61c94 introduced a cyclic dependency between saxtree & htmlparser.

2f61c94 merely appended Locator2, without removing its now-unnecessary superinterface.

Note that the saxtree test is in the htmlparser module due to its dependence on XmlSerializer.

This sets the default language level to Java SE 11. However, the main sources (except for module-info.java) are still compiled with Java SE 8.

src/nu/validator/saxtree/DocumentFragment.java

carlosame · 2020-08-20T19:03:28Z

The easiest way to do so, is by simply making the related heuristic a no-op

I thought that there was already an agreement about not introducing backward incompatibilities at this point.

The usage of optional Maven dependencies and requires static clauses are indications that the current situation isn't quite right.

On ICU4J, the requires need not be static, and the Maven dependency could be marked as not optional. Saying "I'm going to use setCheckingNormalization(boolean) so I need to add ICU4J to the path" isn't exactly obvious or straight-forward.

carlosame

That looks like something used by the C++ translator (d6df8ad).

carlosame

The htmlparser project without the C++ translation? Have you talked with @hsivonen about that?

pom.xml

carlosame

Too bad that you did this (2928550) after changing the repository layout, otherwise this could have been a good candidate to cherry-pick just now.

saxtree/src/main/java/module-info.java

carlosame · 2020-08-21T16:30:09Z

normalization checking no longer requires ICU4J.

There is no compelling need to remove ICU4J, and now you are using an ICU4J snapshot in COMPOSING_CHARACTERS which may be out of sync with the JDK's unicode. The unicode data used by the JDK changes over releases.

Not a huge issue, but I do not see the need to remove ICU4J.

anthonyvdotbe added 30 commits August 15, 2020 09:32

Break cyclic dependency

777c1f2

2f61c94 introduced a cyclic dependency between saxtree & htmlparser.

Remove unnecessary superinterface Locator

52edf70

2f61c94 merely appended Locator2, without removing its now-unnecessary superinterface.

Clean up: doc

223c87c

Clean up: gwt-src et al.

78674ec

Clean up: mozilla-export-scripts

db75e90

Clean up: ruby-gcj

0a18810

Clean up: super

d6df8ad

Clean up: tools

d459ad0

Clean up: cpptranslate et al.

429bfe2

Clean up: GenerateNamedCharacters.java

6f9916f

Remove RPM package support

6d00055

Remove OSGi bundle support

1f70f9b

Update URLs

1c484d4

Format POM

31888b3

Adopt Maven directory layout

ea60638

Fix Maven build

44b25b6

Split into separate Maven modules

c7c30d7

Note that the saxtree test is in the htmlparser module due to its dependence on XmlSerializer.

Fix Maven build

7c8b11f

Fix doclint errors

2928550

Fix typos

88184e1

Fix some lint warnings

5304a2c

Suppress remaining lint warnings

653a769

Remove unused imports

7511367

Upgrade to a module-aware Java SE version

eabf3dd

This sets the default language level to Java SE 11. However, the main sources (except for module-info.java) are still compiled with Java SE 8.

Modularize

f2c3126

Rename package.html to package-info.java

283955b

Fix package Javadoc after rename

9376c4f

Make NormalizationChecker independent of ICU4J

2ce23df

Make jchardet & ICU4J heuristics a no-op

bd6aae9

Bump version number to 2.0

51b5c92

carlosame reviewed Aug 20, 2020

View reviewed changes

src/nu/validator/saxtree/DocumentFragment.java Show resolved Hide resolved

carlosame reviewed Aug 20, 2020

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

carlosame reviewed Aug 20, 2020

View reviewed changes

carlosame reviewed Aug 21, 2020

View reviewed changes

saxtree/src/main/java/module-info.java Show resolved Hide resolved

anthonyvdotbe mentioned this pull request Aug 22, 2020

Help wanted: review Maven pom.xml changes for test automation #45

Closed

carlosame mentioned this pull request Sep 4, 2020

Add test automation #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prepare for 2.0 release #43

Prepare for 2.0 release #43

Uh oh!

anthonyvdotbe commented Aug 17, 2020 •

edited

Loading

Uh oh!

Uh oh!

carlosame commented Aug 20, 2020

Uh oh!

carlosame left a comment •

edited

Loading

Uh oh!

carlosame left a comment

Uh oh!

Uh oh!

carlosame left a comment •

edited

Loading

Uh oh!

Uh oh!

carlosame commented Aug 21, 2020

Uh oh!

Uh oh!

Prepare for 2.0 release #43

Are you sure you want to change the base?

Prepare for 2.0 release #43

Uh oh!

Conversation

anthonyvdotbe commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

carlosame commented Aug 20, 2020

Uh oh!

carlosame left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlosame left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carlosame left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carlosame commented Aug 21, 2020

Uh oh!

Uh oh!

anthonyvdotbe commented Aug 17, 2020 •

edited

Loading

carlosame left a comment •

edited

Loading

carlosame left a comment •

edited

Loading