-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for 2.0 release #43
base: master
Are you sure you want to change the base?
Conversation
2f61c94 introduced a cyclic dependency between saxtree & htmlparser.
2f61c94 merely appended Locator2, without removing its now-unnecessary superinterface.
Note that the saxtree test is in the htmlparser module due to its dependence on XmlSerializer.
This sets the default language level to Java SE 11. However, the main sources (except for module-info.java) are still compiled with Java SE 8.
I thought that there was already an agreement about not introducing backward incompatibilities at this point.
On ICU4J, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like something used by the C++ translator (d6df8ad).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The htmlparser
project without the C++ translation? Have you talked with @hsivonen about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad that you did this (2928550) after changing the repository layout, otherwise this could have been a good candidate to cherry-pick just now.
There is no compelling need to remove ICU4J, and now you are using an ICU4J snapshot in Not a huge issue, but I do not see the need to remove ICU4J. |
Closes #26, closes #25, closes #14.
To build, a JDK 11+ is required. To build the Javadoc, a JDK 15+ is required (otherwise it fails on the
xom
module. There's already an RC available at http://jdk.java.net/15/, and the final release is due next month).The commits should be self-explanatory.
W.r.t. the cleanup: which of these commits must be reverted, and what can effectively be removed (also considering that it's trivial to resurrect anything from a Git repo)?
Edit: to clarify: before starting out with this PR, I deleted everything that isn't required to build the final JARs, purely for my own convenience. Obviously, I'm aware that some of it needs to stay. However, given the reorganization of the project's structure, I believe it would be useful to go over these commits, and decide what to do with each of them:
gwt-src
seems to have been added at the time in order to enable HTML parsing in env.js. However,env.js
is effectively dead, and GWT itself is mostly irrelevant as well nowadays)htmlparser
,saxtree
,xom
). So to avoid clutter in the top-level directory, I believe it would be best to move any other remaining top-level directories elsewhere, typically into the newhtmlparser
module's directory (e.g.translator-src
would move tohtmlparser/src/translator
).So for the record: I'm not putting any of this up for debate: if you say "revert all commits as is", then I'll do so without further ado.
W.r.t. "Make jchardet & ICU4J heuristics a no-op":
jchardet
is problematic. When relying on its automatic module name, Maven gives the following warning:So I really believe we should eliminate the dependency altogether. The easiest way to do so, is by simply making the related heuristic a no-op. In fact, I believe the same should be done with the
ICU4J
dependency. The usage of optional Maven dependencies andrequires static
clauses are indications that the current situation isn't quite right.The current commit is the minimal change required to eliminate the dependencies.
Now there are 3 options:
remove
Heuristics
altogether, given thatInputSource
doesn't wrap aReader
, and both of the basic sniffers (Bom
&Meta
) failuse an SPI:
htmlparser
should introduce an interfaceEncodingSniffer
and itsmodule-info.java
should sayuses EncodingSniffer;
. Then the current sniffers should be moved into their own Maven module(s), implementEncodingSniffer
, and be made available as services tohtmlparser
merge as is: this is no more than a behavioral regression (except for the removal of the sniffers, but I don't see how direct usage of these classes could possibly be justified), so we could just wait for someone to report it. Then once (if ever) it happens, one of the former 2 options can be implemented
What's your opinion?
(Edit: @carlosame I edited this post to clarify the "Clean up"-commits. And contrary to what you're saying, normalization checking no longer requires ICU4J. So please remove any comments that are not/no longer relevant, or at least bundle all your comments into a single one.)
(Edit2: @carlosame ICU4J is over 30x the size of htmlparser, and both its encoding detection (which "isn't very good") and normalization checking are disabled by default. Anyway, I get your point. And again: 3 of your comments are irrelevant at this point, and the other 2 could easily be merged into 1. So please clean up your existing comments & stop adding new comments (just to be clear: add as much feedback as you want, but do so by editing an existing comment, not adding new ones all the time).)