Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented first version of the asReader interface on IString #273

Merged
merged 14 commits into from
Sep 22, 2024

Conversation

DavyLandman
Copy link
Member

@DavyLandman DavyLandman commented Sep 19, 2024

This implements the addition discussed in #71

I've added some tests, hope they are enough.

Copy link

github-actions bot commented Sep 19, 2024

Test Results

     99 files  ±0       99 suites  ±0   5m 29s ⏱️ -4s
242 304 tests +2  242 303 ✅ +2  1 💤 ±0  0 ❌ ±0 
726 999 runs  +6  726 996 ✅ +6  3 💤 ±0  0 ❌ ±0 

Results for commit f7d591e. ± Comparison against base commit 38556b4.

♻️ This comment has been updated with latest results.

@jurgenvinju
Copy link
Member

The tests are very good because many random values will be used.

Copy link
Member

@jurgenvinju jurgenvinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have progress on this! Thanks. I think this demonstrates exactly what is needed.

The reader for indented values requires rethinking without the use of the IntStream. The reader should at least be able to stream entire lines, up too and including the newline character. But larger buffers are preferable. One typical use case here is large json objects serialized as strings with indentation. So if we can cram several lines into buffers up to 8k that will surely help with server response times.

With respect to tests; I'm pretty sure the random string generator does not play with indent yet, or with deep concatenation of long strings. To test this better we could either generate some examples ourselves, or extend the default random generator for strings..

@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int result = currentReader.read(cbuf, off, len);
if (result == -1 && !readingRight) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use some comments to explain that read is allowed to return fewer chars than Len.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I didn't want to repeat the manual of read, but indeed, at least 1 char is enough.

@@ -1380,6 +1438,72 @@ public void write(Writer w) throws IOException {
assert indents.isEmpty();
}

@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct, although I see we need to work hard with the surrogate pairs. Maybe If we would always read until len - 1 unless the last character is a newline, and the int iterator always provides codepoints, there are no further special cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But before we merge we should find an implemention that can say least stream line by line. I suspect that the current experiment with the int iterator will slow down the stream to the point that it doesn't make sense to stream anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we need to do work with surrogate pairs since every int the iterator returns can be either 1 or 2 chars. I initially had some code that would write 2 chars if there was room, but that just lead to duplication of logic with the queedLowSurrogate field. So I rewrote that to be a bit less wordy.

@jurgenvinju
Copy link
Member

To make indented write easier i distributed indentedWrite methods over all implementation classes. Possibly an indentedRead method would help as well in this case. For larger buffers than 1.

@@ -1146,6 +1163,47 @@ public void write(Writer w) throws IOException {
right.write(w);
}

@Override
public Reader asReader() {
return new Reader() {
Copy link
Member

@jurgenvinju jurgenvinju Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we allocate a reader at every level in the tree and every node in every level. For the writer this design didn't hold up because of the enormous amounts of concat nodes in some applications. Typically we have at least as many concat nodes as lines in the output.

So let's have a look at this and also use the benchmarks in the main method so we can fine-tune it. Typically a recursion is ok for the concat nodes and maximally log(n) allocations (a spine of reader objects that moves over the tree).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we have to keep state, the writer was easier as it's push, while the reader is pull. I tried to do it in a similar way as the writer, but could not in a reasonable timeframe figure it out. We need co-routintes ;)

@jurgenvinju
Copy link
Member

This algorithm is what we ended up with to avoid allocation of iterator or stream objects for every node in the balanced binary tree:

@Override
        public OfInt iterator() {
            return new OfInt() {
                final Deque<AbstractString> todo = new ArrayDeque<>(depth);
                OfInt currentLeaf = leftmostLeafIterator(todo, LazyConcatString.this);

                @Override
                public boolean hasNext() {
                    return currentLeaf.hasNext(); /* || !todo.isEmpty() is unnecessary due to post-condition of nextInt() */
                }

                @Override
                public int nextInt() {
                    int next = currentLeaf.nextInt();

                    if (!currentLeaf.hasNext() && !todo.isEmpty()) {
                        // now we back track to the previous node we went left from,
                        // take the right branch and continue with its first leaf:
                        currentLeaf = leftmostLeafIterator(todo, todo.pop());
                    }

                    assert currentLeaf.hasNext() || todo.isEmpty();
                    return next;
                }
            };
        }

If we turn this into an Iterator<AbstractString> where each AbstractString is guaranteed to be a leaf node, then that would work. Only have to check what the indented nodes do in this context.

@DavyLandman
Copy link
Member Author

@jurgenvinju the new implementation uses the exact same style as the int iterators.

Copy link
Member

@jurgenvinju jurgenvinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fast and useful. Hope we can apply it in the webservers to improve visuals and other features like salix diffs.

@DavyLandman DavyLandman merged commit 8621d34 into main Sep 22, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants