Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix new range index field report #243

Merged
merged 8 commits into from
Oct 27, 2024

Conversation

amclark42
Copy link
Contributor

Fixes #191. Tested on eXist v6.2.0 using previously-created range indexes on elements, attributes, and fields (also, Lucene indexes by node).

This PR ensures that Monex can report on fields in the new range index, and that the report is accurate across a collection.

Monex doesn't have a develop branch. Maintainers, please feel free to let me know if I should adjust this PR at all. Does the app need a changelog entry or a version increase, for instance?

The global variable is still useful without the context of
`$indexes:collection`, so it's left as a test of what eXist can do
with the new range index.
If the new range index is used, `indexes:show-index-keys()` checks the
output of `range:index-keys-for-field()` for a mismatch in the number of
terms, and more terms than asked for in `$indexes:max-number-returned`.
If the range function is acting on a document-by-document basis, Monex
cleans up the output for accuracy across the collection.

Also made it easier to tell what's going on when processing `$sorted-keys`.
@joewiz
Copy link
Member

joewiz commented Nov 23, 2023

@amclark42 Thank you for the PR! It is correct in targeting the "master" branch.

While the CI tests are failing, the issues causing those failures preceded your PR are entirely unrelated to it. I checked out your branch tonight and installed it, and in fact the tests intermittently pass; I hope to get to the bottom of that, separate from this PR.

On the topic of sample data to confirm that the Indexes panel is behaving as expected, I had forgotten that we already have a basis for this: the "indexes-test" subcollection, which can be browsed at http://localhost:8080/exist/apps/monex/collection.html?collection=/db/apps/monex/indexes-test and whose source is https://github.com/eXist-db/monex/blob/master/src/main/xar-resources/indexes-test. You'll see a sample index configurations with a range index for a few cases of elements and attributes with and without namespaces. Would you consider adding some sample data & associated index configurations that, when browsed in Monex's Indexes panel, demonstrates the functionality you added?

@amclark42
Copy link
Contributor Author

@joewiz Absolutely! Thanks for pointing to the test indexes; that'll help a good deal.

Thanks also for investigating the Cypress tests! I tried to get them running locally, but couldn't figure out how to get my environment set up for them. Is there any documentation you can share on how the eXist devs run the tests locally?

@amclark42
Copy link
Contributor Author

I'm still working on pulling together a good test dataset, but I figure I can lay out some of the criteria that I'm trying to capture.

When I tested Monex with my organization's data, here's what I looked for:

  • The range field report returns data.
    • This is the issue identified in my bug report.
    • This can be tested with a minimal sample, as the existing index tests have done.
  • The range field report does not repeat terms across rows.
    • This is an issue that came up as part of my fix. The range function returns one row per term per document, not compiled across the corpus.
    • To test this, we need more than one XML document.
    • The test corpus should have at least some values repeated across documents.
  • The range field report honors the number of items to be returned.
    • This also arose as part of my fix. Because the range function works document-by-document, it capped the number of terms per document, not across the corpus.
    • I suspect that, with my fix in place, Monex may report a lower-than-actual frequency for some terms, because the range function would not have reported all uses of the term for all documents. I was not able to test this suspicion, but I will be able to with a constrained corpus.
      • It's worth noting that, if my suspicions are correct, that's fuel for a new bug report on eXist itself and the implemented range function.
      • From Monex's point of view, it should be able to compile the per-document results and return no more table rows than were requested. Having an accurate frequency would be ideal but it was not as pressing as getting useful results at all.
    • Ideally, to test this, we'd have more than 100 terms in one or more documents.

@amclark42
Copy link
Contributor Author

With commit d7ae9ca, I've added three new files to Monex's collection of test data. These files record information about the XPath functions available in the functions specs (2.0, 3.0, 3.1), e.g.:

<function fn="idref" group="sequence">
  <return quantifier="*">node()</return>
  <property>deterministic</property>
  <args num="1">
    <property>context-dependent</property>
    <property>focus-dependent</property>
    <error>XPDY0002</error>
    <error>XPTY0004</error>
  </args>
  <args num="2">
    <property>context-independent</property>
    <property>focus-independent</property>
  </args>
  <error>FODC0001</error>
</function>

To demonstrate fields in the new range index, I've added the following to Monex's collection.xconf:

  <create qname="function">
      <field name="function-name" match="@fn" type="xs:string"/>
      <field name="function-group" match="@group" type="xs:string"/>
  </create>
  <create qname="return">
      <field name="return-type" type="xs:string"/>
      <field name="return-quantifier" match="@quantifier" type="xs:string"/>
  </create>
  <create qname="property" type="xs:string"/>
  <create qname="error" type="xs:string"/>

Of the new indexes, only function-name has more than 100 keys. I haven't yet done a thorough check to make sure that the numbers reported are accurate.

With the new data and indexes in place, I'm satisfied that all of the criteria I listed in my last comment have been met. It'd probably be useful to add new tests to the Cypress suite so that there's a backup to an eyeball test. Since I'm not able to run tests locally, I'll have to leave that for someone else.

@amclark42
Copy link
Contributor Author

I added language to each data file that releases it into the public domain. This way I and others can use the dataset for other purposes. Let me know if this is a problem!

@amclark42
Copy link
Contributor Author

Hi @joewiz! Just curious, what's the status of this PR?

@joewiz
Copy link
Member

joewiz commented Oct 24, 2024

@amclark42 I'm so sorry for the letting so much time pass! Thanks for the reminder! I will review the PR ASAP.

@joewiz
Copy link
Member

joewiz commented Oct 26, 2024

@amclark42 I did a before and after comparison of monex's reports on the test data you added to this PR, and I can confirm that with the old code, monex returned 0 results for the 4 fields, and with the new code, it returns the expected number of results for each. This is really promising!

In my next round of the review, I'll see if I can get the tests running locally and to see if there's something I can add related to the new data & corrected reports.

But before I do, I wanted to comment on the observation in your Dec. 3 post above, where you wrote:

The range function returns one row per term per document, not compiled across the corpus... It's worth noting that, if my suspicions are correct, that's fuel for a new bug report on eXist itself and the implemented range function.

As I read your report and the workaround you added to monex to account for this behavior, I agree!

Please consider opening an issue to report this bug, if/when you have the chance. Thank you!

@joewiz
Copy link
Member

joewiz commented Oct 27, 2024

The tests pass locally for me (both via mvn verify - which runs the tests in a local Docker instance; or one can also run npm run cypress after building the package via mvn package and installing it on localhost). The tests already passed in CI. The tests mostly just check that each of the app's main pages open. I could imagine extending the tests to check that the expected values appear, but I think the tests in the eXist core should really be responsible for checking this behavior. Having your test data in place allows us to confirm that the monex report works.

So I think this PR is good to go! Thank you so much for the fixes!

(And sorry again for the delay in reviewing it.)

@joewiz joewiz merged commit 45851ec into eXist-db:master Oct 27, 2024
2 checks passed
@amclark42
Copy link
Contributor Author

@joewiz Thank you for your review and encouragement!

Please consider opening an issue to report this bug, if/when you have the chance. Thank you!

I will try to do so by the end of the year!

@joewiz
Copy link
Member

joewiz commented Oct 28, 2024

@amclark42 Thank you! I think a natural starting point would be the existing tests for the range index fields:

Notably, these tests only store one test document. Adding a 2nd document (e.g., Act 2) could help illustrate what you observed here.

One other note: My links are to the develop branch (i.e., 7.0.0-SNAPSHOT), which includes eXist-db/exist#4910 - a PR that altered the format of data returned by eXist's internal performance statistics functions. So depending on which version of eXist you're running, be sure to use the tests drawn from your version of eXist to avoid unexpected test failures relating to this PR in develop.

I'll be happy to help in any way if you have questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index for new range field doesn’t show keys [BUG]
2 participants