Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyse Function #545

Closed
wants to merge 1 commit into from
Closed

Analyse Function #545

wants to merge 1 commit into from

Conversation

yua5
Copy link
Contributor

@yua5 yua5 commented Feb 7, 2025

Add the "Analyse" function, including LDA topic analyze(topic page), collocation word extract(collocation page), cooccur word extract(cooccur page), high-frequency wordlist(wordlist page), keyword extract(keyword), cooccur semantic network(network page). The analyse function get the result from blacklab-server and analyze it in "src\main\java\nl\inl\corpuswebsite\utils\analyseUtils".

Add the "Analyse" function, including LDA topic analyze(topic page), collocation word extract(collocation page), cooccur word extract(cooccur page), high-frequency wordlist(wordlist page), keyword extract(keyword), cooccur
semantic network(network page). The analyse function get the result from blacklab-server and analyze it in "src\main\java\nl\inl\corpuswebsite\utils\analyseUtils".
@KCMertens
Copy link
Member

Wow, this is an impressive amount of work! First off, I appreciate the effort and time you've put into this. Given the size and scope of the changes, we'll need some time to properly evaluate everything to ensure it aligns with our own plans for the future.

A few initial thoughts:

  • Some backend functions attempt to download entire corpora, which isn't feasible for larger datasets. We have corpora with millions of documents, meaning such operations could take hours or even days. Unfortunately, this is a fundamental issue that prevents us from integrating these features in their current form.
  • That being said, some functionality would probably be better suited for BlackLab itself, as handling it at that level would be much more efficient. I'll have to further discuss this with @jan-niestadt, the primary developer of BlackLab. However, we’re being cautious about adding more complexity to BlackLab. If we can find a good way to implement these as extensions, that could be an option, but no guarantees.
  • The current implementation assumes document IDs are numeric and sequential, but in reality, PIDs are usually strings. Fortunately, this should be an easy fix.
  • The build currently includes a full NLP model, resulting in an enormous .war file exceeding 2GB. I couldn’t find any references to this code being used. Could you clarify whether this is necessary, or if it’s an unused dependency that can be removed?

There’s a lot of interesting work here, we'll take a deeper look and follow up!

@KCMertens
Copy link
Member

First, thank you for sharing your work with us, we really appreciate the effort and dedication that went into this.
Unfortunately, after reviewing in detail and discussing with the team, we've come to the difficult conclusion that we won't be able to merge this feature in its current state.

There is cool stuff in there - we do see value! But it's mainly down to technical concerns.
The engineering effort it would require to bring it to a state where we could confidently integrate it is beyond what we can commit to at this time.

That said, we'd be happy to provide some guidance or discuss potential next steps in more detail if you're interested. You can reach out via email (see the README), or open issues here on GitHub if you have specific questions.

Why We Can't Merge This

Architectural Fit

We feel that the Corpus-Frontend isn't the best place for this feature. A better approach might be integrating the analysis implementations as a BlackLab plugin or as a separate tool. Philosophically, the Corpus-Frontend is mostly a fancy querybuilder and result viewer - a publishing tool - it doesn't really concern itself with data processing.

Usability & Performance Concerns

Apart from the above, there are some major usability and performance issues that would need to be addressed:

  • The current implementation allows users to trigger extremely resource-intensive operations with a single button click, likely overwhelming the server. There's no way to disable this feature for large datasets, which would cause problems in production.
  • Additional safeguards would be necessary to prevent excessive memory usage or excessively long-running jobs.
  • A more efficient approach might involve precomputing certain operations during indexing, or providing admin-controlled processing with caching (for example Topic Analysis could be performed once, then cached forever).

Technical Considerations

Beyond these, we also found some more specific issues:

  • Page state isn't properly restored on reload.
  • Resetting the search form causes errors in the console.
  • The hardcoded word/lemma switch needs to be dynamic, as word and lemma are not guaranteed to have those exact names (or even exist).
  • Document IDs are assumed to be numeric and contiguous, but they are actually arbitrary strings.
  • Queries will fail on noisy data, due to incorrect/missing URL and regex escaping (e.g., handling " or ' correctly).
  • Queries should use POST request, or they can become too long and fail on large datasets due to GET request size limits (this happens in co-occurance analysis)
  • The implementation pulls entire corpora over the network, relies on inefficient string operations, and uses arrays where sets would be significantly faster - meaning substantial optimization would still be required before we could make this to a wider audience.

Moving Forward

One of the biggest takeaways here is the importance of discussion with maintainers before and during development.
Open collaboration early on could have helped align expectations, and prevented some of these challenges.
I realise this takes time that could also be spent on writing code, but for anything nontrivial, communication often makes the difference between a feature that gets merged and one that's difficult to integrate.

Having said all that, we have discussed building some of these things in the past (but more in a "perhaps one day" fashion), and this is a good concrete example of what that could look like! We will certainly look at this for inspiration for some time to come.

Best,
Koen

@KCMertens KCMertens closed this Feb 14, 2025
@jan-niestadt
Copy link
Member

Impressive work, but I have to agree with @KCMertens; this approach unfortunately won't work for many users.

If you have time to and are interested in seeing how some of the processing could either be done with a separate data enrichment step before indexing, or perhaps during the BlackLab indexing or search process, we'd be happy to discuss that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants