Systematic test units for fit_transform() #2206

PipaFlores · 2024-11-05T09:56:46Z

Feature request

As identified in PR #2191, the current test units do not cover the process of fitting a model. In other words, is not testing the implementation of fit_transform(). Consequently, different current, and future, features that are performed at the fit_transform() level are not tested in any systematic way. We realized about this when debugging topic reduction by declaring a nr_topics for the model before fitting. However, this issue might involve all the core features, and most of the optional ones.

Currently, in conftest.py the tests define the models and fitthem for further testing in the other units.
https://github.com/MaartenGr/BERTopic/blob/c3ec85dec8eeac704b30812dfed4ac8cd7d13561/tests/conftest.py#L50C1-L55C17

As such, some improvement is required for the tests to cover for the fit_transform() method, the core of the library.

Motivation

This is required to systematically test the internal consistency of all features and the overall work pipeline.

Your contribution

I can't tackle this issue yet due to time availability, since I will need to familiarize myself more with the pytest framework first. I will come back in a future to tackle this, but I leave the issue open as a reminder, and in case someone else is up for the challenge.

The text was updated successfully, but these errors were encountered:

PipaFlores · 2024-11-05T10:09:33Z

It might be that is less dramatic than I portray it to be. In the coverage html file for _bertopic.py (html cov zip file below), most of the pipeline is green. And the _extract_topics() part is red (line 479-510). While I have no clue how the red/blue lines are actually defined, it might be that only that part is not covered (which involves topic reduction (if needed) -> vectorizer/ctidf ->
representation models)

htmlcov.zip

MaartenGr · 2024-11-05T11:13:41Z

Thanks for opening up a separate issue and detailing the underlying problem. Coverage is difficult with a modular framework but this really should have been captured by the tests since it is core functionality. I have limited time currently (most of my time is spend answering issues and reviewing PRs) but might shift the balance in the future.

SSivakumar12 · 2024-12-10T16:23:03Z

Hi, I would be interested in picking this up, if that is okay? I have experience with pytest framework which should help with this issue.

SSivakumar12 · 2024-12-11T14:26:09Z

@PipaFlores @MaartenGr Started doing some investigations on what needs to be done. Running the command (for visibility pytest -vv tests/ --cov=bertopic --cov-report term-missing) to see lines within the codebase that haven’t been ‘tested’ (see image below). For the large part we do see that _bertopic.py is well tested. That being said it is evident that there are quite a few lines that don’t have an equivalent test, primarily due to if/else logic which in some cases is nested. If we focus on the fit_transform() method, notably lines 431-437, 446, 470-471, 479-488, 496, 510, you see that those areas aren't covered by an explicit test but can be done to mocking and patching with realistic values (see here for a high-level introduction to mocking functions in pytest) and testing correct arguements passed in for instance.

What I am proposing as a success criteria is to improve the code coverage of the _bertopic.py only, notably within the fit_transform() method. Is this reasonable statement and within the spirit of the original issue?

While I would like to say that we could use the PR for the issue to improve the coverage of the other files, I am aware of scope creep so I will save it for another PR.

Let me know if this success criteria are reasonable and if so will put together an implementation 😊

MaartenGr · 2024-12-15T07:52:53Z

@SSivakumar12 Thank you for considering picking this up!

What I am proposing as a success criteria is to improve the code coverage of the _bertopic.py only, notably within the fit_transform() method. Is this reasonable statement and within the spirit of the original issue?

That sounds good to me. The codebase is growing rather quickly, but I think the core of the code and potential bugs are found within .fit_transform.

One thing to note is that unit tests in topic modeling can be a bit more complex since there is no ground truth nor an easy objective way to identify if the output is "correct". We might have to check things like whether the topic names belong to the correct IDs by looking into the documents themselves, which could complicate things.

Also, I have been caching as much as I could to increase the speed of the tests but they are still slower than I'd like. Perhaps there is some optimization there that might be interesting to look into, but that's of course up to you.

PipaFlores · 2024-12-16T09:32:24Z

For the 'standard' procedure of fit_transform(), couldn't we just use the same sub-sample of the 20newsgroup and use a seed for the UMAP?
From my understanding, the UMAP is the only non-deterministic algorithm in the pipeline. I might be wrong ofc.

In my head, I was thinking whether we can have a seeded topic modeling with the current bertopic version (sort of a snapshot) and use that as a baseline for comparison. Features that are not aimed at changing the outcomes (but only fixing bugs, optimization, adding new parallel methods) should align with this snapshot. And features that purposely aim to improve outcomes will have a standard baseline for testing.

The 20newsgroup is already used in the paper as a dataset benchmark, and could remain as this baseline (as the closest better validated outcome we have) unless there is other easier benchmark to include (but in my head this would be unnecessary effort)

SSivakumar12 · 2024-12-16T17:04:15Z

Hi @PipaFlores to answer your question

couldn't we just use the same sub-sample of the 20newsgroup and use a seed for the UMAP?

Looking a the confest.py file (this is the file which is run before the tests are executed) for various versions of BERTopic models that are tested, It looks like UMAP has a seed set (through the random_state parameter). You may also notice that hbscan_models also have a random_state parameter set as well since hdscan is partially non-deterministic.

I was thinking whether we can have a seeded topic modeling with the current bertopic version (sort of a snapshot) and use that as a baseline for comparison

yeah that is definitely an approach I could consider. My inital plan was to create individual tests for any method called within .fit_transform (such as a individual test for self._extract_embeddings() only - a test already exists for this but im saying this for an example ) since I thought this would be easier to isolate functions and therefore develop tests since I would be breaking the unit test down. However I am also aware that we would need to potentially create a additional models (guided, zeroshot topic modelling) which aren't currently tested.

SSivakumar12 · 2024-12-16T17:19:13Z

@MaartenGr Regarding this point:

Also, I have been caching as much as I could to increase the speed of the tests but they are still slower than I'd like. Perhaps there is some optimization there that might be interesting to look into, but that's of course up to you.

This was actually something we considered in my previous project at work and we were playing around with a pytest plug-in known as pytest-xdist. The idea with this package is that is parallelise tests by distributing tests to the number of available cpus. We found that is does take a bit of time to setup but once it gets going it is very quick. In the end we elected not to use since our testing suite without pytest-xdist took about 35 seconds so it wasn't going to make a massive difference anyway. However with bertopic tests I think it may have some potential since you could do things like manually group similar tests together and I think this could help squeeze some performance out.

However I think it is one of those things I think would be better to test in github actions pipeline since resource on a local machine could vary so it would be difficult to determine whether it produces a tangible benefit if that makes sense. Looking at github actions, and the image you are using to run the workflows I think the specs are 4 cpu + 16gb RAM and the average runtime in github action is 10-11 minutes.

MaartenGr · 2024-12-18T09:24:54Z

@SSivakumar12

This was actually something we considered in my previous project at work and we were playing around with a pytest plug-in known as pytest-xdist. The idea with this package is that is parallelise tests by distributing tests to the number of available cpus. We found that is does take a bit of time to setup but once it gets going it is very quick. In the end we elected not to use since our testing suite without pytest-xdist took about 35 seconds so it wasn't going to make a massive difference anyway. However with bertopic tests I think it may have some potential since you could do things like manually group similar tests together and I think this could help squeeze some performance out.

However I think it is one of those things I think would be better to test in github actions pipeline since resource on a local machine could vary so it would be difficult to determine whether it produces a tangible benefit if that makes sense. Looking at github actions, and the image you are using to run the workflows I think the specs are 4 cpu + 16gb RAM and the average runtime in github action is 10-11 minutes.

That seems like an interesting approach and might indeed speed things up here and there. I'm a bit hesitant adding more overhead with respect to maintenance of the testing pipeline. I'm okay if things are a bit slower as long as it is not unreasonably slow. If there are simple and easy tricks we can use, I want to go for it, but the moment things get more complex we have to think about how we (typically me) are going to maintain things.

For the 'standard' procedure of fit_transform(), couldn't we just use the same sub-sample of the 20newsgroup and use a seed for the UMAP?
From my understanding, the UMAP is the only non-deterministic algorithm in the pipeline. I might be wrong ofc.

Note that with updating python versions and dependencies, it would be unlikely that we would always get the exact same results. Just by updating numpy, you might get different results unfortunately in my experience. That's what makes testing so difficult, since reproducing can only be done in the exact same environment whilst we instead opt for the latest versions of dependencies instead (which update continuously).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic test units for fit_transform() #2206

Systematic test units for fit_transform() #2206

PipaFlores commented Nov 5, 2024 •

edited

Loading

PipaFlores commented Nov 5, 2024 •

edited

Loading

MaartenGr commented Nov 5, 2024

SSivakumar12 commented Dec 10, 2024

SSivakumar12 commented Dec 11, 2024

MaartenGr commented Dec 15, 2024

PipaFlores commented Dec 16, 2024

SSivakumar12 commented Dec 16, 2024

SSivakumar12 commented Dec 16, 2024

MaartenGr commented Dec 18, 2024

Systematic test units for fit_transform() #2206

Systematic test units for fit_transform() #2206

Comments

PipaFlores commented Nov 5, 2024 • edited Loading

Feature request

Motivation

Your contribution

PipaFlores commented Nov 5, 2024 • edited Loading

MaartenGr commented Nov 5, 2024

SSivakumar12 commented Dec 10, 2024

SSivakumar12 commented Dec 11, 2024

MaartenGr commented Dec 15, 2024

PipaFlores commented Dec 16, 2024

SSivakumar12 commented Dec 16, 2024

SSivakumar12 commented Dec 16, 2024

MaartenGr commented Dec 18, 2024

PipaFlores commented Nov 5, 2024 •

edited

Loading

PipaFlores commented Nov 5, 2024 •

edited

Loading