Improvements to support naive Bayes classification of text documents #401

ablaom · 2021-09-12T22:34:19Z

Currently we wrap NaiveBayes.jl, but only allow tabular input (internally converted to matrix) which limits application to NLP and elsewhere. However NaiveBayes.jl itself supports dictionary input.

There is also a text-specific NaiveBayes classifier in TextAnalysis, which accepts dictionaries (keyed on abstract strings).

I have never used either package seriously but my gut feeling is that there would be little difference in performance when using dictionaries, and would suggest we simply enhance the existing interface at MLJNaiveBayesInterface.jl rather than write a new interface. Also, there would be no reason I see to restrict to abstract string keys - any abstract dictionary with Integer values should be supportable. Such objects have the scientific type Multiset{S} where S is the scitype of the keys. So we could support any Multiset as input.

Another possibility might be to add support for sparse matrices, probably adjoints of julia's SparseArrayCSC matrices (input to MLJ model is n x p by convention). However this requires a small generalisation of the NaiveBayes package (needed anyway) which at the moment only supports concrete Matrix types.

Very happy to hear some different suggestions for improving naive bayes support.

@storopoli Perhaps this is something you would be interested in helping out with?

cc @pazzo83

The text was updated successfully, but these errors were encountered:

storopoli · 2021-09-12T22:53:30Z

Yes, I am definitely interested. I am in the middle of the final push of the Julia Data Science 1st edition final chapter. I cannot help with authoring PRs now. Maybe in 1 month I will have time. But I can give insight and help review some PRs.

One thing that might be interested to discuss is Dict over multi-threading parallel applications. I was fooling around a couple months ago with multi-threaded async webscraping in Julia and I was saving stuff to a Dict to find out that they are not thread-safe. There is https://github.com/wherrera10/ThreadSafeDicts.jl which I don't know how mature was but was a potential solution to my problem. Do we need to worry with thread-safe data representations for NLP stuff either to read or write data?

pazzo83 · 2021-09-14T02:19:10Z

I'd definitely like to support this! I agree the changes we'd need to make the existing interface would be pretty minimal. I can take a look at it.

ablaom · 2021-09-14T23:48:00Z

@storopoli I guess algorithms that use dictionaries as read-only should be fine. So supplying dictionaries to NaiveBayes.jl should be okay.

However, I see that the TextAnalysis version does write to dictionaries, and it seems the dictionaries are used as a way to allow for incremental training (BTW, not yet supported by MLJ). I'm not familiar enough with NLP to know how important this feature is. But it's unlikely this implementation would be thread-safe.

@pazzo83 Be great if you could give NaiveBayes.jl some TLC. The author is not actively maintaining but has been quick to review outside PR's and I have some sort of write access there too, I think. Adding sparse array support is low hanging fruit. But perhaps a quick review to determine if a complete re-design is the better path, if that's something you can assess.

ablaom changed the title ~~Improvements to support naive bayes classification of text documents~~ Improvements to support naive Bayes classification of text documents Sep 12, 2021

ablaom added the Discussion label Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to support naive Bayes classification of text documents #401

Improvements to support naive Bayes classification of text documents #401

ablaom commented Sep 12, 2021

storopoli commented Sep 12, 2021

pazzo83 commented Sep 14, 2021

ablaom commented Sep 14, 2021 •

edited

Loading

Improvements to support naive Bayes classification of text documents #401

Improvements to support naive Bayes classification of text documents #401

Comments

ablaom commented Sep 12, 2021

storopoli commented Sep 12, 2021

pazzo83 commented Sep 14, 2021

ablaom commented Sep 14, 2021 • edited Loading

ablaom commented Sep 14, 2021 •

edited

Loading