Big refactorization #43

aleixalcacer · 2024-10-08T09:11:46Z

No description provided.

…hods + User API changes: 1, 2, 3, 4, 15, 19 + Developer level changes: 2, 5, 6, 9, 18 + Behavior changes: 3, 8, 10, 11, 12 + Sklearn Compatibility: 2, 3, 4 + Performance Improvement: 8, 13, 14, 15 + Orthogonal projection onto unit simplex: 15 + Cython: 15, 16 + Docstrings: 7, 8 + Pytest: 17 1. All the parameters of `AA` has to be provided as keyword arguments except `n_archetypes`, following the practice of sklearn. 2. Add `AA` attributes: + `n_archetypes_` (actual number of archetypes after fitting) + `n_iter_` (actual number of iteration performed) + `rss_`, or `reconstruction_error_` available after fitting. 3. All the parameter type checks are moved to `AA` 's class attribute `_parameter_constraints`. Parameters are now saved to attributes as is, and is automatically checked at data fitting (with sklearn's `@_fit_context`) instead of instance initialization. For reproducibility, `random_state` is saved as is, and a new random generator is created at fitting. 4. Functions are now not allowed as the `init` kwarg. Only strings are accepted, as passing a function to kwargs is uncommon for sklearn estimators. 5. Deleted: `AA._compute_archetypes` and `AA._loss`. These computations are implemented more efficiently by using in-place matrix calculation in `fit_transform` and other functions. 6. Change `nnls` to match what is in `AA`: default kwarg name is `max_iter_optimizer` and default const is 100. 7. Match docstring of `AA` with the actual code. E.g. sparse matrix support is deleted; 8. Match actual code with docstring: A copy will be created if data array is not C-contiguous (also for performance). 9. Like what is in `sklearn.decomposition.NMF`, though we know all different optimizing methods have to do alternating optimization, we do not constrain `A_optimize` and `B_optimize` in class methods. Instead, we do alternating optimization manually in `nnls_fit_transform` and `pgd_fit_transform`. To implement a new optimizing method xx: + Implement `xx_fit_transform(X, A, B, archetypes, *, max_iter, tol, verbose, **kwargs)`, where `A`, `B` and `archetypes` are initialized $A_0$, $B_0$ and $B_{0}X$. This function should return optimized A, B, archetypes, number of iterations `i`, a list of loss, a bool flag of convergence. + Implement `xx_transform(X, archetypes, *, max_iter, tol, **kwargs)` that returns a matrix A. + Register the method into `_parameter_constraint` and if-else branches in `AA.transform` and `AA.fit_transform`: `if self. method=="xx": do...` The benefit for doing so: + Avoiding complicated inheritance in `AABase_3` and `AA_3` classes + Separating optimization from initialization + Flexibility and customization in different optimizing methods + makes it easier to put alternating minimization in tight loop to improve performance (e.g. easier to do in-place calculations) + Introduce less unnecessary attributes to `AA`. 10. Verbose mode start printing RSS before the first iteration step is done. 11. For PGD-like iterations, if there is no improvement w.r.t loss, no descend step will be performed, not a single step of optimizing B. 12. For PGD-like `transform`, multiple iterations are performed so that A should sufficiently converge to local minimum. 13. Memory allocation improvement: All the matrix calculations (product, plus&minus) are performed in place ,which saves 5%~30% time in these calculations on my device. Use `np.vdot(a, b)` instead of `np.sum(a * b)`. 14. In PGD-like optimization, values of objective function are evaluated directly $||X-ABX||^2$ , not what is in M&H paper, as I found that when `n_samples` is large the latter becomes slower than the former (both are sufficiently optimized). 15. Implemented fast projection onto unit simplex (`unit_simplex_proj`,Condat 2016 Algorithm) in Cython (and also normalized one in M&H paper, `l1_normalize_proj`). These two functions project the input array in place and return None. I renamed M&H method as "pseudo_pgd" as their normalization is not an "authentic" projection ... 16. Modify `pyproject.toml` to make it compatible with Cython. Run `hatch build` to compile `.pyx` to `.c` and `.so` and build the wheel. 17. Add pytest for projection functions. 18. Remove `AABase`. Base classes may be needed later if we implement variants of AA, but before that we are not really sure what to be put into the base class and what to the subclass. 19. multiple inits for numpy.AA (return the best one)

aleixalcacer · 2024-10-08T10:30:57Z

@wangzcl can you take a look at it to see what the problem is? Modify whatever is necessary, don't worry :)

wangzcl · 2024-10-08T15:48:26Z

Let me have a look

aleixalcacer · 2024-10-08T16:19:33Z

It seems that when you run hatch env create dev, it tries to compile the package dependencies from source. Maybe we could change the GitHub workflows to avoid that.

wangzcl · 2024-10-08T16:21:40Z

I just checked it out. We are good with the workflow. I just need to move numpy 2.0 to the build dependency list

wangzcl · 2024-10-08T16:38:32Z

I noticed the new bug. Let me figure it out.

aleixalcacer · 2024-10-08T16:47:35Z

I think it's because of the old version of scipy, which requires numpy<2.0. However, it seems that the nnls error has been fixed in scipy's main branch on Github.

See: https://github.com/scipy/scipy/blob/main/scipy/optimize/_nnls.py

Maybe we could install scipy from Github. What do you think?

aleixalcacer · 2024-10-08T17:07:58Z

I found some new conflicts in my fork. Does github allow plain merge instead of rebase when merging a pull request?

I had bad experiences with merging without doing a rebase first. How do you update your fork?

The problem is definitely scipy. How would you see using https://github.com/stefanopalmieri/lsqnonneg/blob/master/lsqnonneg.py or https://gist.github.com/jdelafon/b7fdc7a0bae42af56366fc7786cc5d54#file-lsqnonneg-py instead of scipy.optimize.nnls?

wangzcl · 2024-10-08T17:18:27Z

Let me think about that. SciPy should be working as I remember that NNLS did pass the test on my own computer. I will double check.

aleixalcacer · 2024-10-08T17:21:26Z

Let me think about that. SciPy should be working as I remember that NNLS did pass the test on my own computer. I will double check.

I've updated it to 1.13 (known issue) to confirm that the issue with numpy was related to scipy. That's why the tests are failing.

wangzcl · 2024-10-08T20:17:29Z

Cheers! Fixed with #44 and #45, however I have noticed another issue that may prevent verbose mode from printing the right iteration step i.

wangzcl · 2024-10-09T04:29:11Z

One more thing in the todo list is update README for python 3.9~3.12

aleixalcacer · 2024-10-10T08:35:41Z

One more thing in the todo list is update README for python 3.9~3.12

That badge is created automatically from PyPI, using:

archetypes/pyproject.toml

Lines 20 to 23 in 443d09e

    
           "Programming Language :: Python :: 3.9", 
        
           "Programming Language :: Python :: 3.10", 
        
           "Programming Language :: Python :: 3.11", 
        
           "Programming Language :: Python :: 3.12",

Therefore, when a release is uploaded to PyPI with these new classifiers, the badge will automatically update.

Don't worry about that :)

aleixalcacer · 2024-10-16T15:07:44Z

@wangzcl I noticed an error in the refactoring yesterday. When transform is run on an already fitted model, it fails. I've improved the tests to cover this. Previously, fit_transform was fit().transform(). I think it would be good to keep it that way if possible. I'm not sure if it affects performance. Could you take a look?

wangzcl · 2024-10-16T15:25:42Z

I'll take a look at the bug. I did the refactoring following the sklearn NMF source code. They didn't do fit().transform() for fit_transform, so that additional transform step is avoided, making it faster -- I think it is a good design pattern, though it does make things look a little bit weird

aleixalcacer · 2024-10-18T12:34:12Z

Ready to merge?

wangzcl · 2024-10-18T14:45:53Z

I guess so. Though there is always something we can improve, I think it's good for now at least.

wangzcl and others added 5 commits October 8, 2024 11:09

numpy.AA boundary condition: n_archetypes = 1

7f33fab

fix: return the right n_iter_ for numpy.AA

57480d7

numpy norm -> sklearn squared_norm (faster)

3251d08

ensure wheel building and publishing

9cb380e

aleixalcacer changed the title ~~New features~~ Big refactorization Oct 8, 2024

Set numpy version to 2.0

e174179

aleixalcacer added 4 commits October 8, 2024 18:48

Install scipy from Github

21f60d6

Fix scipy dependency requirement

3338c68

Set scipy version to 1.14 (known issue)

7d86931

Set scipy version to 1.13 (known issue)

c2296e0

wangzcl added 3 commits October 8, 2024 14:57

fix build and nnls issues

568dcf7

allow manually run workflow

5f16bae

remove unused imports to fix pre-commit

443d09e

Improve tests by separating fit and transform steps.

1867e34

fix transform bugs

2a75eba

wangzcl added 3 commits October 17, 2024 16:11

fix verbose issue

d9af2ef

MAINT: update docs for PGD

39eaaf1

maint doc chart title

33dc2dd

aleixalcacer merged commit 1fecf52 into main Nov 8, 2024
12 checks passed

aleixalcacer deleted the dev branch November 8, 2024 22:25

wangzcl mentioned this pull request Jan 6, 2025

Direct Projection onto Unit Simplex #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big refactorization #43

Big refactorization #43

aleixalcacer commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

wangzcl commented Oct 8, 2024

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024 •

edited

Loading

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024 •

edited

Loading

wangzcl commented Oct 8, 2024

wangzcl commented Oct 9, 2024

aleixalcacer commented Oct 10, 2024

aleixalcacer commented Oct 16, 2024

wangzcl commented Oct 16, 2024

aleixalcacer commented Oct 18, 2024

wangzcl commented Oct 18, 2024

Big refactorization #43

Big refactorization #43

Conversation

aleixalcacer commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

wangzcl commented Oct 8, 2024

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024 • edited Loading

wangzcl commented Oct 8, 2024

aleixalcacer commented Oct 8, 2024 • edited Loading

wangzcl commented Oct 8, 2024

wangzcl commented Oct 9, 2024

aleixalcacer commented Oct 10, 2024

aleixalcacer commented Oct 16, 2024

wangzcl commented Oct 16, 2024

aleixalcacer commented Oct 18, 2024

wangzcl commented Oct 18, 2024

aleixalcacer commented Oct 8, 2024 •

edited

Loading

aleixalcacer commented Oct 8, 2024 •

edited

Loading