Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big refactorization #43

Merged
merged 18 commits into from
Nov 8, 2024
Merged

Big refactorization #43

merged 18 commits into from
Nov 8, 2024

Conversation

aleixalcacer
Copy link
Owner

No description provided.

wangzcl and others added 5 commits October 8, 2024 11:09
…hods

+ User API changes: 1, 2, 3, 4, 15, 19
+ Developer level changes: 2, 5, 6, 9, 18
+ Behavior changes: 3, 8, 10, 11, 12
+ Sklearn Compatibility: 2, 3, 4
+ Performance Improvement: 8, 13, 14, 15
+ Orthogonal projection onto unit simplex: 15
+ Cython: 15, 16
+ Docstrings: 7, 8
+ Pytest: 17

1. All the parameters of `AA` has to be provided as keyword arguments except `n_archetypes`, following the practice of sklearn.
2. Add `AA` attributes:
    + `n_archetypes_` (actual number of archetypes after fitting)
    + `n_iter_` (actual number of iteration performed)
    + `rss_`, or `reconstruction_error_` available after fitting.
3. All the parameter type checks are moved to `AA` 's class attribute `_parameter_constraints`. Parameters are now saved to attributes as is, and is automatically checked at data fitting (with sklearn's `@_fit_context`) instead of instance initialization. For reproducibility, `random_state` is saved as is, and a new random generator is created at fitting.
4. Functions are now not allowed as the  `init`  kwarg. Only strings are accepted, as passing a function to kwargs is uncommon for sklearn estimators.
5. Deleted: `AA._compute_archetypes` and `AA._loss`.  These computations are implemented more efficiently by using in-place matrix calculation in `fit_transform` and other functions.
6. Change `nnls` to match what is in `AA`: default kwarg name is `max_iter_optimizer` and default const is 100.
7. Match docstring of `AA` with the actual code. E.g. sparse matrix support is deleted;
8. Match actual code with docstring: A copy will be created if data array is not C-contiguous (also for performance).
9. Like what is in `sklearn.decomposition.NMF`, though we know all different optimizing methods have to do alternating optimization, we do not constrain `A_optimize` and `B_optimize` in class methods. Instead, we do alternating optimization manually in `nnls_fit_transform` and `pgd_fit_transform`.
To implement a new optimizing method xx:
    + Implement `xx_fit_transform(X, A, B, archetypes, *, max_iter, tol, verbose, **kwargs)`, where `A`, `B` and `archetypes` are initialized $A_0$, $B_0$ and $B_{0}X$. This function should return optimized A, B, archetypes, number of iterations `i`, a list of loss, a bool flag of convergence.
    + Implement `xx_transform(X, archetypes, *, max_iter, tol, **kwargs)` that returns a matrix A.
    + Register the method into `_parameter_constraint` and if-else branches in `AA.transform` and `AA.fit_transform`: `if self. method=="xx": do...`
    The benefit for doing so:
    + Avoiding complicated inheritance in `AABase_3` and `AA_3` classes
    + Separating optimization from initialization
    + Flexibility and customization in different optimizing methods
    + makes it easier to put alternating minimization in tight loop to improve performance (e.g. easier to do in-place calculations)
    + Introduce less unnecessary attributes to `AA`.
10. Verbose mode start printing RSS before the first iteration step is done.
11. For PGD-like iterations, if there is no improvement w.r.t loss, no descend step will be performed, not a single step of optimizing B.
12. For PGD-like `transform`, multiple iterations are performed so that A should sufficiently converge to local minimum.
13. Memory allocation improvement: All the matrix calculations (product, plus&minus) are performed in place ,which saves 5%~30% time in these calculations on my device. Use `np.vdot(a, b)` instead of `np.sum(a * b)`.
14. In PGD-like optimization, values of objective function are evaluated directly $||X-ABX||^2$ , not what is in M&H paper, as I found that when `n_samples` is large the latter becomes slower than the former (both are sufficiently optimized).
15. Implemented fast projection onto unit simplex (`unit_simplex_proj`,Condat 2016 Algorithm) in Cython (and also normalized one in M&H paper, `l1_normalize_proj`). These two functions project the input array in place and return None. I renamed M&H method as "pseudo_pgd" as their normalization is not an "authentic" projection ...
16. Modify `pyproject.toml` to make it compatible with Cython. Run `hatch build` to compile `.pyx` to `.c` and `.so` and build the wheel.
17. Add pytest for projection functions.
18. Remove `AABase`. Base classes may be needed later if we implement variants of AA, but before that we are not really sure what to be put into the base class and what to the subclass.
19. multiple inits for numpy.AA (return the best one)
@aleixalcacer aleixalcacer changed the title New features Big refactorization Oct 8, 2024
@aleixalcacer
Copy link
Owner Author

@wangzcl can you take a look at it to see what the problem is? Modify whatever is necessary, don't worry :)

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 8, 2024

Let me have a look

@aleixalcacer
Copy link
Owner Author

It seems that when you run hatch env create dev, it tries to compile the package dependencies from source. Maybe we could change the GitHub workflows to avoid that.

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 8, 2024

I just checked it out. We are good with the workflow. I just need to move numpy 2.0 to the build dependency list

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 8, 2024

I noticed the new bug. Let me figure it out.

@aleixalcacer
Copy link
Owner Author

I think it's because of the old version of scipy, which requires numpy<2.0. However, it seems that the nnls error has been fixed in scipy's main branch on Github.

See: https://github.com/scipy/scipy/blob/main/scipy/optimize/_nnls.py

Maybe we could install scipy from Github. What do you think?

@aleixalcacer
Copy link
Owner Author

aleixalcacer commented Oct 8, 2024

I found some new conflicts in my fork. Does github allow plain merge instead of rebase when merging a pull request?

I had bad experiences with merging without doing a rebase first. How do you update your fork?

The problem is definitely scipy. How would you see using https://github.com/stefanopalmieri/lsqnonneg/blob/master/lsqnonneg.py or https://gist.github.com/jdelafon/b7fdc7a0bae42af56366fc7786cc5d54#file-lsqnonneg-py instead of scipy.optimize.nnls?

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 8, 2024

Let me think about that. SciPy should be working as I remember that NNLS did pass the test on my own computer. I will double check.

@aleixalcacer
Copy link
Owner Author

aleixalcacer commented Oct 8, 2024

Let me think about that. SciPy should be working as I remember that NNLS did pass the test on my own computer. I will double check.

I've updated it to 1.13 (known issue) to confirm that the issue with numpy was related to scipy. That's why the tests are failing.

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 8, 2024

Cheers! Fixed with #44 and #45, however I have noticed another issue that may prevent verbose mode from printing the right iteration step i.

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 9, 2024

One more thing in the todo list is update README for python 3.9~3.12

@aleixalcacer
Copy link
Owner Author

One more thing in the todo list is update README for python 3.9~3.12

That badge is created automatically from PyPI, using:

"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",

Therefore, when a release is uploaded to PyPI with these new classifiers, the badge will automatically update.

Don't worry about that :)

@aleixalcacer
Copy link
Owner Author

@wangzcl I noticed an error in the refactoring yesterday. When transform is run on an already fitted model, it fails. I've improved the tests to cover this. Previously, fit_transform was fit().transform(). I think it would be good to keep it that way if possible. I'm not sure if it affects performance. Could you take a look?

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 16, 2024

I'll take a look at the bug. I did the refactoring following the sklearn NMF source code. They didn't do fit().transform() for fit_transform, so that additional transform step is avoided, making it faster -- I think it is a good design pattern, though it does make things look a little bit weird

@aleixalcacer
Copy link
Owner Author

Ready to merge?

@wangzcl
Copy link
Collaborator

wangzcl commented Oct 18, 2024

I guess so. Though there is always something we can improve, I think it's good for now at least.

@aleixalcacer aleixalcacer merged commit 1fecf52 into main Nov 8, 2024
12 checks passed
@aleixalcacer aleixalcacer deleted the dev branch November 8, 2024 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants