Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_corr error LinAlgError: SVD did not converge #440

Open
shoffm opened this issue Sep 10, 2024 · 7 comments
Open

rm_corr error LinAlgError: SVD did not converge #440

shoffm opened this issue Sep 10, 2024 · 7 comments
Labels
question 🙋 Further information is requested

Comments

@shoffm
Copy link

shoffm commented Sep 10, 2024

Hello,

I am trying to run rm_corr on multiple columns of a dataframe (gene expression data), and while the function works well on many column pairs (and match the expected output from the R rmcorr function) one pair of columns throws an error LinAlgError: SVD did not converge. However, this pair of columns has no trouble running in the R rmcorr implementation. Therefore, I am curious what the difference between the two implementations is, and whether it is possible to get this to converge in python? I would prefer to continue to use your implementation as it seems to be much faster (I am in the midst of benchmarking how both implementations scale, so as a side note if you have any data on that I would very much appreciate it!).

I am using Pingouin v.0.5.5. I have attached a minimal dataframe to recreate my error, along the with the code below:

# load dataframe as dataframe
import pingouin as pg
pg.rm_corr(data = dataframe, x = "Gene1", y = "Gene2", subject = "Subject") 

Dataframe:
df_pingouin_fail.csv

Thanks so much for your help!
Best,
Sophie

@raphaelvallat
Copy link
Owner

Hi Sophie,

Thanks for opening the issue. The Pingouin implementation is based on an ANCOVA model that is implemented with statsmodels.

I am not able to reproduce the error on my machine:

image

I am using Python 3.9 with statsmodels 0.14. What versions of Python, pingouin, and statsmodels are you using?

Second, I noticed that your datasets includes many subjects with only 1 or 2 observations. Do you still get the error if you remove these participants from the dataset?

image

Thanks,
Raphael

@raphaelvallat raphaelvallat added the question 🙋 Further information is requested label Sep 14, 2024
@shoffm
Copy link
Author

shoffm commented Sep 16, 2024

Hi Raphael,

Thanks so much for your reply. I am using the following versions (on Linux):

statsmodels 0.14.0
python 3.9.0
pingouin 0.5.5

Which version of pingouin are you using when you don't reproduce the error?

Thanks so much!
Sophie

@raphaelvallat
Copy link
Owner

Hi,

I am using pingouin 0.5.5 (on Mac), pandas 2.2.2, numpy 1.26.4, statsmodels 0.14.0

@Eric-Kobayashi
Copy link

Hi Raphael,
I was wondering if the LAPACK dependency of numpy could cause the issue. Would you mind running numpy.show_config() and sharing the results?
Many thanks,
Eric

@raphaelvallat
Copy link
Owner

Sure thing @Eric-Kobayashi:

Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas64
    openblas configuration: USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.23.dev
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4548835888
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -fno-strict-aliasing
    commands: clang
    linker: ld64
    linker args: -fno-strict-aliasing
    name: clang
    version: 14.0.0
  c++:
    commands: clang++
    linker: ld64
    name: clang
    version: 14.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
Python Information:
  path: /private/var/folders/kx/gw6dssyn19d9qjs9mvh4hkz80000gn/T/cibw-run-5dj358b1/cp39-macosx_x86_64/build/venv/bin/python
  version: '3.9'
SIMD Extensions:
  baseline:
  - SSE
  - SSE2
  - SSE3
  found:
  - SSSE3
  - SSE41
  - POPCNT
  - SSE42
  - AVX
  - F16C
  - FMA3
  - AVX2
  not found:
  - AVX512F
  - AVX512CD
  - AVX512_KNL
  - AVX512_SKX
  - AVX512_CLX
  - AVX512_CNL
  - AVX512_ICL

@Eric-Kobayashi
Copy link

Hi Raphael,

Thanks for providing these information. It turns out not to be a version issue but it might be a deeper problem with the numpy.linalg.pinv function.

I was able to successfully run the rm_corr function after shuffling the dataframe. I've simulated reshuffling many times and found there is around 8.12% of failure to converge. On the other hand, would you mind testing the same and see if you replicate the issue?

pg.rm_corr(data = dataframe.sample(len(dataframe)), x = "Gene1", y = "Gene2", subject = "Subject")

@raphaelvallat
Copy link
Owner

raphaelvallat commented Oct 4, 2024

Hmm, very strange behavior indeed. I can replicate the error: 5 out of 100 run of the function on resampled data failed (5% failure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question 🙋 Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants