Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error upgrading to 0.0.25 in windows 10 #214

Open
qodesign opened this issue Jan 30, 2025 · 53 comments · Fixed by #246
Open

error upgrading to 0.0.25 in windows 10 #214

qodesign opened this issue Jan 30, 2025 · 53 comments · Fixed by #246

Comments

@qodesign
Copy link

I tried to upgrade to 0.0.25 and got this error :

Building wheels for collected packages: qldpc
Building wheel for qldpc (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for qldpc (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [69 lines of output]
Compiling qldpc\codes_distance.pyx because it changed.
[1/1] Cythonizing qldpc\codes_distance.pyx
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
_distance.c
C:\Conda\envs\qldpc-env\include\pyconfig.h(59): fatal error C1083: Cannot open include file: 'io.h': No such file or directory
...
subprocess.CalledProcessError: Command '['C:/Conda/envs/qldpc-env/python.exe', 'build-cython.py']' returned non-zero exit status 1.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for qldpc
Failed to build qldpc
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (qldpc)

I'm running on windows 10. Mingwin is my primary compiler and -O3 is a valid directive for it but maybe cython doesn't pick that up
(I get the same error if VisualC is in the path)

@perlinm
Copy link
Collaborator

perlinm commented Jan 31, 2025

Interesting. The error log says that -O3 is not recognized, so maybe pip/setuptools is trying to use a different compiler.

We have two lines of attack:

  1. Installing a compiler + libraries that pip/setuptools will automatically recognize.
  2. Trying to tell pip to use mingwin if it is available.

Trying (1) first: can you try installing Microsoft Visual C++ 14.x with Visual Studio 2022 and see if doing so makes pip install qldpc==0.0.25 work?

Incidentally, the link above does tell me that I should try to use setuptools rather than distutils, and I do use distutils in build-cython.py. So let me also look into that and see if it changes anything.

@perlinm
Copy link
Collaborator

perlinm commented Jan 31, 2025

I switched to setuptools instead of distutils on the main branch of the repo (now also deployed in v0.0.26), though I doubt that will fix this issue.

@qodesign
Copy link
Author

I installed Microsoft Visual and got the same error. I then tried to "tell pip" where the c compiler is at.
According to Gemini this is how to do it :

set CC=C:\path\to\gcc.exe
pip install <package_name>

or

On Windows, if you are using Visual Studio, the easiest way is usually to open a "Developer Command Prompt for VS"

They both failed the same way.

I think you might have hard coded "CC" to "cl" so none of the options know what to do with "cl"...
but I'm not sure.

If there's no fix for this, maybe you can make the cython part optional so that other parts of the package can be upgraded and used...I think you use cython to speed up the distance calculations which is not that critical compared to other functions

@perlinm
Copy link
Collaborator

perlinm commented Feb 1, 2025

I presume you didn't literally type set CC=C:\path\to\gcc.exe, but put in the actual path to the gcc compiler? Also, did you follow the instructions here?

I'm willing to skip the fast distance calculations for Windows if necessary, but I want to try a bit more to see if we can get this working. If we can, I'll add documentation to the readme.

@qodesign
Copy link
Author

qodesign commented Feb 1, 2025

yes of course I didn't type that...(you're forgiven for saying that).
I think I tried everything at this point

@qodesign
Copy link
Author

v.0.0.27 still has the same problem...any chance of adding the distance calculations bypass for windows.
Did anyone manage to install under windows?
There's good overlap with the package and what I have for BB codes and other calculations; I'd hate for this
cython issue to be stalling that.

@perlinm
Copy link
Collaborator

perlinm commented Feb 13, 2025

Sorry for the delay on this. I don't work in Windows, so I haven't gotten around to debugging this issue. If you're okay installing from source, as a temporary workaround I added a no-cython branch of the repo, which removes the dependency on cython. You can install with something like

git clone --branch no-cython [email protected]:Infleqtion/qLDPC.git
pip install -e qLDPC

@qodesign
Copy link
Author

Thanks, this installed without problems and the find_bbcode_layouts example had no issues.

I don't think my sw skills are enough to debug the cython problem so hopefully you can also create a no-cython version for future releases.

@perlinm
Copy link
Collaborator

perlinm commented Feb 13, 2025

My intention is to make everything work in Windows, with cython. It may just take some time 🙂

In the meantime, I'll periodically merge main into no-cython to keep it up-to-date.

Note, by the way, that there is a breaking change in a recent PR (#234) that was merged into main (and is included in no-cython).

@perlinm
Copy link
Collaborator

perlinm commented Feb 14, 2025

Shot in the dark @vtomole have you ever had to get cython code to compile in Windows?

@vtomole
Copy link
Member

vtomole commented Feb 14, 2025

I have not.

  1. @qodesign are you able to use the Linux subsystem?
  2. @perlinm Why are we using cython?

@qodesign
Copy link
Author

I don't use Linux subsystem. I do have Centos virtual machines but my main development is under windows unfortunately
so I prefer to stay there for now.

@perlinm
Copy link
Collaborator

perlinm commented Feb 15, 2025

Why are we using cython?

PR #211 added cython code to compute exact code distance by brute force. It's compute-heavy, so much better to dispatch to C (via cython) if possible.

@vtomole
Copy link
Member

vtomole commented Feb 15, 2025

How slow is the pure python code compared to the cython? How much time was spent optimizing the pure python code until you threw your hands up and used cython? What were the bottlenecks of the pure python e.t.c?

I'm trying to confirm that we haven't moved to C/C++ bindings too early.

@perlinm
Copy link
Collaborator

perlinm commented Feb 17, 2025

For "pure python", the biggest bottleneck is essentially that for loops are slow in python. But also the cython code is optimized at the level of minimizing bitwise operations and memory allocations, so the bar for performance is high. Having said that, perhaps jax.jit or numba.jit would be competitive (though I would be surprised). We could test that by re-implementing get_distance_quantum_64 with JAX or numba (ignoring the homogeneous argument, and setting weight_func to a Callable[[int], int] that computes the Hamming weight of a 64-bit integer). Here's an example test for performance:

import time

import numpy as np

from qldpc import codes
from qldpc.objects import Pauli

num_trials = 1000
code = codes.ToricCode(6)

stabilizers = code.get_code(Pauli.X).canonicalized().matrix.view(np.ndarray).astype(np.uint8)
logical_ops = code.get_logical_ops(Pauli.X).view(np.ndarray).astype(np.uint8)

start = time.time()
for _ in range(num_trials):
    codes.common.get_distance_quantum(logical_ops, stabilizers, homogeneous=True)
print((time.time() - start) / num_trials, "sec/trial")

which for me prints 0.003938684940338135 sec/trial. Another data point: setting num_trials = 1 and code = codes.ToricCode(8) runs in 57.39978384971619 sec/trial.

@qodesign
Copy link
Author

Just a side note : GAP has a package (QDistRnd) that calculates the distance of a quantum code and from my experience it looks fast and accurate. Maybe you can consider it as an option.

@perlinm
Copy link
Collaborator

perlinm commented Feb 17, 2025

  1. I believe QDistRnd computes an upper bound on code distance, not the exact code distance.
  2. Either way, distance calculation is a capability that I would like to have in qldpc even when GAP is unavailable.

Having said that, I certainly think QDistRnd would be be nice to add as an alternative to QuditCode.get_distance_bound. It may even perform better than the current method for upper-bounding code distance in qldpc (and it would certainly work for a broader class of codes). This is not a high priority of mine, but external contributions are always welcome 🙂. I would also be happy to meet and chat with somebody about how to implement this using the existing GAP interface in qldpc.

@vtomole
Copy link
Member

vtomole commented Feb 18, 2025

Looking at get_distance_quantum_64, it looks like more functionality can leverage numpy; a package which has been optimized to heck

This is what the AI is telling me. Let's try it?

Image

@perlinm
Copy link
Collaborator

perlinm commented Feb 18, 2025

I am generally bullish on numpy vectorization, but I am skeptical that it would help here.

One immediate issue is memory footprint. To compute the distance of qldpc.codes.ToricCode(8) in my example above, I need to iterate over ~3 items in the outer for loop here, and 2^31 items in the inner for loop here. If each logical_op in those loops is represented by a 64-bit integer, I need ~51 gigabytes to store all logical operators in memory. Repeat this calculation for a 72-qubit code whose distance I had to compute recently, and the memory footprint blows up to ~18 terabytes.

A second issue is that even if you have the required memory (in RAM), I am still skeptical that this would be faster. At the moment, each new logical_op that is looped over takes one bitwise XOR to compute, because it "updates" the value of previous logical_op. If I were to compute a logical_op from scratch, it would take tens of bitwise operations. So the total number of bitwise operations in the calculation grows by a factor of >10x (possibly even >100x).

@vtomole
Copy link
Member

vtomole commented Feb 18, 2025

Does the memory footprint issue apply to cython as well?

@perlinm
Copy link
Collaborator

perlinm commented Feb 18, 2025

No it does not. That memory footprint is the cost of doing everything in the for loop at once (e.g., in a vectorized manner), rather than incrementally.

@vtomole
Copy link
Member

vtomole commented Feb 18, 2025

I'm at a loss then. Looks like we need to dev with cython on Windows. A good first start is adding a Windows check to this repo.

@vtomole
Copy link
Member

vtomole commented Feb 18, 2025

JAX or numba

Let see how they do.

@qodesign
Copy link
Author

Another side note : see this issue for another c code with deterministic distance calculation.
QEC-pages/QDistRnd#49
Also, adding support for subsystem codes (should be incremental addition if you have a solution for subspace codes)

@perlinm
Copy link
Collaborator

perlinm commented Feb 18, 2025

Re: trying JAX or numba, I made a quick attempt in the distance-test branch, inside the [repo_root]/testing directory. Numba seems to be competitive for SurfaceCode(6), but is slower by a factor of ~2 for SurfaceCode(8) (with num_trials = 1)... Maybe the numba implementation can be optimized further.

Re: subsystem code support, this is something I am currently working on. Unfortunately it's not just an incremental addition, but it's in the works. The plan is to define codes by a matrix of gauge group generators (as opposed to stabilizer group generators). This turns out to require a bunch of minor changes in various places, but more importantly it requires (a) an added method to compute and separate out stabilizer generators from "gauge logicals", and (b) a calculation of logical Paulis from the gauge generator matrix. And this needs to be done for both the QuditCode and the CSSCode class.

@vtomole
Copy link
Member

vtomole commented Feb 20, 2025

One immediate issue is memory footprint.

From @richrines1: Batching will reduce the memory footprint.

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

@qodesign could I ask you to again try git pull on no-cython, and share the output you get from running python build-cython.py in the root directory of the project? I think you should be able to share a text file on github.

@qodesign
Copy link
Author

qodesign commented Feb 26, 2025

git clone --branch no-cython [email protected]:Infleqtion/qLDPC.git
pip install -e qLDPC

doesn't seem to include cython; I installed it and ran the command python build-cython.py
log attached

log.txt

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

Oh! It looks like the cython code was compiled! Can you now switch to main branch and try to compute code distance? Something like

import qldpc

code = qldpc.codes.RepetitionCode(3)
code._exact_distance = None
print(code.get_code_params())

Regarding

doesn't seem to include cython

That is correct behavior: cython is only installed by qldpc temporarily for a pip install, and subsequently removed. I will add cython explicitly to qldpc[dev] in #244.

I am still confused though why the log you shared suggests that the cl compiler is used, but it's still being passed -O3 -march=native arguments. So this is not working properly.

@qodesign
Copy link
Author

v.0.0.28 installs ok; so the problem seems to be fixed.
I tried the example and it ran without a problem.
what would be "stress test" for the distance calculation : largest code that you can the get the distance for in ~5 minutes

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

I'm not sure of a good stress test. The runtime is exponential in block length and logical qubit number, so runtime goes from < 1 sec to > 1 hour pretty quickly. You could try the tests here I suppose.

I would still like to close the loop on identifying the cl compiler and passing it correct arguments, though. What do you get from running the following?

import sysconfig
print(sysconfig.get_config_var("CC"))

@qodesign
Copy link
Author

qodesign commented Feb 26, 2025

I get "None" returned.
(I have this in the batch file)
set PATH=C:\Conda\Scripts;C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin;C:\msys\bin;C:\CMake\bin;
set CC=C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin\gcc.exe

as a check : echo %CC% returns the expected value...so the variable is defined
C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin\gcc.exe

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

Okay, so there are two issues:

  1. sysconfig.get_config_var("CC") is not retrieving the "CC" variable.
  2. cython is using cl instead of gcc (or mingwin), even though you explicitly tell it to.

For issue (1), what if you try the following?

import os
print(os.environ.get("CC"))

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

One more thing to try:

import distutils.ccompiler
distutils.ccompiler.new_compiler().compiler_type

@qodesign
Copy link
Author

(added print) it says msvc
so it's picking up Microsoft compiler?

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

Which one says msvc? distutils.ccompiler or os.environ.get?

@qodesign
Copy link
Author

I made "CC" a global environment variable instead of being set by the batch file. This seems to make a difference; now I get :

import os
print(os.environ.get("CC"))
import distutils.ccompiler
print(distutils.ccompiler.new_compiler().compiler_type)

C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin\gcc.exe
msvc

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

Hmm... so you can set CC, but it does not seem to be used by distutils.ccompiler.new_compiler()...

What does distutils.ccompiler.new_compiler().compiler print?

Also, now that you can set the CC environment variable, what do you get from python build-cython.py? Does the output still suggest that it is using cl instead of gcc?

@qodesign
Copy link
Author

now python build-cython.py gives a new error :

Traceback (most recent call last):
File "C:\Users\user\Desktop\q1.py", line 41, in
build_cython()
File "C:\Users\user\Desktop\q1.py", line 25, in build_cython
ext_modules = cythonize(extension, compiler_directives={"language_level": "3"})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Conda\envs\qldpc-v27a\Lib\site-packages\Cython\Build\Dependencies.py", line 1010, in cythonize
module_list, module_metadata = create_extension_list(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Conda\envs\qldpc-v27a\Lib\site-packages\Cython\Build\Dependencies.py", line 845, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Conda\envs\qldpc-v27a\Lib\site-packages\Cython\Build\Dependencies.py", line 117, in nonempty
raise ValueError(error_msg)
ValueError: '/**/.pyx' doesn't match any files

v.0.0.28 also produced the same error.
Maybe strange that I had to install cython for v28

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

Change of plans. Rather than trying to meticulously detect the compiler to set its arguments, I will just add the option to override the C compiler arguments. As a workaround for possible issues, users can re-build with

python build-cython.py --rebuild /O2

See #245

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

As explained here, it is not strange (and is in fact expected) that you had to install cython.

@qodesign
Copy link
Author

now build-cython doesn't run at all.
I removed the CC environment variable and it fails the same way.
The difference between now and what ran before is that I did a pip install cython.
Anyway, I'm confused now...tried too many variations.
The main part is that v0.0.28 installed without problems and seems to be running.
I'll come back to this a bit later

@perlinm
Copy link
Collaborator

perlinm commented Feb 26, 2025

I added this line, which I think should fix the last error you shared.

If I could make one final request: can you try pip install . again on the compile-args branch? I want to merge #245 into main, but I don't want it to break installations in Windows... If Windows installation on that branch works, then I think we can close this issue.

@qodesign
Copy link
Author

a new error running update build_cython

(qldpc-v28) C:\Users\user\Desktop>python build_cython.py
Traceback (most recent call last):
File "C:\Users\user\Desktop\build_cython.py", line 53, in
build_cython(*extra_compile_args, rebuild=rebuild)
File "C:\Users\user\Desktop\build_cython.py", line 27, in build_cython
cmd.run()
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools\command\build_ext.py", line 99, in run
_build_ext.run(self)
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils\command\build_ext.py", line 365, in run
self.build_extensions()
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils\command\build_ext.py", line 481, in build_extensions
self._build_extensions_serial()
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils\command\build_ext.py", line 507, in _build_extensions_serial
self.build_extension(ext)
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools\command\build_ext.py", line 264, in build_extension
_build_ext.build_extension(self, ext)
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\Cython\Distutils\build_ext.py", line 135, in build_extension
super(build_ext, self).build_extension(ext)
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils\command\build_ext.py", line 586, in build_extension
self.compiler.link_shared_object(
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils\ccompiler.py", line 759, in link_shared_object
self.link(
File "C:\Conda\envs\qldpc-v28\Lib\site-packages\setuptools_distutils_msvccompiler.py", line 534, in link
build_temp = os.path.dirname(objects[0])
~~~~~~~^^^
IndexError: list index out of range

(qldpc-v28) C:\Users\user\Desktop>

...so I can't get to the compiler options

@perlinm
Copy link
Collaborator

perlinm commented Feb 27, 2025

I have confirmed that qLDPC works in Windows.

Starting with a Windows machine with no software development history (read: clean slate, no development tools installed), I installed:

  1. Visual Studio Code
  2. Visual Studio
  3. Anaconda (I told it to modify my PATH).

Starting from the main branch of qldpc (commit ad5ae849b4facf127b72496d1bfe8a21d87b9e1a), I then ran (in a VSCode terminal)

conda create --name test python=3.12 -y
conda activate test
pip install .

I then tried running the python script

import qldpc
code = qldpc.codes.RepetitionCode(3)
code._exact_distance = None
code.get_code_params()

which gave the expected output, (3, 1, 3). I even tried making local changes to qldpc/codes/_distance.pyx (namely, adding a print statement) and running python build-cython.py --rebuild (after a pip install .[dev] to get cython). Running the code distance calculation test script above then behaved as expected with the changes to _distance.pyx.

My conclusion is that Windows is supported, and that any issues you are having @qodesign are likely issues with your development environment. I am happy to help you resolve those over time (we can continue talking on this issue page on github), but as far as qldpc is concerned I think this issue is closed. Thanks for the debugging help!

@perlinm perlinm closed this as completed Feb 27, 2025
@perlinm
Copy link
Collaborator

perlinm commented Feb 27, 2025

Actually, maybe we should add a Windows CI test, as @vtomole I think suggested. Would somebody on the Infleqtion end be able to help with that?

@perlinm
Copy link
Collaborator

perlinm commented Feb 27, 2025

#246 adds a workflow to install qldpc and run all tests in a Windows environment. This workflow ran successfully.

@perlinm
Copy link
Collaborator

perlinm commented Feb 27, 2025

On further reflection, it doesn't hurt to keep this issue open until @qodesign considers their problem solved, so I'll re-open.

@perlinm perlinm reopened this Feb 27, 2025
@richrines1
Copy link
Contributor

richrines1 commented Feb 27, 2025

which for me prints 0.003938684940338135 sec/trial. Another data point: setting num_trials = 1 and code = codes.ToricCode(8) runs in 57.39978384971619 sec/trial.

@perlinm are these numbers with the current (cython) implementation?

if so here's a python/numpy-only implementation that seems to run in ~0.0016 and 23.0 sec/trial for those two experiments on my system:

import numpy as np
import numpy.typing as npt


def rows_to_ints(array, dtype: npt.DTypeLike = np.uint):
    array = np.asarray(array, dtype=dtype)
    tsize = array.itemsize * 8
    num_words = int(np.ceil(array.shape[-1] / tsize))

    columns = [
        np.bitwise_or.reduce(
            [array[..., -1 - i - tsize * j] << i for i in range(array.shape[-1] - tsize * j)]
        )
        for j in range(num_words)
    ]
    return np.vstack(columns[::-1]).T


def popcnt64(
    arr: np.uint | npt.NDArray[np.uint],
    buf: np.uint | npt.NDArray[np.uint] | None = None,
    out: np.uint | npt.NDArray[np.uint] | None = None,
) -> npt.NDArray[np.uint]:
    """Somewhat efficient (vectorized) Hamming weight calculation. Assumes 64-bit (u)ints.

    This is the weak point in the python/numpy implementation (unfortunately numpy<2.0.0 doesn't
    expose processors' `popcnt` instruction).
    """
    buf = np.right_shift(arr, 1, out=buf)
    buf &= 0x5555555555555555
    out = np.subtract(arr, buf, out=out)

    buf = np.right_shift(out, 2, out=buf)
    buf &= 0x3333333333333333
    out &= 0x3333333333333333
    out += buf

    buf = np.right_shift(out, 4, out=buf)
    out += buf
    out &= 0x0F0F0F0F0F0F0F0F

    out *= 0x0101010101010101
    out >>= 56
    return out


def ctz(num: int):
    return int(np.log2(num ^ (num - 1)))


def get_distance_quantum(
    logical_ops,
    stabilizers,
    block_size: int = 14,
    dtype: npt.DTypeLike = np.uint,
):
    """Distance of a binary quantum code. Accepts any number of bits (though block_size may need to
    be adjusted for multiples of 64 bits)."""

    num_bits = np.shape(logical_ops)[-1]
    int_logical_ops = rows_to_ints(logical_ops, dtype=dtype)
    int_stabilizers = rows_to_ints(stabilizers, dtype=dtype)

    assert int_logical_ops.itemsize == 8

    block_size = min(block_size, len(int_logical_ops) + len(int_stabilizers))

    # Precompute all combinations of first `block_size` operators
    blocked = np.zeros((1, int_logical_ops.shape[-1]), dtype=dtype)
    for op in int_logical_ops[:block_size]:
        blocked = np.vstack([blocked, blocked ^ op])

    # TODO: probably don't need to treat these as separate cases
    if block_size >= len(int_logical_ops):
        # drop the all-zero entry (so we don't consider products of stabilizers with no logical ops)
        blocked = blocked[1:]

        # fill out block with products of some stabilizers
        for op in int_stabilizers[: block_size - len(int_logical_ops)]:
            blocked = np.vstack([blocked, blocked ^ op])

        int_stabilizers = int_stabilizers[block_size - len(int_logical_ops) :]

        buf1, buf2 = np.empty((2, *blocked.shape), dtype=dtype)

        weights = popcnt64(blocked, buf=buf1, out=buf2).sum(-1)
        min_weight = weights.min(initial=num_bits)

        # iterate over all products of remaining stabilizers
        for ll in range(1, 2 ** len(int_stabilizers)):
            blocked ^= int_stabilizers[ctz(ll)]
            weights = popcnt64(blocked, buf=buf1, out=buf2).sum(-1, out=weights)
            min_weight = min(min_weight, weights.min())

    else:
        blocked = np.zeros((1, int_logical_ops.shape[-1]), dtype=dtype)
        for op in int_logical_ops[:block_size]:
            blocked = np.vstack([blocked, blocked ^ op])

        int_logical_ops = int_logical_ops[block_size:]

        buf1, buf2 = np.empty((2, *blocked.shape), dtype=dtype)

        # initially ignore blocked[0] to avoid products of stabilizers with no logical ops
        weights = popcnt64(blocked[1:], buf=buf1[1:], out=buf2[1:]).sum(-1)
        min_weight = weights.min(initial=num_bits)

        for si in range(1, 2 ** len(int_stabilizers)):
            blocked ^= int_stabilizers[ctz(si)]
            weights = popcnt64(blocked[1:], buf=buf1[1:], out=buf2[1:]).sum(-1, out=weights)
            min_weight = min(min_weight, weights.min())

        for ll in range(1, 2 ** len(int_logical_ops)):
            blocked ^= int_logical_ops[ctz(ll)]
            weights = popcnt64(blocked, buf=buf1, out=buf2).sum(-1)
            min_weight = min(min_weight, weights.min())

            for si in range(1, 2 ** len(int_stabilizers)):
                blocked ^= int_stabilizers[ctz(si)]
                weights = popcnt64(blocked, buf=buf1, out=buf2).sum(-1, out=weights)
                min_weight = min(min_weight, weights.min())

    return min_weight

@perlinm
Copy link
Collaborator

perlinm commented Feb 28, 2025

@richrines1 Yes, those numbers are pretty much what I get on my laptop with the current cython implementation. I don't think I fully understand your implementation, but it certainly seems nice to have a python/numpy-only version!

Questions:

  1. Am I correct that the memory footprint here is essentially 2**block_size * sizeof(int)?
  2. How is it possible for a pythonic loop like for si in range(1, 2 ** len(int_stabilizers)) to be competitive in terms of speed? This is without any jitting? Maybe I am underestimating what python can do...

Regarding numpy >= 2.0.0, at the moment this is blocked by (I think only) sinter. That should be fixed as soon as stim/sinter have a new release, at which point we can use popcnt.

@perlinm
Copy link
Collaborator

perlinm commented Feb 28, 2025

This runs on my laptop in ~0.0015 sec/trial and ~24.5 sec/trial respectively on ToricCode(6) and ToricCode(8). Amazing! Also it worked fine with numpy==1.26.4. (nevermind; I misunderstood the comment about numpy<2.0.0) I'm curious now whether this implementation could be made even faster "for free" by, say, jitting with numba.

@richrines1
Copy link
Contributor

richrines1 commented Feb 28, 2025

Am I correct that the memory footprint here is essentially 2**block_size * sizeof(int)?

basically, though there's another factor of 3 due to the two buffers (we can probably drop these with numpy 2.* though)

How is it possible for a pythonic loop like for si in range(1, 2 ** len(int_stabilizers)) to be competitive in terms of speed? This is without any jitting? Maybe I am underestimating what python can do...

every step in the loop is really doing 2**block_size iterations, so the python overhead is basically negligible - the "real" innermost loop is whatever's going on within the numpy functions, which are mostly compiled/well-optimized C code. python mostly just adds overhead to how fast we step between the numpy calls, which doesn't matter once each call is doing enough work

Regarding numpy >= 2.0.0, at the moment this is blocked by (I think only) sinter. That should be fixed as soon as stim/sinter have a new release, at which point we can use popcnt.

unfortunately i think we have a few other blockers internally, but it seems like it'd be relatively straightforward to have it use np.bitwise_count if available and otherwise fall back on this hacky implementation?

I'm curious now whether this implementation could be made even faster "for free" by, say, jitting with numba.

definitely curious to see but tbh i'm not sure how much of a difference i'd expect. this mostly works because every numpy call is doing a lot of work (and generally does so pretty efficiently), so if the only difference is that we step between them faster it might not make a huge difference (though it'd maybe decrease the ideal block_size slightly, which could e.g. improve cache efficiency?)

i think the biggest "easy" improvement would come from replacing the popcnt function with a builtin (or hopefully just np.bitwise_count). beyond that, there may be some weird algorithmic optimizations we could make but otherwise i'd suspect we'd have to start replacing the numpy functions to see much improvement - e.g. using loops that do better cache blocking/taking advantage of any available SIMD instructions on the processor/etc (not sure how much of this numpy already does)

@perlinm
Copy link
Collaborator

perlinm commented Feb 28, 2025

Hmm, I think that makes sense.

I think we can just use the code inside the else block, right? The for ll in range(1, 2 ** len(int_logical_ops)) loop will be empty because if initially block_size <= len(int_logical_ops), then after setting int_logical_ops[block_size:] we'll be left with len(int_logical_ops) == 0.

UPDATE: nvm I missed the fill out block with products of some stabilizers part. So maybe it's still worth being clever.

UPDATE 2: oh but we could maybe initially do something like

blocked = np.zeros((1, int_logical_ops.shape[-1]), dtype=dtype)
for op in int_logical_ops[:block_size]:
    blocked = np.vstack([blocked, blocked ^ op])

for op in int_stabilizers[: block_size - len(int_logical_ops)]:
    blocked = np.vstack([blocked, blocked ^ op])

stabilizers = stabilizers[min(block_size - len(int_logical_ops), 0) :]
int_logical_ops = int_logical_ops[block_size:]

but I need to think more carefully about how to avoid the logical identity operator...

In any case, let's move this discussion to #248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants