Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Open
Yue-Zhengyuan opened this issue Oct 29, 2024 · 6 comments
Open

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Yue-Zhengyuan opened this issue Oct 29, 2024 · 6 comments

Comments

@Yue-Zhengyuan
Copy link

I use the CTMRG algorithm to measure the Heisenberg model ground state obtained from simple update. The algorithm settings are

trscheme = truncerr(1e-8) & truncdim(12)
ctm_alg = CTMRG(; tol=1e-12, miniter=4, maxiter=100, verbosity=3, trscheme=trscheme)

The PEPS bond dimension is D = 6, and the environment bond dimension is χ = 12. Starting from random CTMRGEnv, it takes about 1.3s to perform one CTMRG step:

[ Info: CTMRG init:     obj = +2.310478936455e-09       err = 1.0000e+00
[ Info: CTMRG   1:      obj = +2.532703514106e-01       err = 3.4933131838e-01  time = 4.43 sec
[ Info: CTMRG   1:      obj = +2.532703514106e-01       err = 3.4933131838e-01  time = 0.18 sec
[ Info: CTMRG   2:      obj = +7.808860223845e-01       err = 1.5017305263e-01  time = 1.46 sec
[ Info: CTMRG   2:      obj = +7.808860223845e-01       err = 1.5017305263e-01  time = 0.00 sec
[ Info: CTMRG   3:      obj = +9.380692547858e-01       err = 4.9406274229e-02  time = 1.21 sec
[ Info: CTMRG   3:      obj = +9.380692547858e-01       err = 4.9406274229e-02  time = 0.00 sec
[ Info: CTMRG   4:      obj = +9.836562322063e-01       err = 2.2671232670e-02  time = 1.19 sec
[ Info: CTMRG   4:      obj = +9.836562322063e-01       err = 2.2671232670e-02  time = 0.00 sec
[ Info: CTMRG   5:      obj = +9.991752105702e-01       err = 9.6623034220e-03  time = 1.35 sec
[ Info: CTMRG   5:      obj = +9.991752105702e-01       err = 9.6623034220e-03  time = 0.00 sec
[ Info: CTMRG   6:      obj = +1.004775679484e+00       err = 3.8899763085e-03  time = 1.34 sec
[ Info: CTMRG   6:      obj = +1.004775679484e+00       err = 3.8899763085e-03  time = 0.00 sec
...

However, using my own Python implementation (using PyTorch; the projectors are also found from the half-infinite environment), it only takes about 0.7s per step, about twice the speed of PEPSKit:

iter      svd_diff    time/s
0       2.2900e+01      0.64
1       2.6927e-01      0.64
2       4.4606e-02      0.65
3       8.5458e-03      0.65
4       1.3304e-03      0.66
5       1.8990e-04      0.65
6       2.6252e-05      0.75
...

Here svd_diff is the convergence criterion calculated as follows (a little bit different from the err of PEPSKit):

  • Calculate singular values spectrum for each CTM tensor before and after the RG step
  • Calculate the 2-norm of the spectrum difference for each CTM tensor
  • Sum them up and devide by 8 * N_row * N_col

I tried to use the functions in PEPSKit to write a simpler version without the fancy autodiff stuff, then the speed can be improved to about 0.9s per RG step, but is still slower than PyTorch:

1         4.5357e-01    11.248 s
2         2.0233e-01     0.984 s
3         3.5182e-02     0.769 s
4         7.0871e-03     0.952 s
5         1.2966e-03     0.757 s
6         2.1065e-04     0.953 s

So my concern is that the auto-diff stuff from Zygote, etc may cause too much performance overhead for applications not using auto-diff of CTMRG.

@Yue-Zhengyuan
Copy link
Author

Yue-Zhengyuan commented Oct 29, 2024

For χ = 24 (with χ = 12 as initialization) the difference is more significant:

  • PEPSKit: about 9.7s per step
  • PEPSKit without AD: about 4.5s per step
  • PyTorch: about 3s per step

I wonder how can I look for the bottleneck...

@lkdvos
Copy link
Member

lkdvos commented Oct 29, 2024

This is not entirely unsurprising. We developed this package with a mindset of "let's make it work first", mostly because Zygote poses a (rather large) number of restrictions on the optimizations that you would typically do to make a Julia algorithm faster. Now, we are indeed at the stage of thinking about what to optimise, but it is not so straightforward given these restrictions.

I'm more than happy to have a look at your implementation to think about more ways to speed up, but it's hard to give any kind of answers without further information: I don't know what the other implementations are, what the setup is, did you run this in a multithreaded environment or not, ... If you don't feel comfortable publicly sharing, we can also continue and discuss via email

@Yue-Zhengyuan
Copy link
Author

Yue-Zhengyuan commented Oct 29, 2024

For better control of setups, here I only attach the PEPSKit CTMRG with all AD stuff removed:

ctmrg_column.txt

The basic idea is still only to implement the left-move. The other moves are done by rotating the network by 90 degrees. One more change is that I write the functions to update only one column each time (which will be used in the full update algorithm), instead of handling all columns at once. I haven't figure out if there are some Python settings that may affect performance.

@lkdvos
Copy link
Member

lkdvos commented Oct 29, 2024

I dont really have enough information to tell you why the python implementation should have different performance. (since I dont really know what you are using in Python). Did you check if both implementations use 1 iteration to denote a move in every direction? Otherwise, I would advise to run a profiler and see if anything stands out

@Yue-Zhengyuan
Copy link
Author

Yes, a move includes all four directions for all rows and columns. Actually I haven't fully removed Zygote overhead in the Julia version (in the function ctmrg_renormalize_col! I'm still using PEPSKit.renormalize_bottom_corner etc, which first creates a Zygote Buffer); I'll first see if it can still be further improved.

@lkdvos
Copy link
Member

lkdvos commented Oct 29, 2024

It might be a large variety of things, I would refrain from changing anything before you have a profiler view. The bottlenecks are very often not where you expect them to be. In this case, I don't think the Zygote buffers really do a lot, and I would expect the impact of not using in-place operations, because of the need to be AD compatible, be a much larger factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants