Sequential CTMRG is slow compared to Python (with PyTorch) #81

Yue-Zhengyuan · 2024-10-29T07:39:13Z

I use the CTMRG algorithm to measure the Heisenberg model ground state obtained from simple update. The algorithm settings are

trscheme = truncerr(1e-8) & truncdim(12)
ctm_alg = CTMRG(; tol=1e-12, miniter=4, maxiter=100, verbosity=3, trscheme=trscheme)

The PEPS bond dimension is D = 6, and the environment bond dimension is χ = 12. Starting from random CTMRGEnv, it takes about 1.3s to perform one CTMRG step:

[ Info: CTMRG init:     obj = +2.310478936455e-09       err = 1.0000e+00
[ Info: CTMRG   1:      obj = +2.532703514106e-01       err = 3.4933131838e-01  time = 4.43 sec
[ Info: CTMRG   1:      obj = +2.532703514106e-01       err = 3.4933131838e-01  time = 0.18 sec
[ Info: CTMRG   2:      obj = +7.808860223845e-01       err = 1.5017305263e-01  time = 1.46 sec
[ Info: CTMRG   2:      obj = +7.808860223845e-01       err = 1.5017305263e-01  time = 0.00 sec
[ Info: CTMRG   3:      obj = +9.380692547858e-01       err = 4.9406274229e-02  time = 1.21 sec
[ Info: CTMRG   3:      obj = +9.380692547858e-01       err = 4.9406274229e-02  time = 0.00 sec
[ Info: CTMRG   4:      obj = +9.836562322063e-01       err = 2.2671232670e-02  time = 1.19 sec
[ Info: CTMRG   4:      obj = +9.836562322063e-01       err = 2.2671232670e-02  time = 0.00 sec
[ Info: CTMRG   5:      obj = +9.991752105702e-01       err = 9.6623034220e-03  time = 1.35 sec
[ Info: CTMRG   5:      obj = +9.991752105702e-01       err = 9.6623034220e-03  time = 0.00 sec
[ Info: CTMRG   6:      obj = +1.004775679484e+00       err = 3.8899763085e-03  time = 1.34 sec
[ Info: CTMRG   6:      obj = +1.004775679484e+00       err = 3.8899763085e-03  time = 0.00 sec
...

However, using my own Python implementation (using PyTorch; the projectors are also found from the half-infinite environment), it only takes about 0.7s per step, about twice the speed of PEPSKit:

iter      svd_diff    time/s
0       2.2900e+01      0.64
1       2.6927e-01      0.64
2       4.4606e-02      0.65
3       8.5458e-03      0.65
4       1.3304e-03      0.66
5       1.8990e-04      0.65
6       2.6252e-05      0.75
...

Here svd_diff is the convergence criterion calculated as follows (a little bit different from the err of PEPSKit):

Calculate singular values spectrum for each CTM tensor before and after the RG step
Calculate the 2-norm of the spectrum difference for each CTM tensor
Sum them up and devide by 8 * N_row * N_col

I tried to use the functions in PEPSKit to write a simpler version without the fancy autodiff stuff, then the speed can be improved to about 0.9s per RG step, but is still slower than PyTorch:

1         4.5357e-01    11.248 s
2         2.0233e-01     0.984 s
3         3.5182e-02     0.769 s
4         7.0871e-03     0.952 s
5         1.2966e-03     0.757 s
6         2.1065e-04     0.953 s

So my concern is that the auto-diff stuff from Zygote, etc may cause too much performance overhead for applications not using auto-diff of CTMRG.

The text was updated successfully, but these errors were encountered:

Yue-Zhengyuan · 2024-10-29T07:54:36Z

For χ = 24 (with χ = 12 as initialization) the difference is more significant:

PEPSKit: about 9.7s per step
PEPSKit without AD: about 4.5s per step
PyTorch: about 3s per step

I wonder how can I look for the bottleneck...

lkdvos · 2024-10-29T08:29:26Z

This is not entirely unsurprising. We developed this package with a mindset of "let's make it work first", mostly because Zygote poses a (rather large) number of restrictions on the optimizations that you would typically do to make a Julia algorithm faster. Now, we are indeed at the stage of thinking about what to optimise, but it is not so straightforward given these restrictions.

I'm more than happy to have a look at your implementation to think about more ways to speed up, but it's hard to give any kind of answers without further information: I don't know what the other implementations are, what the setup is, did you run this in a multithreaded environment or not, ... If you don't feel comfortable publicly sharing, we can also continue and discuss via email

Yue-Zhengyuan · 2024-10-29T08:37:17Z

For better control of setups, here I only attach the PEPSKit CTMRG with all AD stuff removed:

ctmrg_column.txt

The basic idea is still only to implement the left-move. The other moves are done by rotating the network by 90 degrees. One more change is that I write the functions to update only one column each time (which will be used in the full update algorithm), instead of handling all columns at once. I haven't figure out if there are some Python settings that may affect performance.

lkdvos · 2024-10-29T19:08:52Z

I dont really have enough information to tell you why the python implementation should have different performance. (since I dont really know what you are using in Python). Did you check if both implementations use 1 iteration to denote a move in every direction? Otherwise, I would advise to run a profiler and see if anything stands out

Yue-Zhengyuan · 2024-10-29T23:22:28Z

Yes, a move includes all four directions for all rows and columns. Actually I haven't fully removed Zygote overhead in the Julia version (in the function ctmrg_renormalize_col! I'm still using PEPSKit.renormalize_bottom_corner etc, which first creates a Zygote Buffer); I'll first see if it can still be further improved.

lkdvos · 2024-10-29T23:57:57Z

It might be a large variety of things, I would refrain from changing anything before you have a profiler view. The bottlenecks are very often not where you expect them to be. In this case, I don't think the Zygote buffers really do a lot, and I would expect the impact of not using in-place operations, because of the need to be AD compatible, be a much larger factor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Yue-Zhengyuan commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024 •

edited

Loading

lkdvos commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024 •

edited

Loading

lkdvos commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024

lkdvos commented Oct 29, 2024

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Sequential CTMRG is slow compared to Python (with PyTorch) #81

Comments

Yue-Zhengyuan commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024 • edited Loading

lkdvos commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024 • edited Loading

lkdvos commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024

lkdvos commented Oct 29, 2024

Yue-Zhengyuan commented Oct 29, 2024 •

edited

Loading

Yue-Zhengyuan commented Oct 29, 2024 •

edited

Loading