Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The User can now register its determinant into trexio format #80

Merged
merged 16 commits into from
Nov 14, 2024

Conversation

NastaMauger
Copy link
Contributor

@NastaMauger NastaMauger commented Nov 10, 2024

Hello,

For a specific reason, I needed to register my determinants into TREXIO, so I modified the code to accomplish this.

Since the functions use the MCSCF object and mcscf_to_trexio has not been developed yet, it might be beneficial to register the determinants infos into a different TREXIO file compared to the one used for the mf part. Therefore, this final TREXIO file only contains:

  • Number of MOs
  • Number of alpha electrons
  • Number of beta electrons
  • Total number of electrons
  • Number of determinants
  • Coefficients
  • Determinants in bitfield format

The user can also set a threshold to register into the TREXIO file only the determinants with coefficients higher than the user-defined value. If no value is provided by the user, the full list of determinants generated by the PySCF object is registered.

As an improvement, one could develop a mcscf_to_trexio function (in progress here) as well as a converter from bitfield to a human-readable format. This would allow users to import a TREXIO file from other software and read it easily.

The different functions from this PR can be called as follow:

norb = 10
nelec = 4
ci_threshold = 1e-5
mcscf.kernel()
det_to_trexio(mcscf, norb, nelec, 'my_beautiful_determiants.hdf5', ci_threshold = ci_threshold)
num_det, coeff, det = read_det_trexio('my_beautiful_determiants.hdf5')

print(num_det)
print(coeff)
print(det) #In Bitfiled format

Best

Edit: As recommended here, two files are attached where all new functionalities are working on CASCI/sCI and FCI objects.
casci.txt
fci.txt

@NastaMauger
Copy link
Contributor Author

Following discussion from TREX-CoE/trexio_tools#45 (comment)

@@ -260,8 +260,6 @@ def det_to_trexio(mcscf, norb, nelec, filename, backend='h5'):

det_list = []
for a, b, coeff in zip(occsa, occsb, ci_values):
occsa_upshifted = [orb + 1 for orb in a]
occsb_upshifted = [orb + 1 for orb in b]
det_tmp = []
det_tmp += trexio_det.to_determinant_list(occsa_upshifted, int64_num)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NastaMauger you forgot to rename occsa_upshifted to occsa here (and same for occsb).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad ! It is done now

@sunqm
Copy link
Contributor

sunqm commented Nov 10, 2024

The code mixes 2-space and 4-space indention. Please use consistent indentions.

@q-posev I'm not very familiar with TrexIO. I'd appreciate your feedback on this new feature. Please let me know if you think this code is ready for merging. Thanks!

@NastaMauger
Copy link
Contributor Author

NastaMauger commented Nov 10, 2024

@q-posev I realized that the original pyscf-forge code (the part which I have not modified) do not register the ao shell and atomic normalization.
It does not prevent the pyscf-forge to work but this missing infos into trexio file might be an issue for some software import (such as qp2).
I was not able to extract this infos from original PySCF mol object

for i in range(min(len(selected_occslst), mcscf.ci.shape[0])):
for j in range(min(len(selected_occslst), mcscf.ci.shape[1])):
ci_coeff = mcscf.ci[i, j]
if np.abs(ci_coeff) > 1e-8: # Check if CI coefficient is significant
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this threshold for CI coefficients configurable as an argument to this function? I would trust the user to have the full control of the determinant filtering (we can keep your 1e-8 as default).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@q-posev I am not sure if setting a CI threshold is a good idea. For example, after a sCI run, my system produces the following (full) list of determinants with CI values using a threshold of 1e-8:

occsa    occsb       CI coeff
[0 1]       [0 1]    -0.99779049148461
[0 2]       [0 2]     0.03340264667473
[0 2]       [1 2]    -0.03321881541135
[1 2]       [0 2]    -0.03321881541135
[1 2]       [1 2]     0.03303662972152

Now, let’s say a user wants to keep only values higher than 0.1. In this case, the code would register only [0 1] [0 1] -0.9977904914846 and set the others to 0. However, this corresponds to truncating the wavefunctions, which is a different process, right? This would also imply that the one- and two-electron integrals would need to be renormalized to account for the new threshold.

I chose a threshold of 1e-8, as I believe it’s sufficient to filter out coefficients that are effectively zero in terms of bit precision.

Let me know your thoughts.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NastaMauger There is no problem in truncating the CI coefficients. It is a valid wave function that would potentially need to be renormalized, but trexio doesn't impose that the wave function is normalized.

This does not have any incidence on the integrals because they only depend on the Hamiltonian and not on the wave function. It would have an influence on the 1- and 2-body RDMs though.

If you truncate a wave function that is an eigenfunction of S^2 based on the determinant coefficients, the truncated wave function will not be necessarily be an eigenfunction of S^2 any more (it will be spin-contaminated), and some determinants may be removed of the determinant set. In that case, you will not be able to recover the S^2 eigenstate even when you rediagonalize H. Truncating the CSF coefficients keeps the S^2-eigenstate property for the truncated wavefunction.

I think it is a good idea to let the user enter a threshold as a parameter, with a default of zero (no threshold).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I decided to place the CI threshold argument right after the user-defined nelec value for consistency. I thought it might be counterintuitive for the user to place this threshold after the trexio_file argument.

Let me know your thoughts.



def read_det_trexio(filename, backend='h5'):
with trexio.File(filename, 'r', back_end=_mode(backend)) as tf:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading, you can safely use back_end=trexio.TREXIO_AUTO. TREXIO_AUTO allows the library to automatically detect the backend for a given TREXIO file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set back_end=_mode(backend) to maintain consistency with how it was previously written in the code. For example, the same back_end is called on line 95.

I can certainly change it to back_end=trexio.TREXIO_AUTO as you advised, but I thought it might be better to stay consistent.

We could either keep it as it is or update all calls to use back_end=trexio.TREXIO_AUTO. Let me know which option would be best.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would advise to use that but only for reading TREXIO files. The reason is that when one wants to write/create TREXIO file - the I/O back end has to be provided explicitly, but for reading - our library is smart enough and can detect the I/O back end that was used to produce the TREXIO file. So using TREXIO_AUTO allows you to stay compatible with whatever new back end we will decide to implement in the future.

If @sunqm is OK with that, I would recommend to modify also the line 95 accordingly.

Thanks!

Copy link
Contributor Author

@NastaMauger NastaMauger Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this modification in all functions that use trexio.read. It’s done in a separate commit so that if @sunqm disagree, I can easily roll it back.

Copy link

@q-posev q-posev Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NastaMauger Sorry, i should have precised that: you need to use back_end=trexio.TREXIO_AUTO in the calls to trexio.File with 'r' mode. And then you can remove the backend argument in the functions with *_from_trexio and read_eri. I am not sure your commit works as expected because _mode function doesn't handle the TREXIO_AUTO case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@q-posev My apologies—I am a bit confused about the backend. I have modified the code based on your last comment. Could you please check it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would advise to use that but only for reading TREXIO files. The reason is that when one wants to write/create TREXIO file - the I/O back end has to be provided explicitly, but for reading - our library is smart enough and can detect the I/O back end that was used to produce the TREXIO file. So using TREXIO_AUTO allows you to stay compatible with whatever new back end we will decide to implement in the future.

If @sunqm is OK with that, I would recommend to modify also the line 95 accordingly.

Thanks!

sure. Please update the module as you like. I implemented the version based on the paper or the documents. Many places are likely sub-optimal.

with trexio.File(filename, 'r', back_end=_mode(backend)) as tf:
offset_file = 0

num_det = trexio.read_determinant_num(tf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a suggestion (I do not insist for it to be implemented here): If one day num_det becomes huge - the user can run out of memory while reading the CI info. This is why we allow to read/write determinants in buffers (or chunks) of fixed size. It might be a good idea to define some relatively big buffer_size (e.g. 100k) and if num_det is larger than that value - read the data in chunks. Example of writing in chunks (using offset_file to advance the data pointers) can be found in trexio-tutorials.

Copy link
Contributor Author

@NastaMauger NastaMauger Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done
Let me know if it meets your requirement.

@q-posev
Copy link

q-posev commented Nov 11, 2024

@NastaMauger I left a few comments but overall the code looks great, thank you for your contribution! Does your code work for both selected CI and FCI objects of PySCF?

@sunqm Sure, with pleasure! I left a few comments on this PR. I would like to also take this opportunity to give you kudos for all the hard work that you and the team do on PySCF - it's a great software!

@q-posev
Copy link

q-posev commented Nov 11, 2024

@sunqm if I may: we have been recently discussing a few issues reported by the TREXIO users. I can summarize them as follows:

  1. Failures to produce TREXIO files from PySCF when using Dunning cc-pvXz basis sets due to the contraction scheme being used, reported here by @scemama , @NastaMauger and @kousuke-nakano . @scemama proposed a solution here which would basically "decontract"/expand the basis set. Do you think it is feasible? Are there any internal PySCF objects which hold the necessary pieces?
  2. As mentioned by @NastaMauger here: the current TREXIO interface implemented in pyscf-forge PR Trexio interface #60 does not produce some details about the atomic orbitals (we are only missing shell and their normalization arrays from the ao group in TREXIO format), which prevents importing of the PySCF-produced TREXIO files in some other codes. Would it be possible to add this information to the TREXIO file produced by the _mol_to_trexio function?

@kousuke-nakano
Copy link
Contributor

kousuke-nakano commented Nov 11, 2024

@sunqm and @q-posev, I have already implemented the second point in my forked pyscf-forge branch, i.e., AO normalization. I have made a pull request. see #68. I have not found a solution for the first point.

@sunqm
Copy link
Contributor

sunqm commented Nov 13, 2024

@sunqm if I may: we have been recently discussing a few issues reported by the TREXIO users. I can summarize them as follows:

  1. Failures to produce TREXIO files from PySCF when using Dunning cc-pvXz basis sets due to the contraction scheme being used, reported here by @scemama , @NastaMauger and @kousuke-nakano . @scemama proposed a solution here which would basically "decontract"/expand the basis set. Do you think it is feasible? Are there any internal PySCF objects which hold the necessary pieces?
  2. As mentioned by @NastaMauger here: the current TREXIO interface implemented in pyscf-forge PR Trexio interface #60 does not produce some details about the atomic orbitals (we are only missing shell and their normalization arrays from the ao group in TREXIO format), which prevents importing of the PySCF-produced TREXIO files in some other codes. Would it be possible to add this information to the TREXIO file produced by the _mol_to_trexio function?
  1. Is it an issue the mixing of segment contraction and general contraction in pyscf? Does basis sets like ANO work using this the interface? The cc-pv*Z basis in pyscf was obtained as an "optimized" format, which contains from basis set exchange. The zeros in the contraction coefficients are squeezed, leading to , e.g. multiple S-shell blocks, P-shell blocks. Does it work if I aggregate all S shells into one block, and P into one block, so on so forth?

  2. Sure. I'm not clear how to build the index and normalization. An example could explain easily I guess. What do these variables look like for STO-3G of Carbon?

@kousuke-nakano
Copy link
Contributor

Dear @sunqm,

Sure. I'm not clear how to build the index and normalization. An example could explain easily I guess. What do these variables look like for STO-3G of Carbon?

I have already implemented it in #68.

@kousuke-nakano
Copy link
Contributor

kousuke-nakano commented Nov 13, 2024

Dear @sunqm

Does basis sets like ANO work using this the interface?

No, I do not think the format of the ANO basis set is compatible with the TREXIO format.

Roos Augmented Double Zeta ANO, taken from BSE [https://www.basissetexchange.org].
F S
103109.46 0.0000637 -0.0000148 0.0000151 -0.0000154
15281.007 0.0005028 -0.0001172 0.0001187 -0.0001160
3441.5392 0.0026677 -0.0006239 0.0006363 -0.0006596
967.09483 0.0112003 -0.0026280 0.0026599 -0.0025500
314.03534 0.0390980 -0.0093503 0.0096068 -0.0102361
113.44230 0.1122657 -0.0278490 0.0284336 -0.0267629
44.644727 0.2472042 -0.0676880 0.0719831 -0.0829345
18.942874 0.3680345 -0.1230542 0.1310393 -0.1141649
8.5327430 0.2908617 -0.1522180 0.2031617 -0.3160866
3.9194010 0.0781024 -0.0075794 -0.0915990 0.4487239
1.5681570 0.0035403 0.3759313 -1.0070250 1.1605351
0.6232900 0.0009157 0.5438482 0.0174627 -1.3695700
0.2408610 0.0000765 0.2127746 0.7490027 -0.6041401
0.0843010 0.0000205 0.0066422 0.3447692 1.2967102

Here, the problem is that one exponent is associated with multiple contraction coefficients,
which TREXIO does not expect.

Instead, for instance, the ccecp-cc-pv*Z basis set implemented in PySCF works with this interface.

ccecp-ccpvdz, taken from Pseudopotential Library [https://pseudopotentiallibrary.org].
F S
84.048652 -0.00000280
41.094617 -0.00035430
20.092738 0.01840880
9.824112 -0.12776610
4.803386 -0.07028720
2.348560 0.15716040
1.148301 0.37690120
0.561449 0.40951650
0.274514 0.19708930
0.134220 0.02434820

because one exponent is associated with one contraction coefficient (i.e., no general contraction).

Copy link

@q-posev q-posev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NastaMauger I left a few comments to facilitate the code modifications. Right now the CI tests fail because of a typo (trexio.File class does not support backend but we use back_end instead).
Sorry for the trouble with the modifications, I would push the necessary changes myself to make things easier but this PR is from your fork and I cannot push there.

@@ -91,9 +93,10 @@ def _cc_to_trexio(cc_obj, trexio_file):
def _mcscf_to_trexio(cas_obj, trexio_file):
raise NotImplementedError


def mol_from_trexio(filename, backend='h5'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend='h5' can be removed here

Copy link

@q-posev q-posev Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now use TREXIO_AUTO for reading, that's why the backend option does not need to be provided here

def mol_from_trexio(filename, backend='h5'):
mol = gto.Mole()
with trexio.File(filename, 'r', back_end=_mode(backend)) as tf:
with trexio.File(filename, 'r', backend=trexio.TREXIO_AUTO) as tf:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back_end instead of backend (tests failed in the CI because of that)

@@ -132,7 +135,7 @@ def mol_from_trexio(filename, backend='h5'):

def scf_from_trexio(filename, backend='h5'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (backend option can be removed)

@@ -174,7 +177,7 @@ def write_eri(eri, filename, backend='h5'):

def read_eri(filename, backend='h5'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (backend option can be removed)

else:
n_chunks = 1

with trexio.File(filename, 'w', back_end=_mode(backend)) as tf:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need unsafe 'u' mode to delete things from TREXIO file (e.g. determinant group as here)

@@ -174,7 +177,7 @@ def write_eri(eri, filename, backend='h5'):

def read_eri(filename, backend='h5'):
'''Read ERIs in AO basis, 8-fold symmetry is assumed'''
with trexio.File(filename, 'r', back_end=_mode(backend)) as tf:
with trexio.File(filename, 'r', backend=trexio.TREXIO_AUTO) as tf:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back_end instead of backend



def read_det_trexio(filename, backend='h5'):
with trexio.File(filename, 'r', backend=trexio.TREXIO_AUTO) as tf:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back_end instead of backend

pyscf/tools/trexio.py Outdated Show resolved Hide resolved
pyscf/tools/trexio.py Outdated Show resolved Hide resolved
pyscf/tools/trexio.py Outdated Show resolved Hide resolved
@sunqm
Copy link
Contributor

sunqm commented Nov 13, 2024

Dear @sunqm

Does basis sets like ANO work using this the interface?

No, I do not think the format of the ANO basis set is compatible with the TREXIO format.

Roos Augmented Double Zeta ANO, taken from BSE [https://www.basissetexchange.org].
F S
103109.46 0.0000637 -0.0000148 0.0000151 -0.0000154
15281.007 0.0005028 -0.0001172 0.0001187 -0.0001160
3441.5392 0.0026677 -0.0006239 0.0006363 -0.0006596
967.09483 0.0112003 -0.0026280 0.0026599 -0.0025500
314.03534 0.0390980 -0.0093503 0.0096068 -0.0102361
113.44230 0.1122657 -0.0278490 0.0284336 -0.0267629
44.644727 0.2472042 -0.0676880 0.0719831 -0.0829345
18.942874 0.3680345 -0.1230542 0.1310393 -0.1141649
8.5327430 0.2908617 -0.1522180 0.2031617 -0.3160866
3.9194010 0.0781024 -0.0075794 -0.0915990 0.4487239
1.5681570 0.0035403 0.3759313 -1.0070250 1.1605351
0.6232900 0.0009157 0.5438482 0.0174627 -1.3695700
0.2408610 0.0000765 0.2127746 0.7490027 -0.6041401
0.0843010 0.0000205 0.0066422 0.3447692 1.2967102

Here, the problem is that one exponent is associated with multiple contraction coefficients, which TREXIO does not expect.

Instead, for instance, the ccecp-cc-pv*Z basis set implemented in PySCF works with this interface.

ccecp-ccpvdz, taken from Pseudopotential Library [https://pseudopotentiallibrary.org].
F S
84.048652 -0.00000280
41.094617 -0.00035430
20.092738 0.01840880
9.824112 -0.12776610
4.803386 -0.07028720
2.348560 0.15716040
1.148301 0.37690120
0.561449 0.40951650
0.274514 0.19708930
0.134220 0.02434820

because one exponent is associated with one contraction coefficient (i.e., no general contraction).

I see. Let's complete this PR as it is. I can make a new PR to address the general contraction.

@NastaMauger
Copy link
Contributor Author

NastaMauger commented Nov 13, 2024

@q-posev I have updated the code following your recommendations. Let me know your thoughts!

@q-posev
Copy link

q-posev commented Nov 13, 2024

@NastaMauger thank you, it's perfect now!

@sunqm I think we are ready to merge this if the CI is green.

@sunqm sunqm merged commit ceceaaf into pyscf:master Nov 14, 2024
4 checks passed
@NastaMauger NastaMauger deleted the det branch November 17, 2024 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants