Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds script to import cctbx data to rs #264

Merged
merged 39 commits into from
Sep 20, 2024
Merged

adds script to import cctbx data to rs #264

merged 39 commits into from
Sep 20, 2024

Conversation

dermen
Copy link
Collaborator

@dermen dermen commented Aug 10, 2024

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2024

Codecov Report

Attention: Patch coverage is 74.12935% with 52 lines in your changes missing coverage. Please review.

Project coverage is 88.95%. Comparing base (b78e571) to head (1d0b88d).
Report is 58 commits behind head on main.

Files with missing lines Patch % Lines
reciprocalspaceship/io/dials.py 80.68% 28 Missing ⚠️
reciprocalspaceship/io/dials_mpi.py 0.00% 20 Missing ⚠️
reciprocalspaceship/io/common.py 87.50% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #264      +/-   ##
==========================================
- Coverage   91.22%   88.95%   -2.27%     
==========================================
  Files          37       40       +3     
  Lines        2531     2835     +304     
==========================================
+ Hits         2309     2522     +213     
- Misses        222      313      +91     
Flag Coverage Δ
unittests 88.95% <74.12%> (-2.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dermen
Copy link
Collaborator Author

dermen commented Aug 10, 2024

Test this:

conda create -n cctbx -c conda-forge cctbx-base dials dxtbx python=3.9
conda activate cctbx
git clone https://github.com/dermen/reciprocalspaceship.git
pip install ray msgpack
cd reciprocalspaceship
python -m pip install -e .
mkdir test_cctbx
cd test_cctbx
wget https://smb.slac.stanford.edu/~dermen/rs_ints.tar
tar -xvf ts_ints.tar
rs.cctbx2rs  lys_si_8/ lys_si_8.mtz --ucell 78.3 78.3 36.9 90 90 90 --symbol P4 --nj 10
rs.mtzdump lys_si_8.mtz 

Produced:

$ rs.cctbx2rs  lys_si_8/ lys_si_8.mtz --ucell 78.3 78.3 36.9 90 90 90 --symbol P4 --nj 10
Found 59 files
2024-08-10 13:17:14,428	INFO worker.py:1781 -- Started a local Ray instance.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0041_integrated.refl.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0050_integrated.refl.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0024_integrated.refl.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0008_integrated.refl.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0005_integrated.refl.
(get_paths_inds pid=27221) Done finding paths/inds in lys_si_8/idx-0012_integrated.refl.
Found data for 460 unique images
(get_refl_data pid=27222) loading lys_si_8/idx-0041_integrated.expt (1/59)...
(get_refl_data pid=27222) loading lys_si_8/idx-0050_integrated.expt (11/59)...
(get_refl_data pid=27222) loading lys_si_8/idx-0024_integrated.expt (21/59)...
(get_refl_data pid=27222) loading lys_si_8/idx-0008_integrated.expt (31/59)...
(get_refl_data pid=27222) loading lys_si_8/idx-0005_integrated.expt (41/59)...
(get_refl_data pid=27222) loading lys_si_8/idx-0012_integrated.expt (51/59)...
Combining refl data from all processes (564192 total refls)
Done combining!
Wrote lys_si_8.mtz.
Done!
(cctbx) 🦕 ~/reciprocalspaceship/test_cctbx
$ rs.mtzdump lys_si_8.mtz 
Spacegroup: P4
Extended Hermann-Mauguin name: P 4
Unit cell dimensions: 78.300 78.300 36.900 90.000 90.000 90.000

mtz.head():

              dH     dK     dL  BATCH   DPSI  ...        Y     SX     SY     SZ  PARTIAL
H   K   L                                     ...                                       
-54 -26 0  0.046 -0.041  0.005      0  0.001  ...   58.526  0.457   0.51 -0.512     True
    -25 1 -0.053  0.132 -0.009      0 -0.002  ...   169.55  0.476  0.488 -0.517     True
        2  0.116   -0.2  0.022      0  0.004  ...  239.547  0.495   0.47 -0.515     True
-53 -24 1  0.027 -0.013  0.019      0  0.001  ...  262.958  0.469  0.478 -0.531     True
    -23 3  0.026 -0.008  0.012      0  0.001  ...  432.253  0.507  0.437 -0.532     True

[5 rows x 13 columns]

mtz.describe():

             dH        dK        dL     BATCH  ...         Y        SX        SY        SZ
count  564192.0  564192.0  564192.0  564192.0  ...  564192.0  564192.0  564192.0  564192.0
mean        0.0    -0.001       0.0   229.283  ...   2185.51      -0.0    -0.001    -0.694
std       0.082     0.061      0.02    132.71  ...  1185.691     0.343     0.356     0.083
min      -0.274    -0.282    -0.083       0.0  ...    10.438    -0.594     -0.61    -0.856
25%      -0.053    -0.035    -0.013     114.0  ...  1219.682     -0.31    -0.324    -0.761
50%         0.0    -0.002      -0.0     229.0  ...  2212.558    -0.005    -0.011    -0.691
75%       0.054     0.032     0.013     344.0  ...  3151.746     0.309     0.321    -0.628
max       0.284     0.257     0.091     459.0  ...  4353.981     0.594      0.61    -0.497

[8 rows x 12 columns]

mtz.dtypes:

dH           MTZReal
dK           MTZReal
dL           MTZReal
BATCH          Batch
DPSI         MTZReal
I          Intensity
SigI          Stddev
X            MTZReal
Y            MTZReal
SX           MTZReal
SY           MTZReal
SZ           MTZReal
PARTIAL         bool
dtype: object

@dermen
Copy link
Collaborator Author

dermen commented Aug 10, 2024

After thinking about it, the cctbx dependencies could be ignored, I think the refl tables can completely be accessed using msgpack


In [92]: import msgpack

In [93]: d = msgpack.load(open("idx-0040_integrated.refl", "rb"), strict_map_key=False)

In [94]: vals = d[2]['data']['intensity.sum.value'][1]

In [95]: np.frombuffer(vals[1])
Out[95]: 
array([ 80.41394043, 648.45092773,  21.55432129, ...,  49.80627441,
       -22.07312012, 432.39892578])

So a lightweight version of this code could be written and not need any fancy dials conda env

@dermen
Copy link
Collaborator Author

dermen commented Aug 11, 2024

Removed the CCTBX deps , so the code has slightly less functionality, but better performance. To test:

conda create -n cctbx2rs python=3.12
conda activate cctbx2rs
git clone https://github.com/dermen/reciprocalspaceship.git
cd reciprocalspaceship
pip install -e .

# new deps:
pip install ray msgpack

# test
wget https://smb.slac.stanford.edu/~dermen/rs_ints.tar
tar -xvf rs_ints.tar
rs.cctbx2rs  lys_si_8/ lys_si_8.mtz --ucell 78.3 78.3 36.9 90 90 90 --symbol P4 --nj 10
rs.mtzdump lys_si_8.mtz 

@kmdalton
Copy link
Member

this is some inspiring stuff, @dermen. as we discussed on slack, i think rs-booster is the place for command line apps like this. however, i'd like to see this refactored into an io module so we can just run something like

data_set = rs.read_cctbx( ... )

@JBGreisman
Copy link
Member

Thanks for putting this together! I agree with Kevin here -- I think it would make sense to include an rs.read_cctbx() (or should it be called rs.read_refl()?). I think generic format conversion scripts are better suited for rs-booster, although such a commandline tool could then just call to the rs.read....() method.

@dermen
Copy link
Collaborator Author

dermen commented Aug 12, 2024

Sounds good, ill rejig it. And yea, read_refl or even read_dials might be more suitable.. technically the refl files are produced by the program dials in most cases..

@kmdalton
Copy link
Member

yes -- but we probably want different methods for stills and rotation method

@JBGreisman JBGreisman linked an issue Aug 12, 2024 that may be closed by this pull request
@dermen
Copy link
Collaborator Author

dermen commented Aug 20, 2024

rs-station/rs-booster#50

@kmdalton
Copy link
Member

kmdalton commented Aug 20, 2024

I just want to point out that we are using this ContextManager for Ray in the CrystFEL parser. We're using this pattern. It's something that came up with @marinegor during review of that PR. It hopefully makes it more likely that the processes get cleaned up if the main thread crashes (for instance if the user kills it with ctrl-c). It's not necessarily a dealbreaker for this PR, but I'm making a note here that we should refactor that ContextManager into a shared submodule.

I made an issue for this (#267).

@dermen
Copy link
Collaborator Author

dermen commented Aug 20, 2024

@kmdalton , just saw your message.. I think I committed too quickly :)

@kmdalton
Copy link
Member

no problem, we can always refactor the ray stuff after this pr. i think we still haven't converged on a general policy for methods that rely on / benefit from optional dependencies.

Copy link
Member

@kmdalton kmdalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dermen , this is really nice, readable code. Thank you! I have a few requests before we merge it though. Most importantly,

  1. the rs maintainers try to adhere to the classic unix philosophy of not printing extraneous info to console unless the user opts in. So, please put all your print statements behind some kind of verbose=False toggle.
  2. Your methods would be more flexible if they took filenames rather than directories. As written, it seems it would be impossible to only load a subset of the .refl files in a directory. This is an important feature to support. So, I'd request that you refactor all the "globbing" outside of the read_... methods.
  3. We have decorators (from reciprocalspaceship.decorators import cellify, spacegroupify) which take a range of possible input types and convert them to gemmi objects. Using these might simplify your code somewhat.

I hope these comments amount to some light refactoring. I'm happy to help if needed!

After we merge this PR, I can refactor it to use contextmanagers as we discussed.

reciprocalspaceship/io/dials.py Outdated Show resolved Hide resolved
reciprocalspaceship/io/dials.py Outdated Show resolved Hide resolved
reciprocalspaceship/io/dials.py Outdated Show resolved Hide resolved
reciprocalspaceship/io/dials.py Outdated Show resolved Hide resolved
reciprocalspaceship/io/dials.py Outdated Show resolved Hide resolved
reciprocalspaceship/io/dials_mpi.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@kmdalton
Copy link
Member

@dermen, I like the logging changes you made. I think the code doesn't suffer at all in readability, and it might be a good practice to deploy more widely in the rs codebase.

there are a few more things i'd like to request with this pr.

  1. We typically make io functions available at the rs.io.{function_name level using the module's __init__.py. It'd be great if rs.io.read_dials_stills was populated in there. because you have optional dependencies, i think it'd be best if you
  2. implement a wrapper function with the signature rs.io.read_dials_stills(fnames, unitcell, spacegroup, numjobs=10, backend="...") where backend takes a string like "mpi" or"ray". then make the default backend be "numpy" which doesn't require any additional libraries and runs in serial.
  3. Finally, it'd be good if ray and mpi4py were imported lazily and if the module could check whether they're available and fallback to numpy if not. i did something like this in the crystfel io code here.

let me know if you need a hand with any of this. i can probably help out this week. thanks!

@dermen
Copy link
Collaborator Author

dermen commented Aug 26, 2024

Cool - think its mostly there ! feel free to change things as you see fit! There is no real urgency for me with this.. I am actually anticipating we need to add more options for msgpack data columns, and try/excepts where a column name doesnt exist ...

@kmdalton
Copy link
Member

i did some refactoring to de-emphasize the mpi idioms and make it possible to test the mpi code serially. the branch lives here.

@dermen
Copy link
Collaborator Author

dermen commented Sep 17, 2024

NERSC test instructions..

local> ssh perlmutter.nersc.gov
login> module load PrgEnv-gnu 
login> cd $SCRATCH
login> wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh 
login> bash ./Miniforge3-Linux-x86_64.sh -b -u -p $PWD/rs
login> source rs/etc/profile.d/conda.sh 
login> conda activate base

# install reciprocalspaceship and rs-booster
login> git clone [email protected]:dermen/reciprocalspaceship.git
login> cd reciprocalspaceship
login> pip install -e .
login> pip install msgpack ray

login> cd ../
login> git clone [email protected]:dermen/rs-booster.git
login> cd rs-booster
login> pip install -e .


# install mpi4py
login> MPICC="cc -shared" pip install --verbose --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py

# go to nodes
login> salloc -N 4  --ntasks-per-node=64  -A m4326 -t 240 -q interactive -C cpu
salloc: Nodes nid[004268-004271] are ready for job

# test
nid004268> srun -c2 rs.from_dials_mpi $CFS/lcls/dermen/outrun86.[78][90][0-9] outrun86.mtz --ucell 78 78 235 90 90 120 --symbol P6 --extra-cols panel  num_pixels.valid background.mean num_pixels.background num_pixels.foreground xyzcal.px xyzcal.mm --verbose

@kmdalton kmdalton self-requested a review September 19, 2024 01:05
Copy link
Member

@kmdalton kmdalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dermen please review my changes and if you find them acceptable go ahead and merge this into main.

Copy link
Collaborator Author

@dermen dermen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we still infer the stddev dtype ? Is that ok , and should we mention in docstring ?

@kmdalton
Copy link
Member

I guess we still infer the stddev dtype ? Is that ok , and should we mention in docstring ?

Ultimately that one is up to you and what you want the API to do. I would probably defer that, because it will downcast that column to float32. If we are going to infer the stddev dtype, we might as well infer things like HKL dtypes and BATCH dtype. Let me know what you think.

@kmdalton
Copy link
Member

kmdalton commented Sep 19, 2024

I added an option to specify whether to use mtz dtype inference.

but i forgot to push it 😅

@dermen
Copy link
Collaborator Author

dermen commented Sep 20, 2024

Ok, fixed one syntax error in rs-booster code, and added a unit cell / space group estimator to rs-booster. And tested, seems to work ok:

$ rs.print_refl ml_proc.lysozyme_ML_mode_9/idx-0024_integrated.refl 

Found 8 experiment identifiers in ml_proc.lysozyme_ML_mode_9/idx-0024_integrated.refl:
	0: 17da648c-b938-5107-f582-7f7da8b803e8
	1: 1dccff2f-c147-29c6-55e3-163752dfb8fc
	2: 831a07f3-1af2-c739-ef9a-3c249f58e653
	3: 65853d5f-8458-d5e7-43a7-5101c5c3ee09
	4: 229d0132-1cb9-00ff-536a-6f872fede4dc
	5: f619a467-0abc-a18b-cbdf-75eef2df1f3e
	6: cde7c534-73ee-0d7b-5a39-77db7e62ee03
	7: 5b700c75-fa32-a892-0fdb-e53294402c3c

Reflection contents:
             names                           dtypes            
        background.dispersion                        double    
              background.mean                        double    
               background.mse                        double    
         background.sum.value                        double    
      background.sum.variance                        double    
                         bbox                          int6    
                            d                        double    
                delpsical.rad                        double    
                     entering                          bool    
                        flags                   std::size_t    
                           id                           int    
          intensity.sum.value                        double    
       intensity.sum.variance                        double    
                 miller_index        cctbx::miller::index<>    
        num_pixels.background                           int    
   num_pixels.background_used                           int    
        num_pixels.foreground                           int    
             num_pixels.valid                           int    
                        panel                   std::size_t    
                   partial_id                   std::size_t    
                   partiality                        double    
                           s1                  vec3<double>    
                      shoebox                     Shoebox<>    
                    xyzcal.mm                  vec3<double>    
                    xyzcal.px                  vec3<double>    
              xyzobs.mm.value                  vec3<double>    
           xyzobs.mm.variance                  vec3<double>    
              xyzobs.px.value                  vec3<double>    
           xyzobs.px.variance                  vec3<double>    

Number of reflections: 10555 


$ rs.from_dials ml_proc.lysozyme_ML_mode_9/ test.mtz --extra-cols num_pixels.foreground background.mean --verbose
Found 63 reflection files.
Found 63 corresponding exptlist files. Will use 5 of them to estimate ucell and space group.
Consensus unit cell from 35 expts: <gemmi.UnitCell(78.3001, 78.3001, 36.8998, 90, 90, 90)>
Consensus space group from 35 expts: <gemmi.SpaceGroup("P 4/m")>
2024-09-20 09:36:25,931	INFO worker.py:1781 -- Started a local Ray instance.
(_get_refl_data pid=2408434) Loading ml_proc.lysozyme_ML_mode_9/idx-0021_integrated.refl
(_get_refl_data pid=2408434) ... Read in data for miller_index
(_get_refl_data pid=2408434) ... Read in data for intensity.sum.value
(_get_refl_data pid=2408434) ... Read in data for intensity.sum.variance
(_get_refl_data pid=2408434) ... Read in data for xyzcal.px
(_get_refl_data pid=2408434) ... Read in data for s1
(_get_refl_data pid=2408434) ... Read in data for delpsical.rad
(_get_refl_data pid=2408434) ... Read in data for id
(_get_refl_data pid=2408434) ... Read in data for num_pixels.foreground
(_get_refl_data pid=2408434) ... Read in data for background.mean
(_get_refl_data pid=2408430) Loading ml_proc.lysozyme_ML_mode_9/idx-0051_integrated.refl [repeated 62x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(_get_refl_data pid=2408444) ... Read in data for miller_index [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for intensity.sum.value [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for intensity.sum.variance [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for xyzcal.px [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for s1 [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for delpsical.rad [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for id [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for num_pixels.foreground [repeated 62x across cluster]
(_get_refl_data pid=2408444) ... Read in data for background.mean [repeated 62x across cluster]
Combining and formatting tables!
Found 592100 refls from 460 expts.
Mapping batch column.
Finished combining tables!
Converted column intensity.sum.variance to MTZ-Type Q, took sqrt of the values, and renamed to intensity.sum.sigma.
Writing MTZ test.mtz ...
Done writing MTZ.


$ rs.mtzdump  test.mtz 
Spacegroup: P4/m
Extended Hermann-Mauguin name: P 4/m
Unit cell dimensions: 78.300 78.300 36.900 90.000 90.000 90.000

mtz.head():

           background.mean  num_pixels.foreground  delpsical.rad  ...  intensity.sum.value  BATCH  PARTIAL
H   K  L                                                          ...                                     
-36 -6 21            2.146                    843         -0.002  ...               30.927     26     True
    -5 21            2.112                    842         -0.001  ...              -17.545     26     True
    -4 21            2.078                    837          0.001  ...               23.911     26     True
    -3 21            2.065                    838          0.002  ...                55.26     26     True
    -2 21            2.002                    827          0.002  ...              151.213     26     True

[5 rows x 13 columns]

mtz.describe():

       background.mean  num_pixels.foreground  delpsical.rad  ...  intensity.sum.sigma  intensity.sum.value     BATCH
count         592100.0               592100.0       592100.0  ...             592100.0             592100.0  592100.0
mean             10.05                404.062            0.0  ...               72.887             3568.129   230.097
std              7.908                143.466          0.003  ...               43.961            14724.827   133.145
min              0.017                  176.0         -0.046  ...                5.319            -7621.242       0.0
25%              3.714                  287.0         -0.002  ...               47.517               72.518     114.0
50%              6.843                  377.0            0.0  ...               59.611              293.854     231.0
75%             14.711                  494.0          0.002  ...               81.446             1530.024     345.0
max             50.248                 1106.0          0.043  ...              910.503           819365.438     459.0

[8 rows x 12 columns]

mtz.dtypes:

background.mean            MTZReal
num_pixels.foreground       MTZInt
delpsical.rad              MTZReal
s1.0                       MTZReal
s1.1                       MTZReal
s1.2                       MTZReal
xyzcal.px.0                MTZReal
xyzcal.px.1                MTZReal
xyzcal.px.2                MTZReal
intensity.sum.sigma         Stddev
intensity.sum.value      Intensity
BATCH                        Batch
PARTIAL                       bool
dtype: object



$ careless mono --kl-weight=0.001  --mc-samples=20  --mlp-layers=20  --image-layers=1 --dmin=2  --intensity-key=intensity.sum.value  --uncertainty-key=intensity.sum.sigma  dHKL,xyzcal.px.0,xyzcal.px.1,delpsical.rad,BATCH,background.mean,num_pixels.foreground   test.mtz  CL/test
Careless version 0.4.4
Training:   0%|                                                                                | 0/10000 [00:00<?, ?it/s]WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1726850289.048002  571008 service.cc:146] XLA service 0x7fd217b60350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1726850289.048049  571008 service.cc:154]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
I0000 00:00:1726850289.182732  571008 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Training:  13%|▎ | 1338/10000 [00:32<02:30, 57.64it/s, Grad Norm=1.50e+03, loss=1.27e+03, F KLDiv=1.13e+00, NLL=1.27e+03]

@dermen dermen merged commit 1fb2b17 into rs-station:main Sep 20, 2024
8 checks passed
dermen added a commit that referenced this pull request Sep 20, 2024
dermen added a commit that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for Reading DIALS refl Files
4 participants