Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XYZ file support #13

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

joshkamm
Copy link

@joshkamm joshkamm commented Dec 18, 2023

Here are notes and questions that have come up so far. Most notably, going from 1 to 2 supported file formats means that a number of decisions came up regarding how the code and functionality associated with different file formats would be related.

  1. In adding a second file format, should there be some sort of shared data structure that stores molecular data consistently across different types of files? Is there already a cheminformatic standard available in rust analogous to rdkit, open babel, or stk? Does ruSTK have a molecule data structure yet?
  2. Should all the rust code continue to live in lib.rs with the XYZ file related rust functions in the same module / file as the Cube file related functions, or should they be distributed into separate modules and or files?
  3. I'm only starting to learn how borrowing and lifetimes work in rust. I clumsily messed with the XYZData class and usage of it until compilation succeeded. I don't know if I'm using types owned by python vs rust appropriately, and whether I should actually need to have the Python GIL when writing the file.
  4. I'm getting a warning to update maturin
  5. I wasn't easily able to get github actions to run on my fork and it seemed like they might be referring to secrets on this main repo. Compilation and pytest succeed locally. I still need to run the other code checks from the justfile because I was running into a couple issues trying to setup just.

Remaining todo list:

  • Edit based on Lukas's feedback
  • Update README
  • Run code checks and fixes

closes joshkamm#1

@lukasturcani
Copy link
Collaborator

Hi Josh -- thanks for this!

  • I don't think we want a shared data structure just yet . Different file formats will hold different information and trying to squeeze all of them into the same data format is likely to lead to difficulties in terms of ergonomics and performance. Once we have more supported data formats and a shared abstraction becomes obvious, I'm happy for there to be a refactor which removes duplication but I don't want to take this step too early.
  • There is https://github.com/rapodaca/chemcore but I think it is still early stage and I want this repo to be optimized for data reading and writing. By using a general purpose data structure we may become limited in how fast we can read and write to it.
  • ruSTK is still early stage so nothing I want to re-use from there.
  1. I think split things into separate files. maybe a cube.rs and xyz.rs. if theres functions you want to reuse you put those into their own file too and import them.
  2. I don't think you should need the Python GIL for writing a file -- I'll add comments to the code itself.
  3. Will update maturin!
  4. For PRs I think I need to allow the tests to run the first time. Done that now. Not sure if this will fix them not running your fork.

benchmark: typing.Any,
tmp_path: pathlib.Path,
) -> None:
xyz_structures = create_xyz_structures()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would create a fixture

@fixture
def structures() -> list[flour.XyzData]:
    num_atoms = 100
    rng = np.random.default_rng(11)
    return [
        flour.XyzData(
            comment="Test comment",
            elements=["C"]*num_atoms,
            positions=rng.random((num_atoms, 3)) * 100,
        )
        for _ in range(500)
    ]

which i would use in the benchmarks

def benchmark_write_xyz(
    benchmark: typing.Any,
    tmp_path: pathlib.Path,
    structures: list[flour.XyzData],
) -> None:
    ...

assert xyz_data_1.comment == xyz_structures[1].comment
assert np.all(np.equal(xyz_data_1.elements, xyz_structures[1].elements))
assert np.all(np.isclose(xyz_data_1.positions, xyz_structures[1].positions))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a newline



class XyzData:
elements: list[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a numpy array of uint8 called atoms

this will be more consistent with cube file interface

#[pyclass]
struct XyzData {
#[pyo3(get)]
elements: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

#[pyo3(get)]
atoms: Py<PyArray1<u8>>,

instead of elements

this is more consistent with the cube file interface

also using Vec<String> would be less performant because of redundant copies and could lead to surprising behvaiour since it creates copies each time it is accessed -- see https://pyo3.rs/v0.20.0/faq.html#pyo3get-clones-my-field

#[pyo3(get)]
elements: Vec<String>,
#[pyo3(get)]
comment: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Py<PyString> https://pyo3.rs/main/doc/pyo3/types/struct.pystring

here is an example of how to use the type

use pyo3::prelude::*;
use pyo3::types::PyString;

#[pyclass]
struct Foo {
    #[pyo3(get)]
    x: Py<PyString>,
}

#[pymethods]
impl Foo {
    #[new]
    fn new(py: Python, x: &str) -> PyResult<Self> {
        Ok(Foo {
            x: PyString::new(py, x).into_py(py),
        })
    }
}

src/lib.rs Outdated
let mut multi_xyz_data: Vec<XyzData> = vec![];
let contents = fs::read_to_string(path)?;
let mut lines = contents.lines();
loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would re-write this using awhile-let loop

while let (Some(first_line), Some(second_line)) = (lines.next(), lines.next()) {
...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for XYZ files
2 participants