Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing - adds jld2 function #145

Merged
merged 39 commits into from
Jun 3, 2024
Merged

Checkpointing - adds jld2 function #145

merged 39 commits into from
Jun 3, 2024

Conversation

aelligp
Copy link
Collaborator

@aelligp aelligp commented May 3, 2024

This PR features a new checkpointing routine checkpointing_jld2() which stores the entire stokes and thermal arrays as well as the particles, its location and phase for restarting.

The older function checkpointing() with HDF5is now updated after the reworking of the package.

Also in this PR: load functions for both checkpoint formats

To be updated after JustPIC implementation:

  • array conversion of particles and phases
function checkpointing_jld2(dst, stokes, thermal, particles, phases, time, igg)
    !isdir(dst) && mkpath(dst) # create folder in case it does not exist
    fname = joinpath(dst, "checkpoint_rank_$(igg.me).jld2")
    return jldsave(
        fname;
        stokes=Array(stokes),
        thermal=Array(thermal),
        particles=particles,   
        phases=phases,        
        time=time,
    )
end
function checkpointing_hdf4(dst, stokes, T, time)
    !isdir(dst) && mkpath(dst) # creat folder in case it does not exist
    fname = joinpath(dst, "checkpoint")
    h5open("$(fname).h5", "w") do file
        write(file, @namevar(time)...)
        write(file, @namevar(stokes.V.Vx)...)
        write(file, @namevar(stokes.V.Vy)...)
        write(file, @namevar(stokes.V.Vz)...)
        write(file, @namevar(stokes.P)...)
        write(file, @namevar(stokes.viscosity.η)...)
        write(file, @namevar(T)...)
    end
end

test/test_checkpointing.jl Show resolved Hide resolved
test/test_checkpointing.jl Outdated Show resolved Hide resolved
src/IO/JLD2.jl Outdated Show resolved Hide resolved
test/test_checkpointing.jl Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files

📢 Thoughts on this report? Let us know!

@aelligp aelligp mentioned this pull request May 6, 2024
11 tasks
@aelligp aelligp marked this pull request as ready for review May 6, 2024 14:25
@aelligp
Copy link
Collaborator Author

aelligp commented May 31, 2024

With the JustPIC PR #106, the particles and phases can now also be transferred to the CPU. However, not with the Array(particles/phases) coming from JustPIC, we might need to include it in the Project.toml

src/IO/H5.jl Outdated Show resolved Hide resolved
src/IO/H5.jl Outdated
Comment on lines 60 to 66
h5file = h5open(file_path, "r") # Open the file in read mode
P = read(h5file["P"]) # Read the stokes variable
T = read(h5file["T"]) # Read the thermal.T variable
Vx = read(h5file["Vx"]) # Read the stokes.V.Vx variable
Vy = read(h5file["Vy"]) # Read the stokes.V.Vy variable
η = read(h5file["η"]) # Read the stokes.viscosity.η variable
t = read(h5file["time"]) # Read the t variable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h5file = h5open(file_path, "r") # Open the file in read mode
P = read(h5file["P"]) # Read the stokes variable
T = read(h5file["T"]) # Read the thermal.T variable
Vx = read(h5file["Vx"]) # Read the stokes.V.Vx variable
Vy = read(h5file["Vy"]) # Read the stokes.V.Vy variable
η = read(h5file["η"]) # Read the stokes.viscosity.η variable
t = read(h5file["time"]) # Read the t variable
h5file = h5open(file_path, "r") # Open the file in read mode
P = read(h5file["P"]) # Read the stokes variable
T = read(h5file["T"]) # Read the thermal.T variable
Vx = read(h5file["Vx"]) # Read the stokes.V.Vx variable
Vy = read(h5file["Vy"]) # Read the stokes.V.Vy variable
Vz = read(h5file["Vz"]) # Read the stokes.V.Vz variable
η = read(h5file["η"]) # Read the stokes.viscosity.η variable
t = read(h5file["time"]) # Read the t variable

test/test_checkpointing.jl Show resolved Hide resolved
src/IO/JLD2.jl Outdated Show resolved Hide resolved
src/IO/JLD2.jl Outdated
restart = load(file_path) # Load the file
stokes = restart["stokes"] # Read the stokes variable
thermal = restart["thermal"] # Read the thermal variable
particles = restart["particles"] # Read the particles variable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to remove the particles

@albert-de-montserrat
Copy link
Collaborator

albert-de-montserrat commented Jun 1, 2024

Before merging we need to add the time step plus the model time to the jld2 file. And I think it would be nice to copy the scripts that are being run, plus the Project.toml and Manifest.toml as well -this needs to be done once- so that we have a copy of everything in case then later one forgets what model these files were run with -which can happen easily.

Could organize the files like this:

.
└── checkpointing
    ├── files
    |   ├── checkpoint.jld2   
    └── metadata
        ├── Manifest.toml
        ├── Project.toml
        ├── rheology.jl
        ├── script.jl
        └── setup.jl

The user would need to specify what .jl will be copied, while the .toml are copied by default.

@boriskaus
Copy link
Collaborator

From experience it may happen on an HPC cluster that the simulation is out of time exactly when you are writing the checkpoint file. In that case the file will be corrupt and if you were overwriting an earlier file you have to start the simulation from the beginning again.

A workaround is to save it to all to a temporary checkpoint file and remove the original one only once you are finished writing the current one.

@albert-de-montserrat
Copy link
Collaborator

Sounds like the smart thing to do.

@aelligp
Copy link
Collaborator Author

aelligp commented Jun 3, 2024

The checkpointing have now been updated to save a temporary file that is being moved to the real dst after completion @boriskaus

function checkpointing_jld2(dst, stokes, thermal, time, timestep, fname::String)
    !isdir(dst) && mkpath(dst) # create folder in case it does not exist

    # Create a temporary directory
    mktempdir() do tmpdir
        # Save the checkpoint file in the temporary directory
        tmpfname = joinpath(tmpdir, basename(fname))
        jldsave(
            tmpfname;
            stokes=Array(stokes),
            thermal=Array(thermal),
            time=time,
            timestep=timestep,
        )
        # Move the checkpoint file from the temporary directory to the destination directory
        mv(tmpfname, fname; force=true)
    end

    return nothing
end

The function metadata will store the Manifest.toml and the Project.toml but also the files you provide either by name if in the same directory or by path, in a dst folder. Which can be the same as the checkpointing one. @albert-de-montserrat

function metadata(src, dst, files...)
    @assert dst != pwd()
    if !ispath(dst)
        println("Created $dst folder")
        mkpath(dst)
    end
    for f in vcat(collect(files), ["Manifest.toml", "Project.toml"])
        !isfile(joinpath(f)) && continue
        newfile = joinpath(dst, basename(f))
        isfile(newfile) && rm(newfile)
        cp(joinpath(src,f), newfile)
    end
end

@aelligp aelligp merged commit 900485d into main Jun 3, 2024
11 of 12 checks passed
@aelligp aelligp deleted the pa-restart branch June 3, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants