Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Refactor #31

Open
makeclean opened this issue Jun 19, 2013 · 8 comments
Open

Code Refactor #31

makeclean opened this issue Jun 19, 2013 · 8 comments

Comments

@makeclean
Copy link
Contributor

When the alpha version is released for use in the group, it would be beneficial to have an experienced python user refactor and tidy the code from its current state.

  • Identify areas where code refactor would benefit clarity
  • refactor the code
@makeclean
Copy link
Contributor Author

The re factored code can now launch Fluka jobs correctly. Due to the limitations of the CHTC system the code was re factored based on the following assumptions.

  • we will nearly always be dealing with 'large data' in Condor terms ( > 50 Mb )
  • we should not address cases differently, have a consistent method that can launch small and large jobs

On the basis of this lead to the following design changes

  1. Preprocessing (as much as it exists) should be done by the user away from Condor, i.e. if the MCNP job must be split this must be done away from Condor for several reasons.
    • We cannot transfer large amounts of cross section data, it is unweildy
    • Due to the advanced tally methodology in place, the output filename from the meshtally must be unique, otherwise data being returned from condor jobs will be overwritten by other jobs, this means that the preprocessing stage must also set a unique filename for advanced tallies.
    • This preprocessing must produce unique input and output data for each calculation to allow recombination
    • This is problematic for very large fiiles which can take several minutes ( to hours) to initialize the runtpe files for running
  2. Since there is the requirement that large files should be transferred via the squid/wget the input data and other ancillary files should be bundled and transferred. This has implications on recombination of the data.
  3. Since large data must be sent/received using squid/wget we cannot flow nicely from the running of all calculations to the production of the collected output data, therefore the production of the averaged output dataset will be done as a seperate post processing step.

@makeclean
Copy link
Contributor Author

The script looks for a certain directory structure within the run directory, it looks for

  1. input (contains all input decks to run)
  2. geometry (containing the h5m of the geometry to be run)
  3. mesh (containing the h5m of the advanced tally to use)
  4. ancillary ( wwinp files etc)

On the basis of what is passed, the script looks in the directories provided to determine what calculation should be performed, for example

/home/davisa/condorht_tools/chtc_sub/submit_job.py 
--path /home/davisa/fng_str/ --job FLUKA --batch 10 

Tells the script to look in /home/davisa/fng_str/input for the input decks, that its a Fluka calculation and that each calculation should be run 10 times. The script tar.gz's everything whithin /home/davisa/fng_str and copies it to /squid/davisa where the precompiled tar.gz's of the gcc compilers and fluka compilers exist.

The script then build the DyAG graph to control the tasks using the dag_manager. (since we can no longer tag on the post processing as a child of this run the only benefit to using dag_manager is for resubmission of failed runs)

@gonuke
Copy link
Member

gonuke commented Aug 12, 2013

Does any of this change if we have our own submit machine over which we have full control and disk access? I think that's what all the productive HTCondor users do.

@makeclean
Copy link
Contributor Author

I don't know for sure, but I don't think so, its the IO of getting all the data to and from the compute nodes that is the issue, which is why we have to put things in squid and then wget them. If we had our own dedicated submit node it would make things easier in the sense a lot of the processing could be done there.

However, at some point we have do deal with the issue of these large files so take for example one of Tim's ITER FW models, he has 2 or 3 advanced tallies in there as well, the model itself takes 10 minutes to read in cross sections, then its another 40 mins to build the kdtree. This preprocessing is done in serial on another machine, which means almost an hour just to build the runtpe file for one calculation, where we may consider splitting into 1000 sub calculations. We can of course parallelise this.

If instead we brought the xs with the calculation, in effect abandoning the idea of a continue run, storing the xs on squid, then this would be several hours of transfer of xsdata before the run begins, which is not much use either since we have several hours of dead time before any useful work is done.

An alternative is to pull out the xs data that is needed for the calculation and build a custom xsdir and ace file for each calculation.

Another issue we have is one of storage, to even get a big ITER calculation onto Condor will take several 10's or even 100's of GB. Which from our perspective isn't really a problem but for the Condor folks who winced when I asked for 30 GB, it is.

@gonuke
Copy link
Member

gonuke commented Aug 12, 2013

I think having our own submit node will solve at least 2 problems:

#. We'll have a little bit more control over our environment for building our tools (although it will still have to be compatible with the execute machines)

#. We can put a big hard drive there as a launching/landing pad for the data as it comes and goes.

I think if we're clever, the initial costs of reading data and building the MOAB search trees (do we need both an OBB-tree for DAGMC and a KD-tree for mesh tallies?) is worth it if we can reuse the runtpe for each of the separate jobs.

We should perhaps try do do a 2- to 4-way replication by hand and see what the moving parts actually look like.

@makeclean
Copy link
Contributor Author

The reuse of the runtpes is the key, and unfortunately I don't currently see how we can reuse them. In normal MCNP use we can, however for advanced tallies we have to ensure the output mesh name is unique, which cannot be reset after the runtpe has been written, hence the need for multiple runtpes. Unless we shift the meshtal setup routine?

@gonuke
Copy link
Member

gonuke commented Aug 14, 2013

What about different subdirectories?

@makeclean
Copy link
Contributor Author

Yeah that would probably work, it just seemed a bit messy, but thats preferable to slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants