-
Notifications
You must be signed in to change notification settings - Fork 18
Tutorial: Constraining Galacticus Parameters
This tutorial guides you through constraining the parameters of a Galacticus model to achieve a good fit to an observational dataset. Galacticus has built-in MCMC functionality which can be used for this purpose.
Here we will use a very simple example - we'll constrain a single parameter of a Galacticus model to obtain a match to a single point in the stellar mass-halo mass relation of Leauthaud et al. (2012). Much more complex cases are possible (multiple parameters, coupled parameters, multiple target datasets etc.), but this simple example will illustrate the key features.
For this tutorial we require Galacticus to be compiled with MPI parallelism. To do this:
make -j8 GALACTICUS_BUILD_OPTION=MPI Galacticus.exe
Note that you must have MPI installed for this to work.
Once compilation is completed, to run the tutorial model:
export OMP_NUM_THREADS=1
mpirun -np 4 Galacticus.exe parameters/tutorials/mcmcConfig.xml
The export OMP_NUM_THREADS=1
effectively switches off OpenMP parallelism (which we don't want to use for this tutorial). The mpirun -np 4
prefix command launches 4 parallel Galacticus processes which will communicate via MPI.
Expect this example to run for around 15 minutes - it should output something like this:
##
#### # #
# # # #
# ### # ### ### ### ## ### ## ## ##
# # # # # # # # # # # # # # #
# ### ### # ### # # # # # # #
# # # # # # # # # # # # # #
#### #### ### #### ### ## ### ### #### ##
© 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
2017, 2018, 2019, 2020
- Andrew Benson
1:M: -> Begin task: posterior sampling
1:M: Process 0001 [PID: 13486] is running on host 'node081'
2:M: -> Begin task: posterior sampling
2:M: Process 0002 [PID: 13487] is running on host 'node081'
3:M: -> Begin task: posterior sampling
3:M: Process 0003 [PID: 13488] is running on host 'node081'
0:M: -> Begin task: posterior sampling
0:M: Process 0000 [PID: 13485] is running on host 'node081'
0:M: -> Load balancing report
0:M: Not performing load balancing - missing work cost data
0:M: <- done
0:M: Chain 0 has logℒ= -0.8306E+00
1:M: Chain 1 has logℒ= -0.4578E+01
2:M: Chain 2 has logℒ= -0.3024E+02
3:M: Chain 3 has logℒ= -0.1131E+01
0:M: -> Node work done vs. expected:
0:M: Node 0000: work (actual/estimated) = 10.65 / -1.00
0:M: Node 0001: work (actual/estimated) = 8.99 / -1.00
0:M: Node 0002: work (actual/estimated) = 8.70 / -1.00
0:M: Node 0003: work (actual/estimated) = 8.07 / -1.00
0:M: <- done
0:M: -> Load balancing report
0:M: -> Chain redistribution:
0:M: Chain 0000 -> process/node 0000/0001 (work = 10.65)
0:M: Chain 0001 -> process/node 0001/0001 (work = 8.99)
0:M: Chain 0002 -> process/node 0002/0001 (work = 8.70)
0:M: Chain 0003 -> process/node 0003/0001 (work = 8.07)
0:M: <- done
0:M: -> Node work loads:
0:M: Node 0001: work = 36.40
0:M: <- done
0:M: <- done
0:M: Chain 0 has logℒ= -0.1154E+01
1:M: Chain 1 has logℒ= -0.3044E+02
2:M: Chain 2 has logℒ= -0.4176E+01
3:M: Chain 3 has logℒ= -0.3042E+02
0:M: -> Node work done vs. expected:
0:M: Node 0000: work (actual/estimated) = 8.01 / 10.65
0:M: Node 0001: work (actual/estimated) = 7.93 / 8.99
0:M: Node 0002: work (actual/estimated) = 8.71 / 8.70
0:M: Node 0003: work (actual/estimated) = 7.70 / 8.07
0:M: <- done
.
.
.
0:M: Converged after 10 steps
0:M: -> Load balancing report
0:M: -> Chain redistribution:
0:M: Chain 0000 -> process/node 0002/0001 (work = 7.84)
0:M: Chain 0001 -> process/node 0003/0001 (work = 7.81)
0:M: Chain 0002 -> process/node 0001/0001 (work = 8.14)
0:M: Chain 0003 -> process/node 0000/0001 (work = 9.10)
0:M: <- done
0:M: -> Node work loads:
0:M: Node 0001: work = 32.87
0:M: <- done
0:M: <- done
0:M: Chain 3 has logℒ= -0.8918E+00
1:M: Chain 2 has logℒ= -0.3917E+00
2:M: Chain 0 has logℒ= -0.9008E+00
3:M: Chain 1 has logℒ= -0.2899E+01
0:M: -> Node work done vs. expected:
0:M: Node 0000: work (actual/estimated) = 8.41 / 7.84
0:M: Node 0001: work (actual/estimated) = 8.42 / 7.81
0:M: Node 0002: work (actual/estimated) = 8.53 / 8.14
0:M: Node 0003: work (actual/estimated) = 8.55 / 9.10
0:M: <- done
.
.
.
0:M: -> Load balancing report
0:M: -> Chain redistribution:
0:M: Chain 0000 -> process/node 0001/0001 (work = 8.85)
0:M: Chain 0001 -> process/node 0000/0001 (work = 13.86)
0:M: Chain 0002 -> process/node 0002/0001 (work = 8.22)
0:M: Chain 0003 -> process/node 0003/0001 (work = 8.06)
0:M: <- done
0:M: -> Node work loads:
0:M: Node 0001: work = 39.00
0:M: <- done
0:M: <- done
0:M: Chain 1 has logℒ= -0.3850E+01
1:M: Chain 0 has logℒ= -0.1265E+02
2:M: Chain 2 has logℒ= -0.1516E-01
3:M: Chain 3 has logℒ= -0.3263E+02
0:M: -> Node work done vs. expected:
0:M: Node 0000: work (actual/estimated) = 7.64 / 8.85
0:M: Node 0001: work (actual/estimated) = 9.02 / 13.86
0:M: Node 0002: work (actual/estimated) = 9.03 / 8.22
0:M: Node 0003: work (actual/estimated) = 8.44 / 8.06
0:M: <- done
0:M: <- Done task: posterior sampling
2:M: <- Done task: posterior sampling
1:M: <- Done task: posterior sampling
3:M: <- Done task: posterior sampling
where we have cut out some of the output of chain likelihoods for brevity.
The output starts by reporting on where each MPI process is running and lists the PID for that process (which can be useful for debugging). Since we're running four MPI processes we get 4 MCMC chains, which begin to evaluate our model for different combinations of parameters. After each evaluation you'll see a report on the likelihood of the model for that chain:
0:M: Chain 0 has logℒ= -0.8306E+00
The simulation will continue evaluating models until it completes. In between this chain output you'll see various reports, for example a report on run time for each chain:
0:M: -> Node work done vs. expected:
0:M: Node 0000: work (actual/estimated) = 8.41 / 7.84
0:M: Node 0001: work (actual/estimated) = 8.42 / 7.81
0:M: Node 0002: work (actual/estimated) = 8.53 / 8.14
0:M: Node 0003: work (actual/estimated) = 8.55 / 9.10
0:M: <- done
which reports how much CPU time was spent on each process vs. how much was estimated would be used. This is used in load balancing, which is also reported:
0:M: -> Load balancing report
0:M: -> Chain redistribution:
0:M: Chain 0000 -> process/node 0002/0001 (work = 7.84)
0:M: Chain 0001 -> process/node 0003/0001 (work = 7.81)
0:M: Chain 0002 -> process/node 0001/0001 (work = 8.14)
0:M: Chain 0003 -> process/node 0000/0001 (work = 9.10)
0:M: <- done
0:M: -> Node work loads:
0:M: Node 0001: work = 32.87
0:M: <- done
0:M: <- done
which reports which chain was computed on which node and how much time total was used by each node.
You can view the complete input parameter file for this tutorial here. Here we'll focus on each section of the parameter file and understand what it does:
<task value="posteriorSample">
<initializeNodeClassHierarchy value="false"/>
</task>
We begin by specifying the "task" to perform - we choose posteriorSample
which causes Galacticus to run a posterior sampling simulation, which will generate a set of parameters sampled from the posterior distribution given some constraining datasets.
<posteriorSampleLikelihood value="galaxyPopulation">
<baseParametersFileName value="parameters/tutorials/mcmcBase.xml"/>
<failedParametersFileName value="./failedParameters.xml" />
<randomize value="false" />
<evolveForestsVerbosity value="silent" />
</posteriorSampleLikelihood>
In this section we specify how the likelihood function for the model should be computed. Many different likelihood functions can be used, but here we use the galaxyPopulation
likelihood function - which causes Galacticus to run a model generating a population of galaxies, computed some observables from that population, and then find the likelihood of that model given some target datasets. We specify a baseParametersFileName
which gives the path to a parameter file describing the galaxy population model to run - this will be described in more detail below. We also specify a path for the output of failed parameter sets - sometimes while exploring the model parameter space a model will fail to complete - its parameters will be output to this file for further exploration. The randomize
option specifies whether or not to change the random number seed for the model on each evaluation. Finally, we set evolveForestsVerbosity
to silent
to switch off output of logging messages while running the galaxy population calculation (this just keeps the amount of output reasonable).
<!-- MCMC -->
<posteriorSampleSimulation value="differentialEvolution">
<stepsMaximum value="1000" />
<acceptanceAverageCount value=" 10" />
<stateSwapCount value=" 100" />
<logFileRoot value="mcmcChains"/>
<reportCount value=" 10" />
<sampleOutliers value="false" />
<logFlushCount value=" 1" />
This next section controls the type of posterior sampling simulation that we want to perform. Here we choose to use the differential evolution MCMC algorithm (which runs multiple MCMC chains in parallel). Note that this section is not closed at this point - the sections below are all therefore subsections of this section.
<posteriorSampleState value="correlation">
<acceptedStateCount value="100"/>
</posteriorSampleState>
Next we specify the type of object used to store the state of each chain - here we choose a type which can store and compute the correlation structure of each chain's state.
<posteriorSampleStateInitialize value="latinHypercube">
<maximinTrialCount value="100"/>
</posteriorSampleStateInitialize>
Each MCMC chain must be assigned some initial position in the model parameter space. He we choose to assign those positions using a Latin hypercube design, in which we generate 100 trial Latin hypercubes and keep the hypercube with the maximum minimum distance between chains - this ensures a good coverage of the parameter space.
<posteriorSampleConvergence value="gelmanRubin">
<thresholdHatR value=" 1.30"/>
<burnCount value="10" />
<testCount value="10" />
<outlierCountMaximum value=" 1" />
<outlierSignificance value=" 0.95"/>
<outlierLogLikelihoodOffset value="60" />
<reportCount value=" 1" />
<logFileName value="mcmcConvergence.log"/>
</posteriorSampleConvergence>
To judge the convergence of our simulation we choose here to use the Gelman-Rubin statistic - specifying the threshold value for convergence to be declared, how many steps to burn before checking convergence, and how often to test convergence. There are also options to control detection of outlier chains.
<posteriorSampleStoppingCriterion value="stepCount">
<stopAfterCount value="10"/>
</posteriorSampleStoppingCriterion>
This option controls when the simulation will be stopped. Here we choose a simple example which stops the simulation after a specified number of steps post-convergence.
<posteriorSampleDffrntlEvltnRandomJump value="adaptive"/>
In generating proposed new states in the differential evolution MCMC method a random component is added to each chain to ensure that the parameter space is fully converged. This option choose to use the adaptive
method for this random component, which scales the size of the random proportion based on the current chain distribution.
<posteriorSampleDffrntlEvltnProposalSize value="adaptive" >
<gammaInitial value="0.500e+0"/>
<gammaAdjustFactor value="1.100e+0"/>
<gammaMinimum value="1.000e-4"/>
<gammaMaximum value="3.000e+0"/>
<acceptanceRateMinimum value="0.100e+0"/>
<acceptanceRateMaximum value="0.900e+0"/>
<updateCount value="10" />
</posteriorSampleDffrntlEvltnProposalSize>
In the differential evolution MCMC algorithm new proposed states are generated by taking some fraction, γ, of the difference of two randomly selected states and adding that to the current state. The choice of γ is controlled by the above section. We chose the adaptive
option which will adjust the size of γ to ensure that the acceptance rate of proposed states remains between a given minimum and maximum.
<!-- Feedback -->
<modelParameter value="active">
<name value="nodeOperator::nodeOperator[2]::stellarFeedbackOutflows::stellarFeedbackOutflows::velocityCharacteristic"/>
<distributionFunction1DPrior value="uniform">
<limitLower value="25.0"/>
<limitUpper value="500.0"/>
</distributionFunction1DPrior>
<operatorUnaryMapper value="identity"/>
<distributionFunction1DPerturber value="cauchy">
<median value="0.0"/>
<scale value="1.0e-3"/>
</distributionFunction1DPerturber>
</modelParameter>
Most importantly, we have to specify the model parameters that we wish to constrain. Here we use just one, but any number can be included. We choose an active
parameter (one which will be adjusted to obtain the best match to the target data). The name
elements specifies the parameter in the base parameter file to vary, as we will describe below. We must specify a prior for this function - here we choose a uniform prior between given lower and upper limits. We also specify a "mapping" for the parameter - here we use an identity mapping, but we could also use a logarithmic mapping such that the internal calculations would be performed with the log of the parameter value. Finally, we must specify a distribution function for the random perturbation added to each proposal generated for this parameter - here we use a Cauchy distribution which typically produces a small perturbation, but has a tail to arbitrarily large perturbations.
</posteriorSampleSimulation>
Finally, we close the posteriodSampleSimulation
section.
<!-- Random seed -->
<randomNumberGenerator value="GSL">
<seed value="219" />
<mpiRankOffset value="true"/>
</randomNumberGenerator>
One last thing - since the MCMC algorithm is a Monte Carlo algorithm it needs to generate random numbers. Here we choose a random number generator (we use the GSL standard algorithm), and explicitly request that the random number seeds be offset for each different MPI process rank - this ensures that each of our MPI processes generates a different sequence of random numbers.
In addition to the mcmcConfig.xml
parameter file that we examined above, a second parameter file is used in this tutorial. As explained above when discussing the posteriorSampleLikelihood
section, we supply a parameter file which defines the "base" model, which will then have its parameters varied to attempt to find a good fit to the target dataset. In this example that base parameter file is parameters/tutorials/mcmcBase.xml
. This is just a normal Galacticus parameter file, using the evolveForests
task, that you could run directly to generate a model if you wanted to. The only important feature of it is that it contains each parameter which will be varied as specified in the mcmcConfig.xml
file. In this case it contains a section:
<!-- Node evolution and physics -->
<nodeOperator value="multi">
<!-- Star formation options -->
<nodeOperator value="starFormationDisks" >
<luminositiesStellarInactive value="true"/>
</nodeOperator>
<nodeOperator value="starFormationSpheroids">
<luminositiesStellarInactive value="true"/>
</nodeOperator>
<!--Stellar feedback outflows-->
<nodeOperator value="stellarFeedbackDisks">
<stellarFeedbackOutflows value="rateLimit">
<timescaleOutflowFractionalMinimum value="0.001"/>
<stellarFeedbackOutflows value="powerLaw">
<velocityCharacteristic value="175.0"/> <!-- This is the parameter being varied -->
<exponent value=" 3.5"/>
</stellarFeedbackOutflows>
</stellarFeedbackOutflows>
</nodeOperator>
<nodeOperator value="stellarFeedbackSpheroids">
<stellarFeedbackOutflows value="rateLimit">
<timescaleOutflowFractionalMinimum value="0.001"/>
<stellarFeedbackOutflows value="powerLaw">
<velocityCharacteristic value=" 50.0"/>
<exponent value=" 3.5"/>
</stellarFeedbackOutflows>
</stellarFeedbackOutflows>
</nodeOperator>
</nodeOperator>
In the mcmcConfig.xml
file we referred to a parameter "nodeOperator::nodeOperator[2]::stellarFeedbackOutflows::stellarFeedbackOutflows::velocityCharacteristic
". This refers to the parameter indicated (by a comment) in the above - we first find the nodeOperator
element in mcmcBase.xml
. Next we look for the third nodeOperator
element within that element (remember that nodeOperator[2]
indexes the third nodeOperator
element as our indexing starts at 0). Next we find the stellarFeedbackOutflows
element inside that element, then the stellarFeedbackOutflows
inside that element, and finally locate the velocityCharacteristic
element inside that element. It is the value of this parameter which will be varied.
Of course, you can specify any other parameters in the base parameter file also - their values will be fixed throughout the MCMC simulation.
The other key part of the base parameter file is in defining the target datasets to be used to constrain the model. These make use of Galacticus' ability to compute predictions for observables as it runs. The base parameter file contains a section:
<!-- Analyses -->
<outputAnalysis value="stellarVsHaloMassRelationLeauthaud2012" >
<redshiftInterval value="1" />
<computeScatter value="false" />
<systematicErrorPolynomialCoefficient value="0.0 0.0"/>
<likelihoodBins value="9" />
</outputAnalysis>
This tells Galacticus to compute its prediction for the stellar mass-halo mass relation of Leauthaud et al. (2012) - specifically in the first redshift interval (Leauthaud et al. (2012) compute the relation in three redshift intervals), and to compute the likelihood of the model given this dataset using only halo mass bin number 11 (since for this simple tutorial we only compute halos falling within that bin). The computeScatter
option is set to false
such that we compute the mean of the relation - if it were instead set to true
the scatter in the relation would be computed instead. The systematicErrorPolynomialCoefficient
parameter allows for the possibility of including and constraining a model of observational systematic errors here, but we ignore it for now (setting the values of this parameter to zero).
The results of the MCMC simulation are output to files named mcmcChains_????.log
where ????
corresponds to the MPI process number and each file holds the output of states from that chain. An example of the content of each file is:
1 0 8.65868378 F -7.3171034777503241 -1.1537886737156828 321.98163229607087
2 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
3 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
4 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
5 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
6 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
7 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
8 0 8.96963501 F -7.3564114151060496 -1.1930966110714085 322.62448285700106
9 0 8.95562744 F -7.3727669523445538 -1.2094521483099128 322.80007092398876
10 0 8.95562744 F -7.3727669523445538 -1.2094521483099128 322.80007092398876
11 0 8.36373901 T -7.0640684583132103 -0.90075365427856924 318.64017850185974
12 0 8.06176758 T -7.1758380406243720 -1.0125232365897303 324.84524606237727
13 0 8.06176758 T -7.1758380406243720 -1.0125232365897303 324.84524606237727
14 0 8.76565552 T -7.3388034666121991 -1.1754886625775576 322.42077843550794
15 0 8.08679199 T -6.7269143477283295 -0.56359954369368803 314.22563572091480
16 0 8.08679199 T -6.7269143477283295 -0.56359954369368803 314.22563572091480
17 0 7.77185059 T -7.1396858801635883 -0.97637107612894714 319.78935277370073
18 0 8.35467529 T -6.5888317014128344 -0.42551689737819348 306.82365204243695
19 0 7.88580322 T -6.2438500587941803 -8.0535254759539335E-002 292.61032842104959
20 0 7.88580322 T -6.2438500587941803 -8.0535254759539335E-002 292.61032842104959
21 0 7.88580322 T -6.2438500587941803 -8.0535254759539335E-002 292.61032842104959
Each line corresponds to one step in the MCMC chain. The columns are as follows:
- The step number;
- The chain number;
- The CPU time taken to evaluate this model;
- T/F to indicate whether the simulation is converged or not;
- The logarithm of the posterior probability of this state;
- The logarithm of the likelihood of this state;
- (and any further columns) The values of the model parameters in this state.
-
Tutorials
- Introduction to Galacticus parameter files
- Dark matter halo mass function
- Warm dark matter halo mass function
- Power spectra
- Warm dark matter power spectra
- Dark matter only merger trees
- Subsampling of merger tree branches
- Dark matter only subhalo evolution
- Solving the excursion set problem
- Reionization calculations
- Instantaneous & Non-instantaneous recycling
- Computing Broadband Stellar Luminosities
- Postprocessing of stellar spectra
- Using N-body Merger Trees
- Generating Mock Catalogs with Lightcones
- Constraining Galacticus parameters
- Generating galaxy merger trees
-
How Galacticus works
- Structure Formation Flowchart
- Merger Tree Building Flowchart
- How Galacticus Evolves Halos and Galaxies
- Galaxy Physics Flowchart
- CGM Cooling Physics Flowchart
- Star Formation Physics Flowchart
- Outflow Physics Flowchart
- Galactic Structure Flowchart
- CGM Physics Flowchart
- SMBH Physics Flowchart
- Subhalo Evolution Flowchart
-
Contributing
- Coding conventions
- Coding tutorials
-
Reference models
- Benchmarks and validation scores
- Validation plots and data