Last updated: Nov. 2020
This tutorial walks the user through analyzing, validating and depositing integrative models produced by the Integrative Modeling Platform.
The tutorial contains code to perform a complete integrative modeling pipeline from model building through archiving the models and protocols in mmCIF format. Examples for a partially sampled and more extensively sampled protocols are provided.
Install the tutorial and IMP as described in the main page README
This step is optional. One may skip directly to analysis (Step 2) using the provided output files
-
Navigate to
./modeling
-
Change line 7 in
./run_rnapolii_modeling.sh
to point to your python installation -
run
./run_rnapolii_modeling.sh output_dir N n_steps
output_dir
is the prefix of the output directoryN
is the number of independent sampling runs to do (at least 2, up to the number of processors you are willing to sacrifice)n_steps
is the number of Monte-Carlo steps to run.
Each 1000 steps takes a few to 20 minutes depending on your machine. The script will produce a significant amount of output to screen and create directories output_dir0
, output_dir1
, ...
which contain the output that will be analyzed.
To sufficiently sample the system, at N>=4
and n_steps>=50000
should be performed. Analysis can still be done on smaller samples.
First, we do a quick assessment of the output for sampling mixing by clustering based on score values.
- Navigate to
./analysis
- run
python ./run_analysis_trajectories.py ../modeling/ output_dir
../modeling
is the path to the output directoriesoutput_dir
is the prefix of the output directories (same as above orexample
if you wish to analyze the supplied output.
- Look at
model_analysis/plot_clustering_scores.png
(below)
Here, we see the plots for the example
output with distinct, non-overlapping clusters of solutions. This can be indicative of insufficient sampling. Models can still be analyzed, however it is likely that sampling precision will be able to be improved. The plot for more complete sampling, showing a single major cluster, can be seen in the ./analysis_extensive
results:
The results of score-based clustering are tabulated in analysis/model_analysis/summary_hdbscan_clustering.dat
, which shows that cluster 1 is the largest and most diverse cluster (contains similar amounts of models in sample A and sample B). We continue by analyzing this cluster.
Additional information about the satisfaction of restraints, including a detailed breakdown of crosslinking satisfaction by cluster are contained in ./model_analysis
.
We extract the coordinates from the models of cluster 1 using the following command:
python ./run_extract_models.py ../modeling example 1
Model coordinates (A_models_clust1.rmf3
and B_models_clust1.rmf3
) and model scores (A_models_clust1.txt
, and B_models_clust1.txt
) for the A and B subsets of cluster 1 are written to ./model_analysis
.
The sampling precision of these models can then be computed as follows:
imp_sampcon exhaust -n rnapol \
--rmfA ./model_analysis/A_models_clust1.rmf3 \
--rmfB ./model_analysis/B_models_clust1.rmf3 \
--scoreA ./model_analysis/A_models_clust1.txt \
--scoreB ./model_analysis/B_models_clust1.txt \
-d density_rnapol.txt \
-m cpu_omp -c 4 \
-gp -g 2.0
Where:
-n
is the name of the system, used as the prefix for all output files--rmfA
/B
are the rmf files computed above--scoreA
/B
are the score files computed above-d
is the name of the file containing the localization density definitions-m
is the computation mode:cpu_omp
,cpu_serial
,cuda
-c
specifies the number of cores when-m
=cpu_omp
- -gp uses gnuplot to create output graphs of convergence statistics
- -g is the grid size for computing sampling precision. Larger values are faster, however are less precise.
The sampling precision for cluster 1 of the example
data is reported as 12.6 angstroms in rnapol.Sampling_Precision_Stats.txt
. The result of the three statistical tests for sampling covergence are plotted in rnapol.ChiSquare.pdf
The script will cluster the models at the estimated sampling precision. The centroid and localization densities for these cluster are output into cluster.0
, cluster.1
, etc...
The model precision for each cluster is found in rnapol.Cluster_Precision.txt
. We find here that our cluster precision is 8.615 angstroms.
Cluster precision is defined as the weighted average RMSD between each configuration in the cluster and the cluster centroid (or RMSF). Since our sampling precision is computed as an RMSD, to compare our model and sample precision, we multiply by the general relationship between RMSD and RMSF for globular systems (RMSD = 1.4 * RMSF). Our cluster precision of 8.6
Localization densities can be viewed in ChimeraX as well as RMF files with the associated plugin for viewing RMF files.
Finally, the protocols and models produced from integrative modeling should be archived and deposited in the public domain so that others may build upon the protocols by adding information, sampling, new representations or scoring functions.
Navigate to ./ihm_deposition
and follow the README.
The file density_rnapol.txt
contains sets of model components used to compute localization densities. Each entry in the {}
bracketed dictionary represents a localization density that is produced during the clustering analysis. Each entry is named in quotes, i.e., "Rbp4"
and assigned a list of components in []
brackets. Each component is defined by the molecule name defined in the modeling step, i.e., "Rbp7"
. Specific residues from a molecule can be defined in a ()
bracket, e.g., ("Rbp4", 1, 25)
, which defines residues 1-25 in molecule Rbp4.
For this example, we place each moving subunit into its own localization density and the balance of the complex in the fixed "Rigid_Body
complex.
density_custom_ranges={"Rbp4":["Rbp4"],"Rbp7":["Rbp7"], "Rigid_Body":["Rbp1", "Rbp2", "Rbp3", "Rbp5", "Rbp6", "Rbp8", "Rbp9","Rbp10","Rbp11","Rbp12"]}