Skip to content

Α toolset to visualize basic statistical data, extracted out of vcf files

Notifications You must be signed in to change notification settings

kutsukos/VCF-statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VCF statistics - psarema

uses-bash Python 2.7 R 3.6.0

Table of contents

  1. Step 1 - Analysis
  2. Step 2 - Visualization
  3. About VCF statistics
  4. Version Changelog
  5. Contact

Step 1 - Analysis

The first step for this analysis is to run psarema.py.

python3 psarema.py

This script needs 2 files and generates 3 output files. Please check the global variables on psarema.py, to learn more about the requested input.

The tab-delimited file contains 2 fixed fields per line. All data lines are tab-delimited. Fixed fields are:

  1. sample - an identifier to a sample
  2. population - the population this sample belongs to

In SupportData directory, you can find 1KGP.sample.pop.tab which is a sample file, we are going to use in this tutorial.

This file is suitable to be used for analysis, in samples contained in 1000 genome project. So if your vcf file's samples belong to 1000 genome project, you will probably use 1KGP.sample.pop.tab file. Or else try to keep the same format and the output will be ok.

Output of this script are 3 files:

  1. result.1.tab
  2. result.2.tab
  3. result.2.tab

The first one contains information about each line of your vcf file. Information about the number of samples that have 0/0, 1/1 or 0/1 in each population. This information will help us later to visualize the information from the vcf file.

The 2nd file contains the number of and which populations does have the SNP or whatever a line explains in your vcf file.

The 3rd file contains information per sample. This file will inform us on how many insertions/SNPs a sample has and in which population this sample belongs.

Step 2 - Visualization

For Step2, we will need Rscript and tidyverse package. If you dont have it, follow the commands below.

$ R
> install.packages("tidyverse")
> q()
Save workspace image? [y/n/c]: n

In order to run this script, we only need one argument and this is the ID of this project and the file popSUPERpop.tab, in the same directory as psarema.plots.R script is.

The file popSUPERpop.tab, that is providen in the SupportData directory has information about the populations from 1000 genome project.

In our example the ID was yourfile, because the vcf file was named yourfile.vcf.

Does it make sense? No! But we assumed, that was your file's name.

$ Rscript psarema.plots.R yourfile 

The command above, will output 2 pdf files, 5 files (5 super families) with tables, with statistics about each super population and 1 .tab file that contain some extra information, that is needed for the visualization.

The plot1 is showing us in how many populations, a SNP/insertion exists.

The plot2 is showing us, in each population how many samples do have 0,1,2,... insertions/SNPs depending again on the information each line explains in the vcf file and the number of lines.

ABOUT VCF statistics

This toolset was created when we needed to visualize some of our data and also to make some basic statistical analysis.

Ψάρεμα-Psarema means fishing (in Greek), and this toolkit was named after this, because we try to fish information out of a vcf file...

Contact

Contact me at [email protected] for reporting bugs or anything else! :)

About

Α toolset to visualize basic statistical data, extracted out of vcf files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published