The ancestry estimation is based on correlating the sample variants with population-specific SNPs.
For each population (AFR, EUR, SAS, EAS) the 1000 most informative exonic SNPs were selected for this purpose.
A benchmark on the 1000 Genomes variant data assigned 99.81% of the samples to the correct population (2153 of 2157).
Due to different similarity between popultations, the expected scores differ depending on the ancestry of the sample of interest.
This plot shows the score distribution on the 1000 Genomes data:
This table shows the score median and median average deviation determined from the 1000 Genomes data and used internally to assign a population:
population | AFR median / mad | EUR median / mad | SAS median / mad | EAS median / mad |
---|---|---|---|---|
AFR | 0.5002 / 0.0291 | 0.0553 / 0.0280 | 0.1061 / 0.0267 | 0.0895 / 0.0274 |
EUR | 0.0727 / 0.0271 | 0.3251 / 0.0252 | 0.1922 / 0.0249 | 0.0603 / 0.0264 |
SAS | 0.0698 / 0.0264 | 0.1574 / 0.0295 | 0.3395 / 0.0291 | 0.1693 / 0.0288 |
EAS | 0.08415 / 0.0275 | 0.06725 / 0.0269 | 0.21495 / 0.0228 | 0.47035 / 0.0242 |
The SampleAncestry command-line help and changelog can be found here.