Genetic Simulation Tools (GST)¶
When developing new methods it is often easier to start with some simulated data. Here I provide some tools to simulate various phenotypes from imported or artificial genetic data.
Overall the library is separated into 3 components
- genotypes: facilities to import genetic data
- simtools: simulation of phenotypes
- tools: handy tools
Loading Genotypes¶
The genotype module provides two classes to read and sample genotypes for both plink and vcf files.
For plink file you simply can
import simtools.genotypes as gp
n = 100; p=1000
seedfile = 'plink_stem'
plink = gp.ReadPlink(seedfile)
genotypematrix = plink.sample(n, p)
This will randomly chose n subjects and p SNPs from a given plink file. Furthermore, sample also allows to write the samples subjects and variants to disk in the current working directory.
You can also read specific variants and subjects into a genotype-matrix.
import simtools.genotypes as gp
import numpy as np
seedfile = 'plink_stem'
plink = gp.ReadPlink(seedfile)
marker = np.random.choice(plink.bim.index.values, 5)
subjects = np.random.choice(plink.bim.index.values, 5)
genotypematrix = plink.read_bed(marker, subjects)
If read_plink is not provided with any marker or subjects it will attempted to read the entire bed file into memory.
Similar you can also you a vcf file
import simtools.genotypes as gp
n = 100; p=1000
vcf = gp.ReadVCF('example.vcf.gz')
genotypematrix = vcf.sample(n, p)
Similar to the plink class, vcf also has a read_vcf method to obtain the genotype-matrix of specific individuals and variants.
Furthermore, the vcf class also has a simple method to obtain the MAF of all variants in a vcf file
import simtools.genotypes as gp
output = 'path.to.output'
vcf = gp.ReadVCF('example.vcf.gz')
vcf.get_allele_frequency(output)
Phenotype Simulations¶
One of the main purposes of this library is to provide a simulation framework to simulate phenotypes. To start off one first needs a seed population in form of a plink file (support for vcf will be added).
import simtools.simtools as st
import numpy as np
plink_file = 'plink_stem'
sims = st.Simtools(plink_file)
sims.chunk_size = 100 # default
The chunk size sets how many variants simtools should process at a time. This might be useful if your seed population is large and cannot be loaded into memory. In addition, you can also select specific subjects within the seed population when initializing the class.
Single Phenotype Simulation¶
Multiple Phenotype Simulation¶
Use¶
It is relatively easy to simulate multiple related phenotypes from existing genetic data.
import simtools.simtools as st
import numpy as np
plink_file = 'plink_stem'
sims = st.Simtools(plink_file)
heratibiltiy = 0.4
num_causal_snps = [10, 10, 10] #for each phenotype
# adjacency matrix
lamb = np.eye(3)
lamb[1,0] = 0.1
lamb[2,0] = 0.1
# genetc effect matrix
b = np.eye(3); b = b * heratibiltiy
# simulate phenoytpes
phenos = sims.multi_phenotype(lamb, b, num_causal_snps)
The resulting matrix phenos is a matrix containing the phenotypes of all subjects. Please be aware that currently the multi_phenotype method does not produce binary traits.
Method¶
It is often desirable to simulate multiple related phenotypes.
Lets say we have three phenotypes
and we would like to simulate inter-causal relationship among those.
Then we can express the relationship between
in matrix form (also called a adjacency matrix) as

in which the effect of A on B is indicated by
, the effect of B on A as
, and so on.
The phenotype for subject i for t phenotypes is then
![y_i &= \Lambda y_i + \beta X_i + e_i \\
&= {(I - \Lambda)}^{-1} [\beta X_i+ \varepsilon_i]](_images/math/8ac4dae44384cddf2f055326d704b0d157638db9.png)
where
represents the vector of variants for a subject i
the corresponding effect sizes of size
is the vector of random effects
is a vector of the phenotypes for subjects i
is a
matrix of structural coefficients (see above);
Hence it is easy to see that this approach flexible and allows the simulation of
number of arbitrary phenotypes.
However, in the current form it is unable to specify specific genetic and environmental effects.
In order to do so one can generalize the previous equation to all subjects
![y &= {(I - \Lambda)}^{-1} [BG + DE] \\
&= {(I - \Lambda)}^{-1}BG + {(I - \Lambda)}^{-1}DE](_images/math/08378feeec73226b3fcf9265833f15d233fa2f36.png)
in which
is the genetic for the genotype matrix
and the effect sizes
the
matrix of phenotypes
be the
matrix of errors (
)
and
the diagonal scalar matrices of size
for the genetic risk and error term respectively.
In order to constrain the variance of each phenotype to
, let
and
then we can constrain

in which
,
and
are the variance-covariance matrices of y, G and E respectively.
Since we constrain the variance of all traits to be
the sum of the causal and genetic effect cannot be greater than
.
Hence the following constrain has to be respected when constructing both
and 

Contents:
represents the vector of variants for a subject i
the corresponding effect sizes of size 
is the vector of random effects
is a vector of the phenotypes for subjects i
matrix of structural coefficients (see above);
is the genetic for the genotype matrix
and the effect sizes 
the
matrix of phenotypes
be the
)
the diagonal scalar matrices of size
for the genetic risk and error term respectively.