Genetic Simulation Tools (GST)

When developing new methods it is often easier to start with some simulated data. Here I provide some tools to simulate various phenotypes from imported or artificial genetic data.

Overall the library is separated into 3 components

  • genotypes: facilities to import genetic data
  • simtools: simulation of phenotypes
  • tools: handy tools

Loading Genotypes

The genotype module provides two classes to read and sample genotypes for both plink and vcf files.

For plink file you simply can

import simtools.genotypes as gp
n = 100; p=1000
seedfile = 'plink_stem'
plink = gp.ReadPlink(seedfile)
genotypematrix = plink.sample(n, p)

This will randomly chose n subjects and p SNPs from a given plink file. Furthermore, sample also allows to write the samples subjects and variants to disk in the current working directory.

You can also read specific variants and subjects into a genotype-matrix.

import simtools.genotypes as gp
import numpy as np
seedfile = 'plink_stem'
plink = gp.ReadPlink(seedfile)
marker = np.random.choice(plink.bim.index.values, 5)
subjects = np.random.choice(plink.bim.index.values, 5)
genotypematrix = plink.read_bed(marker, subjects)

If read_plink is not provided with any marker or subjects it will attempted to read the entire bed file into memory.

Similar you can also you a vcf file

import simtools.genotypes as gp
n = 100; p=1000
vcf = gp.ReadVCF('example.vcf.gz')
genotypematrix = vcf.sample(n, p)

Similar to the plink class, vcf also has a read_vcf method to obtain the genotype-matrix of specific individuals and variants.

Furthermore, the vcf class also has a simple method to obtain the MAF of all variants in a vcf file

import simtools.genotypes as gp
output = 'path.to.output'
vcf = gp.ReadVCF('example.vcf.gz')
vcf.get_allele_frequency(output)

Phenotype Simulations

One of the main purposes of this library is to provide a simulation framework to simulate phenotypes. To start off one first needs a seed population in form of a plink file (support for vcf will be added).

import simtools.simtools as st
import numpy as np
plink_file = 'plink_stem'
sims = st.Simtools(plink_file)
sims.chunk_size = 100 # default

The chunk size sets how many variants simtools should process at a time. This might be useful if your seed population is large and cannot be loaded into memory. In addition, you can also select specific subjects within the seed population when initializing the class.

Single Phenotype Simulation

Multiple Phenotype Simulation

Use

It is relatively easy to simulate multiple related phenotypes from existing genetic data.

import simtools.simtools as st
import numpy as np
plink_file = 'plink_stem'
sims = st.Simtools(plink_file)
heratibiltiy = 0.4
num_causal_snps = [10, 10, 10] #for each phenotype
# adjacency matrix
lamb = np.eye(3)
lamb[1,0] = 0.1
lamb[2,0] = 0.1
# genetc effect matrix
b = np.eye(3); b = b * heratibiltiy

# simulate phenoytpes
phenos = sims.multi_phenotype(lamb, b, num_causal_snps)

The resulting matrix phenos is a matrix containing the phenotypes of all subjects. Please be aware that currently the multi_phenotype method does not produce binary traits.

Method

It is often desirable to simulate multiple related phenotypes. Lets say we have three phenotypes A, B, C and we would like to simulate inter-causal relationship among those. Then we can express the relationship between A, B, C in matrix form (also called a adjacency matrix) as

\Lambda =
\begin{bmatrix}
0 & \lambda_{A,B} & \lambda_{A,C} \\
\lambda_{B,A} & 0 & \lambda_{B,C} \\
\lambda_{C,A} & \lambda_{C,B} & 0
\end{bmatrix}

in which the effect of A on B is indicated by \lambda_{B,A}, the effect of B on A as \lambda_{A,B}, and so on. The phenotype for subject i for t phenotypes is then

y_i &= \Lambda y_i + \beta X_i  + e_i \\
 &= {(I - \Lambda)}^{-1} [\beta X_i+ \varepsilon_i]

where

  • X_i represents the vector of variants for a subject i
  • \beta the corresponding effect sizes of size t\times p
  • e_i is the vector of random effects
  • y_i is a vector of the phenotypes for subjects i
  • \Lambda is a t\times t matrix of structural coefficients (see above);

Hence it is easy to see that this approach flexible and allows the simulation of t number of arbitrary phenotypes. However, in the current form it is unable to specify specific genetic and environmental effects. In order to do so one can generalize the previous equation to all subjects

y &= {(I - \Lambda)}^{-1} [BG + DE] \\
&= {(I - \Lambda)}^{-1}BG + {(I - \Lambda)}^{-1}DE

in which

  • G = (X\beta')' is the genetic for the genotype matrix X_{n\times p} and the effect sizes \beta_{t\times p}
  • y the t \times n matrix of phenotypes
  • E be the t \times n matrix of errors (E = (\varepsilon_1, \ldots, \varepsilon_t) \quad with \quad \varepsilon \sim N(0,1))
  • B and D the diagonal scalar matrices of size t \times t for the genetic risk and error term respectively.

In order to constrain the variance of each phenotype to 1, let C = {(I - \Lambda)}^{-1} B and F = {(I - \Lambda)}^{-1} D then we can constrain

\Sigma_y = C\times \Sigma_G \times C^{T} + F\times \Sigma_E \times F^{T} =
\begin{bmatrix}
1 & \lambda_{1,2} & \lambda_{1,3} \\
\lambda_{2,1} & 1 & \lambda_{2,3} \\
\lambda_{3,1} & \lambda_{3,2} & 1
\end{bmatrix}

in which \Sigma_y, \Sigma_G and \Sigma_E are the variance-covariance matrices of y, G and E respectively.

Since we constrain the variance of all traits to be 1 the sum of the causal and genetic effect cannot be greater than 1. Hence the following constrain has to be respected when constructing both \Lambda and B

(\Lambda + B) \boldsymbol{1} \leq \boldsymbol{1}

Contents: