Permute? or no?

This post covers brief findings from exploring permutation tests when looking for differentially methylated genes from whole genome bisulfite sequencing.

The idea behind it: if we randomly shuffle the data between control and treatment samples and see roughly the same number of genes coming out as differentially methylated then we can’t say the genes we originally identified are a product of the treatment we applied. This idea is explained better here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5067136/.

What we did: to run a permutation test on our data we shuffled the ‘methylation values’ between samples for each CpG position and re-ran the differential methylation analysis, this was done 1000 times. We only shuffled within CpG and not the entire dataset as doing so would cause extreme high/low values to influence other positions, causing a huge number of differentially methylated sites to be called. Therefore it is important to keep the relative methylation information for each CpG.

(I will edit this post in the future and put a link to the code on GitHub when it’s available).

What the possible outcomes of the permutation could be: if the number of differentially methylated genes found between control and treatment was greater than the number of genes found with shuffled data (>95% of the outcomes), then we would conclude the treatment is causing this increase in differentially methylated genes. See example histogram below.

upper

We could also find that the number of differentially methylated genes between control and treatment falls within 95% of the permutation outcomes, in which case we cannot say the number of differentially methylated genes originally found is caused by the treatment.

middle

Third outcome (and what we actually found): the number of differentially methylated genes between control and treatment were considerably less than the permutation outcomes (see below graph).

real

What does this mean!?! After lengthy debate by Ben/Alun/Boris we figured out why we see this. The original differential methylation analysis between control and treatment takes ‘bee colony’ as a covariate. This information is lost however when we permute the data as each CpG contains methylation information randomly chosen from treatment/control. The majority of the methylation differences are explained by colony (not treatment) and when we can no longer take colony into account we produce many more differentially methylated sites.

In conclusion: a permutation test is not appropriate for our data-set and chosen differential methylation analysis method.