Ensuring correct coverage in MethylKit

Ben (previous postdoc in the lab) recently discovered that MethylKit automatically filters data based on coverage when the files are first read into R. We usually destrand our data (i.e. merge the reads from the positive and negative DNA strand to increase coverage at each CpG site) after we have read the data in and before we filter on coverage. However, Ben found that the filtering is happening before we destrand the data. This is an issue because if we have a coverage of say 8 for the positive strand of a CpG and 8 for the negative strand, these data would be thrown out (the minimum coverage base filter is 10), when we destrand the data we would have ended up with a CpG position with 16 coverage but because it has already been filtered we never get to actually test it.

Ben’s workaround for this is to destrand the data BEFORE we read it into methylKit. Previously we would use .bam files for methylKit, now we run the following commands on the .bam files to destrand the data and then read in these new .txt files to methylKit.

for file in $(ls *_deduplicated.bam)
do
    bismark_methylation_extractor \
    -p --no_overlap --comprehensive --bedgraph --report --cytosine_report \
    --genome_folder bumblebee_genome \
    ${file}
done

for file in $(ls *cov.gz)
do
    base=$(basename ${file} "_reads.bismark.cov.gz")
    coverage2cytosine \
    -o ${base} --merge_CpGs \
    --genome_folder bumblebee_genome \
    ${file}
done

The final files needed for methylKit will end with “merged_CpG_evidence.cov”. To read these files into methylKit use the following command.

sample.list <- list("sample1_merged_CpG_evidence.cov", "sample2_merged_CpG_evidence.cov",
                    "sample3_merged_CpG_evidence.cov", "sample4_merged_CpG_evidence.cov")

CPGRaw <- methRead(sample.list, 
                   sample.id = list("sample1", "sample2","sample3","sample4"),
                   assembly="bter_1.0",
                   treatment=c(0,0,1,1),
                   context="CpG",
                   dbtype = NA,
                   pipeline = "bismarkCoverage",
                   header = T, sep = "\t", mincov=1,
                   dbdir= path)