Skip to main content

Table 2 Filtering and QC procedures in Stage 2: calling genotypes in all 725 monkeys at the unequivocal segregating sites identified in Stage 1. Stage 2 started with 4,235,761 sites and ended with 3,369,989 sites

From: Sequencing strategies and characterization of 721 vervet monkey genomes for future genetic analyses of medically relevant traits

QC filtering procedure

Number of variants removed

Not passing SAMtools filters (“mpileup -S -D -q 30 -Q 20”, “vcfutils.pl varFilter -w 10 -d 3 -D 12740 -e 0–2 0”)

209,826

Cumulative coverage outside of twofold range of global median coverage

20,843

MAF in 723 monkeys <10 %

10,766

Missing >50 % of data

105

Too few (<3) loci in 3Mb regions, not enough for TrioCaller to work.

1,360

Loci unmapped or not mapped uniquely during LiftOver

32,419

Filtered out by GATK’s FilterLiftedVariants

4,094

Whole contig removed for contigs with >1 chromosome switching events per 100 loci

6,208

LiftOver MapScore <0.5

61,721

Loci mapped to the same coordinate in the new reference genome

4

Alignment: identified regions of poor alignment (mapping quality <2- or coverage >2-fold range of global median depth) and masked these genotypes as missing. Sites with >50 % missing in 4X and above monkeys are removed

438,423

Sex chromosome SNPs

65,271

>=5 Mendel errors in parent–child comparisons

8,563

>60 % heterozygous calls

6,201

Total

865,772