Upload
jaycee-routledge
View
226
Download
3
Embed Size (px)
Citation preview
2
A basic question…
3
Factors to consider
• Number of observations• Magnitude of effect• Technical considerations• Biological variability• Biological common sense
4
The problem of power…• Ideally want to cover every Cytosine (CpG)• Have to correct for the number of tests
• There’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correction
• Stats have to find a way to work round this.
5
Maximising power• Options– Analyse in windows– Pre-filter– Hierarchical or Adaptive filtering
6
Window sizes
Small windows• Good resolution• Specific biological effects• High MTC burden• Small observations• High p-values
Large windows• Lots of data• High statistical power• Low MTC burden• Low p-values• Effect averaging
7
Simple Statistical Approach
• Is the proportion of methylated calls different between two samples, given the number of observations?
Meth count A Unmeth count A Meth count B Unmeth count B % change Significant?
2 0 0 2 100 No
200 2 198 5 1.5 No
100 50 75 60 11 Probably
8
Contingency tests• Chi-square / G-test / Fisher’s exact test– Differ only at low observations– Significant changes require enough observations
that any of these should give the same answer• Operates on single replicates• Technical measure of difference
Meth A Unmeth A
Meth B Unmeth B
9
Chi-Square results
10
Biological considerations
• Minimum relevant effect size?– Balance power vs change– What makes biological sense – (what would you follow up?)
• Minimum coverage worth testing– No point testing poorly covered regions
11
Effect of pre-filtering
12
Distribution of methylation
Chi square assumes a normal distribution, and methylation data isn’t normally distributed
13
Beta binomial distribution
More relevant statistics than chi-square. Need to fit custom model to actual data.
14
Implications of a beta distribution• Many summaries assume normality– Mean– Standard Deviation– Boxplots
• None of these is strictly appropriate when looking at methylation data
15
Dealing with replicates• Simple approach
– Merge data from replicates together– Single test, High power– Post-hoc test for consistency
• Explicitly account for batch effects– Logistic regression– Measures batch effects and excludes them from final significance
calculation
• Work with methylation values– Normalise percentage methylation values– Use conventional statistics (t-tests etc) for comparing groups
16
Hierarchical testing
• Test larger regions– Windows / Features etc.
• Take significant hits and subdivide– Smaller windows– Individual CpGs– Correct only for these tests
• Assemble hits together to make up DMRs
17
Hierarchical testing
GenomeCGI CGI CGI CGI CGI CGI
GenomeCGI CGI CGI CGI CGI CGIX X X XGenomeCGI CGI CGI CGI CGI CGIX X X X
Statistically ‘creative’ solution to not having enough data
18
Methylation statistics packages• swDMR (Perl/R-package)
Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3)
• methylKit* (R-package by A. Akalin et al.)Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.
• bsseq* (R/Bioconductor by K.D. Hansen) Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s exact
test. Requires biological replicates for DMR detection
• BiSeq* (R/Bioconductor by K. Hebestreit et al.)Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq
• RnBeads* (R package by F. Mueller et al.) works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data
• DMAP* (C command line tool by P. Stockwell et al.)RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA
• RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs
• MOABS* (C++ command line tool by D. Sun et al.)Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that
combines biological and statistical significance
• ComMet (Y. Saito et al., 2014)Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require
biological replicates
• DSS (R/Bioconductor by Feng et al., 2014)Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated loci
• more appearing every other week… * interface well with
19
Tool Statistical test Suitable for Implementation Notes
bsseq Sample-wise smoothing, then group differences via CpG-wise t-tests (p-value cutoff to define adjacent CpG sites as DMRs)
WGBS; not designed for targeted BS-Seq or RRBS
R package/Bioconductor
Outperforms Fisher’s exact test; intended to compare 2 groups;replicates required
BiSeq Define CpG clusters, smooth methylation data, model and test group effect (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test), hierarchical testing procedure on CpG clusters, then define DMR boundaries
RRBS; targeted BS-Seq; for WGBS
R package/Bioconductor
Very computationally intensive; Not limited to 2 groups
MethylKit Models CpG methylation within a logistic regression. Sliding linear model (SLIM) to correct for multiple testing
(e)RRBS R package
* WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq
20
bsseq – for whole genome BS-Seq• Smoothing of low coverage BS-Seq first to get reliable semi-local
methylation estimation estimates
• Not suitable for captured or restricted data
• After smoothing it uses biological replicates to estimate biological variation and identify methylated regions (DMRs)
• Smoothing suitable for even a single sample
• Works for CpG context in humans, will probably not scale to 2x585M Cs in non-CG context
21
BSmooth algorithm
black: 25x (Lister)pink: 4x (Lister)
22
Bsmooth t-values