Predicting Cell Types from DNA Methylation Profiles

Version: 2016-09-27
Author:  Fabian Müller

Overview

We provide R scripts that allow for the training and interpretation of classifiers that accurately predict cell types from DNA methylation data.

Website

The source code and example data can be found at http://blueprint-methylomes.computational-epigenetics.org/

Prerequisites

Example data

The example data provided contains DNA methylation levels in putative regulatory regions for 446 replicates of hematopoietic cell types (methMatrix.tsv). A corresponding sample annotation table is also included (sampleAnnot.tsv). 319 hematopoietic progenitor cell types are labeled as training data in the annotation table.

Command line call

The classifiers can be trained and evaluated using the following command line call:

Rscript cellTypePredictor.R --features methMatrix.tsv --annot sampleAnnot.tsv --out predictCellCellTypes

This will

Note: The code has not been optimized for performance. So, with the example data, the script will run fur a while (approximately 22 hours on our computing infrastructure). The size of the output is also considerable (~30GB) since the data matrices and full model specifications are included for each model.

Input

Parameter Description
--features, -f Filename of the feature/methylation matrix (tab-separated, dimension: features X samples, includes a header).
--annot, -a Filename of the sample annotation matrix (tab-separated). Must contain the same number of rows as the feature matrix has columns and a header line. The column called ‘class’ is assumed to contain the class assignment. An optional ‘train’ column may contain TRUE/FALSE depending on whether a sample will be used in training of the predictor. An optional ‘color’ column may contain color values for each class (colors must be consistent for samples of the same class).
--out, -o Output directory. Must be non-existing and will be created.

Output

Several files will be contained in the output directory:

Filename pattern Description
confusion_(en|svm)_predict.pdf Plot of the confusion matrix showing which samples in the dataset are assigned to which class according to the respective classifier (en for elastic-net logistic regression, svm for linear SVM). Note: this is not the result for the 10-fold cross-validation but for the classifier trained on the entire training dataset. It will therefore be overoptimistic for the training data.
confusion_(en|svm)_cv_predict.pdf As above, but showing the class-confusion in the 10-fold cross-validation setting
confusion_(en|svm)_predict_loco_*.pdf Confusion matrices for the leave-one-class-out classifiers
hm_(en|svm)_classProbs_predict.pdf Heatmap of assigned class probabilities for each sample (not cross-validated).
hm_(en|svm)_cv_classProbs_predict.pdf Heatmap of assigned class probabilities for each sample (cross-validated).
hm_(en|svm)_classProbs_predict_loco_*.pdf Heatmap of assigned class probabilities for the leave-one-class-out classifiers
rd_(en|svm)_classProbs_predict.pdf Radar plot of assigned class probabilities for each sample (not cross-validated). In these plots, the size of the sectors corresponds to the assigned class probabilities to each class.
rd_(en|svm)_cv_classProbs_predict.pdf Radar plot of assigned class probabilities for each sample (cross-validated).
rd_(en|svm)_classProbs_predict_loco_*.pdf Radar plot of assigned class probabilities for the leave-one-class-out classifiers
roc_(en|svm)_cv_classProbs_predict.pdf Per-class ROC curves and AUC values (10-fold cross-validated)
predictionTable_(en|svm)_classProbs.tsv Table (tab-separated) containing class probabilities for each sample in the dataset
predictionTable_(en|svm)_signatureIndices.tsv List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection
predictionTable_loco_*_(en|svm)_classProbs.tsv Table (tab-separated) containing class probabilities for each sample in the dataset (leave-one-class-out classifiers)
predictionTable_loco_*_(en|svm)_signatureIndices.tsv List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection (leave-one-class-out classifiers)
predictionResult.rds Binary file containing the trained models, data and prediction results for further evaluation and custom scripting (see description below)
predictionResult_loco_*.rds As above, but for the different leave-one-class-out classifiers

R object files containing prediction results

The predictionResult.rds files contain the prediction results of the trained classifier as R objects. They can be loaded using readRDS("predictionResult.rds")
The object is a list with one element for each classifier. Each of those elements is again a list containing

Methods and implementation details

Reproducibility of the paper’s results

The example data provided is the dataset used for predicting cell types in the context of hematopoietic stem cell differentiation [Farlik, Halbritter, Müller et al. (2016)]. With the exception of stochastic differences due to missing value imputation and sampling of the cross-validation folds, the script reproduces the data for the results presented in the paper. Custom scripting was applied to produce the figures in the article from the inferred feature weights and prediction probabilities. The extended code is available upon request.

Citation

Please use the reference given on the following website when citing this code:
http://blueprint-methylomes.computational-epigenetics.org/