Predicting Cell Types from DNA Methylation Profiles
Version: 2016-09-27
Author: Fabian Müller
Overview
We provide R
scripts that allow for the training and interpretation of classifiers that accurately predict cell types from DNA methylation data.
Website
The source code and example data can be found at http://blueprint-methylomes.computational-epigenetics.org/
Prerequisites
- A working installation of
R
including the packagesRnBeads, argparse, matrixStats, glmnet, e1071, impute
Example data
The example data provided contains DNA methylation levels in putative regulatory regions for 446 replicates of hematopoietic cell types (methMatrix.tsv
). A corresponding sample annotation table is also included (sampleAnnot.tsv
). 319 hematopoietic progenitor cell types are labeled as training data in the annotation table.
Command line call
The classifiers can be trained and evaluated using the following command line call:
Rscript cellTypePredictor.R --features methMatrix.tsv --annot sampleAnnot.tsv --out predictCellCellTypes
This will
- impute missing values (if present) in the data matrix (
methMatrix.tsv
) - train a classifier based on the features and observations contained in
methMatrix.tsv
. The class labels are taken from the sample annotation file (sampleAnnot.tsv
) - perform 10-fold cross-validation
- train classifiers excluding observations for a particular class and then predict the observations belonging to that class to one of the other classes (“leave-one-class-out” classifiers)
- assess class probabilities for each observation in the methylation matrix
- identify signature features
- generate corresponding misclassification tables and plots, including distribution plots, ROC curves, heatmaps and radar plots of class probabilities
- store the trained classifiers, data matrices and class probabilities in a binary format for further evaluation and custom scripting
Note: The code has not been optimized for performance. So, with the example data, the script will run fur a while (approximately 22 hours on our computing infrastructure). The size of the output is also considerable (~30GB) since the data matrices and full model specifications are included for each model.
Input
Parameter | Description |
---|---|
--features, -f |
Filename of the feature/methylation matrix (tab-separated, dimension: features X samples, includes a header). |
--annot, -a |
Filename of the sample annotation matrix (tab-separated). Must contain the same number of rows as the feature matrix has columns and a header line. The column called ‘class’ is assumed to contain the class assignment. An optional ‘train’ column may contain TRUE/FALSE depending on whether a sample will be used in training of the predictor. An optional ‘color’ column may contain color values for each class (colors must be consistent for samples of the same class). |
--out, -o |
Output directory. Must be non-existing and will be created. |
Output
Several files will be contained in the output directory:
Filename pattern | Description |
---|---|
confusion_(en|svm)_predict.pdf |
Plot of the confusion matrix showing which samples in the dataset are assigned to which class according to the respective classifier (en for elastic-net logistic regression, svm for linear SVM). Note: this is not the result for the 10-fold cross-validation but for the classifier trained on the entire training dataset. It will therefore be overoptimistic for the training data. |
confusion_(en|svm)_cv_predict.pdf |
As above, but showing the class-confusion in the 10-fold cross-validation setting |
confusion_(en|svm)_predict_loco_*.pdf |
Confusion matrices for the leave-one-class-out classifiers |
hm_(en|svm)_classProbs_predict.pdf |
Heatmap of assigned class probabilities for each sample (not cross-validated). |
hm_(en|svm)_cv_classProbs_predict.pdf |
Heatmap of assigned class probabilities for each sample (cross-validated). |
hm_(en|svm)_classProbs_predict_loco_*.pdf |
Heatmap of assigned class probabilities for the leave-one-class-out classifiers |
rd_(en|svm)_classProbs_predict.pdf |
Radar plot of assigned class probabilities for each sample (not cross-validated). In these plots, the size of the sectors corresponds to the assigned class probabilities to each class. |
rd_(en|svm)_cv_classProbs_predict.pdf |
Radar plot of assigned class probabilities for each sample (cross-validated). |
rd_(en|svm)_classProbs_predict_loco_*.pdf |
Radar plot of assigned class probabilities for the leave-one-class-out classifiers |
roc_(en|svm)_cv_classProbs_predict.pdf |
Per-class ROC curves and AUC values (10-fold cross-validated) |
predictionTable_(en|svm)_classProbs.tsv |
Table (tab-separated) containing class probabilities for each sample in the dataset |
predictionTable_(en|svm)_signatureIndices.tsv |
List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection |
predictionTable_loco_*_(en|svm)_classProbs.tsv |
Table (tab-separated) containing class probabilities for each sample in the dataset (leave-one-class-out classifiers) |
predictionTable_loco_*_(en|svm)_signatureIndices.tsv |
List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection (leave-one-class-out classifiers) |
predictionResult.rds |
Binary file containing the trained models, data and prediction results for further evaluation and custom scripting (see description below) |
predictionResult_loco_*.rds |
As above, but for the different leave-one-class-out classifiers |
R
object files containing prediction results
The predictionResult.rds
files contain the prediction results of the trained classifier as R
objects. They can be loaded using readRDS("predictionResult.rds")
The object is a list with one element for each classifier. Each of those elements is again a list containing
- the fitted model object (element:
$fit
) - predicted class for each observation in the dataset (element:
$pred
) - predicted class probabilities for each observation in the dataset (element:
$prob
)
Methods and implementation details
- By default, elastic-net regularized logistic regression models and linear support vector machine classifiers will be trained. They are denoted by
en
andsvm
, respectively in the outputs. Theglmnet
ande1071
R
packages are used for model inference, respectively. - For elastic-net regularized logistic regression, the
alpha
parameter is set to 0.5 to reflect equal mixing of lasso and ridge penalties.lambda
is chosen by nested 10-fold cross-validation. - Missing value imputation will be performed using 5-nearest neighbor imputation (using the
impute
R
package).
Reproducibility of the paper’s results
The example data provided is the dataset used for predicting cell types in the context of hematopoietic stem cell differentiation [Farlik, Halbritter, Müller et al. (2016)]. With the exception of stochastic differences due to missing value imputation and sampling of the cross-validation folds, the script reproduces the data for the results presented in the paper. Custom scripting was applied to produce the figures in the article from the inferred feature weights and prediction probabilities. The extended code is available upon request.
Citation
Please use the reference given on the following website when citing this code:
http://blueprint-methylomes.computational-epigenetics.org/