Predicting Cell Types from DNA Methylation Profiles

Version: 2016-09-27
Author:  Fabian Müller

Overview

We provide R scripts that allow for the training and interpretation of classifiers that accurately predict cell types from DNA methylation data.

Website

The source code and example data can be found at http://blueprint-methylomes.computational-epigenetics.org/

Prerequisites

A working installation of R including the packages RnBeads, argparse, matrixStats, glmnet, e1071, impute

Example data

The example data provided contains DNA methylation levels in putative regulatory regions for 446 replicates of hematopoietic cell types (methMatrix.tsv). A corresponding sample annotation table is also included (sampleAnnot.tsv). 319 hematopoietic progenitor cell types are labeled as training data in the annotation table.

Command line call

The classifiers can be trained and evaluated using the following command line call:

Rscript cellTypePredictor.R --features methMatrix.tsv --annot sampleAnnot.tsv --out predictCellCellTypes

This will

impute missing values (if present) in the data matrix ( methMatrix.tsv)
train a classifier based on the features and observations contained in methMatrix.tsv. The class labels are taken from the sample annotation file (sampleAnnot.tsv)
perform 10-fold cross-validation
train classifiers excluding observations for a particular class and then predict the observations belonging to that class to one of the other classes (“leave-one-class-out” classifiers)
assess class probabilities for each observation in the methylation matrix
identify signature features
generate corresponding misclassification tables and plots, including distribution plots, ROC curves, heatmaps and radar plots of class probabilities
store the trained classifiers, data matrices and class probabilities in a binary format for further evaluation and custom scripting

Note: The code has not been optimized for performance. So, with the example data, the script will run fur a while (approximately 22 hours on our computing infrastructure). The size of the output is also considerable (~30GB) since the data matrices and full model specifications are included for each model.

Input

Parameter	Description
`--features, -f`	Filename of the feature/methylation matrix (tab-separated, dimension: features X samples, includes a header).
`--annot, -a`	Filename of the sample annotation matrix (tab-separated). Must contain the same number of rows as the feature matrix has columns and a header line. The column called ‘class’ is assumed to contain the class assignment. An optional ‘train’ column may contain TRUE/FALSE depending on whether a sample will be used in training of the predictor. An optional ‘color’ column may contain color values for each class (colors must be consistent for samples of the same class).
`--out, -o`	Output directory. Must be non-existing and will be created.

Output

Several files will be contained in the output directory:

Filename pattern	Description
`confusion_(en\|svm)_predict.pdf`	Plot of the confusion matrix showing which samples in the dataset are assigned to which class according to the respective classifier (`en` for elastic-net logistic regression, `svm` for linear SVM). Note: this is not the result for the 10-fold cross-validation but for the classifier trained on the entire training dataset. It will therefore be overoptimistic for the training data.
`confusion_(en\|svm)_cv_predict.pdf`	As above, but showing the class-confusion in the 10-fold cross-validation setting
`confusion_(en\|svm)_predict_loco_*.pdf`	Confusion matrices for the leave-one-class-out classifiers
`hm_(en\|svm)_classProbs_predict.pdf`	Heatmap of assigned class probabilities for each sample (not cross-validated).
`hm_(en\|svm)_cv_classProbs_predict.pdf`	Heatmap of assigned class probabilities for each sample (cross-validated).
`hm_(en\|svm)_classProbs_predict_loco_*.pdf`	Heatmap of assigned class probabilities for the leave-one-class-out classifiers
`rd_(en\|svm)_classProbs_predict.pdf`	Radar plot of assigned class probabilities for each sample (not cross-validated). In these plots, the size of the sectors corresponds to the assigned class probabilities to each class.
`rd_(en\|svm)_cv_classProbs_predict.pdf`	Radar plot of assigned class probabilities for each sample (cross-validated).
`rd_(en\|svm)_classProbs_predict_loco_*.pdf`	Radar plot of assigned class probabilities for the leave-one-class-out classifiers
`roc_(en\|svm)_cv_classProbs_predict.pdf`	Per-class ROC curves and AUC values (10-fold cross-validated)
`predictionTable_(en\|svm)_classProbs.tsv`	Table (tab-separated) containing class probabilities for each sample in the dataset
`predictionTable_(en\|svm)_signatureIndices.tsv`	List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection
`predictionTable_loco_*_(en\|svm)_classProbs.tsv`	Table (tab-separated) containing class probabilities for each sample in the dataset (leave-one-class-out classifiers)
`predictionTable_loco_*_(en\|svm)_signatureIndices.tsv`	List of feature indices (as in the input matrix) of signature regions obtained by the classifier feature selection (leave-one-class-out classifiers)
`predictionResult.rds`	Binary file containing the trained models, data and prediction results for further evaluation and custom scripting (see description below)
`predictionResult_loco_*.rds`	As above, but for the different leave-one-class-out classifiers

`R` object files containing prediction results

The predictionResult.rds files contain the prediction results of the trained classifier as R objects. They can be loaded using readRDS("predictionResult.rds")
The object is a list with one element for each classifier. Each of those elements is again a list containing

the fitted model object (element: $fit)
predicted class for each observation in the dataset (element: $pred)
predicted class probabilities for each observation in the dataset (element: $prob)

Methods and implementation details

By default, elastic-net regularized logistic regression models and linear support vector machine classifiers will be trained. They are denoted by en and svm, respectively in the outputs. The glmnet and e1071 R packages are used for model inference, respectively.
For elastic-net regularized logistic regression, the alpha parameter is set to 0.5 to reflect equal mixing of lasso and ridge penalties. lambda is chosen by nested 10-fold cross-validation.
Missing value imputation will be performed using 5-nearest neighbor imputation (using the impute R package).

Reproducibility of the paper’s results

The example data provided is the dataset used for predicting cell types in the context of hematopoietic stem cell differentiation [Farlik, Halbritter, Müller et al. (2016)]. With the exception of stochastic differences due to missing value imputation and sampling of the cross-validation folds, the script reproduces the data for the results presented in the paper. Custom scripting was applied to produce the figures in the article from the inferred feature weights and prediction probabilities. The extended code is available upon request.

Citation

Please use the reference given on the following website when citing this code:
http://blueprint-methylomes.computational-epigenetics.org/