Introduction

This is an R tutorial for the epigenetic age prediction tool MethylAger, integrated into the RnBeads software package (http://rnbeads.mpi-inf.mpg.de). It should guide you how to use the epigenetic age prediction for different data sets. In this specific case, we focus on an Infinium 450K BeadChip data set as a representative, but MethylAger also supports sequencing-based data such as RRBS or WGBS. In this tutorial, we show two common use cases of age prediction that can be performed by MethylAger. First, we present the integration into the RnBeads module Covariate Inference and describe the HTML report that is created by the RnBeads pipeline. Then we go into more detail about customizing MethylAger and introduce the integration into a complete DNA methylation analysis pipeline.

Installing and Loading RnBeads

RnBeads is a software package for the integrative, start-to-finish analysis of DNA methylation data. It is part of R/Bioconductor (http://bioconductor.org/packages/release/bioc/html/RnBeads.html) and installation, including dowloading of dependencies,can be achieved via:

source("http://rnbeads.mpi-inf.mpg.de/install.R")

Now you can load RnBeads into your current R session by:

library(RnBeads)

Default Analysis

For the analysis conducted in this tutorial, we collected a 450K BeadChip data set derived from two independent studies, one focused on different tissues (H. Y. Lee et al. 2015, GSE59505) and the other on head and neck cancer patients (GSE40005, publication not available). This data set was already imported and preprocessed with RnBeads and can be found here. Download the zip-file into your working directory and load it.

#Load the RnBSet
rnbSet <- load.rnb.set("analysis_set.zip")

MethylAger is situated in the Covariate Inference module of RnBeads. This module deals with the analysis of potentially confounding factors, for instance by performing Surrogate Variable Analysis. One of these unkown factors is the individual’s epigenetic age, which could largely influence the overall epigenetic pattern. MethylAger infers this epigenetic age information from DNA methylation data and provides the information for follow-up analysis. To run the Covariate Inference module, a single command has to be executed:

?rnb.run.inference

The parameters of this method are a RnBSet object on which the analysis should be conducted and dir.reports, a path to which a HTML report should be written out. This report contains the results of the analysis, for instance plots and tables.

#A new folder 'inference_report' is created in your working directory
dir.reports <- file.path(getwd(),"inference_report")
#Execute the Inference module
result <- rnb.run.inference(rnb.set=rnbSet, dir.reports=dir.reports)
## 2018-04-10 15:03:00     1.0  STATUS STARTED Covariate Inference
## 2018-04-10 15:03:00     1.0    INFO     Number of cores: 1
## opening ff /tmp/RtmpTbh3qD/ff6aef4b4d242b.ff
## 2018-04-10 15:03:03     1.2  STATUS     STARTED Age Prediction using predefined predictor
## 2018-04-10 15:03:14     1.9  STATUS     COMPLETED Age Prediction using predefined predictor
## 2018-04-10 15:03:14     1.9  STATUS     STARTED Adding Age Prediction Section to Report
## 2018-04-10 15:05:20     1.7  STATUS         Added Comparison Plot
## 2018-04-10 15:05:22     1.7  STATUS         Added Error Plot
## 2018-04-10 15:05:22     1.7  STATUS     COMPLETED Adding Age Prediction Section to Report
## 2018-04-10 15:05:23     1.7  STATUS     Calculated LUMP estimates
## 2018-04-10 15:05:23     1.7  STATUS COMPLETED Covariate Inference

The corresponding report consists of plots that describe the relationship between predicted and annotated age for all samples. For instance, MethylAger compares the annotated ages in the data set to the predicted epigenetic ages with the following scatterplot:

Figure 1: This scatterplot describes the connection between predicted (y-axis) and annotated ages (x-axis). In the case of perfect prediction, all points are located on the diagonal.

Figure 1: This scatterplot describes the connection between predicted (y-axis) and annotated ages (x-axis). In the case of perfect prediction, all points are located on the diagonal.

In the interactive HTML report you are able to select different sample characteristics to annotate the points with different shapes or colors.

Custom Analysis

Performing Age Prediction on a RnBSet

Furthermore, MethylAger can modify the input RnBSet object by adding additional columns to the phenotypic table. This is achieved by running MethylAger separately. Without customizing any further options, MethylAger will check the assay type of the input data set and select the corresponding pre-defined predictor. Epigenetic age prediction is then performed by selecting the methylation values at the CpG sites of interest.

#Refresh the data set
rnbSet <- load.rnb.set("analysis_set.zip")
#Execute MethylAger
?rnb.execute.age.prediction
rnbSet <- rnb.execute.age.prediction(rnbSet)
## 2018-04-10 15:05:25     1.7  STATUS STARTED Performing Age Prediction
## 2018-04-10 15:05:28     2.2  STATUS COMPLETED Performing Age Prediction

This command starts MethylAger to predict epigenetic age from the provided data set’s methylation data. MethylAger returns a modified RnBSet object, where the predicted epigenetic age resides in the column predicted_ages of the phenotypic table. The epigenetic ages could be linked to the donor’s health state or lifestyle.

#Show the phenotypic information about the data set
head(pheno(rnbSet)[,c("geo_accession","tissue","age","age_combined","predicted_ages","age_increase")])
##   geo_accession tissue  age age_combined predicted_ages age_increase
## 1    GSM1438496 saliva <NA>           20       33.74164           NA
## 2    GSM1438497 saliva <NA>           24       32.28709           NA
## 3    GSM1438498 saliva <NA>           27       22.48387           NA
## 4    GSM1438499 saliva <NA>           50       49.94989           NA
## 5    GSM1438500 saliva <NA>           59       57.01872           NA
## 6    GSM1438501 saliva <NA>           55       55.95207           NA

The column age_increase contains the difference between predicted, epigenetic and annotated, chronological age. For healthy samples, this difference should be close to zero, otherwise the individual’s epigenetic aging has been influenced by some, probably unkown, factor. Since the age is annotated in the column age_combined (and not in the default column age), MethylAger has to be told where to find the age information by:

#Set the option to the correct column
rnb.options(inference.age.column="age_combined")
#Refresh the data set
rnbSet <- load.rnb.set("analysis_set.zip")
rnbSet <- rnb.execute.age.prediction(rnbSet)
## 2018-04-10 15:05:29     2.2  STATUS STARTED Performing Age Prediction
## 2018-04-10 15:05:33     2.2  STATUS COMPLETED Performing Age Prediction
#Only show the relevant columns of the phenotypic table
head(pheno(rnbSet)[,c("age_combined","predicted_ages","age_increase")])
##   age_combined predicted_ages age_increase
## 1           20       33.74164  13.74164055
## 2           24       32.28709   8.28709313
## 3           27       22.48387  -4.51613149
## 4           50       49.94989  -0.05010714
## 5           59       57.01872  -1.98127925
## 6           55       55.95207   0.95207303

A high value of age_increase indicates issues with the corresponding sample For instance, the sample’s donor could have suffered from a disease, which changed the epigenetic pattern in a way that the sample became epigenetically older.

Training a new Predictor

In addition to using the pre-defined predictors of MethylAger, a custom predictor can be trained from the provided data set. For this, some parameters have to be set accordingly:

rnb.options(inference.age.column="age_combined",
            inference.age.prediction.training=TRUE,
            inference.age.prediction.cv=TRUE)

This activates both the training of a new predictor and the evaluation of its performance by applying 10-fold cross validation. Now start the Covariate Inference module to perform training:

#Refresh the data set
rnbSet <- load.rnb.set("analysis_set.zip")
#A new folder 'training_report' is created in your working directory
dir.reports <- file.path(getwd(),"training_report")
result <- rnb.run.inference(rnb.set=rnbSet,dir.reports=dir.reports)
## 2018-04-10 15:05:34     2.0  STATUS STARTED Covariate Inference
## 2018-04-10 15:05:34     2.0    INFO     Number of cores: 1
## opening ff /tmp/RtmpTbh3qD/ff6aef4b4d242b.ff
## 2018-04-10 15:05:36     2.1  STATUS     STARTED Training new age predictor
## 2018-04-10 15:06:49     3.3  STATUS     COMPLETED Training new age predictor
## 2018-04-10 15:06:49     3.3  STATUS     STARTED Age Prediction using specified predictor
## 2018-04-10 15:06:51     3.9  STATUS     COMPLETED Age Prediction using specified predictor
## 2018-04-10 15:06:51     3.9  STATUS     STARTED Adding Age Prediction Section to Report
## 2018-04-10 15:08:53     2.5  STATUS         Added Comparison Plot
## 2018-04-10 15:08:55     2.5  STATUS         Added Error Plot
## 2018-04-10 15:08:55     2.5  STATUS     COMPLETED Adding Age Prediction Section to Report
## 2018-04-10 15:08:55     2.5  STATUS     STARTED 10-fold Cross Validation
## 2018-04-10 15:09:18     3.4    INFO         1
## 2018-04-10 15:09:59     4.1    INFO         2
## 2018-04-10 15:10:38     4.3    INFO         3
## 2018-04-10 15:11:21     4.3    INFO         4
## 2018-04-10 15:12:07     4.3    INFO         5
## 2018-04-10 15:12:50     4.2    INFO         6
## 2018-04-10 15:13:31     2.9    INFO         7
## 2018-04-10 15:14:09     2.8    INFO         8
## 2018-04-10 15:14:46     4.5    INFO         9
## 2018-04-10 15:15:22     4.3    INFO         10
## 2018-04-10 15:16:02     4.3  STATUS     COMPLETED 10-fold Cross Validation
## 2018-04-10 15:16:03     4.0  STATUS     Calculated LUMP estimates
## 2018-04-10 15:16:03     4.0  STATUS COMPLETED Covariate Inference

The resulting HTML report additionally contains information about the outcome of the cross validation. In this report, the new predictor that was trained on the input data set was applied to the data set itself. Thus, the high prediction accuracy shown in Figure 2 can be explained.

Figure 2: This plot describes the differences between predicted and annotated age for all samples. In the upper part, the distribution of these differences is shown. The line in the middle is the mean and the two other lines correspond to the 1- and 99 % quantiles, respectively. The lower part shows the corresponding difference for each sample.

Figure 2: This plot describes the differences between predicted and annotated age for all samples. In the upper part, the distribution of these differences is shown. The line in the middle is the mean and the two other lines correspond to the 1- and 99 % quantiles, respectively. The lower part shows the corresponding difference for each sample.

If there had been a shift away from mean zero, we would have been conforted with a systematic error in our data set. On top, the trained predictor is available as a CSV file in the data folder of the report directory. Similar to perfoming prediction, training can be invoked by rnb.execute.training without report generation.

Using a previously trained Predictor

If you trained a predictor in a previous run of MethylAger, you could use this predictor for a follow-up analysis by setting the appropriate option:

#Set predictor to the CSV file created with the training function
rnb.options(inference.age.prediction.predictor=file.path(getwd(),"training_report","covariate_inference_data","trained_predictor.csv"))

After performing training, the option is automatically set to the new predictor for the current R session. MethylAger can also be integrated into a start-to-finish analysis of RnBeads. For instance, the command below executes parts of RnBeads’ pipeline and executes age prediction with the previously trained predictor. The arguments for rnb.run.analysis are a path to which the report should be written out (dir.reports), the RnBSet and data.type, which tells the method that the import into RnBeads was already done. We will deactivate all RnBeads modules except Covariate Inference and Exploratory Analysis to save time.

#Refresh the data set
rnbSet <- load.rnb.set("analysis_set.zip")
#Set some options for the RnBeads pipeline, so the analysis finishes in a reasonable time
rnb.options(inference=TRUE,
            qc=FALSE,
            preprocessing=FALSE,
            exploratory.principal.components=4,
            exploratory.columns=c("tissue","gender","disease state","age_increase","predicted_ages","age_combined"),
            exploratory.beta.distribution=FALSE,
            exploratory.correlation.qc=FALSE,
            exploratory.intersample=FALSE,
            exploratory.region.profiles=vector("character"),
            differential=FALSE,
            export.to.bed=FALSE,
            export.to.trackhub=NULL,
            inference.age.prediction.training=FALSE,
            inference.age.prediction.cv=FALSE)
#A new folder 'analysis_report' is created in your working directory
dir.reports <- file.path(getwd(),"analysis_report")
#running time ~50min
result <- rnb.run.analysis(data.source=rnbSet,dir.reports=dir.reports,data.type="rnb.set")

MethylAger was executed within the RnBeads pipeline and in this way, the epigenetic age is accounted for in the follow-up module Exploratory Analysis. The resulting HTML report consists of a separate report for each of the executed modules. For instance, RnBeads tests for associations between epigenetic age and other sample characteristics and creates the following figure in the report:

Figure 3: Testing the relationship between different sample characteristics.

Figure 3: Testing the relationship between different sample characteristics.

If we find a significant assocation to epigenetic aging, the p-value resulting from the performed test will be shown in the corresponding cell of Figure 3. We then could analyze the effect of the correspoding sample’s characteristic to increased epigenetic aging of the individual. In this case, the epigenetic age is significantly linked to the sample’s tissue indicating different epigenetic ages for the analyzed tissues

Remarks

MethylAger can be deactivated within the Covariate Inference module by:

rnb.options(inference.age.prediction=FALSE)

Furthermore, some functions of MethylAger can run in parallel, so please consider the parallel processing options of RnBeads when analyzing a large data set, but keep in mind that MethylAger uses nested cross validation to assess model performance. Therefore, for instance, setting the number of workers to 3 results in 9 threads.

?parallel.setup

We recommend to execute cross validation only on data sets with at least 30 samples, since the folds become too small otherwise.

Conclusion

In this tutorial we presented MethylAger, an easy-to-use epigenetic age prediction tool that was integrated into the RnBeads software package. It is able to derive epigenetic age of samples analyzed by different methylation assays, such as the Illumina BeadChips or sequencing-based approaches. Increased epigenetic aging could be linked to a sample’s disease history by additional modules of RnBeads. Due to the integration of MethylAger into the RnBeads package, MethylAger can easily be included into a methylation data analysis workflow.

References

Lee, Hwan Young, Ja Hyun An, Sang Eun Jung, Yu Na Oh, Eun Young Lee, Ajin Choi, Woo Ick Yang, and Kyoung Jin Shin. 2015. “Genome-wide methylation profiling and a multiplex construction for the identification of body fluids using epigenetic markers.” Forensic Science International: Genetics 17: 17–24. doi:10.1016/j.fsigen.2015.03.002.