Figure 5 - Fragment coverage at regions-of-interest (using LIQUORICE)

In order to run LIQUORICE on our cfDNA data, we first need several different inputs:

  • a reference genome file
  • chromosome lengths of that reference genome
  • fragment size information from Picard's InsertSizeMetrics
  • CNA information from ichorCNA (.correcteddepth files)
  • mappability information
  • .bed files containing the sites of interest
  • .bw files which give fragment-based coverage information. In order to make hosting of these large files feasable, they only contain coverage information at the regions that are contained in the region-sets used in the paper. In this example, we will just analyze a single sample to speed things up.

Let's download them:

In [ ]:
%%bash
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/fragment_size.tar.gz # as produced by Picard's CollectInsertSizeMetrics
tar -xzf fragment_size.tar.gz
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/ichorCNA_output.tar.gz
tar -xzf ichorCNA_output.tar.gz # as produced by ichorCNA, used as described in Fig. 3
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/hg38_mappability_75bp.bigwig.gz
gunzip hg38_mappability_75bp.bigwig.gz
samplename="EwS_6_1"
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/subsetted_bigwigs/${samplename}.bw

Note: ichorCNA's output was generated as shown in Figure 3, the .bigWig files were generated like so:

In [ ]:
%%bash

# REQUIRES .BAM FILES, OUTPUT PROVIDED FOR DOWNLOAD ABOVE
mkdir -p bigwigs
for BAMFILE in $BAMLIST
do
    SAMPLENAME=$(basename ${BAMFILE} $BAMSUFFIX)
    echo "Running bamCoverage on $SAMPLENAME"
    bamCoverage -b $BAMFILE -o bigwigs/${SAMPLENAME}.bw \
    --binSize 1 -e --effectiveGenomeSize 2747877777 \
    --normalizeUsing RPGC --minMappingQuality 20 --ignoreForNormalization chrX chrM chrY
done

Now, we are ready to start LIQUORICE:

In [1]:
%%bash
samplename="EwS_6_1" # Just an example, you can choose any sample here
BEDPATH="../tables/ews_specific_DHSs.bed" # Just an example, can be replaced by other region-sets from Suppl. Table 6
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences

# Please be patient, the calculations will take a while..
bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${BEDPATH} --bigwigpath ${samplename}.bw  \
            --fraglenpath fragment_size/${samplename}_*.txt --cnvpath ichorCNA_output/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
            --hg38path hg38.fa --mapppath hg38_mappability_75bp.bigwig --chromsizespath hg38.chrom.sizes \
            --binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 \
            --classifier RF

# Let's also plot the output and fit the gaussian models:
mkdir -p fitted_model
cd fitted_model/                                                    # size of flanking region , binsize, average size of core region,
                                                                    # and width restrictions for the 3 Gaussian functions.
                                                                    # The numbers used here are the same as in the paper and represent the median widths over 
                                                                    # all samples
python ../fit_model.py $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 148 149 851 852 5927 5928
Running on EwS_6_1
Fitting done - see the output table and .png in fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928

LIQUORICE generates folders for each analyzed sample. In this folder, you can find .csv files containing the aggregated, binned coverage profile (both bias-corrected as well as uncorrected). You can also find a complete table with the coverage values per bin per regionset, along with the associated bias factors. A table with the scaled feature importances of the bias-correction model is also included. The model-based fitting procedure produces a table with a set of parameters for each sample; the graphical output of the fitting procedure looks like this (Example: EwS_6_1 coverage at EwS-specific DHSs)

In [8]:
from IPython.display import Image
Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_fitted_gaussians.png') # generated by code above
Out[8]:
In [7]:
Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_normalized_coverage.png') # generated by code above
Out[7]:

If you want to run LIQUORICE for multiple region-sets and multiple samples, you can do that like so:

In [ ]:
%%bash
BEDFILESLIST="" # space-seperated list of region-sets
BWFILES="" # space-seperated list of .bigwig files to be analysed, or : "<folder containing bigwigs>/*.bw"
FRAGMENT_SIZE_FOLDER=fragment_size # folder containing Picards InsertSizeMetrics for every sample
CNV_FOLDER=ichorCNA_output # folder containing ichorCNA's .correctedDepth files for every sample
HG38_PATH=$(pwd)
HG38_SIZES_PATH=$(pwd)
HG38_MAPP_PATH=$(pwd)
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences
FIT_MODEL_SCRIPT=$(pwd/fit_model.py)

for bedpath in $BEDFILESLIST
do
    bedname=$(basename ${bedpath} .bed)
    echo "running on regionset $bedname......"
    mkdir -p $bedname
    cd $bedname
    cut -f1,2,3 $bedpath > ${bedname}.bed
    bedpath=$(pwd)/${bedname}.bed
    
    for bwfile in $BWFILES
        do
            samplename=$(basename ${bwfile} .bw)
            echo "running on sample $samplename.."
            
            bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${bedpath} --bigwigpath ${bwfile}  \
            --fraglenpath ${FRAGMENT_SIZE_FOLDER}/${samplename}_*.txt --cnvpath ${CNV_FOLDER}/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
            --hg38path ${HG38_PATH}/hg38.fa --mapppath ${HG38_MAPP_PATH}/hg38_mappability_75bp.bigwig --chromsizespath ${HG38_SIZES_PATH}/hg38.chrom.sizes \
            --binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 --classifier RF
    
    # to calculate the median sigma values (widths) of the gaussian functions over all samples:
    mkdir fitted_model
    cd fitted_model
    mkdir loose_params
    cd loose_params
    FITPARAMS=$(python $FIT_MODEL_SCRIPT $(pwd)/'../../*/corr_coverage_mean_per_bin.csv' 20000 500 150 | tail -n1)
    cd ..
    python $FIT_MODEL_SCRIPT $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 $FITPARAMS
    cp fitted_gaussians_sigmas*/parameter_summary_leastsq.csv ../$bedname.dipparams.csv
    cd ..
    
    cd ..

If you want to check GC bias before and after correction by LIQUORICE, you can run:

In [13]:
%%bash
python GCbias_before_after.py EwS_6_1/mastertable_w_corrections.csv.gz
(476935, 53)
Figure(640x480)
Figure(640x480)
In [9]:
Image(filename='EwS_6_1_GCcont_vs_coverage.png') # generated by code above
Out[9]:
In [10]:
Image(filename='EwS_6_1_GCcont_vs_corr_coverage.png') # generated by code above
Out[10]: