Figure 5 - Fragment coverage at regions-of-interest (using LIQUORICE)

In order to run LIQUORICE on our cfDNA data, we first need several different inputs:

a reference genome file
chromosome lengths of that reference genome
fragment size information from Picard's InsertSizeMetrics
CNA information from ichorCNA (.correcteddepth files)
mappability information
.bed files containing the sites of interest
.bw files which give fragment-based coverage information. In order to make hosting of these large files feasable, they only contain coverage information at the regions that are contained in the region-sets used in the paper. In this example, we will just analyze a single sample to speed things up.

Let's download them:

%%bash
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/fragment_size.tar.gz # as produced by Picard's CollectInsertSizeMetrics
tar -xzf fragment_size.tar.gz
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/ichorCNA_output.tar.gz
tar -xzf ichorCNA_output.tar.gz # as produced by ichorCNA, used as described in Fig. 3
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/hg38_mappability_75bp.bigwig.gz
gunzip hg38_mappability_75bp.bigwig.gz
samplename="EwS_6_1"
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/subsetted_bigwigs/${samplename}.bw

Note: ichorCNA's output was generated as shown in Figure 3, the .bigWig files were generated like so:

%%bash

# REQUIRES .BAM FILES, OUTPUT PROVIDED FOR DOWNLOAD ABOVE
mkdir -p bigwigs
for BAMFILE in $BAMLIST
do
    SAMPLENAME=$(basename ${BAMFILE} $BAMSUFFIX)
    echo "Running bamCoverage on $SAMPLENAME"
    bamCoverage -b $BAMFILE -o bigwigs/${SAMPLENAME}.bw \
    --binSize 1 -e --effectiveGenomeSize 2747877777 \
    --normalizeUsing RPGC --minMappingQuality 20 --ignoreForNormalization chrX chrM chrY
done

Now, we are ready to start LIQUORICE:

%%bash
samplename="EwS_6_1" # Just an example, you can choose any sample here
BEDPATH="../tables/ews_specific_DHSs.bed" # Just an example, can be replaced by other region-sets from Suppl. Table 6
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences

# Please be patient, the calculations will take a while..
bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${BEDPATH} --bigwigpath ${samplename}.bw  \
            --fraglenpath fragment_size/${samplename}_*.txt --cnvpath ichorCNA_output/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
            --hg38path hg38.fa --mapppath hg38_mappability_75bp.bigwig --chromsizespath hg38.chrom.sizes \
            --binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 \
            --classifier RF

# Let's also plot the output and fit the gaussian models:
mkdir -p fitted_model
cd fitted_model/                                                    # size of flanking region , binsize, average size of core region,
                                                                    # and width restrictions for the 3 Gaussian functions.
                                                                    # The numbers used here are the same as in the paper and represent the median widths over 
                                                                    # all samples
python ../fit_model.py $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 148 149 851 852 5927 5928

Running on EwS_6_1
Fitting done - see the output table and .png in fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928

LIQUORICE generates folders for each analyzed sample. In this folder, you can find .csv files containing the aggregated, binned coverage profile (both bias-corrected as well as uncorrected). You can also find a complete table with the coverage values per bin per regionset, along with the associated bias factors. A table with the scaled feature importances of the bias-correction model is also included. The model-based fitting procedure produces a table with a set of parameters for each sample; the graphical output of the fitting procedure looks like this (Example: EwS_6_1 coverage at EwS-specific DHSs)

from IPython.display import Image
Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_fitted_gaussians.png') # generated by code above

Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_normalized_coverage.png') # generated by code above

If you want to run LIQUORICE for multiple region-sets and multiple samples, you can do that like so:

%%bash
BEDFILESLIST="" # space-seperated list of region-sets
BWFILES="" # space-seperated list of .bigwig files to be analysed, or : "<folder containing bigwigs>/*.bw"
FRAGMENT_SIZE_FOLDER=fragment_size # folder containing Picards InsertSizeMetrics for every sample
CNV_FOLDER=ichorCNA_output # folder containing ichorCNA's .correctedDepth files for every sample
HG38_PATH=$(pwd)
HG38_SIZES_PATH=$(pwd)
HG38_MAPP_PATH=$(pwd)
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences
FIT_MODEL_SCRIPT=$(pwd/fit_model.py)

for bedpath in $BEDFILESLIST
do
    bedname=$(basename ${bedpath} .bed)
    echo "running on regionset $bedname......"
    mkdir -p $bedname
    cd $bedname
    cut -f1,2,3 $bedpath > ${bedname}.bed
    bedpath=$(pwd)/${bedname}.bed
    
    for bwfile in $BWFILES
        do
            samplename=$(basename ${bwfile} .bw)
            echo "running on sample $samplename.."
            
            bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${bedpath} --bigwigpath ${bwfile}  \
            --fraglenpath ${FRAGMENT_SIZE_FOLDER}/${samplename}_*.txt --cnvpath ${CNV_FOLDER}/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
            --hg38path ${HG38_PATH}/hg38.fa --mapppath ${HG38_MAPP_PATH}/hg38_mappability_75bp.bigwig --chromsizespath ${HG38_SIZES_PATH}/hg38.chrom.sizes \
            --binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 --classifier RF
    
    # to calculate the median sigma values (widths) of the gaussian functions over all samples:
    mkdir fitted_model
    cd fitted_model
    mkdir loose_params
    cd loose_params
    FITPARAMS=$(python $FIT_MODEL_SCRIPT $(pwd)/'../../*/corr_coverage_mean_per_bin.csv' 20000 500 150 | tail -n1)
    cd ..
    python $FIT_MODEL_SCRIPT $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 $FITPARAMS
    cp fitted_gaussians_sigmas*/parameter_summary_leastsq.csv ../$bedname.dipparams.csv
    cd ..
    
    cd ..

If you want to check GC bias before and after correction by LIQUORICE, you can run:

%%bash
python GCbias_before_after.py EwS_6_1/mastertable_w_corrections.csv.gz

(476935, 53)
Figure(640x480)
Figure(640x480)

Image(filename='EwS_6_1_GCcont_vs_coverage.png') # generated by code above

Image(filename='EwS_6_1_GCcont_vs_corr_coverage.png') # generated by code above