Figure 5 - Fragment coverage at regions-of-interest (using LIQUORICE)
In order to run LIQUORICE on our cfDNA data, we first need several different inputs:
Let's download them:
%%bash
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/fragment_size.tar.gz # as produced by Picard's CollectInsertSizeMetrics
tar -xzf fragment_size.tar.gz
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/ichorCNA_output.tar.gz
tar -xzf ichorCNA_output.tar.gz # as produced by ichorCNA, used as described in Fig. 3
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/hg38_mappability_75bp.bigwig.gz
gunzip hg38_mappability_75bp.bigwig.gz
samplename="EwS_6_1"
wget http://medical-epigenomics.org/papers/peneder2020_f17c4e3befc643ffbb31e69f43630748/data/subsetted_bigwigs/${samplename}.bw
Note: ichorCNA's output was generated as shown in Figure 3, the .bigWig files were generated like so:
%%bash
# REQUIRES .BAM FILES, OUTPUT PROVIDED FOR DOWNLOAD ABOVE
mkdir -p bigwigs
for BAMFILE in $BAMLIST
do
SAMPLENAME=$(basename ${BAMFILE} $BAMSUFFIX)
echo "Running bamCoverage on $SAMPLENAME"
bamCoverage -b $BAMFILE -o bigwigs/${SAMPLENAME}.bw \
--binSize 1 -e --effectiveGenomeSize 2747877777 \
--normalizeUsing RPGC --minMappingQuality 20 --ignoreForNormalization chrX chrM chrY
done
Now, we are ready to start LIQUORICE:
%%bash
samplename="EwS_6_1" # Just an example, you can choose any sample here
BEDPATH="../tables/ews_specific_DHSs.bed" # Just an example, can be replaced by other region-sets from Suppl. Table 6
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences
# Please be patient, the calculations will take a while..
bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${BEDPATH} --bigwigpath ${samplename}.bw \
--fraglenpath fragment_size/${samplename}_*.txt --cnvpath ichorCNA_output/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
--hg38path hg38.fa --mapppath hg38_mappability_75bp.bigwig --chromsizespath hg38.chrom.sizes \
--binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 \
--classifier RF
# Let's also plot the output and fit the gaussian models:
mkdir -p fitted_model
cd fitted_model/ # size of flanking region , binsize, average size of core region,
# and width restrictions for the 3 Gaussian functions.
# The numbers used here are the same as in the paper and represent the median widths over
# all samples
python ../fit_model.py $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 148 149 851 852 5927 5928
LIQUORICE generates folders for each analyzed sample. In this folder, you can find .csv files containing the aggregated, binned coverage profile (both bias-corrected as well as uncorrected). You can also find a complete table with the coverage values per bin per regionset, along with the associated bias factors. A table with the scaled feature importances of the bias-correction model is also included. The model-based fitting procedure produces a table with a set of parameters for each sample; the graphical output of the fitting procedure looks like this (Example: EwS_6_1 coverage at EwS-specific DHSs)
from IPython.display import Image
Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_fitted_gaussians.png') # generated by code above
Image(filename='fitted_model/fitted_gaussians_sigmas_148to149_851to852_5927to5928/EwS_6_1_normalized_coverage.png') # generated by code above
If you want to run LIQUORICE for multiple region-sets and multiple samples, you can do that like so:
%%bash
BEDFILESLIST="" # space-seperated list of region-sets
BWFILES="" # space-seperated list of .bigwig files to be analysed, or : "<folder containing bigwigs>/*.bw"
FRAGMENT_SIZE_FOLDER=fragment_size # folder containing Picards InsertSizeMetrics for every sample
CNV_FOLDER=ichorCNA_output # folder containing ichorCNA's .correctedDepth files for every sample
HG38_PATH=$(pwd)
HG38_SIZES_PATH=$(pwd)
HG38_MAPP_PATH=$(pwd)
NTHREADS=30 # set according to your preferences
MAXMEM="90G" # set according to your preferences
FIT_MODEL_SCRIPT=$(pwd/fit_model.py)
for bedpath in $BEDFILESLIST
do
bedname=$(basename ${bedpath} .bed)
echo "running on regionset $bedname......"
mkdir -p $bedname
cd $bedname
cut -f1,2,3 $bedpath > ${bedname}.bed
bedpath=$(pwd)/${bedname}.bed
for bwfile in $BWFILES
do
samplename=$(basename ${bwfile} .bw)
echo "running on sample $samplename.."
bash run_LIQUORICE_v0.3.1.sh --samplename ${samplename} --bedpath ${bedpath} --bigwigpath ${bwfile} \
--fraglenpath ${FRAGMENT_SIZE_FOLDER}/${samplename}_*.txt --cnvpath ${CNV_FOLDER}/ichorCNA_nonss_all_samples/${samplename}.correctedDepth.txt \
--hg38path ${HG38_PATH}/hg38.fa --mapppath ${HG38_MAPP_PATH}/hg38_mappability_75bp.bigwig --chromsizespath ${HG38_SIZES_PATH}/hg38.chrom.sizes \
--binsize 500 --extendto 20000 --readlength 100 --nthreads ${NTHREADS} --maxmem ${MAXMEM} --nroffragmentsdrawn=200 --classifier RF
# to calculate the median sigma values (widths) of the gaussian functions over all samples:
mkdir fitted_model
cd fitted_model
mkdir loose_params
cd loose_params
FITPARAMS=$(python $FIT_MODEL_SCRIPT $(pwd)/'../../*/corr_coverage_mean_per_bin.csv' 20000 500 150 | tail -n1)
cd ..
python $FIT_MODEL_SCRIPT $(pwd)/'../*/corr_coverage_mean_per_bin.csv' 20000 500 150 $FITPARAMS
cp fitted_gaussians_sigmas*/parameter_summary_leastsq.csv ../$bedname.dipparams.csv
cd ..
cd ..
If you want to check GC bias before and after correction by LIQUORICE, you can run:
%%bash
python GCbias_before_after.py EwS_6_1/mastertable_w_corrections.csv.gz
Image(filename='EwS_6_1_GCcont_vs_coverage.png') # generated by code above
Image(filename='EwS_6_1_GCcont_vs_corr_coverage.png') # generated by code above