Jekyll2019-02-11T10:58:09+00:00http://bcb.io/feed.xmlBlue Collar BioinformaticsCommunity built tools for biological data analysisBrad Chapman2019-02-11T10:58:09+00:002019-02-11T10:58:09+00:00http://bcb.io/2019/02/11/2017-03-25-heterogenity<h1 id="inputs">Inputs</h1>
<ul>
<li>Germline heterozygous SNPs, informative for purity/ploidy/clone estimation
Need estimation for tumor-only</li>
<li>Copy number calls – GC corrected and normalized (to normal or process-matched
normal)</li>
<li>Split copy number calls into major/minor alleles, potentially with multiple
states</li>
<li>Somatic variant calls with allele frequencies, for tumor subclones</li>
<li>Estimate subclones from somatic calls + major/minor CNVs</li>
</ul>
<h2 id="challenges">Challenges</h2>
<ul>
<li>Heterogeneous input samples ranging from WGS tumor/normal to panel/capture
tumor-only, would like to have similar workflow to handle most cases</li>
<li>Lack of good truth sets, so hard to determine if truth sets work well</li>
<li>Most tools not fully automated and require some decision making during the
process</li>
</ul>
<h2 id="example-figures">Example figures</h2>
<ul>
<li>
<p>Overview of problem
Figure 1: http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0602-8</p>
</li>
<li>
<p>sequencing levels required for reconstruction depending on clonal complexity
Figure 5: http://www.cell.com/action/showImagesData?pii=S2405-4712%2815%2900113-1
Figure 6: http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0602-8</p>
</li>
</ul>
<h1 id="tools">Tools</h1>
<h2 id="purity-cnv-allelic-copy-number">Purity, CNV allelic copy number</h2>
<p>PureCN – purity/ploidy, classify variants as germline/somatic clonal/subclonal,
panels/exome > 100x, support for no matching normals but needs process-matched normal BAM
http://bioconductor.org/packages/release/bioc/html/PureCN.html</p>
<p>BubbleTree: purity, LOH and subclonality WGS/exome, from heterozygous variants
and CNVs
http://www.bioconductor.org/packages/release/bioc/html/BubbleTree.html</p>
<p>TitanCNA – WGS/exome; heterozygous variants => purity, CNVs into major/minor
subclones, LOH
https://github.com/gavinha/TitanCNA</p>
<p>Battenberg – WGS tumor/normal; BAMs => purity, CNV caller into major/minor subclones
https://github.com/cancerit/cgpBattenberg</p>
<h2 id="subclonal-reconstruction">Subclonal reconstruction</h2>
<p>PhyloWGS – subclone and tumor evolution from Battenberg or TITAN output
(major/minor allele CN)+ VCFs
https://github.com/morrislab/phylowgs</p>
<p>SciClone – exome/WGS: somatic CNV calls + variants. Uses only variants in CN=2
CN=2 regions, requires a relatively stable genome to have enough events.
https://github.com/genome/sciclone
https://github.com/hdng/clonevol</p>
<p>Canopy – exome/WGS: allele specific CNVs + variants. Recommend using Sequenza
as input segmented CNV calls.
https://github.com/yuchaojiang/Canopy
https://github.com/yuchaojiang/Canopy/blob/master/instruction/SNA_CNA_input.md</p>
<p>Guan Lab, U of M – Somatic variants and CNVs, SMC-Het winner but demonstration implementation only
https://www.synapse.org/#!Synapse:syn6087005/wiki/398911</p>
<p>THetA – integrated, CNVs only, academic only in latest version
https://github.com/raphael-group/THetA</p>
<p>PyClone – academic only
https://bitbucket.org/aroth85/pyclone/wiki/Home</p>
<h3 id="cnv-callers">CNV callers</h3>
<ul>
<li>CNVkit https://github.com/etal/cnvkit</li>
<li>Seq2C https://github.com/AstraZeneca-NGS/Seq2C</li>
<li>Canvas https://github.com/Illumina/Canvas</li>
</ul>
<h1 id="validation">Validation</h1>
<p>tHapMix simulation – WGS tumor/normal https://github.com/Illumina/tHapMix
ICGC-TCGA DREAM Tumor Heterogeneity Challenge (VCF + Battenberg stratified
output), no truth sets available
https://www.synapse.org/#!Synapse:syn2813581/wiki/303137</p>
<h2 id="remove-artifacts">Remove artifacts</h2>
<ul>
<li>small variants – Damage assessment</li>
<li>germline CNVs https://github.com/chapmanb/bcbio-nextgen/issues/963</li>
</ul>
<h2 id="cnv-benchmarking">CNV benchmarking</h2>
<p>HCC2218 truth set from Canvas https://github.com/Illumina/Canvas#demo-tumor-normal-enrichment-data
GiaB NA24385 CNVs http://biorxiv.org/content/early/2016/12/13/093526</p>Brad ChapmanData science infrastructure for agricultural sustainability and food security2019-02-11T06:00:00+00:002019-02-11T06:00:00+00:00http://bcb.io/2019/02/11/ginkgo<p>I’m excited to be joining <a href="https://www.ginkgobioworks.com/">Ginkgo Bioworks</a>.
Ginkgo and the synthetic biology community have an incredible amount of useful
data in intricate experimental designs, measured screening outcomes and
pre-existing biological knowledge. I’ll help organize, present, compute on and
do science with this data.</p>
<p>I hope to enable downstream applications that <a href="https://joynbio.com/">improve agricultural
sustainability and food security</a>. I’m <a href="https://twitter.com/eric_herring/status/1073982401364328450">motivated
to</a> help with
building an increasingly fair and <a href="https://twitter.com/BuildSoil/status/1081664237997481989">climate friendly agricultural
system</a> due to the
<a href="https://www.theguardian.com/environment/2018/oct/08/global-warming-must-not-exceed-15c-warns-landmark-un-report">dire warnings about the state of our
planet</a>.
There are many different ways to contribute on mitigating climate change, from
personal action to politics to your daily work; see
<a href="https://www.drawdown.org/">Drawdown</a>, <a href="https://sliced.org/">Sliced</a> and
<a href="https://www.mattermore.io/">Mattermost</a> for some comprehensive lists. Ginkgo
and their downstream applications provided a unique opportunity to make use of
my programming, genomics and synthetic biology background to work towards
<a href="https://blog.cabi.org/2019/02/05/new-goal-post-agriculture-nutrition/">healthy and sustainable food production</a>.</p>
<p>The incredibly difficult part of this change is moving away from the amazing
community of researchers I’ve had the privilege to work with.
<a href="http://bioinformatics.sph.harvard.edu/">Our incredible team at Harvard Chan School</a>.
continues to teach me new things every day and makes supporting science a joy.
Rewarding collaborations with <a href="https://www.astrazeneca.com/our-focus-areas/oncology.html">AstraZeneca</a>,
<a href="https://research.unimelb.edu.au/centre-for-cancer-research/home">the University of Melbourne Centre for Cancer Research</a> and
<a href="https://www.veritasgenetics.com/">Veritas Genetics</a> provide a continuous
source of scientific challenges alongside the inspiration of seeing smart teams answer
difficult questions.</p>
<h1 id="data-science-infrastructure-in-synthetic-biology">Data science infrastructure in synthetic biology</h1>
<p>At Ginkgo my initial focus will be on providing infrastructure and scientific
support to enable application specific data science teams. The data challenges at
Ginkgo are similar to those faced across multiple companies and large
communities. They have large amount of valuable process and analysis data in
custom internal formats, such that querying requires knowing multiple data
models. Additionally, there is high level biological knowledge about experiments
and biological systems that gets captured in ad-hoc ways and requires structure
to improve interoperability. The challenge will be to build the smallest and
leanest set of processes, tools and interfaces that organize this data to
increase scientific productivity.</p>
<p>My rough technical plan is semi-automated data organization. First, use
<a href="https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API">OpenRefine reconciliation
services</a>
through <a href="https://github.com/SciGraph/SciGraph#whats-included">SciGraph</a> to map
unstructured data into ontologies like <a href="http://sbolstandard.org/">SBOL</a>. With
this in place, store the data as RDF-like triples of entities with structured
attributes and values. Ideally this uses the time-based approach within
<a href="https://docs.datomic.com/cloud/whatis/data-model.html">Datomic</a> and models
high level entity relationships with <a href="https://github.com/luchiniatwork/hodur-engine">Hodur</a>.
Finally, expose these heterogeneous backends in a standardized way for external
access using a
<a href="https://graphql.org/">GraphQL</a>/<a href="https://github.com/walmartlabs/lacinia">Lacinia</a>
or custom Datalog interface. I’m open to any and all suggestions on alternative
approaches to explore. <a href="http://rgalen.com/agile-training-news/2017/10/15/agile-planning-getting-punched-every-day">Planning</a>
and <a href="https://quoteinvestigator.com/2013/10/20/no-predict/">predictions</a> are
always perilous in both science and computing.</p>
<p>This work will provide a template and documentation of approaches for data cleaning
and organizing, along with practical tools to help with this process. The
outcome – modeled and organized data – will not only help scientists work more
effectively on Ginkgo’s synthetic biology platform, but also help coordinate
with other researchers and the general public. I want to work collaboratively
to connect and enable the diverse synthetic biology community through practical
data science. I’d love to discuss with anyone working on similar problems.</p>
<h1 id="future-of-bcbio">Future of bcbio</h1>
<p>The unfortunate downside of this change is moving away from personalized
medicine and cancer variant analysis. I spent a majority of my time over the
past eight years at Harvard working with the <a href="https://bcbio-nextgen.readthedocs.io/en/latest/">bcbio
community</a> on high throughput
sequencing analysis pipelines. bcbio supports a large number of researchers and
we’re committed to helping maintain and continue its development. <a href="http://bioinformatics.sph.harvard.edu/">Our team at
Harvard Chan</a> has active contributors
like <a href="https://twitter.com/RoryKirchner">Rory Kirchner</a> and <a href="https://github.com/lpantano">Lorena
Pantano</a> who plan to continue to develop bcbio.
We’re also hiring at the Harvard Chan core and would love to hear from bcbio
community developers with an interest in working more directly on it. Similarly,
if you depend on bcbio and want to collaborate please also <a href="https://github.com/chapmanb/">get in
touch</a>. I’ll continue to work part time on bcbio
over the next few months to improve documentation and stability and make
transitions as painless as possible.</p>
<p>I’m so appreciative to the bcbio and open bioinformatics community and all the
groups I’ve worked with, and welcome feedback and thoughts.</p>Brad Chapmanchapmanb@fastmail.comI’m excited to be joining Ginkgo Bioworks. Ginkgo and the synthetic biology community have an incredible amount of useful data in intricate experimental designs, measured screening outcomes and pre-existing biological knowledge. I’ll help organize, present, compute on and do science with this data.Developing low frequency filters for cancer variant calling using VarDict2016-04-04T00:00:00+00:002016-04-04T00:00:00+00:00http://bcb.io/2016/04/04/vardict-filtering<h1 id="overview">Overview</h1>
<p>The underlying work that goes into preparing good filters for variant calling is
not always exposed to the biologists who use variants for making research and
clinical decisions. Filter sets like
<a href="https://www.broadinstitute.org/gatk/guide/article?id=2806">the GATK best practice pipeline hard filters</a>
<a href="http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/">perform well</a> and are
widely used, but lack detailed background on the underlying truth sets and
methods used to derive them.</p>
<p>As a result, some scientists treat variant calling and filtering as a solved
problem. This creates a disconnect between researchers who work on the
underlying algorithms and understand the filtering trade offs and imperfectness
of our current analysis tools, and users who take variants as inputs to
downstream analysis. Additionally, there is often room for improvement in
filters, even widely used ones. We recently
<a href="http://imgur.com/a/oHRVB">tweaked mapping quality filters for GATK</a> from the
defaults to provide an improvement in sensitivity without loss of precision.</p>
<p>To provide more clarity into work we do as part of
<a href="https://github.com/chapmanb/bcbio-nextgen">bcbio</a> development, this post describes
improved filters for
<a href="https://github.com/AstraZeneca-NGS/VarDictJava">VarDict tumor/normal cancer variant calling</a>.
VarDict (v1.4.5) is a freely available small variant (SNP and insertions/deletions) caller from
<a href="https://twitter.com/ZhongwuL">Zhongwu Lai</a> and
<a href="https://twitter.com/DrySci">Jonathan Dry’s</a> group in
<a href="http://www.astrazeneca.com/Home">AstraZeneca Oncology</a>. The new set of filters
use the
<a href="https://www.synapse.org/#!Synapse:syn312572/wiki/62018">synthetic 4 dataset from the ICGC-TCGA DREAM challenge</a>
truth set for validation. This dataset is more difficult than the
<a href="http://bcb.io/2015/03/05/cancerval/">synthetic 3 dataset used in previous validations</a>
since it represents a heterogeneous tumor with 20% normal contamination and two
lower frequency sub-clones.</p>
<p>The filters originally derived from the synthetic 3 dataset analysis result in
VarDict underperforming on SNP sensitivity and precision, relative to
<a href="https://www.broadinstitute.org/cancer/cga/mutect">MuTect (v1.1.5)</a>:</p>
<p><img src="http://i.imgur.com/Mn9DyZP.png" alt="Original sensitivity and precision" /></p>
<p>after adjustment with a new filtering strategy, VarDict is now on par with
MuTect for SNP sensitivity and precision, and better than
<a href="http://scalpel.sourceforge.net/">Scalpel (v0.5.1)</a> on sensitivity for
insertions and deletions. It also performs well in comparison with
<a href="http://dkoboldt.github.io/varscan/">VarScan (2.4.1)</a>,
<a href="https://github.com/ekg/freebayes">FreeBayes (v0.9.21-26)</a> and
<a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php">MuTect2 (3.5-0)</a>:</p>
<p><img src="http://i.imgur.com/jFW45NO.png" alt="Final sensitivity and precision" /></p>
<p>The improvement in VarDict come from improved filtering of lower frequency
mutations. Detecting these is critical to understanding tumor heterogeneity, but
is also more challenging. This is analogous to the way that filters for germline
calling often primarily benefit detection of heterozygote calls.</p>
<p>We’ll describe the new filters that led to the improvement, showing plots used
to identify the new linear combinations of filtering parameters.
Cross-validation with a second evaluation shows that these are general
improvements and provide similar benefits on other datasets. We’ll discuss the
visual approach used here versus machine learning methods, and how we hope to
avoid over-training and devise generally useful filters for detecting low
frequency variants.</p>
<h1 id="cross-validation">Cross validation</h1>
<p>To avoid over-optimizing to the DREAM datasets, we also compared caller accuracy
against a mixture of two <a href="http://genomeinabottle.org/">Genome in a Bottle</a>
datasets,
<a href="https://github.com/genome-in-a-bottle/giab_data_indexes">NA12878 and NA24385</a>,
prepared by researchers at the <a href="http://www.hartwigmedicalfoundation.nl/en/">Hartwig Medical Foundation</a>.
Using a mixture of the two samples as the tumor and NA24385 alone as a normal
allows
<a href="https://github.com/hbc/projects/tree/master/giab_somatic">identification of NA12878 variants as somatic calls</a>.
For an mixture of the two samples with 30% NA12878 and 70% NA24385, the expected
allele frequencies are 15% for NA12878 heterozygotes not present in NA24385 and
30% for NA12878 homozygotes not in NA24385. While not a real tumor, this
approach exercises many of the features of somatic callers.</p>
<p>Our VarDict filter improvements carry over to this dataset and provide good
resolution of both SNPs and Indels relative to the other callers:</p>
<p><img src="http://i.imgur.com/hPoFH14.png" alt="GiaB mixture cross-validation" /></p>
<p>While most callers have similar patterns of sensitivity and specificity in both
the DREAM synthetic 4 and NA12878/NA24385 evaluations, MuTect2 (3.5-0) performs
much differently between the two. MuTect2 had the best resolution of SNPs and
Indels in the DREAM dataset, but performs the worst in the Genome in a Bottle
mixture validation. This suggests that MuTect2, which is still under development
at the <a href="https://www.broadinstitute.org/gatk/">Broad Institute</a> and released as
beta software for research use only, may currently be over optimized to features
of the DREAM datasets.</p>
<h1 id="filter-development">Filter development</h1>
<p>The key insight for improving VarDict sensitivity was that providing a more
lenient p-value filter (<code class="highlighter-rouge">-P</code>) to VarDict results in
<a href="http://i.imgur.com/LZ3eCop.png">detection of many of the initially missing variants</a>.
However, setting a lenient p-value filter also results in poor precision.
By using this as a starting point and introducing two new filters for low
frequency variants, we were able to recover good precision without losing the
sensitivity gains.</p>
<h2 id="low-frequency-with-low-depth">Low frequency with low depth</h2>
<p>The developed filter provides better discrimination of low frequency variants
with low depth, starting with the
<a href="http://www.ncbi.nlm.nih.gov/pubmed/23773188">the recommendation of 13X coverage to accurately detect heterozygote calls</a>.
Generalizing heterozygote allele frequency (0.5) and depth (13) to any allele
frequency gives us a recommended coverage where variant allele frequency times
depth at a position is six or more. We apply the new filters only in lower
frequency (AF) and lower depth (DP) calls that fail to meet this threshold.
Within this set of calls, we filter variants failing one of 3 other criteria for
mapping quality (MQ), number of read mismatches (NM), low depth (DP) or low
quality (QUAL). In pseudo-code, the new filter is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((AF * DP < 6) &&
((MQ < 55.0 && NM > 1.0) ||
(MQ < 60.0 && NM > 2.0) ||
(DP < 10) ||
(QUAL < 45)))
</code></pre></div></div>
<p>Breaking this down into each component:</p>
<ul>
<li>
<p>Remove variants supported by reads with low mapping quality and multiple
mismatches. Poorly mapped reads with multiple errors contribute to
false positive low frequency calls. VarDict reports two useful metrics:
the mean number of mismatches (NM) and mean mapping quality (MQ) for the reads
covering a position in the genome. For bwa mapping qualities, we filter calls
with (MQ < 55.0 and NM > 1.0 or MQ < 60.0 and NM > 2.0). Plotting the
relationship between mapping quality (MQ) and number of mismatches (NM) shows
the distributions of these metrics. In true positives they cluster
near the edges of high mapping quality and a single mismatch:</p>
<p><img src="http://i.imgur.com/lzPFRaA.png" alt="AFDP: mq nm true" /></p>
<p>In false positives there is a more even distribution of the
values amongst the ranges of mapping quality and mismatches:</p>
<p><img src="http://i.imgur.com/Aus76Z7.png" alt="AFDP: mq nm false" /></p>
<p>This filter is currently tuned to the
<a href="https://github.com/lh3/bwa">bwa aligner (v0.7.12)</a>, which uses 60 as the
maximum mapping quality. In bcbio, we only use this filter when using the bwa
aligner, but we’d also like to generalize by testing with other alignment
methods and associating poor mapping quality scores with genome mappability
tracks.</p>
</li>
<li>
<p>Low depth (DP < 10) – Low depth calls, coupled with AF * DP < 6, contain a
disproportionate number of false positives.</p>
</li>
</ul>
<p><img src="http://i.imgur.com/OipIRil.png" alt="AFDP: low depth" /></p>
<ul>
<li>Low quality (QUAL < 45) – Similarly, low quality values correlate with false
positive calls.</li>
</ul>
<p><img src="http://i.imgur.com/W63W872.png" alt="AFDP: low quality" /></p>
<h2 id="low-frequency-with-poor-quality">Low frequency with poor quality</h2>
<p>The first set of filters improved precision but still retained 3x more false
positives compared with MuTect. To improve precision further, we looked at the
remaining true and false positive calls and found false positives had a
combination of low frequency (AF), low quality (QUAL) and high p-values (SSF).
The pseudo-code for this filter is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AF < 0.2 && QUAL < 55 && SSF > 0.06
</code></pre></div></div>
<p>Breaking down the distribution differences for each of these filters.</p>
<ul>
<li>Allele frequency < 0.2</li>
</ul>
<p><img src="http://i.imgur.com/1AeuVoc.png" alt="low freq" /></p>
<ul>
<li>Quality < 55</li>
</ul>
<p><img src="http://i.imgur.com/H2G5OYM.png" alt="low freq quality" /></p>
<ul>
<li>P-value (SSF) > 0.06</li>
</ul>
<p><img src="http://i.imgur.com/XhlBQCu.png" alt="low freq p-value" /></p>
<h1 id="filter-development-1">Filter development</h1>
<p>We identified these filters by manual inspection of metrics graphs. We
run automated calling and validation with bcbio, then plot metrics and look for
differences that discriminate true and false positive calls. Practically, we
extract metrics from VCF files into tables using
<a href="https://github.com/pjotrp/bioruby-vcf">bio-vcf</a>, then plot with
<a href="http://pandas.pydata.org/">pandas</a>,
<a href="http://stanford.edu/~mwaskom/software/seaborn/index.html">seaborn</a> and
<a href="http://matplotlib.org/">matplotlib</a>. Since the base set of calls already
optimize single metric filters, we start with linear combinations of metrics,
subsetting and re-plotting as we identify sets of filters that do a good job of
separating signal and noise.</p>
<p>The updated filters for VarDict improve sensitivity and precision on low
frequency variants. SNP call rates are on par with MuTect, and VarDict is more
sensitive and precise than Scalpel and other callers for insertions and
deletions.</p>
<p>Our goal with using the DREAM synthetic truth sets is to discover generally
useful metrics, rather than performing well on this particular dataset. To avoid
over-training we use our best intuition to select metrics based on depth and
quality which will apply generally across other samples. By cross-validating
against a second dataset, a mixture of two Genome in a Bottle samples, we show
that the results do generalize. Ideally, we’d also include a panel of additional
test sets, including real tumors. A
<a href="http://www.cell.com/cell-systems/abstract/S2405-4712%2815%2900113-1">recent paper from Malachi Griffith, Chris Miller and colloborators at the McDonnell Genome Institute</a>
has an <a href="http://aml31.genome.wustl.edu/">extensively characterized AML31 case</a>
with validated variants. The ICGC also published
<a href="http://www.nature.com/ncomms/2015/151209/ncomms10001/full/ncomms10001.html">a well characterized set of validated calls on a high depth medulloblastoma sample</a>.
We plan to cross-validate on these datasets to continue
to improve filtering approaches for heterogeneous, low frequency variants.</p>
<p>This approach is analagous to applying machine learning tools to discriminate
true and false positives on the training set, and we’d like to incorporate
additional automated approaches as we continue to improve filters. In the past
we’ve used
<a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machines (SVMs)</a>
to learn and apply filtering metrics. This approach worked well, but tended to
over-train to specific datasets. Because it was difficult to extract the
underlying relationships in the filter, we didn’t have a way to reliably
estimate when a filter was general versus over-trained. We’d ideally like to use
a hybrid approach where we use machine learning to identify potentially useful
linear combinations of metrics, then refine these with manual inspection.
We’d love feedback from others with experience doing this type of
filtering, and would especially welcome suggestions for practical methods for
apply machine learning methods to the initial metric selection steps.</p>
<p>This filter improvement methodology is general and can apply to other callers,
although it’s not always easy to derive metrics that will effectively separate
true and false calls. For instance, we also wanted to improve MuTect and Scalpel
calling on this dataset. MuTect doesn’t offer quality or other annotation
metrics to provide a way to build additional filters beyond their implemented
heuristics. Scalpel does provide additional metrics and has additional true
positive calls filtered by the default calling approach, so we could improve in
sensitivity. However,
<a href="http://imgur.com/a/7Dzd3">we were unable to improve sensitivity without a large decrease in precision</a>.
Unfortunately improving callers is still often a laborious process done on a
case-by-case basis. We hope to continue to improve the process by automating
validation work and incorporating
<a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#validation">a large number of diverse validation sets</a>.</p>
<h1 id="running-the-analysis">Running the analysis</h1>
<p>Full details for running this comparison are available in
<a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal">the bcbio example pipeline documentation</a>.
The DREAM synthetic 4 dataset requires
<a href="https://www.synapse.org/#!Synapse:syn312572/wiki/60702">obtaining access through ICGC-DACO</a>
since the input data comes from real human tumors. It does challenge variant
callers more than the unrestricted synthetic dataset 3 due to the added
heterogeneity and lower frequency variants.</p>
<p>Once you have access to the input data, bcbio automatically runs alignment,
variant calling and validation, producing the validation graphs shown above. The
synthetic 4 example includes alignments for both build 37 and 38 as well as
structural variant calling, so is a full tumor characterization you can use as a
template for calling on your cancer samples. By combining validation and
analysis tools we provide a integrated approach that both improves algorithms
and runs real research samples.</p>Brad Chapmanchapmanb@fastmail.comOverviewValidating small RNA analysis with miRQC2016-03-24T00:00:00+00:002016-03-24T00:00:00+00:00http://bcb.io/2016/03/24/mirqc<h1 id="small-rna-seq-with-bcbio-nextgen">Small RNA-seq with bcbio-nextgen</h1>
<p>The study of small RNA helps to understand part of the gene regulation of a
cell. There are different types of small RNAs, the most important in mammals are
miRNA, tRNA fragments and piRNAs.
An advantage of small RNA-seq analysis is that we can study all small RNA types
simultaneously, with the possibility to also detect novel small RNAs.
<a href="http://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> is a Python framework
supported by a big scientific community that implements best practices for
next-generation sequencing data and uses gold standard data to validate its
analyses. It is well known for its variant calling and RNA-seq pipeline. Now
bcbio has an small RNA-seq pipeline that allows quality control, adapter removal
of fastq files, annotation of miRNA, isomiRs and tRNAs, and genome-wide
characterization of other types of small RNAs. Here, I show the capabilities of
the pipeline and its validation using data from the
<a href="http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html">miRQC project</a>.</p>
<ul>
<li><a href="http://seqcluster.readthedocs.org/example_pipeline.html">reproducible code</a></li>
<li><a href="https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.rmd">R code</a></li>
</ul>
<h2 id="mirqc-project">miRQC project</h2>
<p>miRQC provides samples with known relative amounts of small RNAs, enabling
comparison of quantitation and detection of miRNAs. The main goal was to test
different platforms for miRNA detection, but these are also great samples for
benchmarking tools.</p>
<p>The samples in miRQC are: one Universal Human miRNA reference RNA (Agilent
Technologies, #750700), one human brain total RNA (Life Technologies, #AM6050),
several human liver total RNA (LifeTechnologies, #AM7960) and several MS2-phage
RNA (Roche, #10165948001). Moreover, two more samples were created using
different concentrations of UHmiRR and HBR. And finally, two miRNA families were
spiked in human liver and MS2-phage samples.</p>
<p><img src="https://github.com/lpantano/mypubs/raw/master/srnaseq/mirqc/figure/nmeth.3014-F1.jpg" alt="samples" /></p>
<h2 id="pipeline">Pipeline</h2>
<p>There are 4 main steps in the small RNA-seq pipeline:</p>
<ul>
<li>adapter removal</li>
<li>miRNA annotation</li>
<li>de-novo miRNA detection</li>
<li>other small RNAs detection</li>
<li>quality control metrics</li>
</ul>
<h3 id="adapter-removal">Adapter removal</h3>
<p>We integrated <a href="http://cutadapt.readthedocs.org/en/latest/guide.html">cutadapt</a>
allowing a minimum read size of 17 nts and removing the adapter if it is at
least 8 nts long.</p>
<h3 id="mirna-annotation">miRNA annotation</h3>
<p><code class="highlighter-rouge">bcbio</code> uses
<a href="http://seqcluster.readthedocs.org/mirna_annotation.html">seqbuster</a> for this
step. It has been used by
<a href="http://www.nature.com/nature/journal/v501/n7468/full/nature12531.html">GEUVADIS consortium</a>
for miRNA quantification and allows
<a href="https://en.wikipedia.org/wiki/IsomiR">isomiRs</a> analysis as well. Read more
about why <a href="http://seqcluster.readthedocs.org/literature.html">isomiRs are important</a>.</p>
<p>Although not covered here, it is important to mention the pipeline uses <a href="https://www.mdc-berlin.de/8551903/en/">miRDeep2</a> for de-novo miRNA discovery. In this case, it uses all samples for this analysis and then seqbuster for quantification of only the novel miRNAs.</p>
<h3 id="other-smallrnas-detection">Other smallRNAs detection</h3>
<p><code class="highlighter-rouge">bcbio</code> uses <a href="http://github.com/lpantano/seqcluster">seqcluster</a> to detect
unique units of transcription over the genome, allowing resolutions of small
RNAs found in multiple genomic locations. Normally
these small RNAs are dropped because they map multiple times on the genome and
require special analysis to avoid bias in the quantification. Read more about
why
<a href="http://seqcluster.readthedocs.org/literature.html">other small RNAs are important</a>.
This tool produce a sqlite3 database for visualization. An image example is <a href="http://seqcluster.readthedocs.org/more_outputs.html#report">here</a>.</p>
<h3 id="quality-control-metrics">Quality control metrics</h3>
<p><code class="highlighter-rouge">bcbio</code> summarizes <code class="highlighter-rouge">Fastqc</code> metrics for each sample. Together with different
metrics from the previous steps, the user has an idea of the quality of the
samples and the overall project. It includes <code class="highlighter-rouge">fastqc</code> results, size
distribution after adapter removal and amount of small RNAs mapped to miRNAs,
tRNA, rRNA, repeats among others. Other metrics such as, amount of data used until
the end of the analysis, or warning flags if the data is noisy, are provided by
<code class="highlighter-rouge">seqcluster</code> and included in the final R markdown <a href="https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.md">template report</a>.</p>
<h3 id="reporting">Reporting</h3>
<p><code class="highlighter-rouge">bcbio</code> generates a
<a href="https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.md">R markdown template report</a>
to visualize results from each of the steps. It
is inside the <code class="highlighter-rouge">report</code> folder in the working directory or final folder after the analysis.</p>
<h2 id="results">Results</h2>
<p>The mirRQC samples allow us to measure quantitation and detection
accuracy of specific miRs for the tools integrated in <code class="highlighter-rouge">bcbio</code>.</p>
<h3 id="size-distribution">Size distribution</h3>
<p>The size distribution is a first estimate of the quality of your data. In a normal small
RNA sample we should see a peak at 22/23nt and maybe another at 26 or 31
depending on the biology of the samples.</p>
<p><img src="https://github.com/lpantano/mypubs/raw/master/srnaseq/mirqc/figure/adapter-2.png" alt="size-distribution" /></p>
<blockquote>
<p><strong>note</strong>: there may be specific cases where this assumption is not true, but
it applies to the majority of projects.</p>
</blockquote>
<h3 id="mirna-abundance-detection-of-mirqc-samples">miRNA abundance detection of miRQC samples</h3>
<p>There are 4 samples that help to validate the quantification analysis:</p>
<ul>
<li>A: universal human RNA sample</li>
<li>B: human brain sample</li>
<li>C: 25% of A + 75% of B</li>
<li>D: 25% of B + 75% of A</li>
</ul>
<p>So we can assume:</p>
<ul>
<li>If A > B then A > D > C > B</li>
<li>If B > A then A < D < C < B</li>
</ul>
<p>Note that C/D samples are swapped in the paper and in the GEO web. The text from the paper says:</p>
<blockquote>
<p>These samples (termed miRQC A–D) consist of 100% Universal Human miRNA
Reference RNA (UHmiRR; A), 100% human brain RNA (HBR; B) and two titrations
thereof (C = 0.75A + 0.25B and D = 0.25A + 0.75B).</p>
</blockquote>
<p>While the text in the GEO web says:</p>
<blockquote>
<p>Source name miRQC C
Organism Homo sapiens
Characteristics biomaterial: Mixture of 25% UHRR and 75% HuBr Total RNA</p>
</blockquote>
<blockquote>
<p>Source name miRQC D
Organism Homo sapiens
Characteristics biomaterial: Mixture of 75% UHRR and 25% HuBr Total RNA</p>
</blockquote>
<p>After the analysis with <code class="highlighter-rouge">bcbio</code>, we can calculate the miRs that
follow the relative abundance rule. To measure this, I took the average from
the replicates and kept only the miRs with a median > 5 counts after
upper quantile normalization among samples.</p>
<p>miRNAs which A > B are 111, and all of them follows A > D > C</p>
<p>miRNAs which B > A are 181 and 174 follows B > C > D</p>
<p>That is more than 95% of accuracy for miRs with more than 5 counts.</p>
<h3 id="specificity">Specificity</h3>
<p>To evaluate specificity we used samples that included specific miRNAs that are not
normally expressed there. These samples were analyzed in a different
<a href="https://github.com/lpantano/seqcluster/blob/master/data/pipeline_example/mirqc/non_mirqc_bcbio.csv">run</a>.</p>
<p>This analysis uses samples from 4 human liver RNAs and 4 MS2-phage RNAs with 8 synthetic miRNAs from two families (miR-302a/b/c/d and let-7a/b/c/d).</p>
<p>We should only see those miRNAs (let-7 and miR-302 families) in those samples.
Only sequences that appear > 2 times in the fastq files are considered.</p>
<p><img src="https://raw.githubusercontent.com/lpantano/mypubs/master/srnaseq/mirqc/figure/detection-serum-liver-1.png" alt="heatmap-mir302" /></p>
<p><img src="https://raw.githubusercontent.com/lpantano/mypubs/master/srnaseq/mirqc/figure/detection-serum-ms2-1.png" alt="heatmap-mirlet7" /></p>
<p>hsa-let-7a appears in all samples. If you check the processed data, this is not due to
incorrect alignments. It would be interesting to figure out whether this signal is
because of contamintation or errors during sequencing or/and during amplification.</p>
<h3 id="cluster-abundance-detection-of-mirqc-samples">Cluster abundance detection of miRQC samples</h3>
<p>Similarly to miRs, we can validate other small RNAs detected by
<code class="highlighter-rouge">seqcluster</code>. The results were very similar:</p>
<ul>
<li>clusters which A > B are 147, where 139 (75 are known miRNAs) follow D > C</li>
<li>clusters which B > A are 230, where 222 (129 are known miRNAs) follow D > C</li>
</ul>
<h2 id="timing-and-resources">Timing and Resources</h2>
<p>The running time for 8 samples with 6 millions reads each was 3 hour and 19 minutes.</p>
<table>
<thead>
<tr>
<th>Total</th>
<th style="text-align: center">3:19</th>
<th style="text-align: center">total cores</th>
<th style="text-align: center">total memory GB</th>
</tr>
</thead>
<tbody>
<tr>
<td>organize samples</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
</tr>
<tr>
<td>trimming & miRNA</td>
<td style="text-align: center">0:21</td>
<td style="text-align: center">8</td>
<td style="text-align: center">20</td>
</tr>
<tr>
<td>prepare</td>
<td style="text-align: center">0:01</td>
<td style="text-align: center">1</td>
<td style="text-align: center">8</td>
</tr>
<tr>
<td>alignment</td>
<td style="text-align: center">0:07</td>
<td style="text-align: center">6</td>
<td style="text-align: center">42.1</td>
</tr>
<tr>
<td>cluster</td>
<td style="text-align: center">2:49</td>
<td style="text-align: center">1</td>
<td style="text-align: center">8</td>
</tr>
<tr>
<td>quality control</td>
<td style="text-align: center">0:01</td>
<td style="text-align: center">8</td>
<td style="text-align: center">20</td>
</tr>
<tr>
<td>report</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>We conclude that the current analysis has a reliable quantification and
specificity of miRNAs and other small RNA molecules. What’s more, it helps with
the downstream analysis creating a complete R markdown template that covers the
most important section of small RNA studies.</p>
<p>I am currently implementing post-processing steps of the
<a href="https://github.com/sararselitsky/tDRmapper">tDRmapper</a> (analysis of tRNAs)
output to allow an easy differential expression or clustering in downstream
analysis. In the future, I would like to implement
<a href="http://sourceforge.net/p/protrac/home/Home/">proTRAC</a> for the analysis of
piRNAs.</p>
<h1 id="thanks">Thanks</h1>
<ul>
<li><a href="http://bioinformatics.sph.harvard.edu">Harvard T.H. Chan School of Public Health</a>
for supporting the integration of small RNAseq pipeline in
<a href="http://github.com/chapman/bcbio-nextgen">bcbio</a>. Special thanks to @roryk and
@chapmanb.</li>
<li><a href="https://rc.hms.harvard.edu/#people">Research Computing at Harvard Medical School</a>:
Chris Botka, Director of Research Computing and all the people in the team.</li>
<li>Special thanks to the authors for making their validation data available. I
encourage use of this data for any tool that analyzes small RNAs.</li>
</ul>Lorena PantanoSmall RNA-seq with bcbio-nextgenValidated variant calling with human genome build 382015-09-17T08:53:00+00:002015-09-17T08:53:00+00:00http://bcb.io/2015/09/17/hg38-validation<p>This post describes a ready to run
<a href="https://github.com/chapmanb/bcbio-nextgen">bcbio implementation</a> of variant
calling and validation on human genome build 38, demonstrating its utility for
improving variant detection.</p>
<p><a href="http://genomeref.blogspot.com/2013/12/announcing-grch38.html">Human genome build 38</a> (hg38, GRCh38) offers a major upgrade over the previous
build, 37 (hg19, GRCh37). The new genome reflects our increased understanding of
the heterogeneity within human sub-populations and contains a large number of
alternative genomic loci that better capture our knowledge of genome structure.
Better genomic representation improves mapping, avoiding a source of hard to
remove false positives during variant calling, and moves towards more accurate
graph-based representations of the genome.</p>
<p>Practically, the community has been slow to move to build 38 in the year and a
half since its official release. There was a
<a href="https://www.reddit.com/r/genome/comments/3b3s3t/switch_from_hg19build37_to_hg20build38/">nice discussion in the bioinformatics subreddit</a>
about adoption challenges. Our experience has been that researchers want the
improvements in 38 but two areas hold them back:</p>
<ul>
<li>
<p>Lack of validated truth sets for build 38. We make extensive use of the
NA12878 truth sets from
<a href="http://genomeinabottle.org/">the Genome in a Bottle consortium</a> and
<a href="http://www.illumina.com/platinumgenomes/">Illumina’s platinum genomes</a> for
germline calling, and the
<a href="https://www.synapse.org/#!Synapse:syn312572/wiki/">ICGC-TCGA DREAM challenge</a>
for cancer calling. Not having these ground truths for build 38 makes it
difficult to validate and improve analysis tools.</p>
</li>
<li>
<p>Lack of resources for build 38. Tools like
<a href="http://gemini.readthedocs.org/en/latest/">GEMINI</a> provide a large number of
<a href="http://gemini.readthedocs.org/en/latest/content/database_schema.html#the-variants-table">external annotations associated with variations</a>,
many of which are not yet prepared for build 38. Over time, these resources
will become increasingly available but we need a stop gap solution to
convert 37-only resources to build 38.
<a href="http://genome.ucsc.edu/cgi-bin/hgLiftOver">liftOver</a>,
<a href="http://crossmap.sourceforge.net/">CrossMap</a> and
<a href="http://www.ncbi.nlm.nih.gov/genome/tools/remap">NCBI remap</a> enable
coordinate conversion, but we lack validations of the positives and
negatives of each method.</p>
</li>
</ul>
<p>Both of these transition challenges result in fewer tested and validated tools
for 38, which makes migration for new projects challenging. The goal of this
post is to enable use of build 38 by:</p>
<ul>
<li>
<p>Providing support for variant calling on build 38 in
<a href="https://github.com/chapmanb/bcbio-nextgen">bcbio</a>. The
<a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/installation.html">bcbio automated installation</a>
now includes two build 38 genome targets: hg38 –
<a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/">the full 38 build from 1000 genomes with alternative contigs, including HLA</a>;
and hg38-noalt –
<a href="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/">build 38 with no alternative loci, from NCBI</a>.</p>
</li>
<li>
<p>Validating build 38 variant calling against two truth sets:
<a href="http://genomeinabottle.org/">Genome in a Bottle NA12878 build 37 calls</a>
converted to 38 coordinates, and
<a href="http://www.illumina.com/platinumgenomes/">an Illumina Platinum Genome NA12878 truth set</a>
natively prepared with alignment and calling using the build 38 primary
assembly. The native Platinum Genomes truth set avoids the complications of
converting Genome in a Bottle coordinates, at the expense of not having
orthogonal validation of variants from non-Illumina technologies.</p>
</li>
<li>
<p>Comparing coordinate conversion methods. We used
<a href="http://www.ncbi.nlm.nih.gov/genome/tools/remap">NCBi’s Remap</a> and
<a href="http://crossmap.sourceforge.net/">CrossMap</a> with a
<a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/">UCSC chain file</a>
to convert the Genome in a Bottle reference prior to validating, allowing
comparison of the effectiveness of both methods.</p>
</li>
</ul>
<p>The results show improved variant detection for build 38 compared
with 37, along with a reduction in false positive calls using the full genome
build with alternative contigs.</p>
<p>Both coordinate conversion methods are a reasonable alternative when full
realignment and reanalysis is impractical. Validation results with NCBI Remap
and CrossMap are comparable to results using fully reanalyzed reference genomes.
The remaining challenges are in the last 1% of difficult to convert variants.</p>
<p>This work is possible thanks to our collaborations with the
<a href="http://www.gla.ac.uk/researchinstitutes/cancersciences/ics/facilities/wwcrc/">Wolfson Wohl Cancer Research Centre</a>
and <a href="http://www.astrazeneca.com/Home">AstraZeneca Oncology</a>. It is also due to
help from members of the bcbio community, thanks to
<a href="https://github.com/chapmanb/bcbio-nextgen/issues/817">the extensive discussion around build 38 support</a>.
I’d like to especially thank <a href="https://twitter.com/deannachurch">Deanna Church</a>
for writing the original scripts for remapping Genome in a Bottle to build 38;
<a href="https://www.ebi.ac.uk/about/people/laura-clarke">Laura Clarke</a> for preparing a
number of hg38 resources from the 1000 genomes project; and
<a href="https://www.linkedin.com/pub/alison-meynert/25/b19/556">Alison Meynert</a> for
using CrossMap to generate the resource files used by GATK variant calling. Many
of the tools and truth sets developed from the work of the
<a href="http://ga4gh.org/#/benchmarking-team">Global Alliance for Genomics and Health (GA4GH) Benchmarking Team</a>.
I’m so grateful for the community and collaborations that make this work
possible.</p>
<h1 id="why-support-build-38">Why support build 38?</h1>
<p>Moving to build 38 requires an investment in validation, resource preparation
and tool integration. This effort is worthwhile because of the improvements
which help eliminate errors due to the genome build itself. Deanna Church has a
<a href="http://www.slideshare.net/GenomeRef/transitioning-to-grch38">great slide deck</a>
that summarizes the advantages of 38:</p>
<ul>
<li>
<p>Removal of false positives due to additional genomic regions in
build 38. If a region is missing from the reference build but a nearly identical
segment is present, the read from the missing region will map to the incorrect
segment. This results in hard to remove false positive variants in that segment.</p>
</li>
<li>
<p>Better capture of the diversity of the human genome by inclusion of
alternative haplotypes. A reference assembly is a useful construct for
creating a shared coordinate space for collaborating, but sub-populations need
to have representations of their unique sequences to ensure correct analysis.
This is especially true for medically important genes where we’ll soon be
making clinical decisions.</p>
</li>
<li>
<p>The inclusion of
<a href="https://twitter.com/erikgarrison/status/636089957979942912">highly diverse</a>
<a href="https://en.wikipedia.org/wiki/Human_leukocyte_antigen">human leukocyte antigen (HLA) regions</a>
allows
<a href="https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing">HLA typing</a>,
enabling insight into these important immune system genes.</p>
</li>
<li>
<p>Moving to a
<a href="http://www.genomebiology.com/2015/16/1/13">graph based representation of the human genome</a>.
Capturing the heterogeneity of human populations requires
<a href="http://news.ucsc.edu/2015/01/genome-variation.html">cataloging and representing</a>
all potential variants, especially larger structural variations present only
in sub-populations. Build 38 is a nice step that prepares our tools and
analysis methods to consider alternative haplotypes across the genome.</p>
</li>
</ul>
<p>In short, the smart people at the
<a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml">Genome Reference Consortium</a>
produced a thoroughly improved version of the human genome based on our
increasing knowledge of variability. It’s a scientific improvement, and the main
challenges are how to validate and ensure that our existing tools work well with
the changes.</p>
<h1 id="validation">Validation</h1>
<p>Our primary goal was to provide variant calling on build 38 with
<a href="http://bcb.io/2014/10/07/joint-calling/">automated validation</a>. Based on
<a href="https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation">Heng Li’s validation of 38 using haploid/diploid comparisons with CHM1/NA12878</a>
we expected to find a larger number of variants detected in build 38 with a reduction in
false positives when using a build with alternative haplotypes.</p>
<p>We aligned 50x NA12878 100bp reads, available from
<a href="http://www.illumina.com/platinumgenomes/">Illumina Platinum Genomes</a>, using
<a href="https://github.com/lh3/bwa">bwa-mem</a> (v0.7.12) and performed variant calling
with
<a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">GATK HaplotypeCaller</a>
(v3.3-0) and <a href="https://github.com/ekg/freebayes">FreeBayes</a> (v0.9.21-7). We did
not
<a href="http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/">perform realignment or base quality score recalibration (BQSR) prior to variant calling</a>.
We filtered all calls with hard filters, as GATK variant quality score
recalibration (VQSR) intermittently failed to converge on these single sample
whole genome inputs. We aligned and called separately on build 37 (hg19/GRCh37)
and two build 38 (hg38/GRCh38) reference sets, with and without alternative
alleles. We performed validations against the truth sets using
<a href="https://github.com/RealTimeGenomics/rtg-tools">Real Time Genomics vcfeval</a>.</p>
<h2 id="illumina-platinum-genomes">Illumina Platinum Genomes</h2>
<p>Illumina’s Platinum Genome initiative has natively prepared truth sets for both
build 37 and 38. This makes interpretation of validation results more
straightforward since we only have two potential sources of error: incorrect
calls in the validation set and errors in the truth set. Illumina created their
build 38 truth set by mapping against
<a href="https://groups.google.com/d/msg/genome-in-a-bottle/tDOcR0VRxcc/KPmifJlRAgAJ">38 without alternative alleles but with decoy sequence</a>.
They called variants using 4 different approaches, performed pedigree checks,
then combined the individual calls. Below are the results of comparing variant
calls from bcbio against this truth set. The comparison measures sensitivity and
precision metrics for SNPs and indels:</p>
<ul>
<li>
<p><a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">False negative rate (FNR)</a>
– A measure of the sensitivity of variant detection, calculated as 1 -
sensitivity (so 0.5% FNR = 99.5% sensitivity). By having a natural zero point for
plotting, this metric provides better resolution for numbers with high values where
small differences matter. Lower values of FNR are better.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">False discovery rate (FDR)</a>
– Measures the precision of variant detection based on false positives. It’s
calculated as 1 - precision. Lower values have a smaller number
of false positives compared to true positives.</p>
</li>
</ul>
<p><a href="http://i.imgur.com/phzZ09Y.png">
<img src="http://i.imgur.com/phzZ09Y.png" width="700" alt="Illumina Platinum Genomes truth set" />
</a></p>
<p>hg38 and hg38-noalt (without alternative haplotypes) have improved detection
compared to hg19, with both FreeBayes and GATK HaplotypeCaller. For SNPs, the
full hg38 comparison has less false positives (FDR) compared to hg38-noalt,
demonstrating the advantage of alternative haplotypes for resolving mapping and
calling issues. For indels, we detect more total indels in build 38 at the cost
of decreased sensitivity and precision. This reflects tuning of current
algorithms for build 37, and offers the opportunity for better overall detection
on 38. Additionally, since the Platinum Genomes mapping does not use alternative
alleles, some of the discordant calls from build 38 may be due to incorrect
calls in the truth set.</p>
<h2 id="genome-in-a-bottle-reference-with-coordinate-remapping">Genome in a Bottle reference with coordinate remapping</h2>
<p>We also validated build 38 calls against the Genome in a Bottle NA12878 truth
set. Genome in a Bottle uses inputs from
<a href="http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html">multiple sequencing technologies and callers</a>,
but is only currently available for build 37. To prepare it for this comparison
we converted the coordinates to 38 in two ways: using
<a href="http://www.ncbi.nlm.nih.gov/genome/tools/remap">NCBI’s remap</a> and
<a href="http://crossmap.sourceforge.net/">CrossMap</a> with a
<a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/">UCSC chain file</a>.
This adds an additional layer into the comparison since we’re now considering
three potential sources of errors in the evaluation:</p>
<ul>
<li>Incorrect calls in the evaluation dataset</li>
<li>Errors in the truth set</li>
<li>Incorrect or missing variants created during coordinate conversion</li>
</ul>
<p>This approach evaluates coordinate conversion methods at the cost of more
subtleties when assessing the concordant and discordant calls. Even with
potential coordinate conversion issues, the Genome in a Bottle truth set
shows good performance of mapping and calling on build 38.</p>
<ul>
<li>
<p>We see an improvement in total indels detected in build 38 for both coordinate
conversion methods. This is due to locations where a single location in 37
maps to multiple locations in 38. FreeBayes indel resolution is not as good as
GATK HaplotypeCaller, unlike the results for the Platinum Genomes comparisons.
This reflects some bias towards GATK heuristics in the Genome in a Bottle
dataset, which uses HaplotypeCaller for variant resolution. A majority of the
additional discordant FreeBayes SNPs and Indels have overlapping calls with
the truth set but differ in the allele representation, indicating areas where
FreeBayes can improve resolution of heterozygotes versus homozygotes.</p>
</li>
<li>
<p>For SNP calling we have fewer concordant variants, due to a loss or
mismapping of calls in difficult regions. CrossMap has fewer overall SNPs
(concordant + discordant missing), indicating a loss of some SNPs during
conversion. In contrast, NCBI remap converts more variants but also has a
larger number that aren’t called and show up as false negatives in the
comparison. In both cases, these are possible areas to improve coordinate
remapping resolution.</p>
</li>
<li>
<p>There is a reduction in SNP false positives in hg38 compared to hg38-noalt,
but not as dramatic as seen in the native Platinum Genomes truth set. There is
value in the alternative alleles in this comparison, moderated by the
challenges in coordinate remapping of variants in complex regions.</p>
</li>
</ul>
<p><a href="http://i.imgur.com/4XTIf2p.png">
<img src="http://i.imgur.com/4XTIf2p.png" width="700" alt="Genome in a Bottle truth set" />
</a></p>
<h2 id="conclusions">Conclusions</h2>
<p>Both the native and coordinate converted validation sets inspire confidence that
we can call variants well on build 38. They also demonstrate improvements due to
better alignment with the alternative alleles, with improved indel detection
and removal of false positives SNPs.</p>
<p>Coordinate conversion of build 37 resources with CrossMap and Remap captures
the majority of the signal in the validations, with some loss in more difficult
regions. Continued work to cooordinate convert missing resources to 38 should
also provide additional information about edge cases where we should use more
caution in interpreting results.</p>
<h1 id="scripts-chromosome-naming-and-future-work">Scripts, chromosome naming and future work</h1>
<p>All of the data and code used here are freely available to build off this work.
A
<a href="https://github.com/hbc/giab_remap_38">GitHub repository contains the scripts used to convert build 37 to 38</a>,
along with links to the prepared Genome in a Bottle truth sets and validation
outputs. Configuration files and input data for re-running the validations are
in
<a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#human-genome-build-38">the bcbio example pipeline documentation</a>.
We welcome re-analyses and re-interpretation of these results.</p>
<p>For build 38, all of the resources in bcbio use the ‘chr’ prefixed naming scheme
(chr1, chr2, chr3…). In pre-release discussions about build 38 UCSC, Ensembl,
NCBI and other major providers agreed on this as the preferred nomenclature.
NCBI’s build 38 analysis sets use it and
<a href="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS">documents the chromosome naming conventions</a>.
Unfortunately, some groups
<a href="https://www.biostars.org/p/94826/#116487">decided to stay with non-prefixed names</a>
(1, 2, 3…). There is too much interesting biology I’d prefer to work on,
rather than spending time supporting two sets of resources that differ only in
the chromosome name. In bcbio we’ve decided not to perpetuate this dual naming
scheme for another genome release and will convert non-conforming naming over to
the ‘chr’ prefixed naming scheme.</p>
<p>Currently bcbio has validated SNP and indel calling for build 38. We plan to support
structural variant validations, using lifted over resources from Genome in a
Bottle and the
<a href="https://www.synapse.org/#!Synapse:syn312572/wiki/">ICGC-TCGA DREAM challenge data</a>.
We also plan to validate RNA-seq tools on build 38.</p>
<p>For small variant calling, we hope to continue to work with the community to
convert additional 37 only resources to 38. Supporting build 38 in
<a href="http://gemini.readthedocs.org/en/latest/">GEMINI</a> is one of the last large
hurdles. We also plan to integrate tools like
<a href="https://github.com/ekg/vg">Erik Garrison’s vg</a> for mapping and variant calling
against variant graphs. Continuing to move to graph based methods will provide
us with a more reliable assessment of variation in sub-populations, and improved
variant calling in regions with multiple alternative alleles. Build 38 is a
great first step in moving towards a more accurate, graph-based, representation
of human heterogeneity.</p>Brad Chapmanchapmanb@fastmail.comThis post describes a ready to run bcbio implementation of variant calling and validation on human genome build 38, demonstrating its utility for improving variant detection.Validating multiple cancer variant callers and prioritization in tumor-only samples2015-03-05T11:13:00+00:002015-03-05T11:13:00+00:00http://bcb.io/2015/03/05/cancerval<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Overview</h2>
<div class="outline-text-2" id="text-1">
<p> The post discusses work validating multiple cancer variant callers in <a href="http://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> using a synthetic reference call set from <a href="https://www.synapse.org/#!Synapse:syn312572">the ICGC-TCGA DREAM challenge</a>. We've previously <a href="http://bcb.io/2014/10/07/joint-calling/">validated germline variant calling methods</a>, but cancer calling is additionally challenging. Tumor samples have mixed cellularity due to contaminating normal sample, and consist of multiple sub-clones with different somatic variations. Low-frequency sub-clonal variations can be critical to understand disease progression but are more difficult to detect with high sensitivity and precision. </p>
<p> Publicly available whole genome truth sets like the <a href="http://genomeinabottle.org/">NA12878 Genome in a Bottle reference materials</a> don't currently exist for cancer calling, but other groups have been working to prepare standards for use in evaluating callers. The <a href="https://www.synapse.org/#!Synapse:syn312572">DREAM challenge</a> provides a set of <a href="https://www.synapse.org/#!Synapse:syn312572/wiki/62018">synthetic datasets</a> that include cellularity and multiple sub-clones. There is also on ongoing DREAM contest with real, non-simulated data. In addition, the <a href="https://icgc.org/">ICGC</a> has done extensive work on <a href="http://biorxiv.org/content/early/2014/12/24/013177">assessing the challenges in cancer variant calling</a> and produced a detailed comparison of <a href="http://biorxiv.org/content/early/2014/12/24/013177">multiple variant calling pipelines</a>. Finally, <a href="http://www.bina.com/">Bina</a> recently released a <a href="https://github.com/bioinform/varsim">simulation framework called varsim</a> that they used to <a href="http://info.bina.com/hs-fs/hub/419826/file-1900487108-pdf/Posters/ASHG_2014_VarSim.pdf">compare multiple callers</a> as part of their <a href="http://info.bina.com/cancer-resources">cancer benchmarking</a>. I'm excited about all this community benchmarking and hope this work can contribute to the goal of having fully consented real patient reference materials, leading to a good understanding of variant caller tradeoffs. </p>
<p> In this work, we <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal">evaluated cancer tumor/normal variant calling</a> with <a href="https://www.synapse.org/#!Synapse:syn312572/wiki/62018">synthetic dataset 3</a> from the DREAM challenge, using multiple approaches to detect SNPs, indels and structural variants. A majority voting ensemble approach combines inputs from multiple callers into a final callset with good sensitivity and precision. We also provide a prioritization method to enrich for somatic mutations in tumor samples without matched normals. </p>
<p> Cancer variant calling in <a href="http://github.com/chapmanb/bcbio-nextgen">bcbio</a> is due to contributions and support from multiple members of the community: </p>
<ul class="org-ul">
<li><a href="https://github.com/mjafin">Miika Ahdesmaki</a> and <a href="https://twitter.com/BioInfo">Justin Johnson</a> at <a href="http://www.astrazeneca.com/Home">AstraZeneca</a> collaborated with our group on integrating and evaluating multiple variant callers. Their financial support helped fund our time on this comparison, and their scientific contributions improved SNP and indel calling.
</li>
<li><a href="https://github.com/lbeltrame">Luca Beltrame</a> integrated a number of cancer variant callers into bcbio and wrote the initial framework for somatic calling.
</li>
<li><a href="https://github.com/lpantano">Lorena Pantano</a> integrated variant callers and performed benchmarking work. Members of our group, the <a href="http://hsphbio.ghost.io/">Harvard Chan school Bioinformatics core</a>, regularly contribute to bcbio development and validation work.
</li>
<li>The <a href="http://www.gla.ac.uk/researchinstitutes/cancersciences/ics/">Wolfson Wohl Cancer Research Centre</a> supported our work on validation of somatic callers.
</li>
<li>James Cuff, Paul Edmon and the team at <a href="https://rc.fas.harvard.edu/">Harvard FAS research computing</a> provided compute infrastructure and support that enabled the large number of benchmarking runs it takes to get good quality calling.
</li>
</ul>
<p> I'm continually grateful to the community for all the contributions and support. The <a href="https://github.com/chapmanb/bcbio-nextgen/commits/master/bcbio/variation/mutect.py">MuTect commit history for bcbio</a> is a great example of multiple distributed collaborators working towards the shared goal of integrating and validating a caller. </p>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Variant caller validation</h2>
<div class="outline-text-2" id="text-2">
<p> We used a large collection of open source variant callers to detect SNPs, Indels and structural variants: </p>
<ul class="org-ul">
<li><a href="https://www.broadinstitute.org/cancer/cga/mutect">MuTect (1.1.5)</a> – A SNP only caller, built on GATK UnifiedGenotyper, from the Broad Institute. MuTect requires a license if used for commercial purposes.
</li>
<li><a href="https://github.com/AstraZeneca-NGS/VarDict">VarDict (2014-12-15)</a> – A SNP and indel caller from Zhongwu Lai and the oncology team at AstraZeneca.
</li>
<li><a href="https://github.com/ekg/freebayes">FreeBayes (0.9.20-1)</a> – A haplotype aware realigning caller for SNPs and indels from Erik Garrison and Gabor Marth's lab.
</li>
<li><a href="http://varscan.sourceforge.net/">VarScan (2.3.7)</a> – A heuristic/statistic based somatic SNP and indel caller from Dan Koboldt and The Genome Institute at Washington University.
</li>
<li><a href="http://scalpel.sourceforge.net/">Scalpel (0.3.1)</a> – Micro-assembly based Indel caller from Giuseppe Narzisi and the Schatz lab. We pair Scalpel with MuTect to provide a complete set of small variant calls.
</li>
<li><a href="https://github.com/arq5x/lumpy-sv">LUMPY (0.2.7)</a> – A <a href="http://genomebiology.com/2014/15/6/R84/abstract">probabilistic structural variant caller</a> incorporating both split read and read pair discordance, developed by Ryan Layer in <a href="http://quinlanlab.org/">Aaron Quinlan</a> and <a href="http://genome.wustl.edu/people/groups/detail/hall-lab/">Ira Hall's</a> labs.
</li>
<li><a href="https://github.com/tobiasrausch/delly">DELLY (0.6.1)</a> – An <a href="http://bioinformatics.oxfordjournals.org/content/28/18/i333.abstract">integrated paired-end and split-end structural variant caller</a> developed by Tobias Rausch.
</li>
<li><a href="https://github.com/jewmanchue/wham">WHAM (1.5.1)</a> – A structural variant caller that can incorporate association testing, from Zev Kronenberg in Mark Yandell's lab at the University of Utah.
</li>
</ul>
<p> bcbio runs these callers and uses simple ensemble methods to combine <a href="http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/">small variants (SNPs, indels</a>) and <a href="http://bcb.io/2014/08/12/validated-whole-genome-structural-variation-detection-using-multiple-callers/">structural variants</a> into final combined callsets. The <a href="https://github.com/chapmanb/bcbio.variation.recall">new small variant ensemble method</a> uses a simplified majority rule classifier that picks variants to pass based on being present in a configurable number of samples. This performs well and is faster than the previous implementation that made use of both this approach and a subsequent support vector machine step. </p>
<p> Using the 100x whole genome tumor/normal pair from <a href="https://www.synapse.org/#!Synapse:syn312572/wiki/62018">DREAM synthetic dataset 3</a> we evaluated each of the callers for sensitivity and precision on small variants (SNPs and indels). This synthetic dataset contains 100% tumor cellularity with 3 subclones at allele frequencies of 50%, 33% and 20%. </p>
<p><a href="http://i.imgur.com/Ck8AJQk.png"><br />
<img src="http://i.imgur.com/Ck8AJQk.png" width="700" alt="Cancer evaluation for SNPs and indels" /><br />
</a></p>
<p> In addition to the whole genome results, the <a href="http://imgur.com/a/qba5k">validation album</a> includes results from running against the same dataset limited to exome regions. This has identical patterns of sensitivity and precision. It runs quicker, so is useful for evaluating changes to filtering or program parameters. </p>
<p> We also looked at structural variant calls for larger deletions, duplications and inversions. Here is the precision and sensitivity for duplications across multiple size classes: </p>
<p><a href="http://i.imgur.com/csQPl5p.png"><br />
<img src="http://i.imgur.com/csQPl5p.png" width="700" alt="Cancer evaluation for structural variants -- duplications." /><br />
</a></p>
<p> The <a href="http://imgur.com/a/qba5k">full album of validation results</a> includes the comparisons for deletion and inversion events. These comparisons measure the contributions of individual callers within <a href="http://bcb.io/2014/08/12/validated-whole-genome-structural-variation-detection-using-multiple-callers/">an ensemble approach</a> that attempts to maximize sensitivity and specificity for the combined callset. Keep in mind that each of the individual results make use of other caller information in filtering. Our goal is to create the best possible combined calls, rather than a platform for unbiased comparison of structural variant callers. We're also actively working on improving sensitivity and specificity for individual callers and expect the numbers to continue to evolve. For example, <a href="https://github.com/jewmanchue/">Zev Kronenberg</a> added WHAM's ability to identify different classes of structural changes, and we're still in the process of improving the classifier. </p>
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">Improvements in filtering</h2>
<div class="outline-text-2" id="text-3">
<p> Our evaluation comparisons show best effort attempts to provide good quality calls for every caller. The final results often come from multiple rounds of improving sensitivity and precision by adjusting program parameters or downstream filtering. The goal of tightly integrating bcbio with validation is that the community can work on defining a set of parameters and filters that work best in multiple cases, and then use these directly within the same framework for processing real data. </p>
<p> In presenting the final results only, it may not be clear that plugging a specific tool into a custom bash script will not always produce the same results we see here. As an example, here are the improvements in FreeBayes sensitivity and precision from our initial implementation, presented over the exome regions of synthetic dataset 3: </p>
<p><a href="http://i.imgur.com/NJFaoas.png"><br />
<img src="http://i.imgur.com/NJFaoas.png" width="500" alt="FreeBayes caller improvements" /><br />
</a></p>
<p> The original implementation used a <a href="https://github.com/ekg/vcflib">vcfsamplediff based</a> approach to filtering, as recommended on the <a href="https://groups.google.com/d/msg/freebayes/beLYRuHMkQE/RwFMniDmBYoJ">FreeBayes mailing list</a>. The current, improved, version uses a <a href="https://github.com/chapmanb/bcbio-nextgen/blob/4fe770cc1343f8e1a3f3fab1771bad13eb94df7a/bcbio/variation/freebayes.py#L217">custom filter based on genotype likelihoods</a>, based on the approach in the <a href="https://github.com/cc2qe/speedseq">speeseq pipeline</a>. </p>
<p> In general, you can find all of the integration work for individual callers in the <a href="https://github.com/chapmanb/bcbio-nextgen/tree/master/bcbio/variation">bcbio source code</a>, broken down by caller. For instance, here is the integration work on <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/mutect.py">MuTect</a>. The goal is to make all of the configuration settings and filters fully transparent so users can understand how they work when using bcbio, or transplant them into their own custom pipelines. </p>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4">Tumor-only prioritization</h2>
<div class="outline-text-2" id="text-4">
<p> The above validations were all done on cancer calling with tumor and normal pairs. The filters to separate pre-existing <a href="https://en.wikipedia.org/wiki/Germline_mutation">germline mutations</a> from <a href="https://en.wikipedia.org/wiki/Mutation#Somatic_mutations">cancer specific somatic mutations</a> rely on the presence of a normal sample. In some cases, we don't have matched normal samples to do this filtering. Two common examples are <a href="http://www.mobio.com/blog/2012/07/20/formalin-fixed-paraffin-embedded-tissue-dna-isolation-the-basics/">FFPE samples</a> and tumor cell lines. For these samples, we'd like to be able to prioritize likely tumor specific variations for followup using publicly available resources. </p>
<p> We implemented a prioritization strategy from tumor-only samples in bcbio that takes advantage of publicly available resources like <a href="http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/">COSMIC</a>, <a href="http://www.clinvar.com/">ClinVar</a>, <a href="http://www.1000genomes.org/">1000 genomes</a>, <a href="http://evs.gs.washington.edu/EVS/">ESP</a> and <a href="http://exac.broadinstitute.org/">ExAC</a>. It uses <a href="https://github.com/arq5x/gemini">GEMINI</a> to annotate the initial tumor-only VCF calls with external annotations, then extracts these to <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/prioritize.py">prioritize variants</a> with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar. </p>
<p> Validating this prioritization strategy requires real tumor samples with known mutations. Our synthetic datasets are not useful here, since the variants <a href="https://www.synapse.org/#!Synapse:syn312572/wiki/62018">do not necessarily model standard biological variability</a>. You could spike in biologically relevant mutations, as done in the <a href="http://info.bina.com/hs-fs/hub/419826/file-1900487108-pdf/Posters/ASHG_2014_VarSim.pdf">VarSim cancer simulated data</a>, but this will bias towards our prioritization approach since both would use the same set of necessarily imperfect known variants and population level mutations. </p>
<p> We took the approach of using published tumor data with validated mutations. <a href="https://twitter.com/druvus">Andreas Sjödin</a> identified a <a href="http://onlinelibrary.wiley.com/doi/10.1002/hep.27243/full">Hepatoblastoma exome sequencing paper</a> with <a href="http://www.ebi.ac.uk/ena/data/view/SRP037747">publicly available sample data</a> and 23 validated cancer related variations across 5 samples. This is a baseline to help determine how stringent to be in removing potential germline variants. </p>
<p> The prioritization enriches variants of interest by 35-50x without losing sensitivity to confirmed variants: </p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups">
<colgroup>
<col class="left" />
<col class="left" />
<col class="left" />
<col class="right" />
<col class="right" />
<col class="right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="left">sample</th>
<th scope="col" class="left">caller</th>
<th scope="col" class="left">confirmed</th>
<th scope="col" class="right">enrichment</th>
<th scope="col" class="right">additional</th>
<th scope="col" class="right">filtered</th>
</tr>
</thead>
<tbody>
<tr>
<td class="left">HB2T</td>
<td class="left">freebayes</td>
<td class="left">6 / 7</td>
<td class="right">44x</td>
<td class="right">1288</td>
<td class="right">56046</td>
</tr>
<tr>
<td class="left">HB2T</td>
<td class="left">mutect</td>
<td class="left">6 / 7</td>
<td class="right">48x</td>
<td class="right">1014</td>
<td class="right">47755</td>
</tr>
<tr>
<td class="left">HB2T</td>
<td class="left">vardict</td>
<td class="left">6 / 7</td>
<td class="right">36x</td>
<td class="right">1464</td>
<td class="right">52090</td>
</tr>
<tr>
<td class="left">HB3T</td>
<td class="left">freebayes</td>
<td class="left">4 / 4</td>
<td class="right">46x</td>
<td class="right">1218</td>
<td class="right">54997</td>
</tr>
<tr>
<td class="left">HB3T</td>
<td class="left">mutect</td>
<td class="left">4 / 4</td>
<td class="right">49x</td>
<td class="right">961</td>
<td class="right">46894</td>
</tr>
<tr>
<td class="left">HB3T</td>
<td class="left">vardict</td>
<td class="left">4 / 4</td>
<td class="right">35x</td>
<td class="right">1511</td>
<td class="right">51404</td>
</tr>
<tr>
<td class="left">HB6T</td>
<td class="left">freebayes</td>
<td class="left">4 / 4</td>
<td class="right">43x</td>
<td class="right">1314</td>
<td class="right">56240</td>
</tr>
<tr>
<td class="left">HB6T</td>
<td class="left">mutect</td>
<td class="left">4 / 4</td>
<td class="right">51x</td>
<td class="right">946</td>
<td class="right">47747</td>
</tr>
<tr>
<td class="left">HB6T</td>
<td class="left">vardict</td>
<td class="left">3 / 4</td>
<td class="right">35x</td>
<td class="right">1497</td>
<td class="right">51625</td>
</tr>
<tr>
<td class="left">HB8T</td>
<td class="left">freebayes</td>
<td class="left">6 / 6</td>
<td class="right">42x</td>
<td class="right">1364</td>
<td class="right">57121</td>
</tr>
<tr>
<td class="left">HB8T</td>
<td class="left">mutect</td>
<td class="left">6 / 6</td>
<td class="right">47x</td>
<td class="right">1053</td>
<td class="right">48639</td>
</tr>
<tr>
<td class="left">HB8T</td>
<td class="left">vardict</td>
<td class="left">6 / 6</td>
<td class="right">35x</td>
<td class="right">1542</td>
<td class="right">52642</td>
</tr>
<tr>
<td class="left">HB9T</td>
<td class="left">freebayes</td>
<td class="left">2 / 2</td>
<td class="right">41x</td>
<td class="right">1420</td>
<td class="right">57582</td>
</tr>
<tr>
<td class="left">HB9T</td>
<td class="left">mutect</td>
<td class="left">2 / 2</td>
<td class="right">44x</td>
<td class="right">1142</td>
<td class="right">49858</td>
</tr>
<tr>
<td class="left">HB9T</td>
<td class="left">vardict</td>
<td class="left">2 / 2</td>
<td class="right">36x</td>
<td class="right">1488</td>
<td class="right">53098</td>
</tr>
</tbody>
</table>
<p> We consistently missed one confirmed mutation in the HB2T sample. This variant, reported as a somatic mutation in an <a href="http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000177673;r=2:231592901-231594283;t=ENST00000313965;redirect=no">uncharacterized open reading frame (C2orf57</a>), may actually be a germline mutation in the study sub-population. The variant is present at a <a href="http://exac.broadinstitute.org/variant/2-232458818-C-T">10% frequency in the East Asian population</a> but only 2% in the overall population, based on data from both the ExAC and 1000 genomes projects. Although the ethnicity of the original samples is not reported, the study authors are all from China. This helps demonstrate the effectiveness of large population frequencies, stratified by population, in prioritizing and evaluating variant calls. </p>
<p> The major challenge with tumor-only prioritization approaches is that you can't expect to accurately filter private germline mutations that you won't find in genome-wide catalogs. With a sample set using a small number of validated variants it's hard to estimate the proportion of 'additional' variants in the table above that are germline false positives versus the proportion that are additional tumor-only mutations not specifically evaluated in the study. We plan to continue to refine filtering with additional validated datasets to help improve this discrimination. </p>
<p> Practically, bcbio automatically runs prioritization with all tumor-only analyses. It filters variants by adding a ``LowPriority`` filter to the output VCF so users can readily identify variants flagged during the prioritization. </p>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5">Future work</h2>
<div class="outline-text-2" id="text-5">
<p> This is a baseline for assessing SNP, indel and structural variant calls in cancer analyses. It also prioritizes impact variants in cases where we lack normal matched normals. We plan to continue to improve cancer variant calling in <a href="http://github.com/chapmanb/bcbio-nextgen">bcbio</a> and some areas of future focus include: </p>
<ul class="org-ul">
<li>Informing variant calling using estimates of tumor purity and sub-clonal frequency. bcbio integrates <a href="http://cnvkit.readthedocs.org/en/latest/">CNVkit</a>, a copy number caller, which exports read count segmentation data. Tools like <a href="https://github.com/raphael-group/THetA">THetA2</a>, <a href="https://github.com/morrislab/phylowgs">phyloWGS</a>, <a href="https://github.com/uci-cbcl/PyLOH">PyLOH</a> and <a href="https://github.com/genome/sciclone">sciClone</a> use these inputs to estimate normal contamination and sub-clonal frequencies.
</li>
<li>Focusing on difficult to call genomic regions and provide additional mechanisms to better resolve and improve caller accuracy in these regions. Using small variant calls to define problematic genome areas with large structural rearrangements can help prioritize and target the use of more computationally expensive methods for copy number assessment, structural variant calling and de-novo assembly.
</li>
<li>Evaluating calling and tumor-only prioritization on <a href="http://www.horizondx.com/our-products/q-seq-hdx-reference-standards">Horizon reference standards</a>. They contain a larger set of <a href="http://www.horizondx.com/media/item/202">validated mutations</a> at a variety of frequencies.
</li>
</ul>
<p> As always, we welcome suggestions, contributions and feedback. </p>
</div>
</div>Brad Chapmanchapmanb@fastmail.comOverview The post discusses work validating multiple cancer variant callers in bcbio-nextgen using a synthetic reference call set from the ICGC-TCGA DREAM challenge. We've previously validated germline variant calling methods, but cancer calling is additionally challenging. Tumor samples have mixed cellularity due to contaminating normal sample, and consist of multiple sub-clones with different somatic variations. Low-frequency sub-clonal variations can be critical to understand disease progression but are more difficult to detect with high sensitivity and precision. Publicly available whole genome truth sets like the NA12878 Genome in a Bottle reference materials don't currently exist for cancer calling, but other groups have been working to prepare standards for use in evaluating callers. The DREAM challenge provides a set of synthetic datasets that include cellularity and multiple sub-clones. There is also on ongoing DREAM contest with real, non-simulated data. In addition, the ICGC has done extensive work on assessing the challenges in cancer variant calling and produced a detailed comparison of multiple variant calling pipelines. Finally, Bina recently released a simulation framework called varsim that they used to compare multiple callers as part of their cancer benchmarking. I'm excited about all this community benchmarking and hope this work can contribute to the goal of having fully consented real patient reference materials, leading to a good understanding of variant caller tradeoffs. In this work, we evaluated cancer tumor/normal variant calling with synthetic dataset 3 from the DREAM challenge, using multiple approaches to detect SNPs, indels and structural variants. A majority voting ensemble approach combines inputs from multiple callers into a final callset with good sensitivity and precision. We also provide a prioritization method to enrich for somatic mutations in tumor samples without matched normals. Cancer variant calling in bcbio is due to contributions and support from multiple members of the community: Miika Ahdesmaki and Justin Johnson at AstraZeneca collaborated with our group on integrating and evaluating multiple variant callers. Their financial support helped fund our time on this comparison, and their scientific contributions improved SNP and indel calling. Luca Beltrame integrated a number of cancer variant callers into bcbio and wrote the initial framework for somatic calling. Lorena Pantano integrated variant callers and performed benchmarking work. Members of our group, the Harvard Chan school Bioinformatics core, regularly contribute to bcbio development and validation work. The Wolfson Wohl Cancer Research Centre supported our work on validation of somatic callers. James Cuff, Paul Edmon and the team at Harvard FAS research computing provided compute infrastructure and support that enabled the large number of benchmarking runs it takes to get good quality calling. I'm continually grateful to the community for all the contributions and support. The MuTect commit history for bcbio is a great example of multiple distributed collaborators working towards the shared goal of integrating and validating a caller. Variant caller validation We used a large collection of open source variant callers to detect SNPs, Indels and structural variants: MuTect (1.1.5) – A SNP only caller, built on GATK UnifiedGenotyper, from the Broad Institute. MuTect requires a license if used for commercial purposes. VarDict (2014-12-15) – A SNP and indel caller from Zhongwu Lai and the oncology team at AstraZeneca. FreeBayes (0.9.20-1) – A haplotype aware realigning caller for SNPs and indels from Erik Garrison and Gabor Marth's lab. VarScan (2.3.7) – A heuristic/statistic based somatic SNP and indel caller from Dan Koboldt and The Genome Institute at Washington University. Scalpel (0.3.1) – Micro-assembly based Indel caller from Giuseppe Narzisi and the Schatz lab. We pair Scalpel with MuTect to provide a complete set of small variant calls. LUMPY (0.2.7) – A probabilistic structural variant caller incorporating both split read and read pair discordance, developed by Ryan Layer in Aaron Quinlan and Ira Hall's labs. DELLY (0.6.1) – An integrated paired-end and split-end structural variant caller developed by Tobias Rausch. WHAM (1.5.1) – A structural variant caller that can incorporate association testing, from Zev Kronenberg in Mark Yandell's lab at the University of Utah. bcbio runs these callers and uses simple ensemble methods to combine small variants (SNPs, indels) and structural variants into final combined callsets. The new small variant ensemble method uses a simplified majority rule classifier that picks variants to pass based on being present in a configurable number of samples. This performs well and is faster than the previous implementation that made use of both this approach and a subsequent support vector machine step. Using the 100x whole genome tumor/normal pair from DREAM synthetic dataset 3 we evaluated each of the callers for sensitivity and precision on small variants (SNPs and indels). This synthetic dataset contains 100% tumor cellularity with 3 subclones at allele frequencies of 50%, 33% and 20%. In addition to the whole genome results, the validation album includes results from running against the same dataset limited to exome regions. This has identical patterns of sensitivity and precision. It runs quicker, so is useful for evaluating changes to filtering or program parameters. We also looked at structural variant calls for larger deletions, duplications and inversions. Here is the precision and sensitivity for duplications across multiple size classes: The full album of validation results includes the comparisons for deletion and inversion events. These comparisons measure the contributions of individual callers within an ensemble approach that attempts to maximize sensitivity and specificity for the combined callset. Keep in mind that each of the individual results make use of other caller information in filtering. Our goal is to create the best possible combined calls, rather than a platform for unbiased comparison of structural variant callers. We're also actively working on improving sensitivity and specificity for individual callers and expect the numbers to continue to evolve. For example, Zev Kronenberg added WHAM's ability to identify different classes of structural changes, and we're still in the process of improving the classifier. Improvements in filtering Our evaluation comparisons show best effort attempts to provide good quality calls for every caller. The final results often come from multiple rounds of improving sensitivity and precision by adjusting program parameters or downstream filtering. The goal of tightly integrating bcbio with validation is that the community can work on defining a set of parameters and filters that work best in multiple cases, and then use these directly within the same framework for processing real data. In presenting the final results only, it may not be clear that plugging a specific tool into a custom bash script will not always produce the same results we see here. As an example, here are the improvements in FreeBayes sensitivity and precision from our initial implementation, presented over the exome regions of synthetic dataset 3: The original implementation used a vcfsamplediff based approach to filtering, as recommended on the FreeBayes mailing list. The current, improved, version uses a custom filter based on genotype likelihoods, based on the approach in the speeseq pipeline. In general, you can find all of the integration work for individual callers in the bcbio source code, broken down by caller. For instance, here is the integration work on MuTect. The goal is to make all of the configuration settings and filters fully transparent so users can understand how they work when using bcbio, or transplant them into their own custom pipelines. Tumor-only prioritization The above validations were all done on cancer calling with tumor and normal pairs. The filters to separate pre-existing germline mutations from cancer specific somatic mutations rely on the presence of a normal sample. In some cases, we don't have matched normal samples to do this filtering. Two common examples are FFPE samples and tumor cell lines. For these samples, we'd like to be able to prioritize likely tumor specific variations for followup using publicly available resources. We implemented a prioritization strategy from tumor-only samples in bcbio that takes advantage of publicly available resources like COSMIC, ClinVar, 1000 genomes, ESP and ExAC. It uses GEMINI to annotate the initial tumor-only VCF calls with external annotations, then extracts these to prioritize variants with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar. Validating this prioritization strategy requires real tumor samples with known mutations. Our synthetic datasets are not useful here, since the variants do not necessarily model standard biological variability. You could spike in biologically relevant mutations, as done in the VarSim cancer simulated data, but this will bias towards our prioritization approach since both would use the same set of necessarily imperfect known variants and population level mutations. We took the approach of using published tumor data with validated mutations. Andreas Sjödin identified a Hepatoblastoma exome sequencing paper with publicly available sample data and 23 validated cancer related variations across 5 samples. This is a baseline to help determine how stringent to be in removing potential germline variants. The prioritization enriches variants of interest by 35-50x without losing sensitivity to confirmed variants: sample caller confirmed enrichment additional filtered HB2T freebayes 6 / 7 44x 1288 56046 HB2T mutect 6 / 7 48x 1014 47755 HB2T vardict 6 / 7 36x 1464 52090 HB3T freebayes 4 / 4 46x 1218 54997 HB3T mutect 4 / 4 49x 961 46894 HB3T vardict 4 / 4 35x 1511 51404 HB6T freebayes 4 / 4 43x 1314 56240 HB6T mutect 4 / 4 51x 946 47747 HB6T vardict 3 / 4 35x 1497 51625 HB8T freebayes 6 / 6 42x 1364 57121 HB8T mutect 6 / 6 47x 1053 48639 HB8T vardict 6 / 6 35x 1542 52642 HB9T freebayes 2 / 2 41x 1420 57582 HB9T mutect 2 / 2 44x 1142 49858 HB9T vardict 2 / 2 36x 1488 53098 We consistently missed one confirmed mutation in the HB2T sample. This variant, reported as a somatic mutation in an uncharacterized open reading frame (C2orf57), may actually be a germline mutation in the study sub-population. The variant is present at a 10% frequency in the East Asian population but only 2% in the overall population, based on data from both the ExAC and 1000 genomes projects. Although the ethnicity of the original samples is not reported, the study authors are all from China. This helps demonstrate the effectiveness of large population frequencies, stratified by population, in prioritizing and evaluating variant calls. The major challenge with tumor-only prioritization approaches is that you can't expect to accurately filter private germline mutations that you won't find in genome-wide catalogs. With a sample set using a small number of validated variants it's hard to estimate the proportion of 'additional' variants in the table above that are germline false positives versus the proportion that are additional tumor-only mutations not specifically evaluated in the study. We plan to continue to refine filtering with additional validated datasets to help improve this discrimination. Practically, bcbio automatically runs prioritization with all tumor-only analyses. It filters variants by adding a ``LowPriority`` filter to the output VCF so users can readily identify variants flagged during the prioritization. Future work This is a baseline for assessing SNP, indel and structural variant calls in cancer analyses. It also prioritizes impact variants in cases where we lack normal matched normals. We plan to continue to improve cancer variant calling in bcbio and some areas of future focus include: Informing variant calling using estimates of tumor purity and sub-clonal frequency. bcbio integrates CNVkit, a copy number caller, which exports read count segmentation data. Tools like THetA2, phyloWGS, PyLOH and sciClone use these inputs to estimate normal contamination and sub-clonal frequencies. Focusing on difficult to call genomic regions and provide additional mechanisms to better resolve and improve caller accuracy in these regions. Using small variant calls to define problematic genome areas with large structural rearrangements can help prioritize and target the use of more computationally expensive methods for copy number assessment, structural variant calling and de-novo assembly. Evaluating calling and tumor-only prioritization on Horizon reference standards. They contain a larger set of validated mutations at a variety of frequencies. As always, we welcome suggestions, contributions and feedback.Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker2014-12-19T15:17:00+00:002014-12-19T15:17:00+00:00http://bcb.io/2014/12/19/awsbench<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Overview</h2>
<div class="outline-text-2" id="text-1">
<p> We developed a freely available, easy to run implementation of <a href="http://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> on <a href="http://aws.amazon.com/">Amazon Web Services (AWS)</a> using <a href="https://docker.com/">Docker</a>. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation. </p>
<p> The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses <a href="https://github.com/gc3-uzh-ch/elasticluster">Elasticluster</a> to provision and configure the cluster. We automate the process with <a href="http://boto.readthedocs.org/en/latest/">the boto Python interface to AWS</a> and <a href="http://www.ansible.com/home">Ansible scripts</a>. <a href="https://github.com/chapmanb/bcbio-nextgen-vm">bcbio-vm</a> isolates code and tools inside a <a href="https://docker.com/">Docker</a> container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data. </p>
<p> We provide timing benchmarks for running a full variant calling analysis using <a href="http://github.com/chapmanb/bcbio-nextgen">bcbio</a> on AWS. The benchmark dataset was <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal">a cancer tumor/normal evaluation</a>, from <a href="https://www.synapse.org/#!Synapse:syn312572">the ICGC-TCGA DREAM challenge</a>, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from <a href="http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html">the Sequencing Quality Control (SEQC) project</a>. </p>
<p> We developed bcbio on AWS and ran these timing benchmarks thanks to work with great partners. A collaboration with <a href="http://www.biogenidec.com/">Biogen</a> and <a href="http://www.massgeneral.org/research/researchlab.aspx?id=1402">Rudy Tanzi's group at MGH</a> funded the development of bcbio on AWS. A second collaboration with <a href="http://www.intel.com/content/www/us/en/healthcare-it/healthcare-overview.html">Intel Health and Life Sciences</a> and <a href="http://www.astrazeneca.com">AstraZenenca</a> funded the somatic variant calling benchmarking work. We're thankful for all the relationships that make this work possible: </p>
<ul class="org-ul">
<li>John Morrissey automated the process of starting a bcbio cluster on AWS and attaching a Lustre filesystem. He also automated the approach to generating graphs of resource usage from collectl stats and provided critical front line testing and improvements to all the components of the bcbio AWS interaction.
</li>
<li>Kristina Kermanshahche and Robert Read at <a href="http://www.intel.com/content/www/us/en/healthcare-it/healthcare-overview.html">Intel</a> provided great support helping us get the <a href="https://wiki.hpdd.intel.com/display/PUB/Intel+Cloud+Edition+for+Lustre*+Software">Lustre ICEL CloudFormation</a> templates running.
</li>
<li>Ronen Artzi, Michael Heimlich, and Justin Johnson at <a href="http://www.astrazeneca.com">AstraZenenca</a> setup Lustre, Gluster and NFS benchmarks using a bcbio <a href="http://star.mit.edu/cluster/index.html">StarCluster</a> instance. This initial validation was essential for convincing us of the value of moving to a shared filesystem on AWS.
</li>
<li>Jason Tetrault, Karl Gutwin and Hank Wu at <a href="http://www.biogenidec.com/">Biogen</a> provided valuable feedback, suggestions and resources for developing bcbio on AWS.
</li>
<li>Glen Otero parsed the collectl data and provided graphs, which gave us a detailed look into the potential causes of bottlenecks we found in the timings.
</li>
<li>James Cuff, Paul Edmon and the team at <a href="https://rc.fas.harvard.edu/">Harvard FAS research computing</a> built and administered the Regal Lustre setup used for local testing.
</li>
<li>John Kern and other members of the bcbio community tested, debugged and helped identify issues with the implementation. Community feedback and contributions are essential to bcbio development.
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Architecture</h2>
<div class="outline-text-2" id="text-2">
<p> The implementation provides both a practical way to run large scale variant calling and RNA-seq analysis, as well as a flexible backend architecture suitable for production quality runs. This writeup might feel a bit like a <a href="https://web.archive.org/web/20131122230658/http://rampantgames.com/blog/2004/10/black-triangle.html">black triangle moment</a> since I also wrote about <a href="https://bcb.io/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/">running bcbio on AWS three years ago</a>. That implementation was a demonstration for small scale usage rather than a production ready system. We now have a setup we can support and run on <a href="https://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/">large scale projects</a> thanks to numerous changes in the backend architecture: </p>
<ul class="org-ul">
<li>Amazon, and cloud based providers in general, now provide high end filesystems and networking. Our AWS runs are fast because they use SSD backend storage, fast networking connectivity and high end processors that would be difficult to invest in for a local cluster. Renting these is economically feasible now that we have an approach to provision resources, run the analysis, and tear everything down. The dichotomy between local cluster hardware and cloud hardware will continue to expand with upcoming improvements in <a href="http://aws.amazon.com/blogs/aws/new-c4-instances/">compute (Haswell processors)</a> and <a href="http://www.infoq.com/news/2014/11/new-features-ec2-ebs-s3">storage (16Tb EBS SSD volumes</a>).
</li>
<li>Isolating all of the software and code inside <a href="https://docker.com/">Docker</a> containers enables rapid deployment of fixes and improvements. From an open source support perspective, Amazon provides a consistent cluster environment we have full control over, limiting the space of potential system specific issues. From a researcher's perspective, this will allow use of bcbio without needing to spend time installing and testing locally.
</li>
<li>The setup runs from standard Linux base images using <a href="http://www.ansible.com/home">Ansible scripts</a> and <a href="https://github.com/gc3-uzh-ch/elasticluster">Elasticluster</a>. This means we no longer need to build and update AMIs for changes in the architecture or code. This simplifies testing and pushing fixes, letting us spend less time on support and more on development. Since we avoid having a pre-built AMI, the process of building and running bcbio on AWS is fully auditable for both security and scientific accuracy. Finally, it provides a path to support bcbio on container specific management services like <a href="http://aws.amazon.com/ecs/">Amazon's EC2 container service</a>.
</li>
<li>All long term data storage happens in <a href="http://aws.amazon.com/s3/">Amazon's S3 object store</a>, including both analysis specific data as well as general reference genome data. Downloading reference data for an analysis on demand removes the requirement to maintain large shared EBS volumes. On the analysis side, you maintain only the input files and high value output files in S3, removing the intermediates upon completion of the analysis. Removing the need to manage storage of EBS volumes also provides a cost savings (<a href="http://aws.amazon.com/s3/pricing/">$0.03/Gb/month for S3</a> versus <a href="http://aws.amazon.com/ebs/pricing/">$0.10+/Gb/month for EBS</a>) and allows the option of archiving in <a href="https://aws.amazon.com/glacier/">Glacier</a> for long term storage.
</li>
</ul>
<p> All of these architectural changes provide a setup that is easier to maintain and scale over time. Our goal moving ahead is to provide a researcher friendly interface to setting up and running analyses. We hope to achieve that through the in-development <a href="https://github.com/rabix/common-workflow-language">Common Workflow Language</a> from <a href="http://galaxyproject.org/">Galaxy</a>, <a href="https://arvados.org/">Arvados</a>, <a href="https://www.sbgenomics.com/">Seven Bridges</a>, <a href="http://www.taverna.org.uk/">Taverna</a> and the <a href="http://www.open-bio.org/wiki/Main_Page">open bioinformatics community</a>. </p>
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">Variant calling – benchmarking AWS versus local</h2>
<div class="outline-text-2" id="text-3">
<p> We benchmarked somatic variant calling in two environments: on the elasticluster Docker AWS implementation and on local <a href="https://rc.fas.harvard.edu/">Harvard FAS</a> machines. </p>
<ul class="org-ul">
<li>AWS processing was twice as fast as a local run. The gains occur in disk IO intensive steps like alignment post-processing. AWS offers the opportunity to rent SSD backed storage and obtain a 10GigE connected cluster without contention for network resources. Our local test machines have an in-production Lustre filesystem attached to a large highly utilized cluster provided by <a href="https://rc.fas.harvard.edu/">Harvard FAS research computing</a>.
</li>
<li>At this scale Lustre and NFS have similar throughput, with Lustre outperforming NFS during IO intensive steps like alignment, post-processing and large BAM file merging. From <a href="https://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/">previous benchmarking work</a> we'll need to process additional samples in parallel to fully stress the shared filesystem and differentiate Lustre versus NFS performance. However, the resource plots at this scale show potential future bottlenecks during alignment, post-processing and other IO intensive steps. Generally, having Lustre scaled across 4 <a href="http://en.wikipedia.org/wiki/Logical_unit_number">LUNs</a> per object storage server (OSS) enables better distribution of disk and network resources.
</li>
</ul>
<p> AWS runs use two c3.8xlarge instances clustered in a single <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html">placement group</a>, providing 32 cores and 60Gb of memory per machine. Our local run was comparable with two compute machines, each with 32 cores and 128Gb of memory, connected to a Lustre filesystem. The benchmark is <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal">a cancer tumor/normal evaluation</a> consisting of alignment, recalibration, realignment and variant detection with four different callers. The input is a tumor/normal pair from the <a href="https://www.synapse.org/#!Synapse:syn312572">the ICGC-TCGA DREAM challenge</a> with 100x coverage in exome regions. Here are the times, in hours, for each benchmark: </p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups">
<colgroup>
<col class="left" />
<col class="right" />
<col class="right" />
<col class="right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="left"> </th>
<th scope="col" class="right">AWS (Lustre)</th>
<th scope="col" class="right">AWS (NFS)</th>
<th scope="col" class="right">Local (Lustre)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="left">Total</td>
<td class="right">4:42</td>
<td class="right">5:05</td>
<td class="right">10:30</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="left">genome data preparation</td>
<td class="right">0:04</td>
<td class="right">0:10</td>
<td class="right"> </td>
</tr>
<tr>
<td class="left">alignment preparation</td>
<td class="right">0:12</td>
<td class="right">0:15</td>
<td class="right"> </td>
</tr>
<tr>
<td class="left">alignment</td>
<td class="right">0:29</td>
<td class="right">0:52</td>
<td class="right">0:53</td>
</tr>
<tr>
<td class="left">callable regions</td>
<td class="right">0:44</td>
<td class="right">0:44</td>
<td class="right">1:25</td>
</tr>
<tr>
<td class="left">alignment post-processing</td>
<td class="right">0:13</td>
<td class="right">0:21</td>
<td class="right">4:36</td>
</tr>
<tr>
<td class="left">variant calling</td>
<td class="right">2:35</td>
<td class="right">2:05</td>
<td class="right">2:36</td>
</tr>
<tr>
<td class="left">variant post-processing</td>
<td class="right">0:05</td>
<td class="right">0:03</td>
<td class="right">0:22</td>
</tr>
<tr>
<td class="left">prepped BAM merging</td>
<td class="right">0:03</td>
<td class="right">0:18</td>
<td class="right">0:06</td>
</tr>
<tr>
<td class="left">validation</td>
<td class="right">0:05</td>
<td class="right">0:05</td>
<td class="right">0:09</td>
</tr>
<tr>
<td class="left">population database</td>
<td class="right">0:06</td>
<td class="right">0:07</td>
<td class="right">0:09</td>
</tr>
</tbody>
</table>
<p></p>
<p> To provide more insight into the timing differences between these benchmarks, we automated collection and plotting of resource usage on AWS runs. </p>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4">Variant calling – resource usage plots</h2>
<div class="outline-text-2" id="text-4">
<p> bcbio retrieves <a href="http://collectl.sourceforge.net/">collectl</a> usage statistics from the server <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html#graphing-resource-usage">and prepares graphs</a> of CPU, memory, disk and network usage. These plots allow in-depth insight into limiting factors during individual steps in the workflow. </p>
<p> We'll highlight some interesting comparisons between NFS and Lustre during the variant calling benchmarking. During this benchmark, the two critical resources were CPU usage and disk IO on the shared filesystems. We also measure memory usage but that was not a limiting factor with these analyses. In addition to the comparisons highlighted below, we have the full set of resource usage graphs available for each run: </p>
<ul class="org-ul">
<li><a href="http://imgur.com/a/AZjuC">Variant calling with NFS on AWS</a>
</li>
<li><a href="http://imgur.com/a/HfrqY">Variant calling with Lustre on AWS</a>
</li>
<li><a href="http://imgur.com/a/LSDFz">RNA-seq on a single machine on AWS</a>
</li>
</ul>
</div>
<div id="outline-container-sec-4-1" class="outline-3">
<h3 id="sec-4-1">CPU</h3>
<div class="outline-text-3" id="text-4-1">
<p> These plots compare CPU usage during processing steps for Lustre and NFS. The largest differences between the two runs are in the alignment, alignment post-processing and variant calling steps: </p>
</div>
<div id="outline-container-sec-4-1-1" class="outline-4">
<h4 id="sec-4-1-1">NFS</h4>
<div class="outline-text-4" id="text-4-1-1">
<a href="http://i.imgur.com/iUpvyHx.png"><br />
<img src="http://i.imgur.com/iUpvyHx.png" width="700" alt="CPU resource usage for NFS during variant calling" /><br />
</a>
</div>
</div>
<div id="outline-container-sec-4-1-2" class="outline-4">
<h4 id="sec-4-1-2">Lustre</h4>
<div class="outline-text-4" id="text-4-1-2">
<a href="http://i.imgur.com/59W8YvL.png"><br />
<img src="http://i.imgur.com/59W8YvL.png" width="700" alt="CPU resource usage for Lustre during variant calling" /><br />
</a></p>
<p> For alignment and alignment post-processing the Lustre runs show more stable CPU usage. NFS specifically spends more time in the CPU wait state (red line) during IO intensive steps. On larger scale projects this may become a limiting factor in processing throughput. The variant calling step was slower on Lustre than NFS, with inconsistent CPU usage. We'll have to investigate this slowdown further, since no other metrics point to an obvious bottleneck. </p>
</div>
</div>
</div>
<div id="outline-container-sec-4-2" class="outline-3">
<h3 id="sec-4-2">Shared filesystem network usage and IO</h3>
<div class="outline-text-3" id="text-4-2">
<p> These plots compare network usage during processing for Lustre and NFS. We use this as a consistent proxy for the performance of the shared filesystem and disk IO (the <a href="http://imgur.com/a/AZjuC">NFS plots</a> do have directly measured disk IO for comparison purposes). </p>
</div>
<div id="outline-container-sec-4-2-1" class="outline-4">
<h4 id="sec-4-2-1">NFS</h4>
<div class="outline-text-4" id="text-4-2-1">
<a href="http://i.imgur.com/vvru0sv.png"><br />
<img src="http://i.imgur.com/vvru0sv.png" width="700" alt="Network resource usage NFS" /><br />
</a>
</div>
</div>
<div id="outline-container-sec-4-2-2" class="outline-4">
<h4 id="sec-4-2-2">Lustre</h4>
<div class="outline-text-4" id="text-4-2-2">
<a href="http://i.imgur.com/onut3GI.png"><br />
<img src="http://i.imgur.com/onut3GI.png" width="700" alt="Network resource usage Lustre" /><br />
</a></p>
<p> The biggest difference in the IO intensive steps is that Lustre network usage is smoother compared to the spiky NFS input/output, due to spreading out read/writes over multiple disks. Including more processes with additional read/writes will help determine how these differences translate to scaling on larger numbers of simultaneous samples. </p>
</div>
</div>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5">RNA-seq benchmarking</h2>
<div class="outline-text-2" id="text-5">
<p> We also ran an RNA-seq analysis using 4 samples from <a href="http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html">the Sequencing Quality Control (SEQC) project</a>. Each sample has 15 million 100bp paired reads. bcbio handled trimming, alignment with <a href="https://github.com/alexdobin/STAR">STAR</a>, and quantitation with <a href="http://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html">DEXSeq</a> and <a href="http://cufflinks.cbcb.umd.edu/">Cufflinks</a>. We ran on a single AWS c3.8xlarge machines with 32 cores, 60Gb of memory, and attached SSD storage. </p>
<p> RNA-seq optimization in bcbio is at an earlier stage than variant calling. We've done work to speed up trimming and aligning, but haven't yet optimized the expression and count steps. The analysis runs quickly in 6 1/2 hours, but there is still room for further optimization, and this is a nice example of how we can use benchmarking plots to identify targets for additional work: </p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups">
<colgroup>
<col class="left" />
<col class="right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="left">Total</th>
<th scope="col" class="right">6:25</th>
</tr>
</thead>
<tbody>
<tr>
<td class="left">genome data preparation</td>
<td class="right">0:32</td>
</tr>
<tr>
<td class="left">adapter trimming</td>
<td class="right">0:32</td>
</tr>
<tr>
<td class="left">alignment</td>
<td class="right">0:24</td>
</tr>
<tr>
<td class="left">estimate expression</td>
<td class="right">3:41</td>
</tr>
<tr>
<td class="left">quality control</td>
<td class="right">1:16</td>
</tr>
</tbody>
</table>
<p></p>
<p> The <a href="http://imgur.com/a/LSDFz">RNA-seq collectl plots</a> show the cause of the slower steps during expression estimation and quality control. Here is CPU usage over the run: </p>
<p><a href="http://i.imgur.com/D43c94L.png"><br />
<img src="http://i.imgur.com/D43c94L.png" width="700" alt="RNA-seq CPU usage" /><br />
</a></p>
<p> The low CPU usage during the first 2 hours of expression estimation corresponds to DEXSeq running serially over the 4 samples. In contrast with Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a single core. We could run these steps in parallel by using multiprocessing to launch the jobs, split by sample. Similarly, the QC steps could benefit from parallel processing. Alternatively, we're looking at validating other approaches for doing quantification like <a href="http://bio.math.berkeley.edu/eXpress/">eXpress</a>. These are the type of benchmarking and validation steps that are continually ongoing in the development of bcbio pipelines. </p>
</div>
</div>
<div id="outline-container-sec-6" class="outline-2">
<h2 id="sec-6">Reproducing the analysis</h2>
<div class="outline-text-2" id="text-6">
<p> The process to launch the cluster and an NFS or optional Lustre shared filesystem is <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html">fully automated and documented</a>. It sets up permissions, VPCs, clusters and shared filesystems from a basic AWS account, so requires minimal manual work. <code>bcbio_vm.py</code> has commands to: </p>
<ul class="org-ul">
<li>Add an IAM user, a VPC and create the Elasticluster config.
</li>
<li>Launch a cluster and bootstrap with the latest bcbio code and data.
</li>
<li>Create and mount a Lustre filesystem attached to the cluster.
</li>
<li>Terminate the cluster and Lustre stack upon completion.
</li>
</ul>
<p> The processing handles download of input data from S3 and upload back to S3 on finalization. We store data encrypted on S3 and manage access using <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html">IAM instance profiles</a>. The examples below show how to run both a somatic variant calling evaluation and an RNA-seq evaluation. </p>
</div>
<div id="outline-container-sec-6-1" class="outline-3">
<h3 id="sec-6-1">Running the somatic variant calling evaluation</h3>
<div class="outline-text-3" id="text-6-1">
<p> This analysis performs evaluation of variant calling using <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal">tumor/normal somatic sample from the DREAM challenge</a>. To run, prepare an S3 bucket to run the analysis from. Copy the <a href="https://s3.amazonaws.com/bcbio-syn3-eval/cancer-dream-syn3-aws.yaml">configuration file</a> to your own personal bucket and add a <a href="https://www.broadinstitute.org/gatk/">GATK</a> jar. You can use the AWS console or any available S3 client to do this. For example, using the <a href="https://aws.amazon.com/cli/">AWS command line client</a>: </p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">aws s3 mb s3://YOURBUCKET-syn3-eval
aws s3 cp s3://bcbio-syn3-eval/cancer-dream-syn3-aws.yaml s3://YOURBUCKET-syn3-eval
aws s3 cp GenomeAnalysisTK.jar s3://YOURBUCKET-syn3-eval/jars/</code></pre></figure>
<p> Now ssh to the cluster head node, create the work directory and use bcbio_vm to create a batch script that we submit to SLURM. This example uses an attached Lustre filesystem: </p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">bcbio_vm.py elasticluster ssh bcbio
<span class="nb">sudo </span>mkdir <span class="nt">-p</span> /scratch/cancer-dream-syn3-exome
<span class="nb">sudo </span>chown ubuntu <span class="o">!</span><span class="err">$</span>
<span class="nb">cd</span> <span class="o">!</span><span class="nv">$ </span><span class="o">&&</span> mkdir work <span class="o">&&</span> <span class="nb">cd </span>work
bcbio_vm.py ipythonprep s3://YOURBUCKET-syn3-eval/cancer-dream-syn3-aws.yaml <span class="se">\</span>
slurm cloud <span class="nt">-n</span> 60
sbatch bcbio_submit.sh</code></pre></figure>
<p> This runs alignment and variant calling with multiple callers (MuTect, FreeBayes, VarDict and VarScan), validates against the <a href="https://www.synapse.org/#!Synapse:syn312572">DREAM validation dataset truth calls</a> and uploads the results back to S3 in YOURBUCKET-syn3-eval/final. </p>
</div>
</div>
<div id="outline-container-sec-6-2" class="outline-3">
<h3 id="sec-6-2">Running the RNA-seq evaluation</h3>
<div class="outline-text-3" id="text-6-2">
<p> This example runs an RNA-seq analysis using inputs from <a href="http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html">the Sequencing Quality Control (SEQC) project</a>. Full details on the analysis are available in the <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#rnaseq-example">bcbio example run documentation</a>. To setup the run, we copy the input configuration from a publicly available S3 bucket into your own personal bucket: </p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">aws s3 mb s3://YOURBUCKET-eval-rna-seqc/
aws s3 cp s3://bcbio-eval-rna-seqc/eval-rna-seqc.yaml s3://YOURBUCKET-eval-rnaseqc/</code></pre></figure>
<p> Now ssh to the cluster head node, create the work directory and use bcbio_vm to run on 32 cores using the attached EBS SSD filesystem: </p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">bcbio_vm.py elasticluster ssh bcbio
mkdir <span class="nt">-p</span> ~/run/eval-rna-seqc/work
<span class="nb">cd</span> ~/run/eval-rna-seqc/work
bcbio_vm.py run s3://YOURBUCKET-eval-rna-seqc/eval-rna-seqc.yaml <span class="nt">-n</span> 32</code></pre></figure>
<p> This will process three replicates from two different SEQC panels, performing adapter trimming, alignment with <a href="https://github.com/alexdobin/STAR">STAR</a> and produce counts, <a href="http://cufflinks.cbcb.umd.edu/">Cufflinks quantitation</a> and quality control metrics. The results upload back into your initial S3 bucket as YOURBUCKET-eval-rna-seqc/final, and you can shut down the cluster used for processing. </p>
</div>
</div>
</div>
<div id="outline-container-sec-7" class="outline-2">
<h2 id="sec-7">Costs per hour</h2>
<div class="outline-text-2" id="text-7">
<p> These are the instance costs, per hour, for running a 2 node 64 core cluster and associated Lustre filesystem on AWS. For NFS runs, the compute costs are identical but there will not be any additional instances for the shared filesystem. Other run costs will include EBS volumes, but these are small ($0.13/Gb/month) compared to the instance costs over these time periods. We use S3 for long term storage rather than the Lustre or NFS filesystems. </p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups">
<colgroup>
<col class="left" />
<col class="left" />
<col class="right" />
<col class="left" />
<col class="left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="left"> </th>
<th scope="col" class="left">AWS type</th>
<th scope="col" class="right">n</th>
<th scope="col" class="left">each</th>
<th scope="col" class="left">total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="left">compute entry node</td>
<td class="left">c3.large</td>
<td class="right">1</td>
<td class="left">$0.11</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left">compute worker nodes</td>
<td class="left">c3.8xlarge</td>
<td class="right">2</td>
<td class="left">$1.68</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left"> </td>
<td class="left"> </td>
<td class="right"> </td>
<td class="left"> </td>
<td class="left">$3.47/hr</td>
</tr>
<tr>
<td class="left">ost (object data store)</td>
<td class="left">c3.2xlarge</td>
<td class="right">4</td>
<td class="left">$0.42</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left">mdt (metadata target)</td>
<td class="left">c3.4xlarge</td>
<td class="right">1</td>
<td class="left">$0.84</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left">mgt (management target)</td>
<td class="left">c3.xlarge</td>
<td class="right">1</td>
<td class="left">$0.21</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left">NATDevice</td>
<td class="left">m3.medium</td>
<td class="right">1</td>
<td class="left">$0.07</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left">Lustre licensing</td>
<td class="left"> </td>
<td class="right">1</td>
<td class="left">$0.48</td>
<td class="left"> </td>
</tr>
<tr>
<td class="left"> </td>
<td class="left"> </td>
<td class="right"> </td>
<td class="left"> </td>
<td class="left">$3.28/hr</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="left"> </td>
<td class="left"> </td>
<td class="right"> </td>
<td class="left"> </td>
<td class="left">$6.75/hr</td>
</tr>
</tbody>
</table>
</div>
</div>
<div id="outline-container-sec-8" class="outline-2">
<h2 id="sec-8">Work to do</h2>
<div class="outline-text-2" id="text-8">
<p> The bcbio AWS implementation is <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html">freely available and documented</a> and we'll continue to develop and support it. Some of the areas of immediate improvement we hope to tackle in the future include: </p>
<ul class="org-ul">
<li>Supporting encryption at rest on EBS volumes for both NFS and Lustre to meet the <a href="http://d0.awsstatic.com/whitepapers/compliance/AWS_dBGaP_Genomics_on_AWS_Best_Practices.pdf">security requirements for storing genomic data on AWS</a>. We currently encrypt data stored in S3.
</li>
<li>Run directly on container-based parallel frameworks like Amazon's <a href="http://aws.amazon.com/blogs/aws/ec2-container-service-in-action/">EC2 container service</a>, which is also supported by the <a href="https://coreos.com/docs/running-coreos/cloud-providers/ecs/">CoreOS framework</a>.
</li>
<li>Add spot instance support using <a href="http://clusterk.com/">Clusterk</a> or Elasticluster.
</li>
</ul>
<p> We welcome feedback on the implementation, usage and benchmarking results. </p>
</div>
</div>Brad Chapmanchapmanb@fastmail.comOverview We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation. The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data. We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project. We developed bcbio on AWS and ran these timing benchmarks thanks to work with great partners. A collaboration with Biogen and Rudy Tanzi's group at MGH funded the development of bcbio on AWS. A second collaboration with Intel Health and Life Sciences and AstraZenenca funded the somatic variant calling benchmarking work. We're thankful for all the relationships that make this work possible: John Morrissey automated the process of starting a bcbio cluster on AWS and attaching a Lustre filesystem. He also automated the approach to generating graphs of resource usage from collectl stats and provided critical front line testing and improvements to all the components of the bcbio AWS interaction. Kristina Kermanshahche and Robert Read at Intel provided great support helping us get the Lustre ICEL CloudFormation templates running. Ronen Artzi, Michael Heimlich, and Justin Johnson at AstraZenenca setup Lustre, Gluster and NFS benchmarks using a bcbio StarCluster instance. This initial validation was essential for convincing us of the value of moving to a shared filesystem on AWS. Jason Tetrault, Karl Gutwin and Hank Wu at Biogen provided valuable feedback, suggestions and resources for developing bcbio on AWS. Glen Otero parsed the collectl data and provided graphs, which gave us a detailed look into the potential causes of bottlenecks we found in the timings. James Cuff, Paul Edmon and the team at Harvard FAS research computing built and administered the Regal Lustre setup used for local testing. John Kern and other members of the bcbio community tested, debugged and helped identify issues with the implementation. Community feedback and contributions are essential to bcbio development. Architecture The implementation provides both a practical way to run large scale variant calling and RNA-seq analysis, as well as a flexible backend architecture suitable for production quality runs. This writeup might feel a bit like a black triangle moment since I also wrote about running bcbio on AWS three years ago. That implementation was a demonstration for small scale usage rather than a production ready system. We now have a setup we can support and run on large scale projects thanks to numerous changes in the backend architecture: Amazon, and cloud based providers in general, now provide high end filesystems and networking. Our AWS runs are fast because they use SSD backend storage, fast networking connectivity and high end processors that would be difficult to invest in for a local cluster. Renting these is economically feasible now that we have an approach to provision resources, run the analysis, and tear everything down. The dichotomy between local cluster hardware and cloud hardware will continue to expand with upcoming improvements in compute (Haswell processors) and storage (16Tb EBS SSD volumes). Isolating all of the software and code inside Docker containers enables rapid deployment of fixes and improvements. From an open source support perspective, Amazon provides a consistent cluster environment we have full control over, limiting the space of potential system specific issues. From a researcher's perspective, this will allow use of bcbio without needing to spend time installing and testing locally. The setup runs from standard Linux base images using Ansible scripts and Elasticluster. This means we no longer need to build and update AMIs for changes in the architecture or code. This simplifies testing and pushing fixes, letting us spend less time on support and more on development. Since we avoid having a pre-built AMI, the process of building and running bcbio on AWS is fully auditable for both security and scientific accuracy. Finally, it provides a path to support bcbio on container specific management services like Amazon's EC2 container service. All long term data storage happens in Amazon's S3 object store, including both analysis specific data as well as general reference genome data. Downloading reference data for an analysis on demand removes the requirement to maintain large shared EBS volumes. On the analysis side, you maintain only the input files and high value output files in S3, removing the intermediates upon completion of the analysis. Removing the need to manage storage of EBS volumes also provides a cost savings ($0.03/Gb/month for S3 versus $0.10+/Gb/month for EBS) and allows the option of archiving in Glacier for long term storage. All of these architectural changes provide a setup that is easier to maintain and scale over time. Our goal moving ahead is to provide a researcher friendly interface to setting up and running analyses. We hope to achieve that through the in-development Common Workflow Language from Galaxy, Arvados, Seven Bridges, Taverna and the open bioinformatics community. Variant calling – benchmarking AWS versus local We benchmarked somatic variant calling in two environments: on the elasticluster Docker AWS implementation and on local Harvard FAS machines. AWS processing was twice as fast as a local run. The gains occur in disk IO intensive steps like alignment post-processing. AWS offers the opportunity to rent SSD backed storage and obtain a 10GigE connected cluster without contention for network resources. Our local test machines have an in-production Lustre filesystem attached to a large highly utilized cluster provided by Harvard FAS research computing. At this scale Lustre and NFS have similar throughput, with Lustre outperforming NFS during IO intensive steps like alignment, post-processing and large BAM file merging. From previous benchmarking work we'll need to process additional samples in parallel to fully stress the shared filesystem and differentiate Lustre versus NFS performance. However, the resource plots at this scale show potential future bottlenecks during alignment, post-processing and other IO intensive steps. Generally, having Lustre scaled across 4 LUNs per object storage server (OSS) enables better distribution of disk and network resources. AWS runs use two c3.8xlarge instances clustered in a single placement group, providing 32 cores and 60Gb of memory per machine. Our local run was comparable with two compute machines, each with 32 cores and 128Gb of memory, connected to a Lustre filesystem. The benchmark is a cancer tumor/normal evaluation consisting of alignment, recalibration, realignment and variant detection with four different callers. The input is a tumor/normal pair from the the ICGC-TCGA DREAM challenge with 100x coverage in exome regions. Here are the times, in hours, for each benchmark:   AWS (Lustre) AWS (NFS) Local (Lustre) Total 4:42 5:05 10:30 genome data preparation 0:04 0:10   alignment preparation 0:12 0:15   alignment 0:29 0:52 0:53 callable regions 0:44 0:44 1:25 alignment post-processing 0:13 0:21 4:36 variant calling 2:35 2:05 2:36 variant post-processing 0:05 0:03 0:22 prepped BAM merging 0:03 0:18 0:06 validation 0:05 0:05 0:09 population database 0:06 0:07 0:09 To provide more insight into the timing differences between these benchmarks, we automated collection and plotting of resource usage on AWS runs. Variant calling – resource usage plots bcbio retrieves collectl usage statistics from the server and prepares graphs of CPU, memory, disk and network usage. These plots allow in-depth insight into limiting factors during individual steps in the workflow. We'll highlight some interesting comparisons between NFS and Lustre during the variant calling benchmarking. During this benchmark, the two critical resources were CPU usage and disk IO on the shared filesystems. We also measure memory usage but that was not a limiting factor with these analyses. In addition to the comparisons highlighted below, we have the full set of resource usage graphs available for each run: Variant calling with NFS on AWS Variant calling with Lustre on AWS RNA-seq on a single machine on AWS CPU These plots compare CPU usage during processing steps for Lustre and NFS. The largest differences between the two runs are in the alignment, alignment post-processing and variant calling steps: NFS Lustre For alignment and alignment post-processing the Lustre runs show more stable CPU usage. NFS specifically spends more time in the CPU wait state (red line) during IO intensive steps. On larger scale projects this may become a limiting factor in processing throughput. The variant calling step was slower on Lustre than NFS, with inconsistent CPU usage. We'll have to investigate this slowdown further, since no other metrics point to an obvious bottleneck. Shared filesystem network usage and IO These plots compare network usage during processing for Lustre and NFS. We use this as a consistent proxy for the performance of the shared filesystem and disk IO (the NFS plots do have directly measured disk IO for comparison purposes). NFS Lustre The biggest difference in the IO intensive steps is that Lustre network usage is smoother compared to the spiky NFS input/output, due to spreading out read/writes over multiple disks. Including more processes with additional read/writes will help determine how these differences translate to scaling on larger numbers of simultaneous samples. RNA-seq benchmarking We also ran an RNA-seq analysis using 4 samples from the Sequencing Quality Control (SEQC) project. Each sample has 15 million 100bp paired reads. bcbio handled trimming, alignment with STAR, and quantitation with DEXSeq and Cufflinks. We ran on a single AWS c3.8xlarge machines with 32 cores, 60Gb of memory, and attached SSD storage. RNA-seq optimization in bcbio is at an earlier stage than variant calling. We've done work to speed up trimming and aligning, but haven't yet optimized the expression and count steps. The analysis runs quickly in 6 1/2 hours, but there is still room for further optimization, and this is a nice example of how we can use benchmarking plots to identify targets for additional work: Total 6:25 genome data preparation 0:32 adapter trimming 0:32 alignment 0:24 estimate expression 3:41 quality control 1:16 The RNA-seq collectl plots show the cause of the slower steps during expression estimation and quality control. Here is CPU usage over the run: The low CPU usage during the first 2 hours of expression estimation corresponds to DEXSeq running serially over the 4 samples. In contrast with Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a single core. We could run these steps in parallel by using multiprocessing to launch the jobs, split by sample. Similarly, the QC steps could benefit from parallel processing. Alternatively, we're looking at validating other approaches for doing quantification like eXpress. These are the type of benchmarking and validation steps that are continually ongoing in the development of bcbio pipelines. Reproducing the analysis The process to launch the cluster and an NFS or optional Lustre shared filesystem is fully automated and documented. It sets up permissions, VPCs, clusters and shared filesystems from a basic AWS account, so requires minimal manual work. bcbio_vm.py has commands to: Add an IAM user, a VPC and create the Elasticluster config. Launch a cluster and bootstrap with the latest bcbio code and data. Create and mount a Lustre filesystem attached to the cluster. Terminate the cluster and Lustre stack upon completion. The processing handles download of input data from S3 and upload back to S3 on finalization. We store data encrypted on S3 and manage access using IAM instance profiles. The examples below show how to run both a somatic variant calling evaluation and an RNA-seq evaluation. Running the somatic variant calling evaluation This analysis performs evaluation of variant calling using tumor/normal somatic sample from the DREAM challenge. To run, prepare an S3 bucket to run the analysis from. Copy the configuration file to your own personal bucket and add a GATK jar. You can use the AWS console or any available S3 client to do this. For example, using the AWS command line client:Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools2014-10-07T12:53:00+00:002014-10-07T12:53:00+00:00http://bcb.io/2014/10/07/joint-calling<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Incremental joint variant calling</h2>
<div class="outline-text-2" id="text-1">
<p> Variant calling in large populations is challenging due to the difficulty in providing a consistent set of calls at all possible variable positions. A finalized set of calls from a large population should distinguish reference calls, without a variant, from no calls, positions without enough read support to make a call. Calling algorithms should also be able to make use of information from other samples in the population to improve sensitivity and precision. </p>
<p> There are two issues with trying to provide complete combined call sets. First, it is computationally expensive to call a large number of samples simultaneously. Second, adding any new samples to a callset requires repeating this expensive computation. This <a href="http://gatkforums.broadinstitute.org/discussion/4150/should-i-analyze-my-samples-alone-or-together">N+1 problem</a> highlights the inflexibility around simultaneous pooled calling of populations. </p>
<p> The GATK team's recent 3.x release has a solution to these issues: <a href="https://www.broadinstitute.org/gatk/guide/article?id=3893">Incremental joint variant discovery</a>. The approach calls samples independently but produces a <a href="https://www.broadinstitute.org/gatk/guide/article?id=4017">genomic VCF (gVCF)</a> output for each individual that contains probability information for both variants and reference calls at non-variant positions. The genotyping step combines these individual gVCF files, making use of the information from the independent samples to produce a final callset. </p>
<p> We added GATK incremental joint calling to <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> along with a <a href="https://github.com/chapmanb/bcbio.variation.recall">generalized implementation</a> that performs joint calling with other variant callers. Practically, bcbio now supports this approach with four variant callers: </p>
<ul class="org-ul">
<li><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">GATK HaplotypeCaller</a> (3.2-2) – Follows current GATK recommended best practices for calling, with Variant Quality Score Recalibration used on whole genome and large population callsets. This uses individual sample gVCFs as inputs to joint calling.
</li>
<li><a href="https://github.com/ekg/freebayes">FreeBayes (0.9.14-15)</a> – A haplotype-based caller from Erik Garrison in Gabor Marth's lab.
</li>
<li><a href="http://www.well.ox.ac.uk/platypus">Platypus (0.7.9.2)</a> – A recently published haplotype-based variant caller from Andy Rimmer at the Wellcome Trust Centre for Human Genomics.
</li>
<li><a href="http://www.htslib.org/">samtools (1.0)</a> – The recently released version of samtools and bcftools with a new <a href="http://samtools.github.io/bcftools/call-m.pdf">multiallelic calling method</a>. John Marshall, Petr Danecek, James Bonfield and Martin Pollard at Sanger have continued samtools development from Heng Li's code base.
</li>
</ul>
<p> The implementation includes integrated validation against the <a href="http://genomeinabottle.org/">Genome in a Bottle</a> NA12878 reference standard, allowing comparisons between joint calling, multi-sample pooled calling and single sample calling. Sensitivity and precision for joint calling is comparable to pooled calling, suggesting we should optimize design of variant processing to cater towards individual calling and subsequent finalization of calls, rather than pooling. Generalized joint calling enables combining multiple sets of calls under an identical processing framework, which will be important as we seek to integrate large publicly available populations to extract biological signal in complex multi-gene diseases. </p>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Terminology</h2>
<div class="outline-text-2" id="text-2">
<p> There is not a consistent set of terminology around combined variant calling, but to develop one, here is how I'll use the terms: </p>
<ul class="org-ul">
<li>Joint calling – Calling a group of samples together with algorithms that do not need simultaneous access to all population BAM files. GATK's incremental joint calling uses gVCF intermediates. Our generalized implementation performs recalling using individual BAMs supplemented with a combined VCF file of variants called in all samples.
</li>
<li>Pooled or batch calling – Traditional grouped sample calling, where algorithms make use of read data from all BAM files of a group. This scales to smaller batches of samples.
</li>
<li>Single sample calling – Variant calling with a single sample only, not making use of information from other samples.
</li>
<li>Squaring off or Backfilling – Creating a VCF file from a group of samples that distinguishes reference from no-call at every position called as a variant in one of the samples. With a squared off VCF, we can use the <a href="http://cdn.vanillaforums.com/gatk.vanillaforums.com/FileUpload/9f/f0619642db06b73b599253f42ef2bf.png">sample matrix</a> to consider call rate at any position. Large populations called in smaller batches will not be able to distinguish reference from no-call at variants unique to each sub-pool, so will need to be re-processed to achieve this.
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">Implementation</h2>
<div class="outline-text-2" id="text-3">
<p> <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> automates the calling and validation used in this comparison. We aim to make it easy to install, use and extend. </p>
<p> For GATK HaplotypeCaller based joint genotyping, we implement the <a href="http://www.broadinstitute.org/gatk/guide/article?id=3893">GATK best practices</a> recommended by the Broad. Individual sample variant calls produce a gVCF output file that contains both variants as well as probability information about reference regions. Next, variants are jointly called using GenotypeGVFs to produce the final population VCF file. </p>
<p> For the other supported callers – FreeBayes, Platypus and samtools – we use a generalized recalling approach, implemented in <a href="https://github.com/chapmanb/bcbio.variation.recall">bcbio.variation.recall</a>. <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> first calls each individual sample as a standard VCF. We then combine these individual sample VCFs into a global summary of all variant positions called across all samples. Finally we recall at each potential variant position, producing a globally squared off joint callset for the sample that we merge into the final joint VCF. This process parallelizes by chromosome region and by sample, allowing efficient use of resources in both clusters and <a href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html">large multiple core machines</a>. </p>
<p> bcbio.variation.recall generalizes to any variant caller that supports recalling with an input set of variants. Knowing the context of potential variants helps inform better calling. This method requires having the individual sample BAM file available to perform recalling. Having the reads present does provide the ability to improve recalling by taking advantage of realigning reads into haplotypes given known variants, an approach we'll explore more in future work. The implementation is also general and could support gVCF based combining as this becomes available for non-GATK callers. </p>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4">Generalized joint calling</h2>
<div class="outline-text-2" id="text-4">
<p> We evaluated all callers against the NA12878 Genome in a Bottle reference standard using the <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">NA12878/NA12891/NA12892 trio</a> from the <a href="http://blog.goldenhelix.com/wp-content/uploads/2013/03/Utah-Pedigree-1463-with-NA12878.png">CEPH 1463 Pedigree</a>, with 50x whole genome coverage from <a href="http://www.illumina.com/platinumgenomes/">Illumina's platinum genomes</a>. The validation provides putative true positives (concordant), false negatives (discordant missing), and false positives (discordant extra) for all callers: </p>
<p><a href="http://i.imgur.com/ddsRkkd.png"><br />
<img src="http://i.imgur.com/ddsRkkd.png" width="650" alt="Incremental joint calling: GATK HaplotypeCaller, FreeBayes, Platypus, samtools" /><br />
</a></p>
<p> Overall, there is not a large difference in sensitivity and precision for the four methods, giving us four high-quality options for performing joint variant calling on germline samples. The post-calling filters provide similar levels of false positives to enable comparisons of sensitivity. Notably, samtools new calling method is now as good as other approaches, in contrast with <a href="http://bcb.io/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/">previous evaluations</a>, demonstrating the value of continuing to improve open source tools and having updated benchmarks to reflect these improvements. </p>
<p> Improving sensitivity and precision is always an ongoing process and this evaluation identifies some areas to focus on for future work: </p>
<ul class="org-ul">
<li>Platypus SNP and indel calling is slightly less sensitive than other approaches. We worked on <a href="https://github.com/chapmanb/bcbio-nextgen/blob/9320479d8f21677b61ed1274b4da23d569c686ae/bcbio/variation/platypus.py#L29">Platypus calling parameters</a> and <a href="https://github.com/chapmanb/bcbio-nextgen/blob/9320479d8f21677b61ed1274b4da23d569c686ae/bcbio/variation/vfilter.py#L180">post-call filtering</a> to increase sensitivity from the defaults without introducing a large number of false positives, but welcome suggestions for more improvements.
</li>
<li>samtools indel calling needs additional work to reduce false positive indels in joint and pooled calling. There is more detail on this below in the comparison with single sample samtools calling.
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5">Joint versus pooled versus single approaches</h2>
<div class="outline-text-2" id="text-5">
<p> We validated the same NA12878 trio with pooled and single sample calling to assess the advantages of joint calling over single sample, and whether joint calling is comparable in quality to calling simultaneously. The full evaluation for pooled calling shows that performance is similar to joint methods: </p>
<p><a href="http://i.imgur.com/Dna8hrI.png"><br />
<img src="http://i.imgur.com/Dna8hrI.png" width="650" alt="Pooled calling: GATK HaplotypeCaller, FreeBayes, Platypus, samtools" /><br />
</a></p>
<p> If you plot joint, pooled and single sample calling next to each other there are some interesting small differences between approaches that identify areas for further improvement. As an example, here are GATK HaplotypeCaller and samtools with the three approaches presented side by side: </p>
<p><a href="http://i.imgur.com/bcCvXxP.png"><br />
<img src="http://i.imgur.com/bcCvXxP.png" width="750" alt="Joint, pooled and single calling: GATK HaplotypeCaller and samtools" /><br />
</a></p>
<p> GATK HaplotypeCaller sensitivity and precision are close between the three methods, with small trade offs for different methods. For SNPs, pooled calling is most sensitive at the cost of more false positives, and single calling is more precise at the cost of some sensitivity. Joint calling is intermediate between these two extremes. For indels, joint calling is the most sensitive at the cost of more false positives, with pooled calling falling between joint and single sample calling. </p>
<p> For samtools, precision is currently best tuned for single sample calling. Pooled calling provides better sensitivity, but at the cost of a larger number of false positives. The joint calling implementation regains a bit of this sensitivity but still suffers from increased false positives. The authors of samtools tuned variant calling nicely for single samples, but there are opportunities to increase sensitivity when incorporating multiple samples via a joint method. </p>
<p> Generally, we don't expect the same advantages for pooled or joint calling in a trio as we'd see in a larger population. However, even for this small evaluation population we can see the improvements available by considering additional variant information from other samples. For Platypus we unexpectedly had better calls from joint calling compared to pooled calling, but expect these differences to harmonize over time as the tools continue to improve. </p>
<p> Overall, this comparison identifies areas where we can hope to improve generalized joint calling. We plan to provide specific suggestions and feedback to samtools, Platypus and other tool authors as part of a continuous validation and feedback process. </p>
</div>
</div>
<div id="outline-container-sec-6" class="outline-2">
<h2 id="sec-6">Reproducing and extending the analysis</h2>
<div class="outline-text-2" id="text-6">
<p> All variant callers and calling methods validated here are available for running in <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a>. bcbio automatically installs the generalized joint calling implementation, and it is also available as a java executable at <a href="https://github.com/chapmanb/bcbio.variation.recall">bcbio.variation.recall</a>. All tools are freely available, open source and community developed and we welcome your feedback and contributions. </p>
<p> The documentation contains <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#whole-genome-trio-50x">full instructions for running the joint analysis</a>. This is an extended version of <a href="http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/">previous work on validation of trio calling</a> and uses the same input dataset with a bcbio configuration that includes single, pooled and joint calling: </p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">mkdir <span class="nt">-p</span> NA12878-trio-eval/config NA12878-trio-eval/input NA12878-trio-eval/work-joint
<span class="nb">cd </span>NA12878-trio-eval/config
<span class="nb">cd</span> ../input
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh
bash NA12878-trio-wgs-validate-getdata.sh
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-joint.yaml
<span class="nb">cd</span> ../work_joint
bcbio_nextgen.py ../config/NA12878-trio-wgs-joint.yaml <span class="nt">-n</span> 16</code></pre></figure>
<p> Having a general joint calling implementation with good sensitivity and precision is a starting point for more research and development. To build off this work we plan to: </p>
<ul class="org-ul">
<li>Provide better <a href="http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/">ensemble calling methods</a> that scale to large multi-sample calling projects.
</li>
<li>Work with FreeBayes, Platypus and samtools tool authors to provide support for gVCF style files to avoid the need to have BAM files present during joint calling, and to improve sensitivity and precision during recalling-based joint approaches.
</li>
<li>Combine variant calls with local reassembly to improve sensitivity and precision. Erik Garrison's <a href="https://github.com/ekg/glia">glia</a> provides streaming local realignment given a set of potential variants. Jared Simpson used the <a href="https://github.com/jts/sga">SGA assembler</a> to combine <a href="https://github.com/jts/sga-extra">FreeBayes calls with de-novo assembly</a>. Ideally we could identify difficult regions of the genome <a href="http://knightlab.commons.yale.edu/gava-pt-2/">based on alignment information</a> and focus more computationally expensive assembly approaches there.
</li>
</ul>
<p> We plan to continue working with the open source scientific community to integrate, extend and improve these tools and are happy for any feedback and suggestions. </p>
</div>
</div>Brad Chapmanchapmanb@fastmail.comIncremental joint variant calling Variant calling in large populations is challenging due to the difficulty in providing a consistent set of calls at all possible variable positions. A finalized set of calls from a large population should distinguish reference calls, without a variant, from no calls, positions without enough read support to make a call. Calling algorithms should also be able to make use of information from other samples in the population to improve sensitivity and precision. There are two issues with trying to provide complete combined call sets. First, it is computationally expensive to call a large number of samples simultaneously. Second, adding any new samples to a callset requires repeating this expensive computation. This N+1 problem highlights the inflexibility around simultaneous pooled calling of populations. The GATK team's recent 3.x release has a solution to these issues: Incremental joint variant discovery. The approach calls samples independently but produces a genomic VCF (gVCF) output for each individual that contains probability information for both variants and reference calls at non-variant positions. The genotyping step combines these individual gVCF files, making use of the information from the independent samples to produce a final callset. We added GATK incremental joint calling to bcbio-nextgen along with a generalized implementation that performs joint calling with other variant callers. Practically, bcbio now supports this approach with four variant callers: GATK HaplotypeCaller (3.2-2) – Follows current GATK recommended best practices for calling, with Variant Quality Score Recalibration used on whole genome and large population callsets. This uses individual sample gVCFs as inputs to joint calling. FreeBayes (0.9.14-15) – A haplotype-based caller from Erik Garrison in Gabor Marth's lab. Platypus (0.7.9.2) – A recently published haplotype-based variant caller from Andy Rimmer at the Wellcome Trust Centre for Human Genomics. samtools (1.0) – The recently released version of samtools and bcftools with a new multiallelic calling method. John Marshall, Petr Danecek, James Bonfield and Martin Pollard at Sanger have continued samtools development from Heng Li's code base. The implementation includes integrated validation against the Genome in a Bottle NA12878 reference standard, allowing comparisons between joint calling, multi-sample pooled calling and single sample calling. Sensitivity and precision for joint calling is comparable to pooled calling, suggesting we should optimize design of variant processing to cater towards individual calling and subsequent finalization of calls, rather than pooling. Generalized joint calling enables combining multiple sets of calls under an identical processing framework, which will be important as we seek to integrate large publicly available populations to extract biological signal in complex multi-gene diseases. Terminology There is not a consistent set of terminology around combined variant calling, but to develop one, here is how I'll use the terms: Joint calling – Calling a group of samples together with algorithms that do not need simultaneous access to all population BAM files. GATK's incremental joint calling uses gVCF intermediates. Our generalized implementation performs recalling using individual BAMs supplemented with a combined VCF file of variants called in all samples. Pooled or batch calling – Traditional grouped sample calling, where algorithms make use of read data from all BAM files of a group. This scales to smaller batches of samples. Single sample calling – Variant calling with a single sample only, not making use of information from other samples. Squaring off or Backfilling – Creating a VCF file from a group of samples that distinguishes reference from no-call at every position called as a variant in one of the samples. With a squared off VCF, we can use the sample matrix to consider call rate at any position. Large populations called in smaller batches will not be able to distinguish reference from no-call at variants unique to each sub-pool, so will need to be re-processed to achieve this. Implementation bcbio-nextgen automates the calling and validation used in this comparison. We aim to make it easy to install, use and extend. For GATK HaplotypeCaller based joint genotyping, we implement the GATK best practices recommended by the Broad. Individual sample variant calls produce a gVCF output file that contains both variants as well as probability information about reference regions. Next, variants are jointly called using GenotypeGVFs to produce the final population VCF file. For the other supported callers – FreeBayes, Platypus and samtools – we use a generalized recalling approach, implemented in bcbio.variation.recall. bcbio-nextgen first calls each individual sample as a standard VCF. We then combine these individual sample VCFs into a global summary of all variant positions called across all samples. Finally we recall at each potential variant position, producing a globally squared off joint callset for the sample that we merge into the final joint VCF. This process parallelizes by chromosome region and by sample, allowing efficient use of resources in both clusters and large multiple core machines. bcbio.variation.recall generalizes to any variant caller that supports recalling with an input set of variants. Knowing the context of potential variants helps inform better calling. This method requires having the individual sample BAM file available to perform recalling. Having the reads present does provide the ability to improve recalling by taking advantage of realigning reads into haplotypes given known variants, an approach we'll explore more in future work. The implementation is also general and could support gVCF based combining as this becomes available for non-GATK callers. Generalized joint calling We evaluated all callers against the NA12878 Genome in a Bottle reference standard using the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree, with 50x whole genome coverage from Illumina's platinum genomes. The validation provides putative true positives (concordant), false negatives (discordant missing), and false positives (discordant extra) for all callers: Overall, there is not a large difference in sensitivity and precision for the four methods, giving us four high-quality options for performing joint variant calling on germline samples. The post-calling filters provide similar levels of false positives to enable comparisons of sensitivity. Notably, samtools new calling method is now as good as other approaches, in contrast with previous evaluations, demonstrating the value of continuing to improve open source tools and having updated benchmarks to reflect these improvements. Improving sensitivity and precision is always an ongoing process and this evaluation identifies some areas to focus on for future work: Platypus SNP and indel calling is slightly less sensitive than other approaches. We worked on Platypus calling parameters and post-call filtering to increase sensitivity from the defaults without introducing a large number of false positives, but welcome suggestions for more improvements. samtools indel calling needs additional work to reduce false positive indels in joint and pooled calling. There is more detail on this below in the comparison with single sample samtools calling. Joint versus pooled versus single approaches We validated the same NA12878 trio with pooled and single sample calling to assess the advantages of joint calling over single sample, and whether joint calling is comparable in quality to calling simultaneously. The full evaluation for pooled calling shows that performance is similar to joint methods: If you plot joint, pooled and single sample calling next to each other there are some interesting small differences between approaches that identify areas for further improvement. As an example, here are GATK HaplotypeCaller and samtools with the three approaches presented side by side: GATK HaplotypeCaller sensitivity and precision are close between the three methods, with small trade offs for different methods. For SNPs, pooled calling is most sensitive at the cost of more false positives, and single calling is more precise at the cost of some sensitivity. Joint calling is intermediate between these two extremes. For indels, joint calling is the most sensitive at the cost of more false positives, with pooled calling falling between joint and single sample calling. For samtools, precision is currently best tuned for single sample calling. Pooled calling provides better sensitivity, but at the cost of a larger number of false positives. The joint calling implementation regains a bit of this sensitivity but still suffers from increased false positives. The authors of samtools tuned variant calling nicely for single samples, but there are opportunities to increase sensitivity when incorporating multiple samples via a joint method. Generally, we don't expect the same advantages for pooled or joint calling in a trio as we'd see in a larger population. However, even for this small evaluation population we can see the improvements available by considering additional variant information from other samples. For Platypus we unexpectedly had better calls from joint calling compared to pooled calling, but expect these differences to harmonize over time as the tools continue to improve. Overall, this comparison identifies areas where we can hope to improve generalized joint calling. We plan to provide specific suggestions and feedback to samtools, Platypus and other tool authors as part of a continuous validation and feedback process. Reproducing and extending the analysis All variant callers and calling methods validated here are available for running in bcbio-nextgen. bcbio automatically installs the generalized joint calling implementation, and it is also available as a java executable at bcbio.variation.recall. All tools are freely available, open source and community developed and we welcome your feedback and contributions. The documentation contains full instructions for running the joint analysis. This is an extended version of previous work on validation of trio calling and uses the same input dataset with a bcbio configuration that includes single, pooled and joint calling:Validated whole genome structural variation detection using multiple callers2014-08-12T17:22:00+00:002014-08-12T17:22:00+00:00http://bcb.io/2014/08/12/validated-whole-genome-structural-variation-detection-using-multiple-callers<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Structural variant detection goals</h2>
<div class="outline-text-2" id="text-1">
<p> This post describes community based work integrating structural variant calling and validation into <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a>. I've previously written about approaches for <a href="http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/">validating single nucleotide changes (SNPs) and small insertions/deletions (Indels)</a>, but it has always been unfortunate to not have reliable ways to detect larger structural variations: deletions, duplications, inversions, translocations and other disruptive events. Detecting these events with short read sequencing is difficult, and our goal in bcbio is to create a global summary of predicted structural variants from multiple callers alongside measures of sensitivity and precision. </p>
<p> The latest release of bcbio automates structural variant calling with three callers: </p>
<ul class="org-ul">
<li><a href="https://github.com/arq5x/lumpy-sv">LUMPY</a> – a <a href="http://genomebiology.com/2014/15/6/R84/abstract">probabilistic structural variant caller</a> incorporating both split read and read pair discordance, developed by Ryan Layer in <a href="http://quinlanlab.org/">Aaron Quinlan</a> and <a href="http://faculty.virginia.edu/irahall/">Ira Hall's</a> labs. </li>
<li><a href="https://github.com/tobiasrausch/delly">DELLY</a> – an <a href="http://bioinformatics.oxfordjournals.org/content/28/18/i333.abstract">integrated paired-end and split-end structural variant caller</a> developed by Tobias Rausch. </li>
<li><a href="http://www.bioconductor.org/packages/release/bioc/html/cn.mops.html">cn.mops</a> – a <a href="http://nar.oxfordjournals.org/content/40/9/e69">read count based copy number variation (CNV) caller</a> developed by Günter Klambauer. </li>
</ul>
<p> bcbio integrates structural variation predictions from all approaches into a high level BED file. This is a first pass way to identify potentially disruptive large scale events. Here are example regions: a duplication called by all 3 callers, a deletion called by 2 callers, and a complex region with both deletions and duplications. </p>
<pre class="example">
9 139526855 139527537 DUP_delly,DUP_lumpy,cnv3_cn_mops
10 99034861 99037400 DEL_delly,cnv0_cn_mops,cnv1_cn_mops
12 8575814 8596742 BND_lumpy,DEL_delly,DEL_lumpy,DUP_delly,DUP_lumpy,cnv1_cn_mops,cnv3_cn_mops
</pre>
<p>
<p> This output associates larger structural events with regions of interest in a high level way, while allowing us to quickly determine the individual tool support for each event. Using this, we are no longer blind to potential structural changes and can use the summary to target in-depth investigation with the more detailed metrics available from each caller and a read viewer like <a href="http://melissagymrek.com/pybamview/">PyBamView</a>. The results can also help inform prioritization of SNP and Indel calls since structural rearrangements often associate with false positives. Longer term we hope this structural variant summary and comparison work will be useful for community validation efforts like the <a href="http://genomicsandhealth.org/">Global Alliance for Genomics and Health benchmarking group</a>, the <a href="http://www.genomeinabottle.org/">Genome in a Bottle consortium</a> and the <a href="https://www.synapse.org/#!Synapse:syn312572">ICGC-TCGA DREAM Mutation Calling challenge</a>. </p>
<p> Below I'll describe a full evaluation of the sensitivity and precision of this combined approach using an NA12878 trio, as well as describe how to run and extend this work using <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a>. </p>
<p> This blog post is the culmination of a lot of work and support from the open source bioinformatics community. <a href="https://twitter.com/dfjenkins3">David Jenkins</a> worked with our <a href="http://compbio.sph.harvard.edu/chb/">our group</a> for the summer and headed up evaluation of structural variation results. We received wonderful support from Colby Chang, Ryan Layer, Ira Hall and Aaron Quinlan on both LUMPY and structural variation validation in general. They freely shared scripts and datasets for evaluation, which gave us the materials to make these comparisons. Günter Klambauer gave us great practical advice on using cn.mops. Tobias Rausch helped immensely with tips for speeding up DELLY on whole genomes, and Ashok Ragavendran from <a href="http://talkowski.mgh.harvard.edu/people/">Mike Talkowski's lab</a> generously discussed tricks for scaling DELLY runs. <a href="https://rc.fas.harvard.edu/">Harvard Research Computing</a> provided critical support and compute for this analysis as part of a collaboration with <a href="https://01.org/">Intel</a>. </p>
</p></div>
</p></div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Evaluation</h2>
<div class="outline-text-2" id="text-2">
<p> To validate the output of this combined structural variant calling approach we used a set of over 4000 validated deletions made available by Ryan Layer as part of the <a href="http://genomebiology.com/2014/15/6/R84/abstract">LUMPY manuscript</a>. These are deletion calls in <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">NA12878</a> with secondary support evidence from <a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131209_na12878_moleculo/README_na12878_moleculo_20131209">Moleculo</a> and <a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131209_na12878_pacbio/Schadt/README.NA12878_PacBio_data_from_Schadt">PacBio</a> datasets. We further subset these regions by removing calls in <a href="http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/">low complexity regions</a> identified in <a href="http://bioinformatics.oxfordjournals.org/content/early/2014/07/03/bioinformatics.btu356">Heng Li's variant calling artifacts paper</a>. An alternative set of problematic regions are the <a href="https://github.com/cc2qe/speedseq#annotations">potential misassembly regions</a> identified by Colby Chang and Ira Hall during LUMPY and <a href="https://github.com/cc2qe/speedseq">speedseq</a> development (Edit: The original post mistakenly mentioned these overlap significantly with low complexity regions, but that is only due to overlap in obvious problem areas like the N gap regions. We'll need additional work to incorporate both regions into future evaluations. Thanks to Colby for catching this error.). These are a useful proxy for regions we're currently not able to reliably call structural variants in. </p>
<p> We ran all structural variant callers using the <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">NA12878/NA12891/NA12892 trio</a> from the <a href="http://blog.goldenhelix.com/wp-content/uploads/2013/03/Utah-Pedigree-1463-with-NA12878.png">CEPH 1463 Pedigree</a> as an input dataset. This consists of 50x whole genome reads from <a href="http://www.illumina.com/platinumgenomes/">Illumina's platinum genomes</a> project, and is the same dataset used in our previous analyses of population based SNP and small indel calling. </p>
<p> Our goal is to define boundaries on <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">sensitivity</a> – the percentage of validated calls we detect – and <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">precision</a> – how many of the total calls overlap with validated regions. We required a simple overlap of the called regions with validated regions to consider a called variant as validated, and stratified results by event size to quantify detection metrics at different size ranges. </p>
<p> The comparison highlights the value of providing a combined call set. I'd caution against using this as a comparison between methods. Accurate structural variation calling depends on multiple sources of evidence and we still have work to do in improving our ability to filter for maximal sensitivity and specificity. The ensemble method in the figure below displays results of our final calls, made from collapsing structural variant calls from all three input callers: </p>
<p> <a href="http://i.imgur.com/DOqjHRP.png"> <img src="http://i.imgur.com/DOqjHRP.png" width="650" alt="Structural variant calling sensitivity and precision at different event sizes" /> </a>
<p> Across all size classes, we detect approximately half of the structural variants and expect that about half of the called events are false positives. Smaller structural variants of less than 1kb are the most difficult to detect with these methods. Larger events from 1kb to 25kb have better sensitivity and precision. As the size of the events increase precision decreases, so larger called events tend to have more false positives. </p>
<p> Beyond the values for sensitivity and precision, the biggest takeaway is that combining multiple callers helps detect additional variants we'd miss with any individual caller. Count based callers like cn.mops enable improved sensitivity on large deletions but don't resolve small deletions at 50x depth using our current settings, although tuning can help detect these smaller sized events as well. Similarly, lumpy and delly capture different sets of variants across all of the size classes. </p>
<p> The comparison also emphasizes the potential for improving both individual caller filtering and ensemble structural variant preparation. The ensemble method uses <a href="http://bedtools.readthedocs.org/en/latest/index.html">bedtools</a> to create a merged superset of all individually called regions. This is the simplest possible approach to combine calls. Similarly, individual caller filters are intentionally simple as well. cn.mops calling performs no additional filtering beyond the defaults, and could use adjustment to detect and filter smaller events. Our <a href="https://github.com/chapmanb/bcbio-nextgen/blob/ac82a1d25cab0c498d645f899abc3af65c8fbbba/bcbio/structural/delly.py#L132">DELLY filter</a> requires 4 supporting reads or both split and paired read evidence. Our <a href="https://github.com/chapmanb/bcbio-nextgen/blob/ac82a1d25cab0c498d645f899abc3af65c8fbbba/bcbio/structural/lumpy.py#L74">LUMPY filter</a> require at least 4 supporting reads to retain an event. We welcome discussion of the costs and tradeoffs of these approaches. For instance, requiring split and paired evidence for DELLY increases precision at the cost of sensitivity. These filters are a useful starting point and resolution, but we hope to continue to refine and improve them over time. </p>
</p></div>
</p></div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">Implementation</h2>
<div class="outline-text-2" id="text-3">
<p> <a href="https://github.com/chapmanb/bcbio-nextgen/">bcbio-nextgen</a> handles installation and automation of the programs used in this comparison. The documentation contains <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#structural-variant-calling-whole-genome-trio-50x">instructions to download the data and run the NA12878 trio calling and validation</a>. This <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/NA12878-trio-sv.yaml">input configuration file</a> should be easily adjusted to run on your data of interest. </p>
<p> The current implementation has reasonable run times for whole genome structural variant calling. We use <a href="https://github.com/GregoryFaust/samblaster">samblaster</a> to perform duplicate marking alongside identification of discordant and split read pairs. The aligned reads from <a href="https://github.com/lh3/bwa">bwa</a> stream directly into samblaster, adding minimal processing time to the run. For LUMPY calling, the pre-prepared split and discordant reads feed directly into <a href="https://github.com/cc2qe/speedseq">speedseq</a>, which nicely automates the process of running LUMPY. For DELLY, we <a href="https://github.com/chapmanb/bcbio-nextgen/blob/ac82a1d25cab0c498d645f899abc3af65c8fbbba/bcbio/structural/delly.py#L110">subsample correct pairs in the input BAM to 50 million reads</a> and combine with the pre-extracted problematic pairs to improve runtimes for whole genome inputs. </p>
<p> We processed three concurrently called 50x whole genome samples from FASTQ reads to validated structural variants in approximately 3 days using 32 cores. Following the preparation work described above, LUMPY calling took 6 hours, DELLY takes 24 hours parallelized on 32 cores and cn.mops took 16 hours parallelized by chromosome on 16 cores. This is a single data point for current capabilities, and is an area where we hope to continue to improve scalability and parallelization. </p>
<p> The implementation and validation are fully integrated into the community developed <a href="https://github.com/chapmanb/bcbio-nextgen/">bcbio-nextgen</a> project and we hope to expand this work to incorporate additional structural variant callers like <a href="https://github.com/genome/pindel">Pindel</a> and <a href="http://cnvkit.readthedocs.org/en/latest/">CNVkit</a>, as well as improving filtering and ensemble calling. We also want to expand structural variant validation to include tumor/normal cancer samples and targeted sequencing. We welcome contributions and suggestions on current and future directions in structural variant calling. </p>
</p></div>
</p></div>Brad Chapmanchapmanb@fastmail.comStructural variant detection goals This post describes community based work integrating structural variant calling and validation into bcbio-nextgen. I've previously written about approaches for validating single nucleotide changes (SNPs) and small insertions/deletions (Indels), but it has always been unfortunate to not have reliable ways to detect larger structural variations: deletions, duplications, inversions, translocations and other disruptive events. Detecting these events with short read sequencing is difficult, and our goal in bcbio is to create a global summary of predicted structural variants from multiple callers alongside measures of sensitivity and precision. The latest release of bcbio automates structural variant calling with three callers: LUMPY – a probabilistic structural variant caller incorporating both split read and read pair discordance, developed by Ryan Layer in Aaron Quinlan and Ira Hall's labs. DELLY – an integrated paired-end and split-end structural variant caller developed by Tobias Rausch. cn.mops – a read count based copy number variation (CNV) caller developed by Günter Klambauer. bcbio integrates structural variation predictions from all approaches into a high level BED file. This is a first pass way to identify potentially disruptive large scale events. Here are example regions: a duplication called by all 3 callers, a deletion called by 2 callers, and a complex region with both deletions and duplications. 9 139526855 139527537 DUP_delly,DUP_lumpy,cnv3_cn_mops 10 99034861 99037400 DEL_delly,cnv0_cn_mops,cnv1_cn_mops 12 8575814 8596742 BND_lumpy,DEL_delly,DEL_lumpy,DUP_delly,DUP_lumpy,cnv1_cn_mops,cnv3_cn_mops This output associates larger structural events with regions of interest in a high level way, while allowing us to quickly determine the individual tool support for each event. Using this, we are no longer blind to potential structural changes and can use the summary to target in-depth investigation with the more detailed metrics available from each caller and a read viewer like PyBamView. The results can also help inform prioritization of SNP and Indel calls since structural rearrangements often associate with false positives. Longer term we hope this structural variant summary and comparison work will be useful for community validation efforts like the Global Alliance for Genomics and Health benchmarking group, the Genome in a Bottle consortium and the ICGC-TCGA DREAM Mutation Calling challenge. Below I'll describe a full evaluation of the sensitivity and precision of this combined approach using an NA12878 trio, as well as describe how to run and extend this work using bcbio-nextgen. This blog post is the culmination of a lot of work and support from the open source bioinformatics community. David Jenkins worked with our our group for the summer and headed up evaluation of structural variation results. We received wonderful support from Colby Chang, Ryan Layer, Ira Hall and Aaron Quinlan on both LUMPY and structural variation validation in general. They freely shared scripts and datasets for evaluation, which gave us the materials to make these comparisons. Günter Klambauer gave us great practical advice on using cn.mops. Tobias Rausch helped immensely with tips for speeding up DELLY on whole genomes, and Ashok Ragavendran from Mike Talkowski's lab generously discussed tricks for scaling DELLY runs. Harvard Research Computing provided critical support and compute for this analysis as part of a collaboration with Intel. Evaluation To validate the output of this combined structural variant calling approach we used a set of over 4000 validated deletions made available by Ryan Layer as part of the LUMPY manuscript. These are deletion calls in NA12878 with secondary support evidence from Moleculo and PacBio datasets. We further subset these regions by removing calls in low complexity regions identified in Heng Li's variant calling artifacts paper. An alternative set of problematic regions are the potential misassembly regions identified by Colby Chang and Ira Hall during LUMPY and speedseq development (Edit: The original post mistakenly mentioned these overlap significantly with low complexity regions, but that is only due to overlap in obvious problem areas like the N gap regions. We'll need additional work to incorporate both regions into future evaluations. Thanks to Colby for catching this error.). These are a useful proxy for regions we're currently not able to reliably call structural variants in. We ran all structural variant callers using the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree as an input dataset. This consists of 50x whole genome reads from Illumina's platinum genomes project, and is the same dataset used in our previous analyses of population based SNP and small indel calling. Our goal is to define boundaries on sensitivity – the percentage of validated calls we detect – and precision – how many of the total calls overlap with validated regions. We required a simple overlap of the called regions with validated regions to consider a called variant as validated, and stratified results by event size to quantify detection metrics at different size ranges. The comparison highlights the value of providing a combined call set. I'd caution against using this as a comparison between methods. Accurate structural variation calling depends on multiple sources of evidence and we still have work to do in improving our ability to filter for maximal sensitivity and specificity. The ensemble method in the figure below displays results of our final calls, made from collapsing structural variant calls from all three input callers: Across all size classes, we detect approximately half of the structural variants and expect that about half of the called events are false positives. Smaller structural variants of less than 1kb are the most difficult to detect with these methods. Larger events from 1kb to 25kb have better sensitivity and precision. As the size of the events increase precision decreases, so larger called events tend to have more false positives. Beyond the values for sensitivity and precision, the biggest takeaway is that combining multiple callers helps detect additional variants we'd miss with any individual caller. Count based callers like cn.mops enable improved sensitivity on large deletions but don't resolve small deletions at 50x depth using our current settings, although tuning can help detect these smaller sized events as well. Similarly, lumpy and delly capture different sets of variants across all of the size classes. The comparison also emphasizes the potential for improving both individual caller filtering and ensemble structural variant preparation. The ensemble method uses bedtools to create a merged superset of all individually called regions. This is the simplest possible approach to combine calls. Similarly, individual caller filters are intentionally simple as well. cn.mops calling performs no additional filtering beyond the defaults, and could use adjustment to detect and filter smaller events. Our DELLY filter requires 4 supporting reads or both split and paired read evidence. Our LUMPY filter require at least 4 supporting reads to retain an event. We welcome discussion of the costs and tradeoffs of these approaches. For instance, requiring split and paired evidence for DELLY increases precision at the cost of sensitivity. These filters are a useful starting point and resolution, but we hope to continue to refine and improve them over time. Implementation bcbio-nextgen handles installation and automation of the programs used in this comparison. The documentation contains instructions to download the data and run the NA12878 trio calling and validation. This input configuration file should be easily adjusted to run on your data of interest. The current implementation has reasonable run times for whole genome structural variant calling. We use samblaster to perform duplicate marking alongside identification of discordant and split read pairs. The aligned reads from bwa stream directly into samblaster, adding minimal processing time to the run. For LUMPY calling, the pre-prepared split and discordant reads feed directly into speedseq, which nicely automates the process of running LUMPY. For DELLY, we subsample correct pairs in the input BAM to 50 million reads and combine with the pre-extracted problematic pairs to improve runtimes for whole genome inputs. We processed three concurrently called 50x whole genome samples from FASTQ reads to validated structural variants in approximately 3 days using 32 cores. Following the preparation work described above, LUMPY calling took 6 hours, DELLY takes 24 hours parallelized on 32 cores and cn.mops took 16 hours parallelized by chromosome on 16 cores. This is a single data point for current capabilities, and is an area where we hope to continue to improve scalability and parallelization. The implementation and validation are fully integrated into the community developed bcbio-nextgen project and we hope to expand this work to incorporate additional structural variant callers like Pindel and CNVkit, as well as improving filtering and ensemble calling. We also want to expand structural variant validation to include tumor/normal cancer samples and targeted sequencing. We welcome contributions and suggestions on current and future directions in structural variant calling.Whole genome trio variant calling evaluation: low complexity regions, GATK VQSR and high depth filters2014-05-12T10:03:22+00:002014-05-12T10:03:22+00:00http://bcb.io/2014/05/12/wgs-trio-variant-evaluation<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Whole genome trio validation</h2>
<div class="outline-text-2" id="text-1">
<p> I've written previously about the approaches we use to validate the <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> variant calling framework, specifically evaluating <a href="https://bcb.io/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/">aligners and variant calling methods</a> and <a href="https://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/">assessing the impact of BAM post-alignment preparation methods</a>. We're continually looking to improve both the pipeline and validation methods and two recent papers helped advance best-practices for evaluating and filtering variant calls: </p>
<ul class="org-ul">
<li><a href="http://research.mssm.edu/linderman/index.html">Michael Linderman and colleagues</a> describe approaches for <a href="http://www.biomedcentral.com/1755-8794/7/20/abstract">validating clinical exome and whole genome sequencing results</a>. One key result I took from the paper was the difference in assessment between exome and whole genome callsets. Coverage differences due to capture characterize discordant exome variants, while complex genome regions drive whole genome discordants. Reading this paper pushed us to evaluate whole genome population based variant calling, which is now feasible due to improvements in bcbio-nextgen scalability. </li>
<li><a href="https://twitter.com/lh3lh3">Heng Li</a> identified <a href="http://arxiv.org/abs/1404.0929">variant calling artifacts and proposed filtering approaches</a> to remove them, as well as characterizing caller error rates. We'll investigate two of the filters he proposed: removing variants in low complexity regions, and filtering high depth variants. </li>
</ul>
<p> We use the <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">NA12878/NA12891/NA12892 trio</a> from the <a href="http://blog.goldenhelix.com/wp-content/uploads/2013/03/Utah-Pedigree-1463-with-NA12878.png">CEPH 1463 Pedigree</a> as an input dataset, consisting of 50x whole genome reads from <a href="http://www.illumina.com/platinumgenomes/">Illumina's platinum genomes</a>. This enables both whole genome comparisons, as well as pooled family calling that replicates best-practice for calling within populations. We aligned reads using <a href="http://bio-bwa.sourceforge.net/">bwa-mem</a> and performed streaming de-duplication detection with <a href="https://github.com/GregoryFaust/samblaster">samblaster</a>. Combined with no recalibration or realignment based on <a href="https://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/">our previous assessment</a>, this enabled fully streamed preparation of BAM files from input fastq reads. We called variants using two realigning callers: <a href="https://github.com/ekg/freebayes">FreeBayes (v0.9.14-7)</a> and <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html">GATK HaplotypeCaller (3.1-1-g07a4bf8)</a> and evaluated calls using the <a href="http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html">Genome in a Bottle reference callset for NA12878 (v0.2-NIST-RTG)</a>. The bcbio-nextgen documentation has <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#whole-genome-trio-50x">full instructions for reproducing the analysis</a>. </p>
<p> This work provides three practical improvements for variant calling and validation: </p>
<ul class="org-ul">
<li>Low complexity regions contribute 45% of the indels in whole genome evaluations, and are highly variable between callers. This replicates Heng's results and Michael's assessment of common errors in whole genome samples, and indicates we need to specifically identify and assess the 2% of the genome labeled as low complexity. Practically, we'll exclude them from further evaluations to avoid non-representative bias, and suggest removing or flagging them when producing whole genome variant calls. </li>
<li>We add a filter for removing false positives from FreeBayes calls in high depth, low quality regions. This removes variants in high depth regions that are likely due to copy number or other larger structural events, and replicates Heng's filtering results. </li>
<li>We improved settings for <a href="http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr">GATK variant quality recalibration (VQSR)</a>. The default VQSR settings are conservative for SNPs and need adjustment to be compatible with the sensitivity available through FreeBayes or GATK using hard filters. </li>
</ul></div>
</p></div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Low complexity regions</h2>
<div class="outline-text-2" id="text-2">
<p> Low complexity regions (LCRs) consist of locally repetitive sections of the genome. Heng's paper identified these using <a href="http://compbio.dfci.harvard.edu/tgi/software/">mdust</a> and provides a <a href="https://github.com/lh3/varcmp/raw/master/scripts/LCR-hs37d5.bed.gz">BED file of LCRs</a> covering 2% of the genome. Repeats in these regions can lead to artifacts in sequencing and variant calling. Heng's paper provides examples of areas where a full de-novo assembly correctly resolves the underlying structure, while local reassembly variant callers do not. </p>
<p> To assess the impact of low complexity regions on variant calling, we compared calls from FreeBayes and GATK HaplotypeCaller to the Genome in a Bottle reference callset with and without low complexity regions included. The graph below shows concordant non-reference variant calls alongside discordant calls in three categories: missing discordants are potential false negatives, extras are potential false positives, and shared are variants that overlap between the evaluation and reference callsets but differ due to zygosity (heterozygote versus homozygote) or indel representation. </p>
<p> <a href="http://i.imgur.com/D2z9Kyp.png"> <img src="http://i.imgur.com/D2z9Kyp.png" width="700" alt="Low complexity regions for GATK and FreeBayes validation" /> </a>
<ul class="org-ul">
<li>For SNPs, removing low complexity regions removes approximately ~2% of the total calls for both FreeBayes and GATK. This corresponds to the 2% of the genome subtracted by removing LCRs. </li>
<li>For indels, removing LCRs removes 45% of the calls due to the over-representation of indels in repeat regions. Additionally, this results in approximately equal GATK and FreeBayes concordant indels after LCR removal. Since the Genome in a Bottle reference callset uses GATK HaplotypeCaller to resolve discrepant calls, this change in concordance is likely due to bias towards GATK's approaches for indel resolution in complex regions. </li>
<li>The default GATK VQSR calls for SNPs are not as sensitive, relative to FreeBayes calls. I'll describe additional work to improve this below. </li>
</ul>
<p> Practically, we'll now exclude low complexity regions in variant comparisons to avoid potential bias and more accurately represent calls in the remaining non-LCR genome. We'll additionally flag low complexity indels in non-evaluation callsets as likely to require additional followup. Longer term, we need to incorporate callers specifically designed for repeats like <a href="http://lobstr.teamerlich.org/index.html">lobSTR</a> to more accurately characterize these regions. </p>
</p></div>
</p></div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">High depth, low quality, filter for FreeBayes</h2>
<div class="outline-text-2" id="text-3">
<p> The second filter proposed in Heng's paper was removal of high depth variants. This was a useful change in mindset for me as I've primarily thought about removing low quality, low depth variants. However, high depth regions can indicate potential copy number variations or hidden duplicates which result in spurious calls. </p>
<p> Comparing true and false positive FreeBayes calls with a pooled multi-sample call quality of less than 500 identifies a large grouping of false positive heterozygous variants at a combined depth, across the trio, of 200: </p>
<p> <a href="http://i.imgur.com/S9lObRf.png"> <img src="http://i.imgur.com/S9lObRf.png" width="700" alt="Heterozygotes by depth and quality: true versus false positive" /> </a>
<p> The cutoff proposed by Heng was to calculate the average depth of called variants and set the cutoff as the average depth plus a multiplier of 3 or 4 times the square root of average depth. This dataset was an average depth of 169 for the trio, corresponding to a cutoff of 208 if we use the 3 multiplier, which compares nicely with a manual cutoff you'd set looking at the above graphs. Applying a cutoff of QUAL < 500 and DP > 208 produces a reduction in false positives with little impact on sensitivity: </p>
<p> <a href="http://i.imgur.com/zOcrnKS.png"> <img src="http://i.imgur.com/zOcrnKS.png" width="700" alt="Improvement in filtering false positives with high depth filter" /> </a>
<p> A nice bonus of this filter is that it makes intuitive sense: variants with high depth and low quality indicate there is something problematic, and depth manages to partially compensate for the underlying issue. Inspired by <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_QualByDepth.html">GATK's QualByDepth annotation</a> and default filter of QD < 2.0, we incorporated a generalized version of this into <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vfilter.py#L75">bcbio-nextgen's FreeBayes filter</a>: QUAL < (depth-cutoff * 2.0) and DP > depth-cutoff. </p>
</p></div>
</p></div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4">GATK variant quality score recalibration (VQSR)</h2>
<div class="outline-text-2" id="text-4">
<p> The other area where we needed to improve was using GATK Variant Quality Score Recalibration. The default parameters provide a set of calls that are overly conservative relative to the FreeBayes calls. VQSR provides the ability to tune the filtering so we experimented with multiple configurations to achieve approximately equal sensitivity relative to FreeBayes for both SNPs and Indels. The comparisons use the Genome in a Bottle reference callset for evaluation, and include VQSR default settings, multiple tranche levels and GATK's suggested hard filters: </p>
<p> <a href="http://i.imgur.com/c4yWTMv.png"> <img src="http://i.imgur.com/c4yWTMv.png" width="700" alt="VQSR tuning: SNPs" /> </a> <a href="http://i.imgur.com/3yZnkA6.png"> <img src="http://i.imgur.com/3yZnkA6.png" width="700" alt="VQSR tuning: indels" /> </a>
<p> While the sensitivity/specificity tradeoff depends on the research question, in trying to set a generally useful default we'd like to be less conservative than the GATK VQSR default. We learned these tips and tricks for tuning VQSR filtering: </p>
<ul class="org-ul">
<li>The default setting for VQSR is not a tranche level (like 99.0), but rather a LOD score of 0. In this experiment, that corresponded to a tranche of ~99.0 for SNPs and ~98.0 for indels. The best-practice example documentation uses command line parameter that specify a consistent tranche of 99.0 for both SNPs and indels, so depending on which you follow as a default you'll get different sensitivities. </li>
<li>To increase sensitivity, increase the tranche level. My expectations were that decreasing the tranche level would include more variants, but that actually applies additional filters. My suggestion for understanding tranche levels is that they specify the percentage of variants you want to capture; a tranche of 99.9% captures 99.9% of the true cases in the training set, while 99.0% captures less. </li>
<li>We found tranche settings of 99.97% for SNPs and 98.0% for indels correspond to roughly the sensitivity/specificity that you achieve with FreeBayes. These are the new default settings in bcbio-nextgen. </li>
<li>Using <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vfilter.py#L125">hard filtering of variants based on GATK recommendations</a> performs well and is also a good default choice. For SNPs, the hard filter defaults are less conservative and more in line with FreeBayes results than VQSR defaults. VQSR has improved specificity at the same sensitivity and has the advantage of being configurable, but will require an extra tuning step. </li>
</ul>
<p> Overall VQSR provides good filtering and the ability to tune sensitivity but requires validation work to select tranche cutoffs that are as sensitive as hard filter defaults, since default values tend to be overly conservative for SNP calling. In the absence of the ability or desire to tune VQSR tranche levels, the GATK hard filters provide a nice default without much of a loss in precision. </p>
</p></div>
</p></div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5">Data availability and future work</h2>
<div class="outline-text-2" id="text-5">
<p> Thanks to continued community work on improving variant calling evaluations, this post demonstrates practical improvements in bcbio-nextgen variant calling. We welcome interested contributors to re-run and expand on the analysis, with full instructions in the <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#whole-genome-trio-50x">bcbio-nextgen example pipeline documentation</a>. Some of the output files from the analysis may also be useful: </p>
<ul class="org-ul">
<li>VCF files for FreeBayes <a href="https://s3.amazonaws.com/bcbio_nextgen/comparison3/freebayes-tp-het.vcf.gz">true positive</a> and <a href="https://s3.amazonaws.com/bcbio_nextgen/comparison3/freebayes-fp-het.vcf.gz">false positive</a> heterozygote calls, used here to improve filtering via assessment of high depth regions. Heterozygotes make up the majority of false positive calls so take the most work to correctly filter and detect. </li>
<li><a href="https://s3.amazonaws.com/bcbio_nextgen/comparison3/giab-comparision-shared-gatk-fb-extras.vcf.gz">Shared false positives from FreeBayes and GATK HaplotypeCaller</a>. These are potential missing variants in the Genome in a Bottle reference. Alternatively, they may represent persistent errors found in multiple callers. </li>
</ul>
<p> We plan to continue to explore variant calling improvements in bcbio-nextgen. Our next steps are to use the trio population framework to compared pooled population calling versus <a href="http://gatkforums.broadinstitute.org/discussion/3896/the-gatk-reference-model-pipeline-for-incremental-joint-discovery-in-full-detail">the incremental joint discovery approach introduced in GATK 3</a>. We'd also like to compare with single sample calling followed by <a href="https://github.com/chapmanb/bcbio.variation.recall">subsequent squaring off/backfilling</a> to assess the value of concurrent population calling. We welcome suggestions and thoughts on this work and future directions. </p>
</p></div>
</p></div>Brad Chapmanchapmanb@fastmail.comWhole genome trio validation I've written previously about the approaches we use to validate the bcbio-nextgen variant calling framework, specifically evaluating aligners and variant calling methods and assessing the impact of BAM post-alignment preparation methods. We're continually looking to improve both the pipeline and validation methods and two recent papers helped advance best-practices for evaluating and filtering variant calls: Michael Linderman and colleagues describe approaches for validating clinical exome and whole genome sequencing results. One key result I took from the paper was the difference in assessment between exome and whole genome callsets. Coverage differences due to capture characterize discordant exome variants, while complex genome regions drive whole genome discordants. Reading this paper pushed us to evaluate whole genome population based variant calling, which is now feasible due to improvements in bcbio-nextgen scalability. Heng Li identified variant calling artifacts and proposed filtering approaches to remove them, as well as characterizing caller error rates. We'll investigate two of the filters he proposed: removing variants in low complexity regions, and filtering high depth variants. We use the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree as an input dataset, consisting of 50x whole genome reads from Illumina's platinum genomes. This enables both whole genome comparisons, as well as pooled family calling that replicates best-practice for calling within populations. We aligned reads using bwa-mem and performed streaming de-duplication detection with samblaster. Combined with no recalibration or realignment based on our previous assessment, this enabled fully streamed preparation of BAM files from input fastq reads. We called variants using two realigning callers: FreeBayes (v0.9.14-7) and GATK HaplotypeCaller (3.1-1-g07a4bf8) and evaluated calls using the Genome in a Bottle reference callset for NA12878 (v0.2-NIST-RTG). The bcbio-nextgen documentation has full instructions for reproducing the analysis. This work provides three practical improvements for variant calling and validation: Low complexity regions contribute 45% of the indels in whole genome evaluations, and are highly variable between callers. This replicates Heng's results and Michael's assessment of common errors in whole genome samples, and indicates we need to specifically identify and assess the 2% of the genome labeled as low complexity. Practically, we'll exclude them from further evaluations to avoid non-representative bias, and suggest removing or flagging them when producing whole genome variant calls. We add a filter for removing false positives from FreeBayes calls in high depth, low quality regions. This removes variants in high depth regions that are likely due to copy number or other larger structural events, and replicates Heng's filtering results. We improved settings for GATK variant quality recalibration (VQSR). The default VQSR settings are conservative for SNPs and need adjustment to be compatible with the sensitivity available through FreeBayes or GATK using hard filters. Low complexity regions Low complexity regions (LCRs) consist of locally repetitive sections of the genome. Heng's paper identified these using mdust and provides a BED file of LCRs covering 2% of the genome. Repeats in these regions can lead to artifacts in sequencing and variant calling. Heng's paper provides examples of areas where a full de-novo assembly correctly resolves the underlying structure, while local reassembly variant callers do not. To assess the impact of low complexity regions on variant calling, we compared calls from FreeBayes and GATK HaplotypeCaller to the Genome in a Bottle reference callset with and without low complexity regions included. The graph below shows concordant non-reference variant calls alongside discordant calls in three categories: missing discordants are potential false negatives, extras are potential false positives, and shared are variants that overlap between the evaluation and reference callsets but differ due to zygosity (heterozygote versus homozygote) or indel representation. For SNPs, removing low complexity regions removes approximately ~2% of the total calls for both FreeBayes and GATK. This corresponds to the 2% of the genome subtracted by removing LCRs. For indels, removing LCRs removes 45% of the calls due to the over-representation of indels in repeat regions. Additionally, this results in approximately equal GATK and FreeBayes concordant indels after LCR removal. Since the Genome in a Bottle reference callset uses GATK HaplotypeCaller to resolve discrepant calls, this change in concordance is likely due to bias towards GATK's approaches for indel resolution in complex regions. The default GATK VQSR calls for SNPs are not as sensitive, relative to FreeBayes calls. I'll describe additional work to improve this below. Practically, we'll now exclude low complexity regions in variant comparisons to avoid potential bias and more accurately represent calls in the remaining non-LCR genome. We'll additionally flag low complexity indels in non-evaluation callsets as likely to require additional followup. Longer term, we need to incorporate callers specifically designed for repeats like lobSTR to more accurately characterize these regions. High depth, low quality, filter for FreeBayes The second filter proposed in Heng's paper was removal of high depth variants. This was a useful change in mindset for me as I've primarily thought about removing low quality, low depth variants. However, high depth regions can indicate potential copy number variations or hidden duplicates which result in spurious calls. Comparing true and false positive FreeBayes calls with a pooled multi-sample call quality of less than 500 identifies a large grouping of false positive heterozygous variants at a combined depth, across the trio, of 200: The cutoff proposed by Heng was to calculate the average depth of called variants and set the cutoff as the average depth plus a multiplier of 3 or 4 times the square root of average depth. This dataset was an average depth of 169 for the trio, corresponding to a cutoff of 208 if we use the 3 multiplier, which compares nicely with a manual cutoff you'd set looking at the above graphs. Applying a cutoff of QUAL < 500 and DP > 208 produces a reduction in false positives with little impact on sensitivity: A nice bonus of this filter is that it makes intuitive sense: variants with high depth and low quality indicate there is something problematic, and depth manages to partially compensate for the underlying issue. Inspired by GATK's QualByDepth annotation and default filter of QD < 2.0, we incorporated a generalized version of this into bcbio-nextgen's FreeBayes filter: QUAL < (depth-cutoff * 2.0) and DP > depth-cutoff. GATK variant quality score recalibration (VQSR) The other area where we needed to improve was using GATK Variant Quality Score Recalibration. The default parameters provide a set of calls that are overly conservative relative to the FreeBayes calls. VQSR provides the ability to tune the filtering so we experimented with multiple configurations to achieve approximately equal sensitivity relative to FreeBayes for both SNPs and Indels. The comparisons use the Genome in a Bottle reference callset for evaluation, and include VQSR default settings, multiple tranche levels and GATK's suggested hard filters: While the sensitivity/specificity tradeoff depends on the research question, in trying to set a generally useful default we'd like to be less conservative than the GATK VQSR default. We learned these tips and tricks for tuning VQSR filtering: The default setting for VQSR is not a tranche level (like 99.0), but rather a LOD score of 0. In this experiment, that corresponded to a tranche of ~99.0 for SNPs and ~98.0 for indels. The best-practice example documentation uses command line parameter that specify a consistent tranche of 99.0 for both SNPs and indels, so depending on which you follow as a default you'll get different sensitivities. To increase sensitivity, increase the tranche level. My expectations were that decreasing the tranche level would include more variants, but that actually applies additional filters. My suggestion for understanding tranche levels is that they specify the percentage of variants you want to capture; a tranche of 99.9% captures 99.9% of the true cases in the training set, while 99.0% captures less. We found tranche settings of 99.97% for SNPs and 98.0% for indels correspond to roughly the sensitivity/specificity that you achieve with FreeBayes. These are the new default settings in bcbio-nextgen. Using hard filtering of variants based on GATK recommendations performs well and is also a good default choice. For SNPs, the hard filter defaults are less conservative and more in line with FreeBayes results than VQSR defaults. VQSR has improved specificity at the same sensitivity and has the advantage of being configurable, but will require an extra tuning step. Overall VQSR provides good filtering and the ability to tune sensitivity but requires validation work to select tranche cutoffs that are as sensitive as hard filter defaults, since default values tend to be overly conservative for SNP calling. In the absence of the ability or desire to tune VQSR tranche levels, the GATK hard filters provide a nice default without much of a loss in precision. Data availability and future work Thanks to continued community work on improving variant calling evaluations, this post demonstrates practical improvements in bcbio-nextgen variant calling. We welcome interested contributors to re-run and expand on the analysis, with full instructions in the bcbio-nextgen example pipeline documentation. Some of the output files from the analysis may also be useful: VCF files for FreeBayes true positive and false positive heterozygote calls, used here to improve filtering via assessment of high depth regions. Heterozygotes make up the majority of false positive calls so take the most work to correctly filter and detect. Shared false positives from FreeBayes and GATK HaplotypeCaller. These are potential missing variants in the Genome in a Bottle reference. Alternatively, they may represent persistent errors found in multiple callers. We plan to continue to explore variant calling improvements in bcbio-nextgen. Our next steps are to use the trio population framework to compared pooled population calling versus the incremental joint discovery approach introduced in GATK 3. We'd also like to compare with single sample calling followed by subsequent squaring off/backfilling to assess the value of concurrent population calling. We welcome suggestions and thoughts on this work and future directions.