<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Blue Collar Bioinformatics</title>
	
	<link>http://bcbio.wordpress.com</link>
	<description />
	<lastBuildDate>Tue, 21 May 2013 10:28:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain="bcbio.wordpress.com" port="80" path="/?rsscloud=notify" registerProcedure="" protocol="http-post" />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Blue Collar Bioinformatics</title>
		<link>http://bcbio.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://bcbio.wordpress.com/osd.xml" title="Blue Collar Bioinformatics" />
	
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/bcbio" /><feedburner:info uri="bcbio" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://bcbio.wordpress.com/?pushpress=hub" /><item>
		<title>Framework for evaluating variant detection methods: comparison of aligners and callers</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/WiTS3ECzD_U/</link>
		<comments>http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/#comments</comments>
		<pubDate>Mon, 06 May 2013 12:29:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[variation]]></category>
		<category><![CDATA[alignment]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[clinical]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[variant]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=419</guid>
		<description><![CDATA[Variant detection and grading overview Developing pipelines for detecting variants from high throughput sequencing data is challenging due to rapidly changing algorithms and relatively low concordance between methods. This post will discuss automated methods providing evaluation of variant calls, enabling detailed diagnosis of discordant differences between multiple calling approaches. This allows us to: Characterize strengths [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=419&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Variant detection and grading overview</h2>
<div class="outline-text-2" id="text-1">
<p> Developing pipelines for detecting variants from high throughput sequencing data is challenging due to rapidly changing algorithms and <a href="http://genomemedicine.com/content/5/3/28/abstract">relatively low concordance between methods</a>. This post will discuss automated methods providing evaluation of variant calls, enabling detailed diagnosis of discordant differences between multiple calling approaches. This allows us to: </p>
<ul>
<li>Characterize strengths and weaknesses of alignment, post-alignment   preparation and calling methods.  </li>
<li>Automatically verify pipeline updates and installations to ensure   variant calls recover expected variations. This extends the   <a href="http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/">XPrize validation protocol</a> to provide full summary metrics on   concordance and discordance of variants.  </li>
<li>Make recommendations on best-practice approaches to use in   sequencing studies requiring either exome or whole genome variant   calling.  </li>
<li>Identify characteristics of genomic regions more likely to have   discordant variants which require additional care when making   biological conclusions based on calls, or lack of calls, in these   regions. </li>
</ul>
<p> This evaluation work is part of a larger community effort to better characterize variant calling methods. A key component of these evaluations is a well characterized set of reference variations for the <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">NA12878 human HapMap genome</a>, provided by <a href="http://www.genomeinabottle.org/">NIST&#8217;s Genome in a Bottle consortium</a>. The diagnostic component of this work supplements emerging tools like <a href="http://www.bioplanet.com/gcat/">GCAT (Genome Comparison and Analytic Testing)</a>, which provides a community platform for comparing and discussing calling approaches. </p>
<p> I&#8217;ll show a 12 way comparison between 2 different aligners (<a href="http://www.novocraft.com/main/index.php">novoalign</a> and <a href="http://bio-bwa.sourceforge.net/">bwa mem</a>), 2 different post-alignment preparation methods (<a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">GATK best practices</a> and the <a href="http://gkno.me/">Marth lab&#8217;s gkno pipeline</a>), and 3 different variant callers (<a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html">GATK UnifiedGenotyper</a>, <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html">GATK HaplotypeCaller</a>, and <a href="https://github.com/ekg/freebayes">FreeBayes</a>). This allows comparison of openly available methods (bwa mem, gkno preparation, and FreeBayes) with those that require licensing (novoalign, GATK&#8217;s variant callers). I&#8217;ll also describe <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a>, the fully automated open-source pipeline used for variant calling and evaluation, which allows others to easily bring this methodology into their own work and extend this analysis. </p>
</p></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Aligner, post-alignment preparation and variant calling comparison</h2>
<div class="outline-text-2" id="text-2">
<p> To compare methods, we called variants on a NA12878 exome dataset from <a href="http://www.edgebio.com/">EdgeBio&#8217;s clinical pipeline</a> and assessed them against the NIST Genome in a Bottle reference material. Discordant positions where the reference and evaluation variants differ fall into three different categories: </p>
<ul>
<li>Extra variants, called in the evaluation data but not in the   reference. These are potential false positives.  </li>
<li>Missing variants, found in the NA12878 reference but not in the   evaluation data set. These are potential false negatives. The use   of high quality reference materials from NIST enables   identification of genomic regions we fail to call in.  </li>
<li>Shared variants, called in both the evaluation and reference but   differently represented. This could result from allele differences, such as   heterozygote versus homozygote calls, or variant identification   differences, such as indel start and end coordinates. </li>
</ul>
<p> To further identify causes of discordance, we subdivide the missing and extra variants using annotations from the <a href="https://github.com/arq5x/gemini">GEMINI variation framework</a>: </p>
<ul>
<li>Low coverage: positions with limited read coverage (4 to 9 reads). </li>
<li>Repetitive: regions identified by <a href="http://repeatmasker.org/">RepeatMasker</a>. </li>
<li>Error prone: variants falling in   <a href="http://www.biomedcentral.com/1471-2105/14/S5/S1">motifs found to induce sequencing errors</a>. </li>
</ul>
<p> We subdivide and restrict our comparisons to help identify sources of differences between methods indistinguishable when looking at total discordant counts. A critical subdivison is comparing SNPs and indels separately. With lower total counts of indels but higher error rates, each variant type needs independent visualization. Secondly, it&#8217;s crucial to distinguish between discordance caused by a lack of coverage, and discordance caused by an actual difference in variant assessment. We evaluate only in callable regions with 4 or more reads. This low minimum cutoff provides a valuable evaluation of low coverage regions, which differ the most between alignment and calling methods. </p>
<p> I&#8217;ll use this data to provide recommendations for alignment, post-alignment preparation and variant calling. In addition to these high level summaries, the full dataset and summary plots available below providing a starting place for digging further into the data. </p>
</p></div>
<div id="outline-container-2-1" class="outline-3">
<h3 id="sec-2-1">Aligners</h3>
<div class="outline-text-3" id="text-2-1">
<p> We compared two recently released aligners designed to work with longer reads coming from new sequencing technologies: <a href="http://www.novocraft.com/main/index.php">novoalign (3.00.02)</a> and <a href="http://bio-bwa.sourceforge.net/">bwa mem (0.7.3a</a>). bwa mem identified 1389 additional concordant SNPs and 145 indels not seen with novoalign. 1024 of <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/validation/NA12878_cmp/NA12878-cmp-in-bwa-not-novoalign.vcf">these missing variants</a> are in regions where novoalign does not provide sufficient coverage for calling. Of those, 92% (941) have low coverage with less than 10 reads in the bwa alignments. Algorithmic changes impact low coverage regions more due to the decreased evidence and susceptibility to crossing calling coverage thresholds, so we need extra care and consideration of calls in these regions. </p>
<p> Our standard workflow uses novoalign based on its stringency in <a href="http://f1000research.com/articles/1-2/v2#f3">resolving large insertions and deletions</a>. These results suggest equally good results using bwa mem, along with improved processing times. One caveat to these results is that some of the available Illumina call data that feeds into NIST&#8217;s reference genomes comes from a bwa alignment, so some differences may reflect a bias towards bwa alignment heuristics. Using non-simulated reference data sets has the advantage of capturing real biological and process errors, but requires iterative improvement of the reference materials to avoid this type of potential algorithmic bias. </p>
<p> <img src="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-alignerdiff.png" alt="Comparison of concordant variants by aligner type" width="700" /> </p>
</p></div>
</p></div>
<div id="outline-container-2-2" class="outline-3">
<h3 id="sec-2-2">Post-alignment preparation and quality score recalibration</h3>
<div class="outline-text-3" id="text-2-2">
<p> We compared two methods of quality recalibration: </p>
<ul>
<li><a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">GATK&#8217;s best practices (2.4-9)</a>: This involves de-duplication with   <a href="http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates">Picard MarkDuplicates</a>, GATK base quality score recalibration and   GATK realignment around indels.  </li>
<li><a href="http://gkno.me/">The Marth Lab&#8217;s gkno realignment pipeline</a>: This performs de-duplication   with <a href="http://samtools.sourceforge.net/">samtools rmdup</a> and realignment around indels using <a href="https://github.com/ekg/ogap">ogap</a>. All   commands in this pipeline work on streaming input, avoiding disk IO   penalties by using unix pipes. This piped approach improves   scaling on large numbers of whole genome samples. Notably, our   implementation of the pipeline does not use a base quality score   recalibration step. </li>
</ul>
<p> GATK best practice pipelines offer an advantage over the gkno-only pipeline primarily because of improvements in SNP calling from base quality recalibration. Specifically we lose ~1% (824 / 77158) of called variations. These fall into the discordant missing &#8220;other&#8221; category, so we cannot explain them by metrics such as coverage or genome difficulty. The simplest explanation is that initial poor quality calculations in those regions result in callers missing those variants. Base quality recalibration helps recover them. These results match <a href="http://basecallbio.wordpress.com/2013/04/23/base-quality-score-rebinning/">Brendan O&#8217;Fallon&#8217;s recent analysis of base quality score recalibration</a>. </p>
<p> This places a practical number on the lost variants when avoiding recalibration either due to scaling or GATK licensing concerns. Some other options for recalibration include <a href="#novoalign-qual">Novoalign&#8217;s Quality Recalibration</a> and <a href="#bamtuil">University of Michigan&#8217;s BamUtil recab</a>, although we&#8217;ve not yet tested either in depth as potential supplements to improve calling in non-GATK pipelines. </p>
<p> <img src="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-bamprepdiff.png" alt="Comparison of concordant variants by post-alignment prep method" width="700" /> </p>
</p></div>
</p></div>
<div id="outline-container-2-3" class="outline-3">
<h3 id="sec-2-3">Variant callers</h3>
<div class="outline-text-3" id="text-2-3">
<p> For this comparison, we used three general purpose callers that handle SNPs and small indels, all of which have updated versions since <a href="http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/">our last comparison</a>: </p>
<ul>
<li><a href="https://github.com/ekg/freebayes">FreeBayes (0.9.9 296a0fa)</a>: A haplotype-based Bayesian caller from   the Marth Lab, with filtering on quality score and read depth.  </li>
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html">GATK UnifiedGenotyper (2.4-9)</a>: GATK&#8217;s widely used Bayesian caller,   using filtering recommendations for exome experiments from   <a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">GATK&#8217;s best practices</a>.  </li>
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html">GATK HaplotypeCaller (2.4-9)</a>: GATK&#8217;s more recently developed   haplotype caller which provides local assembly around variant   regions, using filtering recommendations for exomes from   <a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">GATK&#8217;s best practices</a>. </li>
</ul>
<p> Adjusting variant calling methods has the biggest impact on the final set of calls. Called SNPs differ by 4577 between the three compared approaches, in comparison with aligner and post-alignment preparation changes which resulted in a maximum difference of 1389 calls. This suggests that experimenting with variant calling approaches currently provides the most leverage to improve calls. </p>
<p> A majority of the SNP concordance differences between the three calling methods are in low coverage regions with between 4 and 9 reads. GATK UnifiedGenotyper performs the best in detecting SNPs in these low coverage regions. FreeBayes and GATK HaplotypeCaller both call more conservatively in these regions, generating more potential false negatives. FreeBayes had the fewest heterozygote/homozygote discrimination differences of the three callers. </p>
<p> For indels, FreeBayes and HaplotypeCaller both provide improved sensitivity compared to UnifiedGenotyper, with HaplotypeCaller identifying the most, especially in low coverage regions. In contrast to the SNP calling results, FreeBayes has more calls that match the expected indel but differ in whether a call is a heterozygote or homozygote. </p>
<p> <img src="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-callerdiff.png" alt="Comparison of concordant variants by calling method" width="800" /> </p>
<p> No one caller outperformed the others on all subsets of the data. GATK UnifiedGenotyper performs best on SNPs but is less sensitive in resolving indels. GATK HaplotypeCaller identifies the most indels, but is more conservative than the other callers on SNPs. FreeBayes provides intermediate sensitivity and specificity between the two for both SNPs and indels. A combined UnifiedGenotyper and HaplotypeCaller pipeline for SNPs and indels, respectively, would provide the best overall calling metrics based on this set of comparisons. </p>
<p> Low coverage regions are the key area of difference between callers. Coupled with the alignment results and investigation of variant changes resulting from <a href="http://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/">quality score binning</a>, this suggests we should be more critical in assessing both calls and coverage in these regions. Assessing coverage and potential false negatives is especially critical since we lack good tools to summarize and prioritize genomic regions that are potentially missed during sequencing. This also emphasizes the role of population-based calling to help resolve low coverage regions, since callers can use evidence from multiple samples to better estimate the likelihoods of low coverage calls. </p>
</p></div>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Automated calling and grading pipeline</h2>
<div class="outline-text-2" id="text-3">
<p> Method comparisons become dated quickly due to the continuous improvement in aligners and variant callers. While these recommendations are useful now, in 6 months there will be new releases with improved approaches. This rapid development cycle creates challenges for biologists hoping to derive meaning from variant results: do you stay locked on software versions whose trade offs you understand, or do you attempt to stay current and handle re-verifying results with every new release? </p>
<p> Our goal is to provide a community developed pipeline and comparison framework that ameliorates this continuous struggle to re-verify. The analysis done here is fully automated as part of the <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> analysis framework. This framework code provides full exposure and revision tracking of all parameters used in analyses. For example, the ngsalign module contains the command lines used for <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/bwa.py#L38">bwa mem</a> and <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/novoalign.py#L44">novoalign</a>, as well as all other tools. </p>
<p> To install the pipeline, third-party software and required data files: </p>
<pre class="example">wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local /usr/local/share/bcbio-nextgen
</pre>
<p> 
<p> The installer bootstraps all installation on a bare machine using <a href="http://cloudbiolinux.org">the CloudBioLinux framework</a>. More details and options are available in the <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/installation.html">installation documentation</a>. </p>
<p> To re-run this analysis, retrieve the input data files and configuration as described in the <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#example-pipelines">bcbio-nextgen example documentation</a> with: </p>
<pre class="example">$ mkdir config &amp;&amp; cd config
$ wget https://raw.github.com/chapmanb/bcbio-nextgen/master/config/\
   examples/NA12878-exome-methodcmp.yaml
$ cd .. &amp;&amp; mkdir input &amp;&amp; cd input
$ wget https://dm.genomespace.org/datamanager/file/Home/EdgeBio/\
   CLIA_Examples/NA12878-NGv3-LAB1360-A/NA12878-NGv3-LAB1360-A_1.fastq.gz
$ wget https://dm.genomespace.org/datamanager/file/Home/EdgeBio/\
   CLIA_Examples/NA12878-NGv3-LAB1360-A/NA12878-NGv3-LAB1360-A_2.fastq.gz
$ wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-nist-v2_13-NGv3-pass.vcf.gz
$ wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-nist-v2_13-NGv3-regions.bed.gz
$ gunzip NA12878-nist-*.gz
$ wget https://s3.amazonaws.com/bcbio_nextgen/NGv3.bed.gz
$ gunzip NGv3.bed.gz
</pre>
<p> 
<p> Then run the analysis, distributed on 8 local cores, with: </p>
<pre class="example">$ mkdir work &amp;&amp; cd work
$ bcbio_nextgen.py bcbio_system.yaml ../input ../config/NA12878-exome-methodcmp.yaml -n 8
</pre>
<p> 
<p> The <a href="https://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html">bcbio-nextgen documentation</a> describes how to parallelize processing over multiple machines using cluster schedulers (LSF, SGE, Torque). </p>
<p> The pipeline and comparison framework are open-source and configurable for multiple aligners, preparation methods and callers. We invite anyone interested in this work to provide feedback and contributions. </p>
</p></div>
</p></div>
<div id="outline-container-4" class="outline-2">
<h2 id="sec-4">Full data sets</h2>
<div class="outline-text-2" id="text-4">
<p> We extracted the conclusions for alignment, post-alignment preparation and variant calling from analysis of the full dataset. The visualizations for the full data are not as pretty but we make them available for anyone interested in digging deeper: </p>
<ul>
<li><a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep.csv">Summary CSV of comparisons</a> split by methods and   concordance/discordance types, easily importable into <a href="http://cran.r-project.org/">R</a> or <a href="http://pandas.pydata.org/">pandas</a>   for further analysis. </li>
<li><a href="https://github.com/chapmanb/bcbb/tree/master/validation">Code for preparing and plotting results</a> </li>
<li>Full comparisons of all 12 methods, stratified by concordance and   discordance: <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-SNP.pdf">SNPs</a> and <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-Indel.pdf">indels</a> </li>
<li>Boxplots of differences between alignment methods: <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-aligner-SNP.pdf">SNPs</a> and <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-aligner-Indel.pdf">indels</a> </li>
<li>Boxplots of differences between post-alignment preparation methods:   <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-bamprep-SNP.pdf">SNPs</a> and <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-bamprep-Indel.pdf">indels</a> </li>
<li>Boxplots of differences between variant calling methods: <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-caller-SNP.pdf">SNPs</a> and <a href="https://github.com/chapmanb/bcbb/raw/master/posts/calling_pipeline_compare/grading-summary-prep-caller-Indel.pdf">indels</a> </li>
</ul>
<p> The comparison variant calls are also useful for pinpointing algorithmic differences between methods. Some useful subsets of variants: </p>
<ul>
<li>Concordant variants called by bwa and not novoalign, where novoalign did   not have sufficient coverage in the region. These are calls where   either novoalign fails to map some reads, or bwa maps too aggressively:   <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/validation/NA12878_cmp/NA12878-cmp-in-bwa-not-novoalign.vcf">VCF of bwa calls with low or no coverage in novoalign.</a>  </li>
<li>Discordant variants called consistently by multiple calling methods.   These are potential errors in the reference material, or   consistently problematic calling regions for multiple   algorithms. Of the 9004 shared discordants, the majority are   potential false negatives not seen in the evaluation calls (7152; 79%). Another large portion is   heterozygote/homozygote differences, which make up 1627 calls (18%).   6652 (74%) of the differences have low coverage in the exome   evaluation, again reflecting the difficulties in calling in these regions.   The <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/validation/NA12878_cmp/NA12878-cmp-shared-discordants.vcf">VCF of discordants found in 2 or more callers</a> contains these   calls, with a &#8216;GradeCat&#8217; INFO tag specifying the discordance category. </li>
</ul>
<p> We encourage reanalysis and welcome suggestions for improving the presentation and conclusions in this post. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/419/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/419/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=419&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=WiTS3ECzD_U:JA1P5JNV5OI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=WiTS3ECzD_U:JA1P5JNV5OI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=WiTS3ECzD_U:JA1P5JNV5OI:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=WiTS3ECzD_U:JA1P5JNV5OI:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=WiTS3ECzD_U:JA1P5JNV5OI:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/WiTS3ECzD_U" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-alignerdiff.png" medium="image">
			<media:title type="html">Comparison of concordant variants by aligner type</media:title>
		</media:content>

		<media:content url="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-bamprepdiff.png" medium="image">
			<media:title type="html">Comparison of concordant variants by post-alignment prep method</media:title>
		</media:content>

		<media:content url="https://raw.github.com/chapmanb/bcbb/master/posts/calling_pipeline_compare/grading-summary-prep-callerdiff.png" medium="image">
			<media:title type="html">Comparison of concordant variants by calling method</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/</feedburner:origLink></item>
		<item>
		<title>Bioinformatics open source interoperability Hackathon at the Broad Institute</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/VjR0XBk6zS4/</link>
		<comments>http://bcbio.wordpress.com/2013/04/08/bioinformatics-open-source-interoperability-hackathon-at-the-broad-institute/#comments</comments>
		<pubDate>Mon, 08 Apr 2013 19:25:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[bosc]]></category>
		<category><![CDATA[hackathon]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=413</guid>
		<description><![CDATA[Interoperability Hackathon On April 7th and 8th, a group of biologists and programmers gathered at the Broad Institute to work on improving interoperability of open-source bioinformatics tools. Organized by the Open Bioinformatics Foundation and GenomeSpace team, this was part of the lead up to the Bioinformatics Open Source Conference (BOSC) in July in Berlin. The [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=413&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Interoperability Hackathon</h2>
<div class="outline-text-2" id="text-1">
<p> On April 7th and 8th, a group of biologists and programmers gathered at the <a href="http://www.broadinstitute.org/">Broad Institute</a> to work on improving interoperability of open-source bioinformatics tools. Organized by the <a href="http://www.open-bio.org/wiki/Main_Page">Open Bioinformatics Foundation</a> and <a href="http://www.genomespace.org/">GenomeSpace team</a>, this was part of the lead up to the <a href="http://www.open-bio.org/wiki/BOSC_2013">Bioinformatics Open Source Conference (BOSC)</a> in July in Berlin. The event is part of an ongoing series of coding sessions (Codefests or Hackathons) organized by the open bioinformatics community, which give programmers who typically work together remotely a chance to code and discuss in the same place for two days. These have been successful in both producing new code and in building connections which help sustain development of these community projects. </p>
</p></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Goals and outcomes</h2>
<div class="outline-text-2" id="text-2">
<p> One major challenge in analyzing biological data is interfacing multiple bioinformatics tools. Tools often work independently, and where general architectures like plugins or API exist they are often project specific. This results in isolated islands of data exchange, but transferring data or resources between tools requires work that is often rate-limiting or insurmountable. </p>
<p> Our goal at the hackathon was to provide simple APIs and implementations that help facilitate transfers between multiple islands of functionality. <a href="http://www.genomespace.org/">GenomeSpace</a> does this by providing a central hub and API to push and pull from tools. We wanted to generalize this to support multiple tools, and build client implementations that demonstrate this in practice. The long term goal is to encourage tool developers to provide server side APIs compatible with the more general library, making extension of the connector toolkit easier. For developers, the client API would allow them to easily transfer files between multiple tools without needing to learn and implement the specific transfer APIs of each tool. </p>
<p> We called this high level client library Genome Connector (gcon, for short) and took a practical approach by implementing client libraries that provide a common interface to multiple tools: GenomeSpace, <a href="http://wiki.galaxyproject.org/Learn/API">Galaxy</a>, <a href="https://developer.basespace.illumina.com/">BaseSpace</a>, <a href="https://api.23andme.com/">23andMe</a> and  <a href="http://www.jclouds.org/">general key-value stores through jClouds.</a> To identify a reasonable amount of work for two days, we focused on file transfer: authentication, finding files, getting and putting files to remote analysis platforms. In addition we defined some critical components for doing biological work: </p>
<ul>
<li>File metadata: We need to be able to store arbitrary key/value   on objects to assign essential biological information necessary to   interpret it, like organisms and genome build. In addition, metadata   allows provenance and tracking of files by enabling annotation of   files with history and processing steps.  </li>
<li>Filesets: Large biological files have secondary files with indexes,   allowing indexed retrieval of data (for example: read bam and bai,   variant vcf and idx, tabix gz and tbi). To avoid expensive   reindexing, we want to group and transfer these together. </li>
</ul>
<p> We also identified other useful extensions that would help improve interoperability and facilitate building connected tools, like providing <a href="https://en.wikipedia.org/wiki/Publish/subscribe">Publish/subscribe</a> hooks to avoid having to poll servers for updates, and smarter approaches to sending data to avoid duplication and unnecessary transfer of data. </p>
<p> The output of our discussion and coding are common Genome Connector implementations in multiple languages. GitHub repositories are available for in-progress <a href="https://github.com/heuermh/java-gcon/">Java</a>, <a href="https://github.com/roryk/python-gcon">Python</a> and <a href="https://github.com/chapmanb/clj-gcon">Clojure</a> implementations. These wrap multiple diverse tools and expose them through a common top level API, allowing developers to push and pull data from multiple tools. </p>
<p> I&#8217;m immensely grateful to the incredible <a href="https://docs.google.com/spreadsheet/ccc?key=0Agxg-o4ZmoZ4dFBpVllXX0s5OFJ1WFVPN2pzb29tWlE#gid=0">participants</a> who generously donated their time and expertise to help with these projects. For anyone interested we also have <a href="https://docs.google.com/document/d/1k71Wpm9_MMmLxKOeX_76TJyCaSGRZdsVgPgheL3eTLQ/edit">detailed documentation</a> on discussions during the hackathon. </p>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Bioinformatics Open Source Conference</h2>
<div class="outline-text-2" id="text-3">
<p> If you&#8217;re a bioinformatics programmers interested in open source coding and helping answer biological questions by improving usability and connectivity of tools, you&#8217;re welcome to join the OpenBio and BOSC communities. We&#8217;ve created a <a href="http://groups.google.com/group/biointerop">biological interoperability mailing list</a> for additional discussion. The next <a href="http://www.open-bio.org/wiki/BOSC_2013">BOSC conference</a> is July 19th and 20th in Berlin, Germany as part of the ISMB conference. There will also be another two day <a href="http://www.open-bio.org/wiki/Codefest_2013">Codefest</a> proceeding BOSC on July 17th and 18th. <a href="http://www.open-bio.org/wiki/BOSC_2013#Session_Topics">Abstracts for talks at BOSC</a> are due this Friday, April 12th. Looking forward to seeing everyone at future BOSC and coding events. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/413/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/413/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=413&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=VjR0XBk6zS4:nVKN9bmB9tU:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=VjR0XBk6zS4:nVKN9bmB9tU:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=VjR0XBk6zS4:nVKN9bmB9tU:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=VjR0XBk6zS4:nVKN9bmB9tU:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=VjR0XBk6zS4:nVKN9bmB9tU:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/VjR0XBk6zS4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2013/04/08/bioinformatics-open-source-interoperability-hackathon-at-the-broad-institute/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2013/04/08/bioinformatics-open-source-interoperability-hackathon-at-the-broad-institute/</feedburner:origLink></item>
		<item>
		<title>The influence of reduced resolution quality scores on alignment and variant calling</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/ZUo1r1dTJ5E/</link>
		<comments>http://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/#comments</comments>
		<pubDate>Wed, 13 Feb 2013 10:49:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[variation]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[clinical]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[variant]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=382</guid>
		<description><![CDATA[BAM file size reduction and quality score binning We have a large upcoming whole genome sequencing project with Illumina, and they approached us about delivering BAM files with reduced resolution base quality scores. They have a white paper describing the approach, which involves binning scores to reduce resolution. This reduces the number of scores describing [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=382&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">BAM file size reduction and quality score binning</h2>
<div class="outline-text-2" id="text-1">
<p> We have a large upcoming whole genome sequencing project with <a href="http://www.illumina.com/">Illumina</a>, and they approached us about delivering BAM files with reduced resolution base quality scores. They have a <a href="http://www.illumina.com/Documents/products/whitepapers/whitepaper_datacompression.pdf">white paper</a> describing the approach, which involves binning scores to reduce resolution.  This reduces the number of scores describing the quality of a base from 40 down to 8. </p>
<p> The advantage of this approach is a significant reduction in file size. BAM files use <a href="http://blastedbio.blogspot.ie/2011/11/bgzf-blocked-bigger-better-gzip.html">BGZF compression</a>, and the underlying <a href="http://www.chipestimate.com/tech-talks/2010/03/23/Altior-Unzipping-the-GZIP-compression-protocol">gzip DEFLATE algorithm</a> compresses based on shared text regions.  Reducing the number of quality values increases shared blocks and improves compression. This reduces BAM file sizes by 25-35%: an exome BAM file reduced from 5.7Gb to 3.7Gb after quality binning. </p>
<p> The potential downside is that the reduction in quality resolution may impact alignment and variant calling approaches that rely on base quality scores. To assess this, I implemented quality score binning as part of the <a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen analysis pipeline</a> using the <a href="http://www.ebi.ac.uk/ena/about/cram_toolkit/">CRAM toolkit</a> and ran alignment, recalibration, realignment and variant calling on: </p>
<ul>
<li>The original unbinned 40-resolution base quality BAM from an   NA12878 exome. </li>
<li>The BAM binned into 8-resolution base qualities before   alignment. </li>
<li>The BAM binned into 8-resolution base qualities before alignment and   binned again following <a href="http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr">base quality score recalibration</a>. </li>
</ul>
<p> A comparison of alignment and variant calls from the three approaches indicates that binning has nearly no impact on alignment and a small impact on variant calls, primarily in low depth regions. </p>
</p></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Alignment differences</h2>
<div class="outline-text-2" id="text-2">
<p> We aligned 100bp paired end reads with <a href="http://www.novocraft.com/main/index.php">Novoalign</a>, a quality aware aligner. Comparison of mapped reads showed nearly no impact on total mapped reads. The plot below shows a generic delta of changes in mapped reads across the 22 autosomes alongside the increase in unmapped pairs. Out of 73 million total reads, the changes account for ~0.003% of the total reads. There also did not appear to be any worrisome patterns of loss for specific chromosomes. Overall, there is a minimal impact of quality score binning on the ability to align the reads. </p>
<p> <img src="http://bcbio.files.wordpress.com/2013/02/wpid-qualbin-alignment-changes.png?w=600" alt="Alignment changes following quality binning" width="600" /> </p>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Variant call differences</h2>
<div class="outline-text-2" id="text-3">
<p> We called variants using the <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html">GATK Unified Genotyper</a> following the <a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">best practice recommendations for exomes</a> and then compared calls from original and binned quality scores. Both approaches for binning &#8212; pre-binning, and pre-binning plus post-quality recalibration binning &#8212; showed similar levels of concordance to non-binned quality scores: 99.81 and 99.78, respectively. Since the additional binning after recalibration provides a smaller prepared BAM file for storage and has a similar impact to pre-binning only, we used this for additional analysis of discordant variants. </p>
<p> The table below shows the discordant differences between the 40 quality score resolution and binned, 8 quality score resolution BAMs. 40 quality discordant variants are those called with full quality score resolution but not called, or called differently, after binning to 8 quality score resolution. Conversely, the 8-quality discordants are those called uniquely after quality binning: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:250px;margin:10px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">Overall genotype concordance</td>
<td class="right">99.78</td>
</tr>
<tr>
<td class="left">concordant: total</td>
<td class="right">117887</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">109144</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">8743</td>
</tr>
<tr>
<td class="left">40-quality discordant: total</td>
<td class="right">821</td>
</tr>
<tr>
<td class="left">40-quality discordant: SNPs</td>
<td class="right">759</td>
</tr>
<tr>
<td class="left">40-quality discordant: indels</td>
<td class="right">62</td>
</tr>
<tr>
<td class="left">8-quality discordant: total</td>
<td class="right">1289</td>
</tr>
<tr>
<td class="left">8-quality discordant: SNPs</td>
<td class="right">1240</td>
</tr>
<tr>
<td class="left">8-quality discordant: indels</td>
<td class="right">49</td>
</tr>
<tr>
<td class="left">het/hom discordant</td>
<td class="right">259</td>
</tr>
</tbody>
</table>
<p> We investigated the discordant variants further since 1.5%  of the total variant calls change as a result of binning, Of the 1851 unique discordant variants, approximately half (928) fall into reproducible variants  <a href="http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/">identified by looking at ensemble combinations of replicates</a>. Of these potentially problematic discordant variants more than half are in low coverage regions with less than 10 reads: </p>
<p> <img src="http://bcbio.files.wordpress.com/2013/02/wpid-qualbin-variant-changes.png?w=600" alt="Variant changes following quality binning" width="600" /> </p>
<p> The major influence of quality score binning is resolution of variants in low coverage regions. This manifests as differences in heterozygote and homozygote calling, indel representation and filtering differences related to quality and mappability. To assess the potential impact,  we looked at the loss in callable bases on a 30x whole genome sequence  when moving from a minimum of 5 reads to a minimum of 10, using  <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_CallableLoci.html">GATK&#8217;s CallableLoci tool</a>. Regions with read coverage of 5 to 9 make up 4.7 million genome positions, 0.17% of the  total callable bases. </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:400px;margin:10px;">
<caption></caption>
<col class="left" />
<col class="left" />
<col class="left" />
<thead>
<tr>
<th scope="col" class="left"></th>
<th scope="col" class="left">5 read minimum</th>
<th scope="col" class="left">10 read minimum</th>
</tr>
</thead>
<tbody>
<tr>
<td class="left">Callable bases</td>
<td class="left">2,775,871,235</td>
<td class="left">2,771,109,000</td>
</tr>
<tr>
<td class="left">Percent callable</td>
<td class="left">96.90%</td>
<td class="left">96.73%</td>
</tr>
<tr>
<td class="left">Low coverage</td>
<td class="left">17,641,980</td>
<td class="left">22,404,215</td>
</tr>
<tr>
<td class="left">No coverage/ poor mapping</td>
<td class="left">71,272,008</td>
<td class="left">71,272,008</td>
</tr>
</tbody>
</table>
<p> In conclusion, quality score binning provides a useful reduction in input file sizes with minimal impact on alignment. For variant calling, use additional caution in low coverage regions with less than 10 supporting reads. Given the rapid increases in read throughput that are driving the need for file size reduction, quality score binning is a worthwhile tradeoff for high-coverage recalling work. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/382/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/382/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=382&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=ZUo1r1dTJ5E:u8TdAdvyZro:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=ZUo1r1dTJ5E:u8TdAdvyZro:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=ZUo1r1dTJ5E:u8TdAdvyZro:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=ZUo1r1dTJ5E:u8TdAdvyZro:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=ZUo1r1dTJ5E:u8TdAdvyZro:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/ZUo1r1dTJ5E" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2013/02/wpid-qualbin-alignment-changes.png" medium="image">
			<media:title type="html">Alignment changes following quality binning</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2013/02/wpid-qualbin-variant-changes.png" medium="image">
			<media:title type="html">Variant changes following quality binning</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/</feedburner:origLink></item>
		<item>
		<title>An automated ensemble method for combining and evaluating genomic variants from multiple callers</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/POa6e0A0K8I/</link>
		<comments>http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/#comments</comments>
		<pubDate>Wed, 06 Feb 2013 12:25:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[xprize]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[clinical]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[variant]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=338</guid>
		<description><![CDATA[Overview A key goal of the Archon Genomics X Prize infrastructure is development of a set of highly accurate reference genome variants. I&#8217;ve described our work preparing these reference genomes, and specifically defined the challenges behind merging genomic variant calls from multiple technologies and calling methods. Comparing calls from two different calling methods, for example [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=338&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Overview</h2>
<div class="outline-text-2" id="text-1">
<p> A key goal of the <a href="http://genomics.xprize.org/">Archon Genomics X Prize</a> infrastructure is development of a set of highly accurate reference genome variants. I&#8217;ve described our work <a href="http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/">preparing these reference genomes</a>, and specifically defined the challenges behind merging genomic variant calls from multiple technologies and calling methods. Comparing calls from two different calling methods, for example <a href="http://www.broadinstitute.org/gatk/">GATK</a> and <a href="http://samtools.sourceforge.net/mpileup.shtml">samtools mpileup</a>, produces a large number of differing variants which need reconciliation. Taking the overlapping subset from multiple callers is too conservative and will miss real variations, while including all calls is too liberal and introduces false positives. </p>
<p> Here I&#8217;ll describe a fully automated approach for preparing an accurate set of combined variant calls. <a href="https://en.wikipedia.org/wiki/Ensemble_learning">Ensemble machine learning methods</a> are a powerful way to incorporate inputs from multiple models. We use a heuristic and <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machine (SVM)</a> algorithm to consolidate variants, producing a final set of calls with better sensitivity and specificity than current best practice methods. The approach is open source, fully automated and generalizable to both  human diploid sequencing as well as X Prize haploid reference fosmids. </p>
<p> We use a pair of replicates from <a href="http://www.edgebio.com/">EdgeBio&#8217;s</a> clinical exome sequencing pipeline to prepare ensemble variant calls in the  <a href="http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878">widely studied HapMap NA12878 genome</a>. Compared to variants from a single calling method, the ensemble method produced more concordant  variants when comparing the replicates, with fewer discordants. The finalized ensemble calls also provide a useful method to compare strengths and weaknesses of individual calling methods. The implementation is freely available and I&#8217;ll discuss how to get it running on your  data so you can use, critique and extend the methods. This work is a collaboration between <a href="http://compbio.sph.harvard.edu/chb/">Harvard School of Public Health</a>, <a href="http://www.edgebio.com/">EdgeBio</a> and <a href="http://genomeinabottle.org/">NIST</a>. </p>
</p></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Comparison materials and algorithm</h2>
<div class="outline-text-2" id="text-2">
<p> A difficult aspect of evaluating variant calling methods is establishing a reference set of calls. For X Prize we use three established methods, each of which comes with tradeoffs. Metrics like  <a href="http://genome.sph.umich.edu/wiki/SNP_Call_Set_Properties">transition/transversion ratios or dbSNP overlap</a> provide a global picture of calling but are not fine grained enough to distinguish improvements over best practices. Sanger validation restricts you to a manageable subset of calls. Comparisons against public resources like <a href="http://www.1000genomes.org/">1000 genomes</a> bias results towards technologies and callers used in preparing those variant callsets. </p>
<p> Here we employ a fourth method by comparing replicates from EdgeBio&#8217;s clinical exome sequencing pipeline. These are NA12878 samples independently prepared using <a href="http://www.nimblegen.com/products/seqcap/ez/v3/index.html">Nimblegen&#8217;s version 3.0 kit</a> and sequenced on an <a href="http://www.illumina.com/systems/hiseq_2500_1500.ilmn">Illumina HiSeq</a>. By comparing the replicates in regions with 4 or more reads in both samples, we identify the ability of variant calling algorithms to call identical variations with differing coverage and error profiles. </p>
<p> We aligned reads with <a href="http://www.novocraft.com/main/index.php">novoalign</a> and performed deduplication, base recalibration and realignment using <a href="http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0">GATK best practices</a>. With these prepared reads, we called variants with five approaches: </p>
<ul>
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html">GATK UnifiedGenotyper</a> &ndash; Bayesian approach to call SNPs and indels,   treating each position independently. </li>
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html">GATK HaplotypeCaller</a> &ndash; Performs local de-novo assembly to call SNPs   and indels on individual haplotypes. </li>
<li><a href="https://github.com/ekg/freebayes">FreeBayes</a> &ndash; Bayesian calling approach that handles simultaneous   SNPs and indel calling via assessment of regional haplotypes. </li>
<li><a href="http://samtools.sourceforge.net/mpileup.shtml">samtools mpileup</a> &ndash; Uses an approach similar to GATK&#8217;s   UnifiedGenotyper for SNP and indel calling. </li>
<li><a href="http://varscan.sourceforge.net/">VarScan</a> &ndash; Calls variants using a heuristic/statistic approach   eliminating common sources of bias. </li>
</ul>
<p> We took a combined heuristic and machine learning approach to consolidate these five sets of variant calls into a final ensemble callset. The first step is to prepare the union of all variant calls from the input callers, identifying calling methods that support each variant. Secondly, we annotate each variant with metrics including strand bias, allele balance, regional sequence entropy, position of calls within reads, regional base quality and overall genotype likelihoods. We then filter this prepared set of all possible variants to produce a final set of trusted calls. </p>
<p> The first filtering step is to heuristically identify trusted variants based on the number of callers supporting them. This configurable parameter allow you to make an initial conservative cutoff for including variants in the final calls: I trust variants supported by N or more callers. </p>
<p> For the remaining calls that fall below the trusted support cutoff, we distinguish true and false positives using a support vector machine (SVM). The annotated metrics described above are the input parameters and we prepare true and false positives for the classifier using a multi-step process: </p>
<ul>
<li>Use variants found in all callers as the true positive set, and   variants found in a single caller as false positives. Use these   training variants to identify an initial set of below-cutoff   variants to include and exclude.  </li>
<li>With this initial set of below-cutoff true/false variants, re-train    multiple classifiers stratified based on variant characteristics:   variant type (indels vs SNPs), zygosity (heterozygous vs   homozygous) and regional sequence complexity.  </li>
<li>Use these final classifiers to identify included and excluded variants   falling below the trusted calling support cutoff. </li>
</ul>
<p> The final set of calls includes the trusted variants and those that pass the SVM filtering. An input configuration file for variant preparation and assessment allows adjustment of the trusted threshold as well as defining which metrics to use for building the SVM classifiers. </p>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Ensemble calling improvements</h2>
<div class="outline-text-2" id="text-3">
<p> We assess calling sensitivity and specificity by comparing concordant and discordant variant calls between the replicates. To provide a consistent method to measure both SNP and indel correctness, we use the <a href="https://en.wikipedia.org/wiki/Positive_predictive_value">positive predictive value</a> as the percentage of concordant calls between duplicates (concordant variants / (concordant variants + discordant variants)). This is different than the overall concordance rate, which also includes non-variant regions where both replicates do not call a variation. As a result these percentages will be lower if  you expect the 99% values that result when including reference calls. The advantage of this metric is that it&#8217;s easily interpreted as the percentage of concordant called variants. It also allows individual comparisons of SNPs and indels, since the overall number of indels are low compared to the total bases considered.  <a href="http://gatkforums.broadinstitute.org/discussion/48/using-varianteval">GATK&#8217;s VariantEval documentation</a> has a nice discussion of alternative metrics to genotype concordance. </p>
<p> As a baseline we used calls from GATK&#8217;s UnifiedGenotyper to represent a current best practice approach. GATK calls 117079 SNPs, 86.6% of which are concordant. It also calls 14966 indels, with 64.6% concordant. Here are the full concordant and discordant numbers, broken down by variant type and replicate: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:250px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">111159</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">101495</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">9664</td>
</tr>
<tr>
<td class="left">rep1 discordant: total</td>
<td class="right">9857</td>
</tr>
<tr>
<td class="left">rep1 discordant: SNPs</td>
<td class="right">7514</td>
</tr>
<tr>
<td class="left">rep1 discordant: indels</td>
<td class="right">2343</td>
</tr>
<tr>
<td class="left">rep2 discordant: total</td>
<td class="right">11029</td>
</tr>
<tr>
<td class="left">rep2 discordant: SNPs</td>
<td class="right">8070</td>
</tr>
<tr>
<td class="left">rep2 discordant: indels</td>
<td class="right">2959</td>
</tr>
<tr>
<td class="left">het/hom discordant</td>
<td class="right">4181</td>
</tr>
</tbody>
</table>
<p> Our ensemble method produces improvements in both total concordant variants detected and the ratio of concordant to discordants. For SNPs, the ensemble calls add 5345 additional variants to a total of 122424, with an increase in concordance to 87.4%. For indels the major improvement is in removal of discordants: We identify 14184 indels with 67.2% concordant. Here is the equivalent table for the ensemble method: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:250px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">116608</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">107063</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">9545</td>
</tr>
<tr>
<td class="left">rep1 discordant: total</td>
<td class="right">9555</td>
</tr>
<tr>
<td class="left">rep1 discordant: SNPs</td>
<td class="right">7581</td>
</tr>
<tr>
<td class="left">rep1 discordant: indels</td>
<td class="right">1974</td>
</tr>
<tr>
<td class="left">rep2 discordant: total</td>
<td class="right">10445</td>
</tr>
<tr>
<td class="left">rep2 discordant: SNPs</td>
<td class="right">7780</td>
</tr>
<tr>
<td class="left">rep2 discordant: indels</td>
<td class="right">2665</td>
</tr>
<tr>
<td class="left">het/hom discordant</td>
<td class="right">3975</td>
</tr>
</tbody>
</table>
<p> For scientists who have worked to increase sensitivity and specificity of individual variant callers, it&#8217;s exciting to be able to improve both simultaneously. As mentioned above, you can also tune the method to increase specificity or sensitivity by adjusting the support needed for including trusted variants. </p>
<p> The final ensemble callsets from both replicates are available as VCF files from <a href="http://www.genomespace.org/">GenomeSpace</a> in the <code>xprize/NA12878-exome-v_03</code> folder: </p>
<ul>
<li>NA12878 exome ensemble calls, replicate 1 &ndash; <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/xprize/NA12878-exome-v0_3/NA12878-NGv3-LAB1360-A-ensemble.vcf">Variants (VCF)</a>,   <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/xprize/NA12878-exome-v0_3/NA12878-NGv3-LAB1360-A-regions.bed">Callable regions (BED)</a> </li>
<li>NA12878 exome ensemble calls, replicate 2 &ndash; <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/xprize/NA12878-exome-v0_3/NA12878-NGv3-LAB1363-A-ensemble.vcf">Variants (VCF)</a>,   <a href="https://dm.genomespace.org/datamanager/file/Home/chapmanb/xprize/NA12878-exome-v0_3/NA12878-NGv3-LAB1363-A-regions.bed">Callable regions (BED)</a> </li>
</ul></div>
</p></div>
<div id="outline-container-4" class="outline-2">
<h2 id="sec-4">Comparison of calling methods</h2>
<div class="outline-text-2" id="text-4">
<p> Calling the same samples with multiple callers allows direct comparisons between calling methods. The advantage of producing an accurate final set of ensemble calls is that this provides a baseline to evaluate the strengths and weaknesses of different calling methods. The figure below compares concordant, missing variants and additional variants called by each of the 5 methods in comparison with the consolidated ensemble calls: </p>
<p> <img src="http://bcbio.files.wordpress.com/2013/01/wpid-na12878-ngv3-lab1360-a-callers.png?w=600" alt="Concordance/discordance for calling methods" width="600" /> </p>
<ul>
<li>GATK UnifiedGenotyper provides the best SNP calling, followed   closely by samtools mpileup.  </li>
<li>For indel calling, the GATK HaplotypeCaller produces the most   concordant calls followed by UnifiedGenotyper and FreeBayes.   UnifiedGenotyper does good as well, but is conservative and has the   fewest additional indels. FreeBayes and GATK HaplotypeCaller both   provide resolution of individual haplotypes which helps in regions   with heterozygous indels or closely spaced SNPs and indels.  </li>
<li>If you want to use a single variant caller, GATK UnifiedGenotyper   does the best overall job.   </li>
<li>If you wanted to choose free open-source tools for calling, I would   recommend samtools for SNP calling and FreeBayes for indel calling. </li>
</ul>
<p> Variant calling methods with recommendations for both calling and filtering provide the best out of the box performance. An advantage of GATK and samtools is they provide calling, variant quality metrics, and filtering. On the other side, FreeBayes is a good example of a powerful tool that takes some time to learn to filter optimally. One potential source of bias in producing the individual calls is that I personally have more experience with GATK tools so may have room to improve with the other callers. </p>
</p></div>
</p></div>
<div id="outline-container-5" class="outline-2">
<h2 id="sec-5">Availability and usage</h2>
<div class="outline-text-2" id="text-5">
<p> Combining multiple calling approaches improves both sensitivity and specificity of the final set of variants. The downside is the need to run and coordinate calls from all of the different callers. To mitigate this, we developed an automated pipeline that ties together multiple open-source tools using two custom components: </p>
<ul>
<li><a href="https://github.com/chapmanb/bcbio-nextgen">bcbio-nextgen</a> &ndash; A Python framework to run a full sequencing   analysis pipeline from input fastq files to consolidated ensemble   variant calls. It supports multiple aligners and variant callers,   and enables distributed work over multiple cores on a large machine   or multiple machines in a cluster environment.  </li>
<li><a href="https://github.com/chapmanb/bcbio.variation">bcbio.variation</a> &ndash; A Clojure toolkit build on top of GATK&#8217;s variant   API that provides ensemble call preparation as well as more general   functionality for normalizing and comparing variants produced by   multiple callers. </li>
</ul>
<p> bcbio-nextgen has a script, built on functionality in the <a href="http://cloudbiolinux.org">CloudBioLinux</a> project, that automates installation of associated variant callers and data dependencies: </p>
<pre class="example">wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py install_directory data_directory
</pre>
<p>
<p> With the dependencies installed, you describe the input files and analysis with a <a href="https://en.wikipedia.org/wiki/YAML">YAML</a> formatted input file. The  <a href="https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/NA12878-ensemble.yaml">NA12878 ensemble configuration file</a> used for this analysis provides a useful starting point. Run the analysis, distributed on multiple cores, with: </p>
<pre class="example">bcbio_nextgen.py bcbio_system.yaml ensemble_sample.yaml -n 8
</pre>
<p>
<p> The <a href="https://bcbio-nextgen.readthedocs.org/en/latest/">bcbio-nextgen documentation</a> provides additional details about configuration inputs and distributed processing. The framework generally handles the automation and processing involved with high throughput sequencing analysis. </p>
<p><a href="http://www.edgebio.com/">EdgeBio</a> kindly made the NA12878 datasets used in this analysis publicly available:</p>
<ul>
<li><a href="https://dm.genomespace.org/datamanager/v1.0/file/Home/EdgeBio/CLIA_Examples/NA12878-NGv3-LAB1360-A">NA12878-NGv3-LAB1360-A (replicate 1)</a>
<li><a href="https://dm.genomespace.org/datamanager/v1.0/file/Home/EdgeBio/CLIA_Examples/NA12878-NGv3-LAB1363-A">NA12878-NGv3-LAB1363-A (replicate 2)</a>
</ul>
</p>
<p> I welcome feedback on the approach, data or tools and am actively working to extend this and make it easier to use. As re-sequencing becomes increasingly important for human health applications it is critical that we develop open, shared best-practice workflows to handle the data processing. This allows us to focus back on the fun and difficult work of understanding the biology. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/338/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/338/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=338&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=POa6e0A0K8I:AnhtbmJdvz0:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=POa6e0A0K8I:AnhtbmJdvz0:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=POa6e0A0K8I:AnhtbmJdvz0:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=POa6e0A0K8I:AnhtbmJdvz0:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=POa6e0A0K8I:AnhtbmJdvz0:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/POa6e0A0K8I" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2013/01/wpid-na12878-ngv3-lab1360-a-callers.png" medium="image">
			<media:title type="html">Concordance/discordance for calling methods</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/</feedburner:origLink></item>
		<item>
		<title>Genomics X Prize public phase update: variant classification and de novo calling</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/2pRQmN7889M/</link>
		<comments>http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/#comments</comments>
		<pubDate>Mon, 17 Sep 2012 12:41:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[xprize]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[clinical]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[gatk]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[variant]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=311</guid>
		<description><![CDATA[Background Last month I described our work at HSPH and EdgeBio preparing reference genomes for the Archon Genomics X Prize public phase, detailing methods used in the first version of our NA19239 variant calls. We&#8217;ve been steadily improving the calling approaches, and released version 0.2 on the X Prize validation website and GenomeSpace. Here I&#8217;ll [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=311&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Background</h2>
<div class="outline-text-2" id="text-1">
<p> Last month <a href="#bc-public-phase">I described our work</a> at <a href="http://compbio.sph.harvard.edu/chb/">HSPH</a> and <a href="http://www.edgebio.com/">EdgeBio</a> preparing reference genomes for the <a href="http://genomics.xprize.org/">Archon Genomics X Prize</a> public phase, detailing methods used in the first version of our NA19239 variant calls. We&#8217;ve been steadily improving the calling approaches, and released version 0.2 on the <a href="https://validationprotocol.org/">X Prize validation website</a> and <a href="http://www.genomespace.org/">GenomeSpace</a>. Here I&#8217;ll describe the improvements we&#8217;ve made over the last month, focusing on two specific areas: </p>
<ul>
<li><i>De novo</i> calling: Zam Iqbal suggested using <a href="http://cortexassembler.sourceforge.net/index_cortex_var.html">his cortex_var <i>de novo</i> variant caller</a>   in addition to the current GATK, FreeBayes and samtools callers.   With his help, we&#8217;ve included these calls in this release, and   provide comparisons between <i>de novo</i> and alignment based methods.  </li>
<li>Improved variant classification: Consolidating variant calls from   multiple callers involves making tough choices about when to include   or exclude variants. I&#8217;ll describe the details of selecting metrics   for use in SVM classification and filtering of variants. </li>
</ul>
<p> Our goal is to iteratively improve our calling and variant preparation to create the best possible set of reference calls. I&#8217;d be happy to talk more with anyone working on similar problems or with insight into useful ways to improve our current callsets. We have a <a href="http://agxpprotocol.xprize.org/agxp">Get Satisfaction site</a> for discussion and feedback, and have offered a <a href="http://agxpprotocol.xprize.org/agxp/topics/_2_500_bioinformatics_challenge">$2500 prize for helpful comments</a>. </p>
<p> As a reminder, all of the code and data used here is freely available: </p>
<ul>
<li>The <a href="#bcbio-variation">variant analysis infrastructure</a>, built on top of <a href="http://www.broadinstitute.org/gatk/">GATK,</a> automates   genome preparation, normalization and comparison. It provides a full   pipeline, driven by simple configuration files, for consolidating   multiple variant calls.  </li>
<li>The combined variant calls, including training data and potential true   and false positives, are available from <a href="http://www.genomespace.org/">GenomeSpace</a>:   <code>Public/chapmanb/xprize/NA19239-v0_2</code>.  </li>
<li>The individual variant calls for each technology and calling method are   also available from <a href="http://www.genomespace.org/">GenomeSpace</a>: <code>Public/EdgeBio/PublicData/Release1</code>. </li>
</ul></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2"><i>de novo</i> variant calling with cortex_var</h2>
<div class="outline-text-2" id="text-2">
<p> <i>de novo</i> variant calling performs reference-free assembly of either local or global genome regions, then subsequently uses these assemblies to call variants relative to a known reference. The advantage is that assemblies can avoid errors associated with mapping to the reference, helping resolve large variations as well as small variations near problem alignments or low complexity regions. </p>
<p> Hybrid approaches that use localized <i>de novo</i> assembly in variant regions help mitigate the extensive computational requirements associated with whole-genome assembly. <a href="http://www.completegenomics.com/FAQs/Assembly-Mapping-and-Variant-Calling/">Complete Genomics variant calling</a> and <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html">GATK 2.0&#8242;s Haplotype Caller</a> both provide pipelines for hybrid <i>de novo</i> assembly in variant detection. The <a href="https://github.com/lh3/fermi">fermi</a> and <a href="https://github.com/jts/sga">SGA</a> assemblers are also used in variant calling, although the paths from assembly to variants are not as automated. </p>
<p> Thanks to Zam&#8217;s generous assistance, we used <a href="http://cortexassembler.sourceforge.net/index_cortex_var.html">cortex_var</a> for localized <i>de novo</i> assembly and variant calling within individual fosmid boundaries. As a result, <a href="http://cloudbiolinux.org/">CloudBioLinux</a> now contains <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/custom/bio_nextgen.py#L576">automated build instructions for cortex_var</a> , handling binary builds for multiple k-mer and color combinations. An <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/cortex.py">automated cortex_var pipeline</a>, part of the <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">bcbio-nextgen</a> toolkit, runs the processing workflow: </p>
<ul>
<li>Start with reads aligned to fosmid regions using novoalign.  </li>
<li>For each fosmid region, extract all mapped reads along with a local   reference genome for variant calling.  </li>
<li><i>de novo</i> assemble all reads in the fosmid region and call variants   against the local reference genome using cortex_var&#8217;s Bubble Caller.  </li>
<li>Remap regional variant coordinates back to the full genome.  </li>
<li>Combine all regional calls into the final set of cortex_var calls. </li>
</ul>
<p> Directly comparing GATK and cortex_var calls shows similar levels of concordance and discordance as the GATK/samtools comparison from the last post: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">153787</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">130913</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">22874</td>
</tr>
<tr>
<td class="left">GATK discordant: total</td>
<td class="right">20495</td>
</tr>
<tr>
<td class="left">GATK discordant: SNPs</td>
<td class="right">6522</td>
</tr>
<tr>
<td class="left">GATK discordant: indels</td>
<td class="right">13973</td>
</tr>
<tr>
<td class="left">cortex_var discordant: total</td>
<td class="right">26790</td>
</tr>
<tr>
<td class="left">cortex_var discordant: SNPs</td>
<td class="right">21342</td>
</tr>
<tr>
<td class="left">cortex_var discordant: indels</td>
<td class="right">5448</td>
</tr>
</tbody>
</table>
<p> 11% of the GATK calls and 14% of the cortex_var calls are discordant. The one area where cortex_var does especially well is on indels: 19% of the cortex_var indels do not agree with GATK, in comparison with 37% of the GATK calls and 25% of the samtools calls. The current downside to this is SNP calling, where cortex_var has 3 times more discordant calls than GATK. </p>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Selection of classification metrics</h2>
<div class="outline-text-2" id="text-3">
<p> Overlapping variant calls from different calling methods (GATK, FreeBayes, samtools and cortex_var) and sequencing technologies (Illumina, SOLiD and IonTorrent) produces 170,286 potential calls in the fosmid regions. 58% (99,227) of these are present in all callers and technologies, so we need to do better than the intersection in creating a consolidated callset. </p>
<p> As detailed in the previous post, we filter the full set in two ways. The first is to keep trusted variants based on their presence in a defined number of technologies or callers. We currently have an inclusive set of criteria: keep variants present in either 4 out of the 7 callsets, 2 distinct technologies, or 3 distinct callers. This creates a trusted set containing 95% (162,202) of the variants. Longer term the goal is to reduce the trusted count and rely on automated filtering approaches based on input metrics. </p>
<p> This second automated filtering step uses a support vector machine (SVM) to evaluate the remaining variants. We train the SVM on potential true positives from variants that overlap in all callers and technologies, and potential false positives found uniquely in one single caller. The challenge is to find useful metrics associated with these training variants that will help provide discrimination. </p>
<p> In version 0.1 we filtered with a vanilla set of metrics: depth and variant quality score. To identify additional metrics, we used a great variant visualization tool developed by <a href="http://keminglabs.com/">Keming Labs</a> alongside <a href="http://compbio.sph.harvard.edu/chb/">HSPH</a> and <a href="http://www.edgebio.com/">EdgeBio</a>. I&#8217;ll write up more details about the tool once we have a demonstration website but the code is already <a href="https://github.com/lynaghk/vcf">available on GitHub</a>. </p>
<p> To remove variants preferentially associated with poorly mapping or misaligned reads, we identified two useful metrics. <code>ReadPosEndDist</code>, written as a <a href="https://github.com/chapmanb/bcbio.variation/blob/master/src/java/bcbio/variation/annotate/ReadPosEndDist.java">GATK annotation</a> by <a href="http://www.nist.gov/mml/biochemical/biomolecular/jzook_bio.cfm">Justin Zook at NIST</a>, identifies variants primarily supported by calls at the ends of reads. Based on visual examination, these associate with difficult to map regions, as identified by <a href="http://sourceforge.net/apps/mediawiki/gma-bio/index.php?title=Genomic_Dark_Matter:_The_limitations_of_short_read_mapping">Genome Mappability Scores</a>: </p>
<p>    <span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/mPvrxYPfm2I?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span>
<p> Secondly, we identified problematic allele balances that differ from the expected ratios. For haploid fosmid calls, we expect 100% of reads to support variants and 0% to support reference calls (in diploid calls, you also need to handle heterozygotes with 50% expected allele balance). In practice, the distribution of reads can differ due to sequencer and alignment errors. We use a metric that measures deviation from the expected allele balance and associates closely with variant likelihoods: </p>
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/iTqmjSCCe4M?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></div>
</p></div>
<div id="outline-container-4" class="outline-2">
<h2 id="sec-4">Improved consolidated calls</h2>
<div class="outline-text-2" id="text-4">
<p> To assess the influence of adding <i>de novo</i> calls and additional filtering metrics on the resulting call set, we compare against whole genome Illumina and Complete Genomics calls for NA19239. Previously we&#8217;d noticed two major issues during this comparison: a high percentage of discordant indel calls and a technology bias signaled by better concordance with Illumina than Complete. </p>
<p> The comparison between fosmid and Illumina data shows a substantial improvement in the indel bias. Previously 46% of the total indel calls were discordant, indicative of a potential false positive problem. With <i>de novo</i> calls and improved filtering, we&#8217;ve lowered this to only 10% of the total calls. </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">147684</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">133861</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">13823</td>
</tr>
<tr>
<td class="left">fosmid discordant: total</td>
<td class="right">7519</td>
</tr>
<tr>
<td class="left">fosmid discordant: SNPs</td>
<td class="right">5856</td>
</tr>
<tr>
<td class="left">fosmid discordant: indels</td>
<td class="right">1663</td>
</tr>
<tr>
<td class="left">Illumina discordant: total</td>
<td class="right">5640</td>
</tr>
<tr>
<td class="left">Illumina discordant: SNPs</td>
<td class="right">1843</td>
</tr>
<tr>
<td class="left">Illumina discordant: indels</td>
<td class="right">3797</td>
</tr>
</tbody>
</table>
<p> This improvement comes with a decrease in the total number of concordant indel calls, since we moved from 17,816 calls to 13,823. However a large number of these seemed to be Illumina specific since 60% of the previous calls were discordant when compared with Complete Genomics. The new callset reduces this discrepancy and only 22% of the indels are now discordant: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">139155</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">127243</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">11912</td>
</tr>
<tr>
<td class="left">fosmid discordant: total</td>
<td class="right">15484</td>
</tr>
<tr>
<td class="left">fosmid discordant: SNPs</td>
<td class="right">12028</td>
</tr>
<tr>
<td class="left">fosmid discordant: indels</td>
<td class="right">3456</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: total</td>
<td class="right">7311</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: SNPs</td>
<td class="right">4972</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: indels</td>
<td class="right">2273</td>
</tr>
</tbody>
</table>
<p> These comparisons provide some nice confirmation that we&#8217;re moving in the right direction on filtering. As before, we extract potential false positives and false negatives to continue to refine the calls: potential false positives are those called in the fosmid dataset and in neither of the Illumina or Complete Genomics sets. Potential false negatives are calls that both Illumina and Complete agree on that the fosmid calls lack. </p>
<p> In the new callsets, there are 5,499 (3.5%) potential false positives and 1,422 (0.9%) potential false negatives. We&#8217;ve reduced potential false positives in the previous set from 10% with a slight increase in false negatives. These subsets are available along with the full callset on GenomeSpace. We&#8217;re also working hard on an NA12878 callset with equivalent approaches and will make that available soon for community feedback. </p>
<p> I hope this discussion, open source code, and dataset release is useful to everyone working on problems of improving variant calling accuracy and filtering. I welcome feedback on calling, consolidation methods, interesting metrics to explore, machine learning or any of the other topics discussed here. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/311/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/311/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=311&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=2pRQmN7889M:bQByGyZ3bsI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=2pRQmN7889M:bQByGyZ3bsI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=2pRQmN7889M:bQByGyZ3bsI:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=2pRQmN7889M:bQByGyZ3bsI:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=2pRQmN7889M:bQByGyZ3bsI:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/2pRQmN7889M" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2012/09/17/genomics-x-prize-public-phase-update-variant-classification-and-de-novo-calling/</feedburner:origLink></item>
		<item>
		<title>Genomics X Prize public phase: reference genome preparation and comparisons to Illumina and Complete Genomics</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/TOGpMlBspH4/</link>
		<comments>http://bcbio.wordpress.com/2012/08/15/genomics-x-prize-public-phase-reference-genome-preparation-and-comparisons-to-illumina-and-complete-genomics/#comments</comments>
		<pubDate>Wed, 15 Aug 2012 13:10:00 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[xprize]]></category>
		<category><![CDATA[bioinformatics variant ngs xprize clinical gatk clojure]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=272</guid>
		<description><![CDATA[Background The Archon Genomics X Prize, presented by Express Scripts, is a 10 million dollar competition to establish highly accurate clinical grade sequencing and variation detection methods. Our group at Harvard School of Public Health works with the EdgeBio team on developing the infrastructure for the competition: identify variations in the grading genomes and provide [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=272&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<h2 id="sec-1">Background</h2>
<div class="outline-text-2" id="text-1">
<p> The <a href="http://genomics.xprize.org/">Archon Genomics X Prize, presented by Express Scripts</a>, is a 10 million dollar competition to establish highly accurate clinical grade sequencing and variation detection methods. Our group at <a href="http://compbio.sph.harvard.edu/chb/">Harvard School of Public Health</a> works with the <a href="http://www.edgebio.com/">EdgeBio</a> team on developing the infrastructure for the competition: identify variations in the grading genomes and provide software to compare these reference variation sets against a competitor&#8217;s list of variations. </p>
<p> The exciting aspect of the Genomics X Prize is that it enables open comparisons between sequencing technologies and variant calling methodologies. Sequencing genomes to the high degree of accuracy sufficient for clinical usage is a difficult, open, problem. Here I&#8217;ll present detailed numbers comparing variants called by different sequencing technologies and variant callers. </p>
<p> The public phase of the Genomics X Prize starts today, August 15th. The goal of this six month period is to have an open dialog with everyone working in the sequencing and variant calling communities. We want to refine our methods to provide the most accurate and fair variant calling for the reference genomes. To start the discussion we&#8217;ve prepared: </p>
<ul>
<li>Variant calls for a HapMap individual:   <a href="http://ccr.coriell.org/Sections/Collections/NHGRI/Yoruba.aspx?PgId=128&amp;coll=HG">NA19239, a Yoruba male from Ibadan, Nigeria</a>.   We sequenced haploid fosmids from NA19239 with two   technologies: <a href="https://en.wikipedia.org/wiki/Illumina_(company)">Illumina</a> and <a href="https://en.wikipedia.org/wiki/ABI_Solid_Sequencing">SOLiD</a>; and called variants with three   different methods:   <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html">GATK Unified Genotyper</a>, <a href="https://github.com/ekg/freebayes">FreeBayes</a>, and <a href="#samtools">SAMtools mpileup</a>.   We combined these calls into a unified final call set, NA19239   version 0.1, that I discuss in detail below.  </li>
<li>Fully documented methods, access to all data files used, and a public   scoring site. The <a href="https://edgebio.atlassian.net/wiki/display/AGXPPUB/Home">X Prize Validation wiki</a> contains detailed   information about sequencing, variant calling, validation and   scoring. The <a href="https://validationprotocol.org/">validationprotocol.org</a> website provides a simple way   for anyone to compare their variant calls against the public   reference genomes. It encourages submission and analysis in public   tools like <a href="http://usegalaxy.org">Galaxy</a> through transparent interoperability with   <a href="http://www.genomespace.org/">GenomeSpace</a>.  </li>
<li><a href="https://github.com/chapmanb/bcbio.variation">An automated variant analysis infrastructure</a> built on top of the   Broad&#8217;s <a href="http://www.broadinstitute.org/gatk/">Genome Analysis Toolkit (GATK)</a> that performs comparisons as   well as variant unification. This is a generally useful toolkit of   functionality to manipulate variants, and we <a href="http://chapmanb.github.com/bcbio.variation/presentations/variation_bosc_2012/variation_chapman.pdf">presented an overview</a>   at the <a href="http://open-bio.org/wiki/BOSC_2012">Bioinformatics Open Source Conference</a> last month.   This is an open-source community developed project, and has   received great contributions from the <a href="http://www.nist.gov/mml/biochemical/biomolecular/genome_in_a_bottle_consortium.cfm">Genome in a Bottle Consortium</a> at   <a href="#justin-nist">the National Institute of Standards and Technology</a>. </li>
</ul>
<p>   The goal of this writeup, and the X Prize public phase, is to iterate over calling and unification methods to improve our algorithms and approaches. Rather than promoting or disparaging any particular technology or calling method, we&#8217;re instead providing full transparency and a good-faith effort to combining approaches. Our hope is that this will help engage the community, encourage feedback, and result in a unbiased and accurate set of reference genomes for the competition. </p>
</p></div>
</p></div>
<div id="outline-container-2" class="outline-2">
<h2 id="sec-2">Unification of variant calls</h2>
<div class="outline-text-2" id="text-2">
<p> For the August 15th public phase kickoff, we prepared a reference data set of NA19239 based on <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3202284/#__sec1title">pooled sequencing of haploid fosmid clones</a>. The callable regions of these clones totaled 129,513,026 total bases, covering ~4% of the 3.1 billion bases in the human genome. We use fosmid clones to obtain complete regional haplotype coverage and focus on partial genome coverage to achieve high coverage depth and accuracy for assessed regions. </p>
<p> Version 0.1 of the NA19239 reference set uses variant calls from two technologies: Illumina and SOLiD; and three callers: GATK&#8217;s Unified Genotyper, FreeBayes and SAMtools. To move from these data to a unified call set we: </p>
<ul>
<li>Align to <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/index.shtml">GRCh37 reference genome</a> with <a href="http://www.novocraft.com/main/index.php">Novoalign</a>. </li>
<li>Perform post-processing and indel realignment with   <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner.html">GATK&#8217;s IndelRealigner</a>. </li>
<li>Perform variant calling with GATK&#8217;s UnifiedGenotyper, FreeBayes and   samtools mpileup. </li>
<li>Do pairwise comparisons between all technology/caller approaches. </li>
<li>Generate the union of all possible calls and merge with initial GATK   calls, recalling any no-call positions at expected sites. </li>
<li>Use validation information on variants found in multiple   technologies, plus metrics associated with common variants, to filter   the full call set to a final set of trusted calls. </li>
</ul>
<p> The challenging decisions begin when merging and filtering the final call set. This requires careful bookkeeping and variant representation to ensure identical variants are directly comparable, followed by setting cutoffs for variant inclusion. </p>
</p></div>
<div id="outline-container-2-1" class="outline-3">
<h3 id="sec-2-1">Comparison details</h3>
<div class="outline-text-3" id="text-2-1">
<p> The details of variant comparisons introduce an additional layer of complexity during assessment. The approach we&#8217;ve taken is create a normalized set of variants so all comparison differences are due to actual call differences rather than variant representation. We split multiple nucleotide polymorphisms into individual calls, split complex indel-variant combinations, and left-align remaining variants. </p>
<p> For haploid/diploid comparisons, we establish haplotype blocks for the diploid sequence based on phasing provided in the input variant file, and then compare the best matching haplotype to our fosmid reference. Single nucleotide polymorphisms and indels less than 30bp require exact machines between two comparison genomes. Larger indels and structural variations receive more flexible matching with confidence intervals around start and end coordinates. </p>
<p> The goal of the normalized, compared variants is to reflect real underlying differences in calling approaches relative to how well we can currently resolve variation endpoints. </p>
</p></div>
</p></div>
<div id="outline-container-2-2" class="outline-3">
<h3 id="sec-2-2">Comparisons between variation callers</h3>
<div class="outline-text-3" id="text-2-2">
<p> For a concrete example of two different variant calling approaches, below is a table comparing GATK variants against samtools calls for the NA19239 sample, using identically aligned and post-processed BAMs: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">160851</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">136146</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">24705</td>
</tr>
<tr>
<td class="left">GATK discordant: total</td>
<td class="right">13925</td>
</tr>
<tr>
<td class="left">GATK discordant: SNPs</td>
<td class="right">1315</td>
</tr>
<tr>
<td class="left">GATK discordant: indels</td>
<td class="right">12610</td>
</tr>
<tr>
<td class="left">samtools discordant: total</td>
<td class="right">25368</td>
</tr>
<tr>
<td class="left">samtools discordant: SNPs</td>
<td class="right">17247</td>
</tr>
<tr>
<td class="left">samtools discordant: indels</td>
<td class="right">8121</td>
</tr>
</tbody>
</table>
<p> The number of discordant variant calls is high, making up 8% of the GATK calls and 14% of the samtools calls, and samtools calls almost 16,000 additional SNPs compared to GATK. As a result, a large percentage of variants require making hard decisions: are those additional calls interesting, real variants in samtools and false negatives in the GATK calls? Or conversely, are they false positives in samtools that GATK correctly excludes? </p>
</p></div>
</p></div>
<div id="outline-container-2-3" class="outline-3">
<h3 id="sec-2-3">Comparisons between sequencing technologies</h3>
<div class="outline-text-3" id="text-2-3">
<p> There is a similar level of discrepancy when comparing variant calls between Illumina and SOLiD sequencing. Below is a comparison between GATK Unified genotyper calls on the two technologies: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">135263</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">122267</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">12996</td>
</tr>
<tr>
<td class="left">Illumina discordant: total</td>
<td class="right">39491</td>
</tr>
<tr>
<td class="left">Illumina discordant: unique</td>
<td class="right">7079</td>
</tr>
<tr>
<td class="left">Illumina discordant: SNPs</td>
<td class="right">15188</td>
</tr>
<tr>
<td class="left">Illumina discordant: indels</td>
<td class="right">24303</td>
</tr>
<tr>
<td class="left">SOLiD discordant: total</td>
<td class="right">16022</td>
</tr>
<tr>
<td class="left">SOLiD discordant: unique</td>
<td class="right">3800</td>
</tr>
<tr>
<td class="left">SOLiD discordant: SNPs</td>
<td class="right">3908</td>
</tr>
<tr>
<td class="left">SOLiD discordant: indels</td>
<td class="right">12114</td>
</tr>
</tbody>
</table>
<p> Unique coverage explains some differences: 4% of the Illumina variants (7079) and 2.5% (3800) of the SOLiD variants were uniquely covered by the technologies. However, the remaining variant discordant calls are on the order of those seen in the technology comparisons. Adding to the complexity, we find only 84% of the total concordant variants compared to the Illumina only GATK/samtools comparison. </p>
</p></div>
</p></div>
<div id="outline-container-2-4" class="outline-3">
<h3 id="sec-2-4">Unified call set</h3>
<div class="outline-text-3" id="text-2-4">
<p> The level of discrepancy between calling methods and sequencing approaches introduces complexity in the preparation of the final call set: How much evidence does a variant need for inclusion? Can single calls be true positives if supported by high confidence values? This will require extensive refinement throughout the public phase. For the initial version 0.1 release of NA19239, we took the following high level approach to filtering: </p>
<ul>
<li>Retain variants found in 4 out of 6 calling/technology methods   (including genotyping data). </li>
<li>Retain variants identified across multiple technologies. </li>
<li>Retain variants found in both more stringent (GATK) and more lenient   (FreeBayes, samtools) callers. </li>
<li>Assess remaining variants using a <a href="https://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machine</a> with   quality score, read depth and <a href="https://github.com/chapmanb/bcbio.variation/blob/master/src/java/bcbio/variation/annotate/ReadPosEndDist.java#L32">variant distance from read ends</a>   metrics, training the classifier on likely true and false positives   from the pairwise overlap comparisons. </li>
</ul>
<p> The result is a unified call set of 171,009 variants derived from all technologies and callers, that we&#8217;re releasing as NA19239 version 0.1. </p>
</p></div>
</p></div>
</p></div>
<div id="outline-container-3" class="outline-2">
<h2 id="sec-3">Comparisons with whole genome datasets</h2>
<div class="outline-text-2" id="text-3">
<p> To assess the quality of the unified call set, we compared to two public genomes: </p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Complete_Genomics">Complete Genomics&#8217;s</a> NA19239 variants from their <a href="http://www.completegenomics.com/public-data/">public whole genome datasets</a>, </li>
<li>An Illumina whole genome dataset for NA19239 at 30x coverage. </li>
</ul>
<p> This provides us with three independent call sets to assess variability between approaches. To provide a baseline, here is the comparison of the Illumina and Complete Genomics calls in our assessment regions: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">Overall genotype concordance</td>
<td class="right">98.47</td>
</tr>
<tr>
<td class="left">concordant: total</td>
<td class="right">205868</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">186365</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">19503</td>
</tr>
<tr>
<td class="left">Illumina discordant: total</td>
<td class="right">31267</td>
</tr>
<tr>
<td class="left">Illumina discordant: SNPs</td>
<td class="right">19334</td>
</tr>
<tr>
<td class="left">Illumina discordant: indels</td>
<td class="right">11933</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: total</td>
<td class="right">15174</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: SNPs</td>
<td class="right">9586</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: indels</td>
<td class="right">5510</td>
</tr>
</tbody>
</table>
<p> We see familiar discordance rates: 13% of the Illumina calls and 7% of the Complete Genomics calls differ. Since it&#8217;s diploid versus diploid, this comparison includes all heterozygous variant matches. As a result the numbers in this comparison will be higher, but it is a good order of magnitude approximation for looking at our fosmid reference set versus each individual technology. </p>
</p></div>
<div id="outline-container-3-1" class="outline-3">
<h3 id="sec-3-1">Illumina</h3>
<div class="outline-text-3" id="text-3-1">
<p> The comparison against the Illumina whole genome variant calls contains 12% discordant calls in our fosmid reference set, with 79% of those being indel differences. Indels are notoriously more difficult to identify and assess, so this will be an area of increased focus as we move forward: </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">150420</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">132604</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">17816</td>
</tr>
<tr>
<td class="left">fosmid discordant: total</td>
<td class="right">19624</td>
</tr>
<tr>
<td class="left">fosmid discordant: SNPs</td>
<td class="right">4165</td>
</tr>
<tr>
<td class="left">fosmid discordant: indels</td>
<td class="right">15459</td>
</tr>
<tr>
<td class="left">Illumina discordant: total</td>
<td class="right">5475</td>
</tr>
<tr>
<td class="left">Illumina discordant: SNPs</td>
<td class="right">2952</td>
</tr>
<tr>
<td class="left">Illumina discordant: indels</td>
<td class="right">2523</td>
</tr>
</tbody>
</table></div>
</p></div>
<div id="outline-container-3-2" class="outline-3">
<h3 id="sec-3-2">Complete Genomics</h3>
<div class="outline-text-3" id="text-3-2">
<p> The Complete Genomics comparison has 17% discordant calls including 2x more discordant SNP calls. This highlights another key area of call set refinement: identifying and correcting for technology specific calls. </p>
<table border="1" cellspacing="0" cellpadding="6" rules="all" style="width:350px;">
<caption></caption>
<col class="left" />
<col class="right" />
<tbody>
<tr>
<td class="left">concordant: total</td>
<td class="right">139559</td>
</tr>
<tr>
<td class="left">concordant: SNPs</td>
<td class="right">126296</td>
</tr>
<tr>
<td class="left">concordant: indels</td>
<td class="right">13263</td>
</tr>
<tr>
<td class="left">fosmid discordant: total</td>
<td class="right">29883</td>
</tr>
<tr>
<td class="left">fosmid discordant: SNPs</td>
<td class="right">10162</td>
</tr>
<tr>
<td class="left">fosmid discordant: indels</td>
<td class="right">19721</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: total</td>
<td class="right">7571</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: SNPs</td>
<td class="right">5542</td>
</tr>
<tr>
<td class="left">Complete Genomics discordant: indels</td>
<td class="right">1965</td>
</tr>
</tbody>
</table></div>
</p></div>
</p></div>
<div id="outline-container-4" class="outline-2">
<h2 id="sec-4">Summary</h2>
<div class="outline-text-2" id="text-4">
<p> The initial NA19239 public genome for the Genomics X Prize provides unified variant calls based on two sequencing technologies and three calling methods. I&#8217;ve delved into a lot of details on our approaches, challenges and goals with the hopes of encouraging suggestions from other researchers working on these problems. We&#8217;re especially interested in feedback on these areas of ongoing research: </p>
<ul>
<li>Digging deeper into potential false positives and negatives: By   combining comparison information between the unified callset and   external resources, we can identify 17654 fosmid variants (10%) not   found in both the Complete Genomics and Illumina datasets. These   require additional in-depth analysis to classify as uniquely   identified fosmid calls or potential false positives. Similarly,   Illumina and Complete Genomics combine to call 1228 variants (0.7%)   that are not in the fosmid call set. These need examination to   classify as fosmid false negatives, or false positive calls in the   individual technologies.  </li>
<li>Additional public genomes: We&#8217;re actively working with teams like   the <a href="http://www.nist.gov/mml/biochemical/biomolecular/genome_in_a_bottle_consortium.cfm">Genome in a Bottle Consortium</a> and <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/index.shtml">Genome Research Consortium</a> to   compare with their reference sets and approaches. Our next target   public genome is NA12878, used in both of these projects and widely   studied.  </li>
<li>Improve variant representation and assessment: The   <a href="https://github.com/chapmanb/bcbio.variation">variation software framework</a> works hard to make variant   representations as uniform as possible. Indels are especially   challenging and we welcome practical examples of regions that need   additional standardization.  </li>
<li>Refine approaches to unifying variant calls: What we learn from the   additional inspection of discordant variants can help inform   improved approaches to filtering. This is a great opportunity to   develop generalized, reusable methods for combining variants from   multiple approaches. </li>
</ul>
<p>   The call sets used here are available as public data folders on <a href="http://www.genomespace.org/">GenomeSpace</a>: </p>
<ul>
<li>Public/chapmanb/xprize/NA19239-v0_1 &ndash; The combined final call set   along with training true/false positives and Illumina/Complete   Genomics comparison based potential false positives and negatives.  </li>
<li>Public/EdgeBio/PublicData/Release1 &ndash; All of the raw input data,   including fastq files, BAM alignments and individual variant calls. </li>
</ul>
<p> Combined with the open source code and configurations, we hope this will provided interested researchers with all the raw materials needed to reproduce and extend these analyses. Your feedback and suggestions are very welcome. </p>
</p></div>
</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/272/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=272&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=TOGpMlBspH4:5HGGgDfs3V8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=TOGpMlBspH4:5HGGgDfs3V8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=TOGpMlBspH4:5HGGgDfs3V8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=TOGpMlBspH4:5HGGgDfs3V8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=TOGpMlBspH4:5HGGgDfs3V8:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/TOGpMlBspH4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2012/08/15/genomics-x-prize-public-phase-reference-genome-preparation-and-comparisons-to-illumina-and-complete-genomics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2012/08/15/genomics-x-prize-public-phase-reference-genome-preparation-and-comparisons-to-illumina-and-complete-genomics/</feedburner:origLink></item>
		<item>
		<title>Extending the GATK for custom variant comparisons using Clojure</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/VGPyBs179Ig/</link>
		<comments>http://bcbio.wordpress.com/2012/03/04/extending-the-gatk-for-custom-variant-comparisons-using-clojure/#comments</comments>
		<pubDate>Mon, 05 Mar 2012 01:13:37 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[variation]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=263</guid>
		<description><![CDATA[The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=263&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">Genome Analysis Toolkit (GATK)</a> is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page">Genome Sequencing and Analysis Group at the Broad Institute</a>, exposes a <a href="http://en.wikipedia.org/wiki/Mapreduce">Map/Reduce framework</a> allowing developers to code custom tools taking advantage of support for: BAM Alignment files through <a href="http://picard.sourceforge.net/">Picard</a>, BED and other interval file formats through <a href="http://code.google.com/p/tribble/">Tribble</a>, and variant data in <a href="http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41">VCF format</a>.</p>
<p>Here I&#8217;ll show how to utilize the GATK API from <a href="http://clojure.org/">Clojure</a>, a functional, dynamic programming language that targets the Java Virtual Machine. We&#8217;ll:</p>
<ul>
<li>Write a GATK walker that plots variant quality scores using the Map/Reduce API.</li>
<li>Create a custom annotation that adds a mean neighboring base quality metric using the <a href="http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator">GATK VariantAnnotator</a>.</li>
<li>Use the VariantContext API to parse and access variant information in a VCF file.</li>
</ul>
<p>The <a href="https://github.com/chapmanb/bcbio.variation">Clojure variation library</a> is freely available and is part of a larger project to provide variant assessment capabilities for the <a href="http://genomics.xprize.org/">Archon Genomics XPRIZE competition</a>.</p>
<h2 id="mapreduce-gatk-walker">Map/Reduce GATK Walker</h2>
<p>GATK&#8217;s <a href="http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development">well documented</a> Map/Reduce API eases the development of custom programs for processing BAM and VCF files. The <a href="http://db.tt/y29eDuh">presentation from Eli Lilly</a> is a great introduction to developing your own custom GATK Walkers in Java. Here we&#8217;ll follow a similar approach to code these in Clojure.</p>
<p>We&#8217;ll start by defining a simple Java base class that extends the base GATK walker and defines an output and input variable. The output is a string specifying the output file to write to, and the input is any type of variant file the GATK accepts. Here we&#8217;ll be dealing with VCF input files:</p>
<pre class="brush: java; title: ; notranslate">
public abstract class BaseVariantWalker extends RodWalker {
  @Output
  public String out;

  @ArgumentCollection
  public StandardVariantContextInputArgumentCollection invrns = new StandardVariantContextInputArgumentCollection();
}
</pre>
<p>This base class is all the Java we need. We implement the remaining walker in Clojure and will walk through the <a href="http://chapmanb.github.com/bcbio.variation/uberdoc.html#bcbio.variation.vcfwalker">fully annotated source</a> in sections. To start, we import the base walker we wrote and extend this to generate a Java class, which the GATK will pick up and make available as a command line walker:</p>
<pre class="brush: clojure; title: ; notranslate">
(ns bcbio.variation.vcfwalker
  (:import [bcbio.variation BaseVariantWalker])
  (:gen-class
   :name bcbio.variation.vcfwalker.VcfSimpleStatsWalker
   :extends bcbio.variation.BaseVariantWalker))
</pre>
<p>Since this is a Map/Reduce framework, we first need to implement the map function. GATK passes this function a tracker, used to retrieve the actual variant call values and a context which describes the current location. We use the <code>invrns</code> argument we defined in Java to reference the input VCF file supplied on the commandline. Finally, we extract the quality score from each <code>VariantContext</code> and return those. This map function produces a stream of quality scores from the input VCF file:</p>
<pre class="brush: clojure; title: ; notranslate">
(defn -map
  [this tracker ref context]
  (if-not (nil? tracker)
    (for [vc (map from-vc
                    (.getValues tracker (.variants (.invrns this))
                                (.getLocation context)))]
      (-&gt; vc :genotypes first :qual))))
</pre>
<p>For the reduce part, we take the stream of quality scores and plot a histogram. In the GATK this happens in 3 functions: <code>reduceInit</code> starts the reduction step and creates a list to add the quality scores to, <code>reduce</code> collects all of the quality scores into this list, and <code>onTraversalDone</code> plots a histogram of these scores using the <a href="http://incanter.org/">Incanter</a> statistical library:</p>
<pre class="brush: clojure; title: ; notranslate">
(defn -reduceInit
  [this]
  [])

(defn -reduce
  [this cur coll]
  (if-not (nil? cur)
    (vec (flatten [coll cur]))
    coll))

(defn -onTraversalDone
  [this result]
  (doto (icharts/histogram result
                           :x-label &quot;Variant quality&quot;
                           :nbins 50)
    (icore/save (.out this) :width 500 :height 400)))
</pre>
<p>We&#8217;ve implemented a full GATK walker in Clojure, taking advantage of existing Clojure plotting libraries. To run this, compile the code into a jarfile and run like a standard GATK tool:</p>
<pre><code>$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
  -r test/data/grch37.fa --variant test/data/gatk-calls.vcf --out test.png
</code></pre>
<p></p>
<p>which produces a plot of quality score distributions:</p>
<p>
<a href="http://bcbio.files.wordpress.com/2012/03/gatk-quals.png"> <img src="http://bcbio.files.wordpress.com/2012/03/gatk-quals.png?w=700" width="400px" alt="GATK walker quality scores" /></a>
</p>
<h2 id="custom-gatk-annotation">Custom GATK Annotation</h2>
<p><a href="http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator">GATK&#8217;s Variant Annotator</a> is a useful way to add metrics information to a file of variants. These metrics allow filtering and prioritization of variants, either by <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration">variant quality score recalibration</a> or <a href="http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker">hard filtering</a>. We can add new annotation metrics by inheriting from GATK Java interfaces. Here we&#8217;ll implement Mean Neighboring Base Quality (NBQ), a metric from the <a href="http://www.biomedcentral.com/1471-2105/13/8/abstract">Atlas2 variation suite</a> that assesses the quality scores in a region surrounding a variation.</p>
<p>We start walking through the <a href="http://chapmanb.github.com/bcbio.variation/uberdoc.html#bcbio.variation.annotate.nbq">full implementation</a> by again defining a generated Java class that inherits from a GATK interface. In this case, <code>InfoFieldAnnotation</code>:</p>
<pre class="brush: clojure; title: ; notranslate">
(ns bcbio.variation.annotate.nbq
  (:import [org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation]
           [org.broadinstitute.sting.utils.codecs.vcf VCFInfoHeaderLine VCFHeaderLineType])
  (:require [incanter.stats :as istats])
  (:gen-class
   :name bcbio.variation.annotate.nbq.MeanNeighboringBaseQuality
   :extends org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation))
</pre>
<p>The <code>annotate</code> function does the work of calculating the mean quality score. We define functions that use the GATK API to:</p>
<ul>
<li>Retrieve the pileup at the current position.</li>
<li>Get the neighbor qualities from a read at a position.</li>
<li>Combine the qualities for all reads in a pileup.</li>
</ul>
<p>With these three functions, we can use the Clojure threading macro to cleanly organize the steps of the operation as we retrieve the pileup, get the qualities and calculate the mean:</p>
<pre class="brush: clojure; title: ; notranslate">
(defn -annotate
  [_ _ _ _ contexts _]
  (letfn [(get-pileup [context]
            (if (.hasExtendedEventPileup context)
              (.getExtendedEventPileup context)
              (.getBasePileup context)))
          (neighbor-qualities [[offset read]]
            (let [quals (-&gt; read .getBaseQualities vec)]
              (map #(nth quals % nil) (range (- offset flank-bp) (+ offset flank-bp)))))
          (pileup-qualities [pileup]
            (map neighbor-qualities (map vector (.getOffsets pileup) (.getReads pileup))))]
    {&quot;NBQ&quot; (-&gt;&gt; contexts
                vals
                (map get-pileup)
                (map pileup-qualities)
                flatten
                (remove nil?)
                istats/mean
                (format &quot;%.2f&quot;))}))
</pre>
<p>With this in place we can now run this directly using the standard GATK command line arguments. As before, we create a jar file with the new annotator, and then pass the name as a desired annotation when running the <code>VariantAnnotator</code>, producing a VCF file with NBQ annotations:</p>
<pre><code>$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf
</code></pre>
<p></p>
<h2 id="access-vcf-variant-information">Access VCF variant information</h2>
<p>In addition to extending the GATK through walkers and annotations you can also utilize the extensive API directly, taking advantage of parsers and data structures to handle common file formats. Using Clojure&#8217;s Java interoperability, the <a href="http://chapmanb.github.com/bcbio.variation/uberdoc.html#bcbio.variation.variantcontext">variantcontext module</a> provides a high level API to parse and extract information from VCF files. To loop through a VCF file and print the location, reference allele and called alleles for each variant we:</p>
<ul>
<li>Open a VCF source providing access to the underlying file inside a <code>with-open</code> statement to ensure closing of the resource.</li>
<li>Parse the VCF source, returning an iterator of <code>VariantContext</code> maps for each variant in the file.</li>
<li>Extract values from the map: the chromosome, start, reference allele and called alleles for the first genotype.</li>
</ul>
<pre class="brush: clojure; title: ; notranslate">
(use 'bcbio.variation.variantcontext)

(with-open [vcf-source (get-vcf-source &quot;test/data/gatk-calls.vcf&quot;)]
  (doseq [vc (parse-vcf vcf-source)]
    (println (:chr vc) (:start vc) (:ref-allele vc)
             (-&gt; vc :genotypes first :alleles)))
</pre>
<p>This produces:</p>
<pre><code>MT 73 #&lt;Allele G*&gt; [#&lt;Allele A&gt; #&lt;Allele A&gt;]
MT 150 #&lt;Allele T*&gt; [#&lt;Allele C&gt; #&lt;Allele C&gt;]
MT 152 #&lt;Allele T*&gt; [#&lt;Allele C&gt; #&lt;Allele C&gt;]
MT 195 #&lt;Allele C*&gt; [#&lt;Allele T&gt; #&lt;Allele T&gt;]
</code></pre>
<p></p>
<p>I hope this tour provides some insight into the powerful tools that can be rapidly built by leveraging the GATK from Clojure. The <a href="https://github.com/chapmanb/bcbio.variation">full library</a> contains a range of additional functionality including normalization of complex MNPs and support for phased haplotype comparisons.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/263/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=263&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=VGPyBs179Ig:GgcVXtB1fy4:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=VGPyBs179Ig:GgcVXtB1fy4:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=VGPyBs179Ig:GgcVXtB1fy4:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=VGPyBs179Ig:GgcVXtB1fy4:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=VGPyBs179Ig:GgcVXtB1fy4:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/VGPyBs179Ig" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2012/03/04/extending-the-gatk-for-custom-variant-comparisons-using-clojure/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2012/03/gatk-quals.png" medium="image">
			<media:title type="html">GATK walker quality scores</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2012/03/04/extending-the-gatk-for-custom-variant-comparisons-using-clojure/</feedburner:origLink></item>
		<item>
		<title>Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/GzBhkLCfQ9k/</link>
		<comments>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/#comments</comments>
		<pubDate>Wed, 30 Nov 2011 01:50:52 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[galaxy]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=251</guid>
		<description><![CDATA[My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=251&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>My <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">previous post described running an automated exome pipeline using CloudBioLinux and CloudMan</a>, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running <a href="http://cloudbiolinux.org/">CloudBioLinux</a> and <a href="http://www.usecloudman.org/">CloudMan</a> during the <a href="http://aws.amazon.com/genomicsevent/">AWS Genomics Event</a> (<a href="http://www.slideshare.net/chapmanb/developing-distributed-analysis-pipelines-with-shared-community-resources-using-cloudbiolinux-and-cloudman">talk slides are available</a>).</p>
<p>The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:</p>
<ul>
<li>
<p><a href="http://biocloudcentral.org">BioCloudCentral</a> &#8212; A web service that launches CloudBioLinux and CloudMan clusters on <a href="http://aws.amazon.com/">Amazon Web Services</a> hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.</p>
</li>
<li>
<p>A custom <a href="http://galaxyproject.org/">Galaxy</a> integrated front-end to next-generation sequencing pipelines. A <a href="http://jqueryui.com/">jQuery UI</a> wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.</p>
</li>
</ul>
<p>Special thanks are due to <a href="http://www.usecloudman.org/enis/">Enis Afgan</a> for his help building these tools. He provided his <a href="http://readthedocs.org/docs/boto/en/latest/">boto</a> expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.</p>
<p>This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an <a href="http://www.nomachine.com/download.php">NX graphical client</a> to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.</p>
<h2 id="biocloudcentral----starting-a-cloudbiolinux-instance">BioCloudCentral &#8212; starting a CloudBioLinux instance</h2>
<p>To get started, sign up for an <a href="http://aws.amazon.com/">Amazon Web services account</a>. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the <a href="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key">Amazon security credentials page</a>.</p>
<p>With these, navigate to <a href="http://biocloudcentral.org">BioCloudCentral</a> and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/9ZqvxhwcPaw?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span><br />
</p>
<p>Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.</p>
<p>Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the <a href="https://console.aws.amazon.com/ec2/home">Amazon web services console</a>.</p>
<h2 id="cloudman----managing-the-cluster">CloudMan &#8212; managing the cluster</h2>
<p>The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:</p>
<pre><code>cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23/
</code></pre>
<p></p>
<p>This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/NkayXBBAr8I?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span><br />
</p>
<h2 id="galaxy-pipeline-interface----running-the-analysis">Galaxy pipeline interface &#8212; running the analysis</h2>
<p>This Galaxy instance is a <a href="https://bitbucket.org/hbc/galaxy-central-hbc/overview">fork of the main codebase</a> containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user: example@example.com; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/dIeQCIi3EXw?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span><br />
</p>
<p>The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).</p>
<h2 id="galaxy----retrieving-and-displaying-results">Galaxy &#8212; retrieving and displaying results</h2>
<p>The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don&#8217;t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/LtOK3U8990w?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span><br />
</p>
<p>As a reminder, don&#8217;t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the <a href="https://console.aws.amazon.com/ec2/home">Amazon console</a>.</p>
<h2 id="analysis-pipeline-details-and-extending-this-work">Analysis pipeline details and extending this work</h2>
<p>The backend analysis pipeline is a <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">freely available set of Python modules</a> included on the CloudBioLinux AMI. The pipeline closely follows current <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3">best practice variant detection recommendations from the Broad GATK team</a>:</p>
<ul>
<li>FASTQ alignment with <a href="http://bio-bwa.sourceforge.net/">BWA</a>; <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/ngsalign/bwa.py">source code</a></li>
<li><a href="http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration">Base quality score recalibration</a> with GATK: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/recalibrate.py">source code</a></li>
<li><a href="http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels">Local realignment around indels</a> with GATK: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/realign.py">source code</a>:</li>
<li>Variant calling (SNPs and indels) using the <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper">GATK Unified Genotyper</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/genotype.py">source code</a></li>
<li>Variant effect estimation with <a href="http://snpeff.sourceforge.net/">snpEff</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/effects.py">source code</a></li>
<li>Read coverage visualization with <a href="http://hgdownload.cse.ucsc.edu/admin/exe/">wigToBigWig</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/bam_to_wiggle.py">source code</a></li>
</ul>
<p>The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.</p>
<p>We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/251/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=251&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=GzBhkLCfQ9k:lahg0d8D8-k:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=GzBhkLCfQ9k:lahg0d8D8-k:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=GzBhkLCfQ9k:lahg0d8D8-k:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=GzBhkLCfQ9k:lahg0d8D8-k:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=GzBhkLCfQ9k:lahg0d8D8-k:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/GzBhkLCfQ9k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/</feedburner:origLink></item>
		<item>
		<title>Parallel approaches in next-generation sequencing analysis pipelines</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/HW_RgYiYXKQ/</link>
		<comments>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/#comments</comments>
		<pubDate>Sat, 10 Sep 2011 19:12:50 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[distributed-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=229</guid>
		<description><![CDATA[My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I&#8217;ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing. Incredible innovation in throughput [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=229&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>My last post described a <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">distributed exome analysis pipeline</a> implemented on the <a href="http://cloudbiolinux.org">CloudBioLinux</a> and <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan</a> frameworks. This was a practical introduction to running the pipeline on <a href="http://aws.amazon.com/">Amazon resources</a>. Here I&#8217;ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.</p>
<p>Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single <a href="http://www.illumina.com/systems/hiseq_2000.ilmn">Hi-Seq</a> run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.</p>
<p>The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support <a href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html">powerful multi-core servers</a>, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarassingly parallel</a> nature of the computation. An <a href="http://www.rabbitmq.com/">AMQP messaging queue</a> allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.</p>
<div id="process-overview----points-for-parallel-implementations">
<h2>Process overview &#8212; points for parallel implementations</h2>
<p>The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:</p>
<p><a href="http://chapmanb.github.com/bcbb/lane_processing.png"> <img src="http://chapmanb.github.com/bcbb/lane_processing.png" width="650px" alt="Initial lane processing" /></a></p>
<p>The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:</p>
<p><a href="http://chapmanb.github.com/bcbb/sample_processing.png"> <img src="http://chapmanb.github.com/bcbb/sample_processing.png" width="650px" alt="Sample processing overview" /></a></p>
<p>The variant calling steps utilize <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">The Genome Analysis Toolkit (GATK)</a> from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:</p>
<p><a href="http://chapmanb.github.com/bcbb/variant_call.png"> <img src="http://chapmanb.github.com/bcbb/variant_call.png" width="650px" alt="GATK variant calling details" /></a></p>
<p>Subsequent steps include <a href="http://snpeff.sourceforge.net/">assessment of variant effects using snpEff</a> and <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Read-backed_phasing_algorithm">haplotype phasing of variants</a> in diploid organism analyses.</p>
</div>
<div id="messaging-approach-to-parallel-execution">
<h2>Messaging approach to parallel execution</h2>
<p>The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:</p>
<p><a href="http://chapmanb.github.com/bcbb/parallel_messaging.png"> <img src="http://chapmanb.github.com/bcbb/parallel_messaging.png" width="650px" alt="Messaging approach" /></a></p>
<p>In <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">my previous post</a>, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.</p>
<p>The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">bcbio-nextgen GitHub repository</a></p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/229/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=229&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=HW_RgYiYXKQ:Ngev2kD4Ico:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=HW_RgYiYXKQ:Ngev2kD4Ico:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=HW_RgYiYXKQ:Ngev2kD4Ico:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=HW_RgYiYXKQ:Ngev2kD4Ico:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=HW_RgYiYXKQ:Ngev2kD4Ico:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/HW_RgYiYXKQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/lane_processing.png" medium="image">
			<media:title type="html">Initial lane processing</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/sample_processing.png" medium="image">
			<media:title type="html">Sample processing overview</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/variant_call.png" medium="image">
			<media:title type="html">GATK variant calling details</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/parallel_messaging.png" medium="image">
			<media:title type="html">Messaging approach</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/</feedburner:origLink></item>
		<item>
		<title>Distributed exome analysis pipeline with CloudBioLinux and CloudMan</title>
		<link>http://feedproxy.google.com/~r/bcbio/~3/ufrwsHl1jl0/</link>
		<comments>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/#comments</comments>
		<pubDate>Fri, 19 Aug 2011 21:33:16 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[distributed-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=217</guid>
		<description><![CDATA[A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=217&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.</p>
<p>Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:</p>
<ul>
<li>
<p><a href="http://cloudbiolinux.org">CloudBioLinux</a> &#8212; A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.</p>
</li>
<li>
<p><a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan</a> &#8212; Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by <a href="http://userwww.service.emory.edu/~eafgan/">Enis Afgan</a> and the <a href="http://wiki.g2.bx.psu.edu/">Galaxy Team</a>, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on <a href="http://aws.amazon.com/">Amazon AWS</a>.</p>
</li>
</ul>
<p>Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput <a href="http://en.wikipedia.org/wiki/Exome_Sequencing">exome sequencing</a> data:</p>
<ul>
<li>The underlying analysis software is from CloudBioLinux.</li>
<li>CloudMan provides an SGE cluster managed via a web front end.</li>
<li><a href="http://www.rabbitmq.com/">RabbitMQ</a> is used for communication between cluster nodes.</li>
<li><a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">An automated pipeline</a>, written in <a href="http://python.org">Python</a>, organizes parallel processing across the cluster.</li>
</ul>
<p>Below are instructions for starting a cluster on <a href="http://aws.amazon.com/ec2/">Amazon EC2</a> resources to run an exome sequencing pipeline that processes <a href="http://en.wikipedia.org/wiki/FASTQ_format">FASTQ</a> sequencing reads, producing fully annotated <a href="http://en.wikipedia.org/wiki/Variant_Call_Format">variant calls</a>.</p>
<div id="start-cluster-with-cloudbiolinux-and-cloudman">
<h2>Start cluster with CloudBioLinux and CloudMan</h2>
<p>Start in the <a href="https://console.aws.amazon.com/ec2/">Amazon web console</a>, a convenient front end for managing EC2 servers. The first step is to follow the <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan setup instructions</a> to create an Amazon account and set up appropriate security groups and user data. The <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">wiki page</a> contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:</p>
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/eS8vmKIXIB4?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span>
<p></p>
<p>Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:</p>
<pre><code>cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23
</code></pre>
<p></p>
<p>This screencast shows all of the details, including starting an additional node on the SGE cluster:</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='700' height='424' src='http://www.youtube.com/embed/4kIRI1m0g7Y?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span><br />
</p>
</div>
<div id="configure-amqp-messaging">
<h2>Configure AMQP messaging</h2>
<p><b>Edit:</b> The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the &#8216;Run Analysis&#8217; section to start processing the data immediately.
</p>
<p>With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:</p>
<pre><code>$ ssh -i your-keypair ubuntu@ec2-50-19-177-134.compute-1.amazonaws.com
</code></pre>
<p></p>
<p>Edit the <code>/export/data/galaxy/universe_wsgi.ini</code> configuration file to add the internal hostname. After editing, the AMQP section will look like:</p>
<pre><code>[galaxy_amqp]
host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester
</code></pre>
<p></p>
<p>Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:</p>
<pre><code>$ sudo rabbitmqctl add_user biouser tester
creating user &quot;biouser&quot; ...
...done.
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost &quot;bionextgen&quot; ...
...done.
$ sudo rabbitmqctl set_permissions -p bionextgen biouser &quot;.*&quot; &quot;.*&quot; &quot;.*&quot;
setting permissions for user &quot;biouser&quot; in vhost &quot;bionextgen&quot; ...
...done.
</code></pre>
<p>
</div>
<div id="run-analysis">
<h2>Run analysis</h2>
<p>With messaging in place, we are ready to run the analysis. <code>/export/data</code> contains a ready to run example exome analysis, with <a href="http://en.wikipedia.org/wiki/FASTQ_format">FASTQ</a> input files in <code>/export/data/exome_example/fastq</code> and configuration information in <code>/export/data/exome_example/config</code>. Start the fully automated pipeline with a single command:</p>
<pre><code> $ cd /export/data/work
 $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/fastq
                                   /export/data/exome_example/config/run_info.yaml
</code></pre>
<p></p>
<p><code>distributed_nextgen_pipeline.py</code> starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:</p>
<ul>
<li>Alignment with <a href="http://bio-bwa.sourceforge.net/">BWA</a></li>
<li>Preparation of merged alignment files with <a href="http://picard.sourceforge.net/">Picard</a></li>
<li>Recalibration and realignment with <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">GATK</a></li>
<li>Variant calling with GATK</li>
<li>Assessment of predicted variant effects with <a href="http://snpeff.sourceforge.net/">snpEff</a></li>
<li>Preparation of summary PDFs for each sample with read details from <a href="http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/">FastQC</a> alongside alignment, hybrid selection and variant calling statistics from Picard</li>
</ul>
</div>
<div id="monitor-the-running-process">
<h2>Monitor the running process</h2>
<p>The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs <code>qstat</code> command will tell you where the analysis servers are running on the cluster:</p>
<pre><code>$ qstat
ob-ID  prior   name   user  state submit/start at   queue
----------------------------------------------------------------------------------
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-125-10-182.ec2.int
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-86-254-105.ec2.int
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47 all.q@ip-10-125-10-182.ec2.int
</code></pre>
<p></p>
<p>Listing files in the working directory will show our progress:</p>
<pre><code>$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o10
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o9
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp
</code></pre>
<p></p>
<p>The files that end with <code>.o*</code> are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:</p>
<pre><code>$ less nextgen_analysis_server.py.o10
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK
</code></pre>
<p></p>
</div>
<div id="retrieve-results">
<h2>Retrieve results</h2>
<p>The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:</p>
<pre><code>$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml
                      /export/data/exome_example/fastq
                      /export/data/work
                      /export/data/exome_example/config/run_info.yaml
</code></pre>
<p></p>
<p>For each sample, this script copies:</p>
<ul>
<li>A <a href="http://samtools.sourceforge.net/SAM1.pdf">BAM file</a> with aligned sequeneces and original FASTQ data</li>
<li>A realigned and recalibrated BAM file, ready for variant calling</li>
<li>Variant calls in <a href="http://en.wikipedia.org/wiki/Variant_Call_Format">VCF format</a>.</li>
<li>A tab delimited file of predicted variant effects.</li>
<li>A PDF summary file containing alignment, variant calling and hybrid selection statistics.</li>
</ul>
<p>into an output directory for the flowcell: <code>/export/data/galaxy/storage/100326_FC6107FAAXX</code>:</p>
<pre><code>$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd
</code></pre>
<p></p>
<p>As suggested by the name, the script can also integrate the data into a <a href="http://usegalaxy.org">Galaxy instance</a> if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the <a href="http://genome.ucsc.edu/">UCSC browser</a>.</p>
</div>
<div id="learn-more">
<h2>Learn more</h2>
<p>All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through <a href="http://en.wikipedia.org/wiki/YAML">YAML</a> configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.</p>
<p>The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.</p>
<p>If you&#8217;d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I&#8217;ll be discussing it at the <a href="http://aws.amazon.com/genomicsevent/">AWS Genomics Event</a> in Seattle on September 22nd.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/217/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&#038;blog=5850073&#038;post=217&#038;subd=bcbio&#038;ref=&#038;feed=1" width="1" height="1" /><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/bcbio?a=ufrwsHl1jl0:ddtrJwJWAWE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bcbio?i=ufrwsHl1jl0:ddtrJwJWAWE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=ufrwsHl1jl0:ddtrJwJWAWE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bcbio?i=ufrwsHl1jl0:ddtrJwJWAWE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/bcbio?a=ufrwsHl1jl0:ddtrJwJWAWE:clraHZBW0_I"><img src="http://feeds.feedburner.com/~ff/bcbio?d=clraHZBW0_I" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/bcbio/~4/ufrwsHl1jl0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/907db62471c84437d0875095402e94b6?s=96&amp;d=http%3A%2F%2Fs0.wp.com%2Fi%2Fmu.gif" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	<feedburner:origLink>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/</feedburner:origLink></item>
	</channel>
</rss>
