Getting Genetics Done

Staying Current in Bioinformatics & Genomics: 2017 Edition

2017-02-01T13:35:00.001-06:00

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.

People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.

What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.

Twitter

In the 2012 post I ended with Twitter, but I have to lead with it this time. Twitter is probably my most valuable resource for learning about the bleeding-edge developments in genomics & bioinformatics. It’s great for learning what’s new and contributing to the dialogue in your field, but only when used effectively.

I aggressively prune the list of people I follow to keep what I see relevant and engaging. I can tolerate an occasional digression into politics, posting pictures of you drinking with colleagues at a conference, or self-congratulatory announcements. But once these off-topic Tweets become the norm, I unfollow. I also rely on the built-in list feature. I follow a few hundred people, but I only add a select few dozen to a “notjunk” list that I look at when I’m short on time. Folks in this list don’t Tweet too often and have a high signal-to-noise ratio (as far as what I’m interested in reading). If I don’t get a chance to catch up on my entire timeline, I can at least breeze through recent Tweets from folks on this list.

I’m also wary of following extremely prolific users. For example — if someone’s been on Twitter less than a year, already has 20,000 Tweets, but only 100 followers, it tells me they’ve got a lot to say but nobody cares. I let the hive mind work for me in this case, using this Tweet-to-follower ratio as sort of a proxy for signal-to-noise.

I mostly follow individuals and aggregators, but I also follow a few organization accounts. These can be a mixed bag. Only a few organization accounts do this well, delivering interesting and applicable content to a targeted audience, while many more are poor attempts at marketing and self-promotion while not offering any substantive value or interesting content.

Individuals: In no particular order, here’s an incomplete list of people who Tweet content that I find consistently on-topic and interesting.

Aaron Quinlan (aaronquinlan)
Adam Phillippy (aphillippy)
Andrew Severin (isugif)
Casey Greene (GreeneScientist)
Clive Brown (Clive_G_Brown)
Dan MacArthur (dgmacarthur)
David Robinson (drob)
Elisabeth Bik (MicrobiomDigest)
Frank Harrell (f2harrell)
Hadley Wickham (hadleywickham)
Heng Li (lh3lh3)
James Hadfield (coregenomics)
Jared Simpson (jaredtsimpson)
Jeff Leek (jtleek)
Jenny Bryan (JennyBryan)
Julia Silge (juliasilge)
Krista Ternus (KristaTernus)
Lex Nederbragt (lexnederbragt)
Lior Pachter (lpachter)
Mick Watson (biomickwatson)
Mike Love (mikelove)
Nick Loman (pathogenomenick)
Nicolas Robine (notSoJunkDNA)
Phil Ashton (flashton2003)
RNA-seq Blog (rnaseqblog)
Rob Patro (nomad421)
Roger Peng (rdpeng)
Sam Minot (sminot)
Sean Davis (seandavis12)
Titus Brown (ctitusbrown)
Torsten Seemann (torstenseemann)
Tuuli Lappalainen (tuuliel)
Vince Buffalo (vsbuffalo)
Willem van Schaik (WvSchaik)
Zamin Iqbal (ZaminIqbal)
Many more I’m failing to specifically mention…

Others: Besides individual accounts, there are also a number of aggregators and organizations that I keep on a high signal-to-noise list.

bioRxiv (biorxivpreprint)
bioRxiv Bioinfo (biorxiv_bioinfo)
bioRxiv Genomics (biorxiv_genomic)
Metagenomics Papers (metagenomic_lit)
InformaticsGW (UduakGW)
Hacker News 300 (newsyc300)
CompBiolPapers (compbiolpapers)
RNA-seq paper aggregator (RNA_seq)
Bioconductor (Bioconductor)
RStudio Tips (rstudiotips)

Blogs

I follow these and other blogs using RSS. I’ve been happy with the free version of Feedly ever since Google Reader was killed. The web interface and iOS app have everything I need, and they both integrate nicely with other services like Evernote, Instapaper, Buffer, Twitter, etc. If you can’t find a direct link to the blog’s RSS feed, you can usually type the name of the blog into Feedly’s search bar and it’ll find it for you. Similar to my “notjunk” list in Twitter, I have a Favorites category in Feedly where I include only the feeds I absolutely wouldn’t want to miss.

These are some of the few that I try to read whenever something new is posted, and Feedly helps me keep those organized, either by “starring” something I want to come back to, or saving it for later with Instapaper. They’re in no particular order, and I’m sure I’ve forgotten something.

Variance Explained: David Robinson’s blog (Data Scientist at Stack Overflow, works in R and Python).
Global Biodefense: News on pathogens, outbreaks, and preparedness, with periodic posts on genomics and bioinformatics-related developments and funding opportunities.
In between lines of code: Lex Nederbragt’s blog on biology, sequencing, bioinformatics, …
Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek.
Bits of DNA: Reviews and commentary on computational biology by Lior Pachter (fair warning: dialogue here can get a bit heated!).
Blue Collar Bioinformatics: articles related tool validation and the open source bioinformatics community.
Microbiome Digest - Bik’s Picks: A daily digest of scientific microbiome papers, by Elisabeth Bik, Science Editor at uBiome.
Living in an Ivory Basement: Titus Brown’s blog on metagenomics, open science, testing, reproducibility, and programming.
Enseqlopedia: James Hadfield’s blog on all things NGS.
Epistasis Blog: Jason Moore’s computational biology blog.
RStudio Blog: announcements about new RStudio functionality, updates about the tidyverse, and more.
nextgenseek.com: Next-Gen Sequencing Blog covering new developments in NGS data & analysis.
RNA-Seq Blog: Transcriptome Research & Industry News.
The Allium: We all need a little humor in our lives. Like The Onion, but for science.

Others

I’m unsure how to categorize the rest. These are things like aggregators, Q&A sites/forums, and others.

Nuzzel is something I’ve only been using for a few months but it works very well. It’s meant to solve the Twitter / social media overload problem. If you’re following a few hundred people, you could easily have thousands of Tweets per day to read through (or miss). Nuzzel emails you a daily newsletter of the most relevant content in your Twitter feed. I’m guessing it does this by analyzing how many people you follow share, retweet, or favorite the same links. I try to read everything in my RSS feeds but I could never do this with Twitter (nor should you worry about trying). Nuzzel helps you catch up on things that are trending among the people you follow. It’s not a substitute for following the right people (see the Twitter section above).
RWeekly: weekly updates from the entire R community. Offers an RSS feed but I subscribe to the weekly email. Each email sends out about 50 links with one-sentence descriptions to things being done in the R community that week.
R Bloggers aggregates RSS feeds from hundreds of blogs about R. Much more comprehensive than RWeekly, but lots to sort through.
GenomeWeb still provides high-quality original content as well as summaries of what’s going on in the field. Create an account, log in, view your profile page, and subscribe to some of their regular emails. I subscribe to their daily news, the scan, informatics, sequencing, and infectious diseases bulletins. Pro tip: Much of their content is only available for premium subscribers. If you sign up with a .edu address, you can access all this content for free.
F1000’s Smart Search is one of the few literature recommendation services that I find useful, relevant, and current. My RNA-seq and metagenomics alerts consistently deliver relevant and fresh content.
BioStars: This is a stack exchange Q&A site focused on bioinformatics, computational genomics, biological data analysis. You can go to the homepage and sort by topic, views, answers, etc., and the platform offers several granular ways to subscribe via RSS.
Bioconductor Support: This is a Q&A site much like BioStars that replaced the Bioconductor mailing list. You can do things like limit to a certain time period and sort by views, for example, if you only want to log in occasionally to see what’s being talked about.
SEQanswers: I subscribe to all new threads in the SEQanswers bioinformatics forum, and regularly browse post titles. When something sparks my interest, I’ll click into that post and subscribe to future updates on that post via email.
Google Scholar lets you search and create email alerts.
PubMed Alerts: You can save, automate, and have search results emailed to you through your MyNCBI account. Surprisingly, these seem to be more relevant than the Google Scholar searches for the terms that I use.
PubMed Trending - I have no idea how PubMed ranks these. It seemed to be more useful in the past, but now it seems that the top “trending” articles alternate between CRISPR/Cas9, and old kinesiology / sports medicine articles.
IFTTT: If This Then That is a service that connects many different web services together in an endless number of ways. At home I might connect Facebook and Dropbox, so that whenever someone tags me in a photo, that photo is automatically downloaded to my Dropbox. At work I can connect an RSS feed to an Evernote note or Google Doc. It’s useful is so many ways, both for personal and for work-related tasks. I mostly use it here as a last safeguard so that things I really shouldn’t miss don’t slip through the cracks. I have recipes that do things like email me if certain low-volume Twitter accounts post a new Tweet, others that automatically save to Instapaper things like starred articles in Feedly. I also use this to keep a close eye on a few accounts on GitHub. I have connections set up for a few users on GitHub so that whenever one of these users creates a new public repository, I get an email. I’ve also used IFTTT to archive Tweets coming out of various hashtags — you can create a recipe where if a new Tweet contains certain keywords or hashtags, then save that Tweet to Evernote, a shared Google Doc spreadsheet, etc. Zapier is a similar service that I’ve heard provides more granular control, but I haven’t tried it.
Podcasts: I listen to every episode of Roger Peng and Hilary Parker’s Not So Standard Deviations data science podcast, and most episodes of Roger Peng and Elizabeth Matsui’s The Effort Report (this one’s more about life in academia in general). I use the Overcast iOS app to listen to these and other podcasts on ~1.75X speed. (When I met Hilary at the RStudio Conference I heard her speak for the first time at regular 1X speed. Odd experience.) Finally, I just learned about the R podcast. I haven’t listened to much yet, but I’ve added it to my long Overcast queue.

Preprints!

Preprints in life sciences were nearly unheard of when I wrote the 2012 post. Now everybody’s doing it. There are still a few people using the arXiv Quantitative biology channel, and I’ll occasionally find something in PeerJ Preprints that grabs my attention.

bioRxiv is the biggest player here, hands down. The Alerts/RSS page lets you sign up for email alerts on particular topics, or subscribe to RSS feeds coming from particular categories that interest you. I subscribe to the Genomics and Bioinformatics feeds. I also follow several of the bioRxiv’s top-level and category Twitter feeds @biorxivpreprint, @biorxiv_bioinfo, and @biorxiv_genomic).

F1000 Research deserves some special attention here. It’s somewhere in-between a preprint server and a peer-reviewed publication. You can upload manuscripts (or other research outputs like posters or slides), and they’re immediately and permanently published, and given a DOI. Then one or more rounds of open peer review as well as public comment take place, and authors can update the published paper for further review. Check out the transcript estimates / gene inference paper I mentioned earlier. You’ll see it’s “version 2,” and was approved by two referees. If you look at the right-hand panel, you can actually go back and see the prior to revision, as well as see who reviewed it, what the reviewer wrote, and how the authors responded to those reviews. It’s an innovative platform where peer review is open and transparent, and is independent of publication, since papers are published before they are reviewed, and remain regardless of the outcome of the review. F1000 Research has a number of channels that are externally curated by different organizations, societies, conferences, etc. I subscribe to and get alerts about the R package and Bioconductor channels. Whenever a new preprint is dropped into one of these channels, I’ll get an email and an RSS item.

I only recently discovered PrePubMed, which looks very useful. PrePubMed indexes preprints from arXiv q-bio, PeerJ Preprints, bioRxiv, F1000Research, preprints.org, The Winnower, Nature Precedings, and Wellcome Open Research. In the tools box on the homepage, you can enter a search string and get back an RSS feed with results from that search. It looks like PrePubMed is maintained by a single person, but he’s made the entire thing open source, so you could presumably set this up and mirror it on your own, should you check back in 2021 and the link be dead.

Journals

I started with Journals in my 2012 post, but they’re last (and probably least) here. I still subscribe to a few journals’ RSS feeds, but in most cases, by the time I see a new Table of Contents hit my RSS reader, I probably saw the publications making the rounds on Twitter, blogs, or other channels mentioned above. It’s also no longer unusual to see a “publication” land where I read the preprint on biorXiv months ago, and perhaps even a blog post before that! What “publication” means is changing rapidly, and I’m sure the lines between a blog post, preprint, and journal article will be even blurrier in the year 2022 post.

How do you have the time to do this?

How do you not? It’s not as bad as it seems. I probably spend an hour each weekday scanning all the resources mentioned here, and I find the time well spent. I can breeze through my Twitter and RSS feeds on my bus ride into work, and saving things I actually want to look at later with a bookmark, star, favorite, Instapaper, etc.

I should have prefaced this whole article with the note that I hardly ever actually fully read any of the papers or blog posts I see here. If I see, for example, a new WGS variant caller published, I’ll glance at the figures benchmarking it against GATK and FreeBayes, and skim through the documentation on the GitHub README or BioConductor vignette. If either of these is missing or falls short, that’s usually enough for me to ignore the publication completely (don’t underestimate the importance of good documentation!).

It’s taken me a decade to compile and continually hone this list of resources to the things that I find useful and relevant. This is what works for me, now, in 2017. It’s not a one-size-fits-all, and the 2018-me will probably have a somewhat different list, but I hope you’ll find it useful. If your interests are similar to what I’ve discussed here, how do you stay current? What have I left out? Let me know in the comments!

RStudio Conference 2017 Recap

2017-01-14T15:48:00.000-06:00

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown.

This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.

Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom).

I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.

Another #rstudioconf-inspired addition to fuzzyjoin:

genome_join, for overlapping intervals on the same chromosome@genetics_blog #rstats pic.twitter.com/oUctyNYc09
— David Robinson (@drob) January 13, 2017

Charlotte Wickham’s 2-hour purrr tutorial was awesome. Here’s a link to a shared dropbox folder with code, challenges, slides, data, etc. The purrr package is a core package in the tidyverse, and I’ll be replacing many of the base ?apply and plyr ??ply functions that I still use here and there. The map_* functions are integral to working with nested list-columns in dplyr, and I think I’m finally starting to grok how to work with these.

Jenny Bryan gave a great talk on list columns. You can see her slides here. Jenny also put together this excellent tutorial with lots of worked examples and code snippets. And if you need some example list data structures for more practice or for teaching that aren’t foo/bar/iris/mtcars-level boring, see her repurrrsive package. Related to this, for more on list columns and purrr map functions, start reading at the “Many Models” section of Hadley’s R for Data Science book.

Julia Silge, data scientist at Stack Overflow, gave a great introduction to tidy text mining with R. You can read Julia and David’s Tidy Text Mining with R book here online (the book was authored in Rmarkdown using bookdown!).

Andrew Flowers, data journalist and former writer at FiveThirtyEight gave the second day’s keynote address on finding and telling stories using R. He gave a series of examples illustrating six motivating features that make data stories worth telling, along with potential danger inherent to each one:

Novelty (potential danger: triviality)
Outlier (spurious result; see also, p-hacking)
Archetype (oversimplification)
Trend (variance)
Debunking (confirmation bias)
Forecast (overfitting)

Yihui Xie led a two-hour tutorial on advanced RMarkdown. You can see his slides here. The rticles package has LaTeX Journal Article Templates for R Markdown for various journals. The tufte package now supports both PDF and HTML output. See an example here. Yihui’s xaringan package ports the remark.js library for slideshows into R. Careful. Yihui warns that you may not sleep after learning about how cool remark.js is. Yihui showed an early version of the in-development blogdown package that can build blog-aware static websites using the blazing-fast and well-documented Hugo static site generator. Finally, the bookdown package is just awesome. It takes multiple RMarkdown documents as input and renders into multiple output formats (screen-readable ebook, PDF, epub, etc.). It looks great for writing books and technical documentation with pushbutton publishing to multiple output formats with some nice built-in styles out of the box. Some examples:

bookdown.org/yihui/bookdown — The bookdown book, written in RMarkdown with bookdown. (whoa, meta)
r4ds.had.co.nz — Garrett Grolemund and Hadley Wickham’s R for Data Science book.
tidytextmining.com — Julia and David’s book on text mining
moderndive.com — an open-source introductory statistics class textbook

Finally, a few gems from other talks that I jotted down:

Chester Ismay gave a great talk on teaching introductory statistics using R, with the open-source course textbook written in RMarkdown using bookdown.
Bob Rudis talked about using pipes (%>%), and pipes within pipes, and best piping practices. See his slides here.
Hilary Parker talked about the idea of an analysis development, (and analysis developers), drawing similarities to software development/developers. Hilary discussed this once before on the excellent podcast that she and Roger Peng host, and you can probably find it in their Conversations On Data Science ebook that summarize and transcribe these conversations.
Simon Jackson introduced corrr package for exploring and manipulating correlations and correlation matrices in a tidy way.
Gordon Shotwell introduced the easymake package that generates Makefiles from a data frame using R.
Karthik Ram quickly introduced several of the (many) rOpenSci packages related to data publication, data access, scientific literature access, scalable & reproducible computing, databases, visualization, taxonomy, geospatial analysis, and many utility tools for data analysis and manipulation.

With split sessions I missed more than half the talks. Lots of people here are active on Twitter, and you can catch many more notes and tidbits on the #rstudioconf hashtag. The meeting was superbly organized, I learned a ton, and I enjoyed meeting in person many of the folks I follow on Twitter and elsewhere online. A few days of 80-degree weather in mid-January didn’t hurt either. I’ll definitely be coming again next year. Kudos to the rstudio::conf organizers and speakers!

All the talks were recorded and will supposedly find their way to rstudio.com at some point soon. I’ll update this post with a link when that happens.

Update Feb 16, 2017: All the talks have now been posted online here under the rstudio::conf2017 heading.

Day 1:

Day 2:

Primers in computational biology

2016-09-19T09:20:00.001-05:00

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.

...

How does multiple testing correction work?
http://www.nature.com/nbt/journal/v27/n12/full/nbt1209-1135.html

What is principal component analysis?
http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html

SNP imputation in association studies
http://www.nature.com/nbt/journal/v27/n4/full/nbt0409-349.html

How does gene expression clustering work?
http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

What is a hidden Markov model?
http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html

What is a support vector machine?
http://www.nature.com/nbt/journal/v24/n12/full/nbt1206-1565.html

What is the expectation maximization algorithm?
http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

Syntax Highlight Code in Keynote or Powerpoint

2016-06-30T13:22:00.001-05:00

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.

Install homebrew if you don’t have it already and brew install highlight.
highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
Paste into Keynote or Powerpoint.

If I’ve got some code in a file called eset_pca.R:

I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.

Covcalc: Shiny App for Calculating Coverage Depth or Read Counts for Sequencing Experiments

2016-06-01T13:48:00.000-05:00

How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.

Check out the app here:
http://apps.bioconnector.virginia.edu/covcalc/

And the source code on GitHub:
https://github.com/stephenturner/covcalc

Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.

Shiny App: Coverage / Read Count Calculator

Shiny Developer Conference 2016 Recap

2016-02-05T11:17:00.000-06:00

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.

tl;dr

reactivity is a beast
javascript isn’t cheating
there are already a ton of shiny features … and more on the way

reactivity

For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:

output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.

Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.

If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.

Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.

Here’s a trivial example:

library(shiny)

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),
         plotOutput("size"),
         textOutput("names")
    )

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)
            hist(dat$wt)

        })

        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)
            rownames(dat)

        })
    }

shinyApp(ui = ui, server = server)

The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.

The code below does the same thing but stores “dat” in a reactive that is only calculated once.

library(shiny)

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),
    plotOutput("size"),
    textOutput("names")
)

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)

    })

    output$size = renderPlot({

        hist(dat()$wt)

    })

    output$names = renderText({

        rownames(dat())

    })
}

shinyApp(ui = ui, server = server)

javascript

For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.

Below is code for a masterpiece that that does some hide() and show():

# https://apps.bioconnector.virginia.edu/game
library(shiny)
library(shinyjs)
shinyApp(

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        useShinyjs(),
        hidden(actionButton("restart", "restart the game")),
        tags$h3(hidden(textOutput("game_over")))
  ),

  server = function(input, output) {

        output$game_over =
            renderText({
                "game over, man ... game over"
            })  

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")
            hide("start")
            show("restart")
        })

       observeEvent(input$restart, {
            hide("game_over")
            hide("restart")
            show("start")
        })

  }
)

everything else

brushing

http://shiny.rstudio.com/articles/plot-interaction.html

Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.

# http://apps.bioconnector.virginia.edu/names/
library(shiny)
library(ggplot2)
library(ggthemes)
library(babynames)
library(scales)

options(scipen=999)

ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),
                plotOutput("plot2"),
                htmlOutput("info")

)

server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))

    })

    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))

    })

    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        req(input$plot_brush)
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))

    })

    output$info = renderText({

        "data source: social security administration names from babynames package

"

    })

}

shinyApp(ui, server)

gadgets

http://shiny.rstudio.com/articles/gadgets.html

A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

library(shiny)
library(miniUI)
library(ggplot2)

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
        miniContentPanel(
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")
            )
        )

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()
        })

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points
            stopApp(data[-g,])
        })
    }

    runGadget(ui, server)
}

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
library(gapminder)
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
library(dplyr)
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%
    summarise(mean(lifeExp))

req()

http://shiny.rstudio.com/reference/shiny/latest/req.html

This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:

# no need to do do this any more
# 
# inFile = input$file1
# 
#         if (is.null(inFile))
#             return(NULL)

# use req() instead
req(input$file1)

profvis

http://rpubs.com/wch/123888

Super helpful method for digging into the call stack of your R code to see how you might optimize it.

One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

https://www.rstudio.com/rstudio-connect-beta/

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Repel overlapping text labels in ggplot2

2016-01-08T09:50:00.003-06:00

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.

But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):

See the ggrepel package vignette for more.

GRUPO: Shiny App For Benchmarking Pubmed Publication Output

2015-12-14T08:29:00.000-06:00

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

The What

GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.

The How

The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.

The methodology is fairly simple:

Build the search query in Pubmed syntax based on user input parameters.
Extract total number of articles from results.
Output a visualization of the total counts for both selected institutions.
Extract unique article identifiers from results.
Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.

Build Query

The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.

GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called build_query():

# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments

build_query = function(institution, startDate, endDate) {

    if (grepl("-", institution)==TRUE) {                
        split_name = strsplit(institution, split="-")
        search_term = paste(split_name[[1]][1], '[Affiliation]',
                             ' AND ',
                             split_name[[1]][2],
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    } else {
        search_term = paste(institution, 
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    }

    return(search_term)
}

The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a - symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.

So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:

University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

And for University of Texas-Austin over the same dates you get the following term:

University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.

For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:

search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query

With the query built, you can run the search in Pubmed. The entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements:

ids (unique Pubmed identifiers for each article in the result list)
count (total number of results)
retmax (maximum number of results that could have been returned)
file (the actual XML record containing the values above)

The following code returns total articles for each of two different searches:

affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)

total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)

Plot Results

The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.

We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.

With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:

Output Collaborations

Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.

The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.

To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:

collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)

By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.

NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.

The Next Steps

What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.

For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.

And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.

So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.

Grupo Shiny App: http://apps.bioconnector.virginia.edu/grupo/

Grupo Source Code: https://github.com/vpnagraj/grupo

Tutorial: RNA-seq differential expression & pathway analysis with Sailfish, DESeq2, GAGE, and Pathview

2015-12-04T11:40:00.000-06:00

Background

This tutorial shows an example of RNA-seq data analysis with DESeq2, followed by KEGG pathway analysis using GAGE. Using data from GSE37704, with processed data available on Figshare DOI: 10.6084/m9.figshare.1601975. This dataset has six samples from GSE37704, where expression was quantified by either: (A) mapping to to GRCh38 using STAR then counting reads mapped to genes with featureCounts under the union-intersection model, or (B) alignment-free quantification using Sailfish, summarized at the gene level using the GRCh38 GTF file. Both datasets are restricted to protein-coding genes only. Here I’ll use the Sailfish gene-level estimated counts.

Differential expression analysis

First, import the countdata and metadata directly from the web. Set up the DESeqDataSet, run the DESeq2 pipeline.

# Note importing BioC pkgs after dplyr requires explicitly using dplyr::select()
library(dplyr)
library(DESeq2)

# Which data do you want to use? Let's use the sailfish counts.
# browseURL("http://dx.doi.org/10.6084/m9.figshare.1601975")
# countDataURL = "http://files.figshare.com/2439061/GSE37704_featurecounts.csv"
countDataURL = "http://files.figshare.com/2600373/GSE37704_sailfish_genecounts.csv"

# Import countdata
countData = read.csv(countDataURL, row.names=1) %>% 
  dplyr::select(-length) %>% 
  as.matrix()

# Filter data where you only have 0 or 1 read count across all samples.
countData = countData[rowSums(countData)>1, ]
head(countData)

##                 SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
## ENSG00000198888     17528     23007     30241     24418     29152
## ENSG00000198763     21264     26720     35550     28878     32416
## ENSG00000198804    130975    151207    195514    178130    196727
## ENSG00000198712     49769     61906     78608     66478     69758
## ENSG00000228253      9304     11160     12830     12608     13041
## ENSG00000198899     45401     51260     66851     63433     66123
##                 SRR493371
## ENSG00000198888     34416
## ENSG00000198763     38422
## ENSG00000198804    244670
## ENSG00000198712     86808
## ENSG00000228253     16063
## ENSG00000198899     79215

# Import metadata
colData = read.csv("http://files.figshare.com/2439060/GSE37704_metadata.csv", row.names=1)
colData

##               condition
## SRR493366 control_sirna
## SRR493367 control_sirna
## SRR493368 control_sirna
## SRR493369      hoxa1_kd
## SRR493370      hoxa1_kd
## SRR493371      hoxa1_kd

# Set up the DESeqDataSet Object and run the DESeq pipeline
dds = DESeqDataSetFromMatrix(countData=countData,
                              colData=colData,
                              design=~condition)
dds = DESeq(dds)
dds

## class: DESeqDataSet 
## dim: 16755 6 
## metadata(0):
## assays(3): counts mu cooks
## rownames(16755): ENSG00000198888 ENSG00000198763 ...
##   ENSG00000267795 ENSG00000165795
## rowRanges metadata column names(27): baseMean baseVar ... deviance
##   maxCooks
## colnames(6): SRR493366 SRR493367 ... SRR493370 SRR493371
## colData names(2): condition sizeFactor

Next, get results for the HoxA1 knockdown versus control siRNA, and reorder them by p-value. Call summary on the results object to get a sense of how many genes are up or down-regulated at FDR 0.1.

res = results(dds, contrast=c("condition", "hoxa1_kd", "control_sirna"))
res = res[order(res$pvalue),]
summary(res)

## 
## out of 16755 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 4193, 25% 
## LFC < 0 (down)   : 4286, 26% 
## outliers [1]     : 22, 0.13% 
## low counts [2]   : 1299, 7.8% 
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Since we mapped and counted against the Ensembl annotation, our results only have information about Ensembl gene IDs. But, our pathway analysis downstream will use KEGG pathways, and genes in KEGG pathways are annotated with Entrez gene IDs. I wrote an R package for doing this offline the dplyr way (https://github.com/stephenturner/annotables), but the canonical Bioconductor way to do it is with the AnnotationDbi and organism annotation packages. Here we’re using the organism package (“org”) for Homo sapiens (“Hs”), organized as an AnnotationDbi database package (“db”) using Entrez Gene IDs (“eg”) as primary keys. To see what all the keys are, use the columns function.

library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Let’s use the mapIds function to add more columns to the results. The row.names of our results table has the Ensembl gene ID (our key), so we need to specify keytype=ENSEMBL. The column argument tells the mapIds function which information we want, and the multiVals argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. Let’s get the Entrez IDs, gene symbols, and full gene names.

res$symbol = mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
res$entrez = mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")
res$name =   mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="GENENAME",
                     keytype="ENSEMBL",
                     multiVals="first")

head(res, 10)

## log2 fold change (MAP): condition hoxa1_kd vs control_sirna 
## Wald test p-value: condition hoxa1_kd vs control_sirna 
## DataFrame with 10 rows and 9 columns
##                  baseMean log2FoldChange      lfcSE      stat    pvalue
##                           
## ENSG00000148773  1885.344      -3.172502 0.07868572 -40.31865         0
## ENSG00000138623  2939.936      -2.418238 0.05889229 -41.06205         0
## ENSG00000104368 13601.963       2.016802 0.05249643  38.41789         0
## ENSG00000124766  2692.200       2.379545 0.06193654  38.41908         0
## ENSG00000122861 35889.413       2.224779 0.05258658  42.30697         0
## ENSG00000116016  4558.157      -1.885339 0.04258766 -44.26961         0
## ENSG00000164251  2404.103       3.325196 0.07021236  47.35912         0
## ENSG00000125257  6187.386       1.943762 0.04259189  45.63692         0
## ENSG00000104321  9334.555       3.186856 0.06227530  51.17367         0
## ENSG00000183508  2110.345       3.190612 0.07488305  42.60794         0
##                      padj      symbol      entrez
##                   
## ENSG00000148773         0       MKI67        4288
## ENSG00000138623         0      SEMA7A        8482
## ENSG00000104368         0        PLAT        5327
## ENSG00000124766         0        SOX4        6659
## ENSG00000122861         0        PLAU        5328
## ENSG00000116016         0       EPAS1        2034
## ENSG00000164251         0       F2RL1        2150
## ENSG00000125257         0       ABCC4       10257
## ENSG00000104321         0       TRPA1        8989
## ENSG00000183508         0      FAM46C       54855
##                                                                               name
##                                                                        
## ENSG00000148773                                      marker of proliferation Ki-67
## ENSG00000138623 semaphorin 7A, GPI membrane anchor (John Milton Hagen blood group)
## ENSG00000104368                                      plasminogen activator, tissue
## ENSG00000124766                               SRY (sex determining region Y)-box 4
## ENSG00000122861                                   plasminogen activator, urokinase
## ENSG00000116016                                   endothelial PAS domain protein 1
## ENSG00000164251                   coagulation factor II (thrombin) receptor-like 1
## ENSG00000125257            ATP-binding cassette, sub-family C (CFTR/MRP), member 4
## ENSG00000104321 transient receptor potential cation channel, subfamily A, member 1
## ENSG00000183508                       family with sequence similarity 46, member C

Pathway analysis

We’re going to use the gage package (Generally Applicable Gene-set Enrichment for Pathway Analysis) for pathway analysis. See also the gage package workflow vignette for RNA-seq pathway analysis. Once we have a list of enriched pathways, we’re going to use the pathview package to draw pathway diagrams, shading the molecules in the pathway by their degree of up/down-regulation.

KEGG pathways

The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms. kegg.sets.hs is a named list of 229 elements. Each element is a character vector of member gene Entrez IDs for a single KEGG pathway. (See also go.sets.hs). sigmet.idx.hs is an index of numbers of sinaling and metabolic pathways in kegg.set.gs. In other words, KEGG pathway include other types of pathway definitions, like “Global Map” and “Human Diseases”, which may be undesirable in pathway analysis. Therefore, kegg.sets.hs[sigmet.idx.hs] gives you the “cleaner” gene sets of sinaling and metabolic pathways only.

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)
data(sigmet.idx.hs)
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
head(kegg.sets.hs, 3)

## $`hsa00232 Caffeine metabolism`
## [1] "10"   "1544" "1548" "1549" "1553" "7498" "9"   
## 
## $`hsa00983 Drug metabolism - other enzymes`
##  [1] "10"     "1066"   "10720"  "10941"  "151531" "1548"   "1549"  
##  [8] "1551"   "1553"   "1576"   "1577"   "1806"   "1807"   "1890"  
## [15] "221223" "2990"   "3251"   "3614"   "3615"   "3704"   "51733" 
## [22] "54490"  "54575"  "54576"  "54577"  "54578"  "54579"  "54600" 
## [29] "54657"  "54658"  "54659"  "54963"  "574537" "64816"  "7083"  
## [36] "7084"   "7172"   "7363"   "7364"   "7365"   "7366"   "7367"  
## [43] "7371"   "7372"   "7378"   "7498"   "79799"  "83549"  "8824"  
## [50] "8833"   "9"      "978"   
## 
## $`hsa00230 Purine metabolism`
##   [1] "100"    "10201"  "10606"  "10621"  "10622"  "10623"  "107"   
##   [8] "10714"  "108"    "10846"  "109"    "111"    "11128"  "11164" 
##  [15] "112"    "113"    "114"    "115"    "122481" "122622" "124583"
##  [22] "132"    "158"    "159"    "1633"   "171568" "1716"   "196883"
##  [29] "203"    "204"    "205"    "221823" "2272"   "22978"  "23649" 
##  [36] "246721" "25885"  "2618"   "26289"  "270"    "271"    "27115" 
##  [43] "272"    "2766"   "2977"   "2982"   "2983"   "2984"   "2986"  
##  [50] "2987"   "29922"  "3000"   "30833"  "30834"  "318"    "3251"  
##  [57] "353"    "3614"   "3615"   "3704"   "377841" "471"    "4830"  
##  [64] "4831"   "4832"   "4833"   "4860"   "4881"   "4882"   "4907"  
##  [71] "50484"  "50940"  "51082"  "51251"  "51292"  "5136"   "5137"  
##  [78] "5138"   "5139"   "5140"   "5141"   "5142"   "5143"   "5144"  
##  [85] "5145"   "5146"   "5147"   "5148"   "5149"   "5150"   "5151"  
##  [92] "5152"   "5153"   "5158"   "5167"   "5169"   "51728"  "5198"  
##  [99] "5236"   "5313"   "5315"   "53343"  "54107"  "5422"   "5424"  
## [106] "5425"   "5426"   "5427"   "5430"   "5431"   "5432"   "5433"  
## [113] "5434"   "5435"   "5436"   "5437"   "5438"   "5439"   "5440"  
## [120] "5441"   "5471"   "548644" "55276"  "5557"   "5558"   "55703" 
## [127] "55811"  "55821"  "5631"   "5634"   "56655"  "56953"  "56985" 
## [134] "57804"  "58497"  "6240"   "6241"   "64425"  "646625" "654364"
## [141] "661"    "7498"   "8382"   "84172"  "84265"  "84284"  "84618" 
## [148] "8622"   "8654"   "87178"  "8833"   "9060"   "9061"   "93034" 
## [155] "953"    "9533"   "954"    "955"    "956"    "957"    "9583"  
## [162] "9615"

The gage() function requires a named vector of fold changes, where the names of the values are the Entrez gene IDs.

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)

##      4288      8482      5327      6659      5328      2034 
## -3.172502 -2.418238  2.016802  2.379545  2.224779 -1.885339

Now, let’s run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir. This value determins whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously (any genes in the pathway dysregulated).

For experimentally derived gene sets, GO term groups, etc, coregulation is commonly the case, hence same.dir = TRUE (default); In KEGG, BioCarta pathways, genes frequently are not coregulated, hence it could be informative to let same.dir = FALSE. Although same.dir = TRUE could also be interesting for pathways.

Here, we’re using same.dir = TRUE, which will give us separate lists for pathways that are upregulated versus pathways that are downregulated. Let’s look at the first few results from each.

# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)

# Look at both up (greater), down (less), and statatistics.
lapply(keggres, head)

## $greater
##                                          p.geomean stat.mean        p.val
## hsa04142 Lysosome                     0.0002630657  3.517890 0.0002630657
## hsa04640 Hematopoietic cell lineage   0.0017919390  2.976432 0.0017919390
## hsa04630 Jak-STAT signaling pathway   0.0048980977  2.604390 0.0048980977
## hsa00140 Steroid hormone biosynthesis 0.0051115493  2.636206 0.0051115493
## hsa04062 Chemokine signaling pathway  0.0125582961  2.250765 0.0125582961
## hsa00511 Other glycan degradation     0.0223819919  2.104311 0.0223819919
##                                            q.val set.size         exp1
## hsa04142 Lysosome                     0.04261664      116 0.0002630657
## hsa04640 Hematopoietic cell lineage   0.14514706       61 0.0017919390
## hsa04630 Jak-STAT signaling pathway   0.20701775      119 0.0048980977
## hsa00140 Steroid hormone biosynthesis 0.20701775       39 0.0051115493
## hsa04062 Chemokine signaling pathway  0.40688879      156 0.0125582961
## hsa00511 Other glycan degradation     0.49956506       15 0.0223819919
## 
## $less
##                                      p.geomean stat.mean        p.val
## hsa04110 Cell cycle               2.165725e-06 -4.722301 2.165725e-06
## hsa03030 DNA replication          3.807440e-06 -4.835336 3.807440e-06
## hsa04114 Oocyte meiosis           1.109869e-04 -3.767561 1.109869e-04
## hsa03013 RNA transport            1.181787e-03 -3.071947 1.181787e-03
## hsa03440 Homologous recombination 1.197124e-03 -3.190747 1.197124e-03
## hsa00240 Pyrimidine metabolism    1.570318e-03 -2.992059 1.570318e-03
##                                          q.val set.size         exp1
## hsa04110 Cell cycle               0.0003084027      121 2.165725e-06
## hsa03030 DNA replication          0.0003084027       36 3.807440e-06
## hsa04114 Oocyte meiosis           0.0059932916      101 1.109869e-04
## hsa03013 RNA transport            0.0387868193      145 1.181787e-03
## hsa03440 Homologous recombination 0.0387868193       28 1.197124e-03
## hsa00240 Pyrimidine metabolism    0.0423985796       96 1.570318e-03
## 
## $stats
##                                       stat.mean     exp1
## hsa04142 Lysosome                      3.517890 3.517890
## hsa04640 Hematopoietic cell lineage    2.976432 2.976432
## hsa04630 Jak-STAT signaling pathway    2.604390 2.604390
## hsa00140 Steroid hormone biosynthesis  2.636206 2.636206
## hsa04062 Chemokine signaling pathway   2.250765 2.250765
## hsa00511 Other glycan degradation      2.104311 2.104311

Now, let’s process the results to pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting.

# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% 
  tbl_df() %>% 
  filter(row_number()<=5) %>% 
  .$id %>% 
  as.character()
keggrespathways

## [1] "hsa04142 Lysosome"                    
## [2] "hsa04640 Hematopoietic cell lineage"  
## [3] "hsa04630 Jak-STAT signaling pathway"  
## [4] "hsa00140 Steroid hormone biosynthesis"
## [5] "hsa04062 Chemokine signaling pathway"

# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids

## [1] "hsa04142" "hsa04640" "hsa04630" "hsa00140" "hsa04062"

Finally, the pathview() function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.

# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))

Here are the plots:

Gene Ontology (GO)

We can also do a similar procedure with gene ontology. Similar to above, go.sets.hs has all GO terms. go.subs.hs is a named list containing indexes for the BP, CC, and MF ontologies. Let’s only do Biological Process.

data(go.sets.hs)
data(go.subs.hs)
gobpsets = go.sets.hs[go.subs.hs$BP]

gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)

lapply(gobpres, head)

## $greater
##                                                             p.geomean
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
##                                                          stat.mean
## GO:0007156 homophilic cell adhesion                       4.017207
## GO:0008285 negative regulation of cell proliferation      3.453345
## GO:0016339 calcium-dependent cell-cell adhesion           3.543891
## GO:0016337 cell-cell adhesion                             3.244296
## GO:0048729 tissue morphogenesis                           3.223979
## GO:1901617 organic hydroxy compound biosynthetic process  3.157421
##                                                                 p.val
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
##                                                              q.val
## GO:0007156 homophilic cell adhesion                      0.1613977
## GO:0008285 negative regulation of cell proliferation     0.4720349
## GO:0016339 calcium-dependent cell-cell adhesion          0.4720349
## GO:0016337 cell-cell adhesion                            0.4720349
## GO:0048729 tissue morphogenesis                          0.4720349
## GO:1901617 organic hydroxy compound biosynthetic process 0.4720349
##                                                          set.size
## GO:0007156 homophilic cell adhesion                           124
## GO:0008285 negative regulation of cell proliferation          458
## GO:0016339 calcium-dependent cell-cell adhesion                27
## GO:0016337 cell-cell adhesion                                 355
## GO:0048729 tissue morphogenesis                               429
## GO:1901617 organic hydroxy compound biosynthetic process      141
##                                                                  exp1
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
## 
## $less
##                                             p.geomean stat.mean
## GO:0048285 organelle fission             4.411540e-18 -8.850004
## GO:0000280 nuclear division              7.459684e-18 -8.805564
## GO:0007067 mitosis                       7.459684e-18 -8.805564
## GO:0000087 M phase of mitotic cell cycle 2.286444e-17 -8.655644
## GO:0007059 chromosome segregation        1.872901e-13 -7.686883
## GO:0051301 cell division                 5.841375e-12 -6.887763
##                                                 p.val        q.val
## GO:0048285 organelle fission             4.411540e-18 1.025209e-14
## GO:0000280 nuclear division              7.459684e-18 1.025209e-14
## GO:0007067 mitosis                       7.459684e-18 1.025209e-14
## GO:0000087 M phase of mitotic cell cycle 2.286444e-17 2.356752e-14
## GO:0007059 chromosome segregation        1.872901e-13 1.544394e-10
## GO:0051301 cell division                 5.841375e-12 4.013998e-09
##                                          set.size         exp1
## GO:0048285 organelle fission                  376 4.411540e-18
## GO:0000280 nuclear division                   352 7.459684e-18
## GO:0007067 mitosis                            352 7.459684e-18
## GO:0000087 M phase of mitotic cell cycle      362 2.286444e-17
## GO:0007059 chromosome segregation             141 1.872901e-13
## GO:0051301 cell division                      462 5.841375e-12
## 
## $stats
##                                                          stat.mean
## GO:0007156 homophilic cell adhesion                       4.017207
## GO:0008285 negative regulation of cell proliferation      3.453345
## GO:0016339 calcium-dependent cell-cell adhesion           3.543891
## GO:0016337 cell-cell adhesion                             3.244296
## GO:0048729 tissue morphogenesis                           3.223979
## GO:1901617 organic hydroxy compound biosynthetic process  3.157421
##                                                              exp1
## GO:0007156 homophilic cell adhesion                      4.017207
## GO:0008285 negative regulation of cell proliferation     3.453345
## GO:0016339 calcium-dependent cell-cell adhesion          3.543891
## GO:0016337 cell-cell adhesion                            3.244296
## GO:0048729 tissue morphogenesis                          3.223979
## GO:1901617 organic hydroxy compound biosynthetic process 3.157421

Annotables: R data package for annotating/converting Gene IDs

2015-11-13T09:54:00.000-06:00

I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.

This package has basic annotation information from Ensembl release 82 for:

Human (grch38)
Mouse (grcm38)
Rat (rnor6)
Chicken (galgal4)
Worm (wbcel235)
Fly (bdgp6)

Where each table contains:

ensgene: Ensembl gene ID
entrez: Entrez gene ID
symbol: Gene symbol
chr: Chromosome
start: Start
end: End
strand: Strand
biotype: Protein coding, pseudogene, mitochondrial tRNA, etc.
description: Full gene name/description.

Additionally, there are tables for human and mouse (grch38_gt and grcm38_gt, respectively) that link ensembl gene IDs to ensembl transcript IDs.

Usage

The package isn’t on CRAN, so you’ll need devtools to install it.

# If you haven't already installed devtools...
install.packages("devtools")

# Use devtools to install the package
devtools::install_github("stephenturner/annotables")

It isn’t necessary to load dplyr, but the tables are tbl_df and will print nicely if you have dplyr loaded.

library(dplyr)
library(annotables)

Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):

grch38

## Source: local data frame [66,531 x 9]
## 
##            ensgene entrez  symbol   chr start   end strand        biotype
##              (chr)  (int)   (chr) (chr) (int) (int)  (int)          (chr)
## 1  ENSG00000210049     NA   MT-TF    MT   577   647      1        Mt_tRNA
## 2  ENSG00000211459     NA MT-RNR1    MT   648  1601      1        Mt_rRNA
## 3  ENSG00000210077     NA   MT-TV    MT  1602  1670      1        Mt_tRNA
## 4  ENSG00000210082     NA MT-RNR2    MT  1671  3229      1        Mt_rRNA
## 5  ENSG00000209082     NA  MT-TL1    MT  3230  3304      1        Mt_tRNA
## 6  ENSG00000198888   4535  MT-ND1    MT  3307  4262      1 protein_coding
## 7  ENSG00000210100     NA   MT-TI    MT  4263  4331      1        Mt_tRNA
## 8  ENSG00000210107     NA   MT-TQ    MT  4329  4400     -1        Mt_tRNA
## 9  ENSG00000210112     NA   MT-TM    MT  4402  4469      1        Mt_tRNA
## 10 ENSG00000198763   4536  MT-ND2    MT  4470  5511      1 protein_coding
## ..             ...    ...     ...   ...   ...   ...    ...            ...
## Variables not shown: description (chr)

Look at the human genes-to-transcripts table:

grch38_gt

## Source: local data frame [216,133 x 2]
## 
##            ensgene          enstxp
##              (chr)           (chr)
## 1  ENSG00000210049 ENST00000387314
## 2  ENSG00000211459 ENST00000389680
## 3  ENSG00000210077 ENST00000387342
## 4  ENSG00000210082 ENST00000387347
## 5  ENSG00000209082 ENST00000386347
## 6  ENSG00000198888 ENST00000361390
## 7  ENSG00000210100 ENST00000387365
## 8  ENSG00000210107 ENST00000387372
## 9  ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## ..             ...             ...

Tables are tbl_df, pipe-able with dplyr:

grch38 %>% 
  filter(biotype=="protein_coding" & chr=="1") %>% 
  select(ensgene, symbol, chr, start, end, description) %>% 
  head %>% 
  pander::pandoc.table(split.table=100, justify="llllll", style="rmarkdown")

ensgene	symbol	chr	start	end
ENSG00000158014	SLC30A2	1	26037252	26046133
ENSG00000173673	HES3	1	6244192	6245578
ENSG00000243749	ZMYM6NB	1	34981535	34985353
ENSG00000189410	SH2D5	1	20719732	20732837
ENSG00000116863	ADPRHL2	1	36088875	36093932
ENSG00000188643	S100A16	1	153606886	153613145

Table: Table continues below

description
solute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013]
hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226]
ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021]
SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819]
ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304]
S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441]

Example with RNA-seq data

Here’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:

# Load libraries (install with Bioconductor if you don't have them)
library(DESeq2)
library(airway)

# Load the data and do the RNA-seq data analysis
data(airway)
airway = DESeqDataSet(airway, design = ~cell + dex)
airway = DESeq(airway)
res = results(airway)

# tidy results with biobroom
library(biobroom)
res_tidy = tidy.DESeqResults(res)
head(res_tidy)

## Source: local data frame [6 x 7]
## 
##              gene    baseMean    estimate   stderror  statistic
##             (chr)       (dbl)       (dbl)      (dbl)      (dbl)
## 1 ENSG00000000003 708.6021697  0.37424998 0.09873107  3.7906000
## 2 ENSG00000000005   0.0000000          NA         NA         NA
## 3 ENSG00000000419 520.2979006 -0.20215550 0.10929899 -1.8495642
## 4 ENSG00000000457 237.1630368 -0.03624826 0.13684258 -0.2648902
## 5 ENSG00000000460  57.9326331  0.08523370 0.24654400  0.3457140
## 6 ENSG00000000938   0.3180984  0.11555962 0.14630523  0.7898530
## Variables not shown: p.value (dbl), p.adjusted (dbl)

Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):

res_tidy %>% 
  arrange(p.adjusted) %>% 
  head(20) %>% 
  inner_join(grch38, by=c("gene"="ensgene")) %>% 
  select(gene, estimate, p.adjusted, symbol, description) %>% 
  pander::pandoc.table(split.table=100, justify="lrrll", style="rmarkdown")

gene	estimate	p.adjusted	symbol
ENSG00000152583	-4.316	4.753e-134	SPARCL1
ENSG00000165995	-3.189	1.44e-133	CACNB2
ENSG00000101347	-3.618	6.619e-125	SAMHD1
ENSG00000120129	-2.871	6.619e-125	DUSP1
ENSG00000189221	-3.231	9.468e-119	MAOA
ENSG00000211445	-3.553	3.94e-107	GPX3
ENSG00000157214	-1.949	8.74e-102	STEAP2
ENSG00000162614	-2.003	3.052e-98	NEXN
ENSG00000125148	-2.167	1.783e-92	MT2A
ENSG00000154734	-2.286	4.522e-86	ADAMTS1
ENSG00000139132	-2.181	2.501e-83	FGD4
ENSG00000162493	-1.858	4.215e-83	PDPN
ENSG00000162692	3.453	3.563e-82	VCAM1
ENSG00000179094	-3.044	1.199e-81	PER1
ENSG00000134243	-2.149	2.73e-81	SORT1
ENSG00000163884	-4.079	1.073e-80	KLF15
ENSG00000178695	2.446	6.275e-75	KCTD12
ENSG00000146250	2.64	1.143e-69	PRSS35
ENSG00000198624	-2.784	1.707e-69	CCDC69
ENSG00000148848	1.783	1.762e-69	ADAM12

Table: Table continues below

description
SPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220]
calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402]
SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925]
dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]
monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]
glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]
STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]
nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557]
metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]
ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217]
FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]
podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]
vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]
period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]
sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]
Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]
potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]
protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387]
coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]
ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]

Explore!

This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.

ls("package:annotables")
?grch38

Let’s join the transcript table to the gene table.

gt = grch38_gt %>% 
  inner_join(grch38, by="ensgene")

Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.

gt %>% 
  filter(biotype=="protein_coding") %>% 
  group_by(ensgene) %>% 
  summarize(ntxps=n_distinct(enstxp)) %>% 
  inner_join(grch38, by="ensgene") %>% 
  select(ensgene, ntxps, symbol, description) %>% 
  mutate(description=substr(description, 1, 20)) %>% 
  arrange(desc(ntxps)) %>% 
  head(10) %>% 
  pander::pandoc.table(split.table=100, justify="lrll", style="rmarkdown")

ensgene	ntxps	symbol	description
ENSG00000165795	77	NDRG2	NDRG family member 2
ENSG00000205336	77	ADGRG1	adhesion G protein-c
ENSG00000196628	75	TCF4	transcription factor
ENSG00000161249	68	DMKN	dermokine [Source:HG
ENSG00000154556	64	SORBS2	sorbin and SH3 domai
ENSG00000166444	62	ST5	suppression of tumor
ENSG00000204580	58	DDR1	discoidin domain rec
ENSG00000087460	57	GNAS	GNAS complex locus [
ENSG00000169398	57	PTK2	protein tyrosine kin
ENSG00000104529	56	EEF1D	eukaryotic translati

Let’s look up DMKN (dermkine) in Ensembl. Search Ensembl for ENSG00000161249, or use this direct link. You can browse the table or graphic to see the splicing complexity in this gene.

Or, let’s do something different. Let’s group the data by what type of gene it is (e.g., protein coding, pseudogene, etc), get the number of genes in each category, and plot the top 20.

library(ggplot2)
grch38 %>% 
  group_by(biotype) %>% 
  summarize(n=n_distinct(ensgene)) %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  ggplot(aes(reorder(biotype, n), n)) + 
  geom_bar(stat="identity") + 
  xlab("Type") + 
  theme_bw() + 
  coord_flip()

Annotables: R data package for annotating/converting Gene IDs

Software from CSHL Genome Informatics 2015

2015-11-02T08:50:00.001-06:00

I just returned from the Genome Informatics meeting at Cold Spring Harbor. This was, hands down, the best scientific conference I've been to in years. The quality of the talks and posters was excellent, and it was great meeting in person many of the scientists and developers whose tools and software I use on a daily basis. To get a sense of what the meeting was about, 140 characters at a time, you can access all the Tweets sent Oct 28-31 2015 tagged #gi2015 at this link.

Below is a very short list of software that was presented at GI2015. This is only a tiny slice of the tools and methods that were presented at the meeting, and the list is highly biased toward tools that I personally find interesting or useful to my own work (please don't be offended if I omitted your stuff, and feel free to mention it in the comments).

Monocle: Software for analyzing single-cell RNA-seq data
Paper: http://www.nature.com/nbt/journal/v32/n4/full/nbt.2859.html
Software: http://cole-trapnell-lab.github.io/monocle-release/

Kallisto: very fast RNA-seq transcript abundance estimation using pseudoalignment.
Preprint: http://arxiv.org/abs/1505.02710
Software: http://pachterlab.github.io/kallisto/about.html

Sleuth: R package for analyzing & reporting differential expression analysis from transcript abundances estimated with Kallisto.
Preprint: coming soon?
Software: http://pachterlab.github.io/sleuth/about.html
See also: The bear's lair (http://lair.berkeley.edu/): reanalysis of published RNA-seq studies using kallisto+sleuth.

QoRTs: Quality of RNA-Seq Toolset. Toolkit for QC, gene/junction counting, and other miscellaneous downstream processing from RNA-seq alignments.

Software: https://github.com/hartleys/QoRTs

Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4506620/

JunctionSeq: R package for testing differential junction usage with RNA-seq data.

Software: https://github.com/hartleys/JunctionSeq

Vignette: http://hartleys.github.io/JunctionSeq/doc/JunctionSeq.pdf

HISAT2: RNA-seq alignment against populations of genomes (aligns DNA also).
Software: http://ccb.jhu.edu/software/hisat2/index.shtml

Rail: software for aligning many-sample RNA-seq data, producing alignments, genome coverage bigWigs, and splice junction BED files.
Software: http://rail.bio
Preprint: http://biorxiv.org/content/early/2015/08/11/019067

LobSTR: genotype short tandem repeats from NGS data.
Software: http://melissagymrek.com/lobstr-code/
Paper: http://www.ncbi.nlm.nih.gov/pubmed/22522390

Basset: convolutional neural networks for learning functional/regulatory features of DNA sequence.
Software: https://github.com/davek44/Basset
Preprint: http://biorxiv.org/content/early/2015/10/05/028399

Genotype Query Tools (GQT): fast/efficient individual-level queries of large-scale variation data.
Software: https://github.com/ryanlayer/gqt
Preprint: http://biorxiv.org/content/early/2015/06/05/018259

Centrifuge: a metagenomics classifier.
Software: https://github.com/infphilo/centrifuge
Poster: http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf

Mash: MinHash-based method for rapidly estimating pairwise distances between genomes or metagenomes.
Software: https://github.com/marbl/Mash
Docs: http://mash.readthedocs.org/en/latest/
Preprint: http://biorxiv.org/content/early/2015/10/26/029827

VCFanno: ultrafast large-sample VCF annotation
Software: https://github.com/brentp/vcfanno

Ginkgo: Interactive analysis and assessment of single-cell copy-number variations
Paper: http://www.nature.com/nmeth/journal/v12/n11/full/nmeth.3578.html
Software: https://github.com/robertaboukhalil/ginkgo

StringTie: RNA-seq transcript assembly+quantification, with or without a reference. See paper for comparison to existing tools.
Software: http://ccb.jhu.edu/software/stringtie/
Source: https://github.com/gpertea/stringtie
Poster: http://ccb.jhu.edu/software/stringtie/cshl2015.pdf
Paper: http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html

Compiling RMarkdown from a Helper R Script

2015-08-06T11:17:00.000-05:00

The problem

I was looking for a way to compile an RMarkdown document and have the filename of the resulting PDF or HTML document contain the name of the input data that it processed. That is, if I compiled the analysis.Rmd file, where in that file it did some analysis and reporting on data001.txt, I’d want the resulting filename to look something like data001.txt.analysis.html. Or even better, to stick in a timestamp with the date, so if the analysis was compiled today, August 6 2015, the resulting filename would be data001.txt.2015-08-06.html. I also wanted to implement the entire solution in R, not relying on fiddly makefiles or scripts that may behave differently depending on the OS/environment.

I found a near-solution as described on this SO post and detailed on this follow-up blog post, but neither really addressed my problem.

The solution

The simplest solution I could come up with involved creating two files:

A .Rmd file that would actually do all the analysis and generate the compiled report.
A second .R script to be used as a config file. Here you’d specify the input data (and potentially other analysis parameters).

By default, when calling rmarkdown::render() from an R script, the environment in which the code chunks are to be evaluated during knitting uses parent.frame() by default, so anything you define in the .R config file will get passed on to the .Rmd that is to be compiled.

Here’s what it looks like in practice.

First, the analysis.Rmd file that actually runs the analysis:

 ---
 title: "Analysis Markdown document"
 author: "Stephen Turner"
 date: "August 6, 2015"
 output: html_document
 ---

 This is the Rmarkdown document that runs the analysis.
 Some narrative text goes here. 
 Maybe we'll do some analysis here. The `infile` variable is passed 
 in from the config script. You could pass in other variables too.

 ```{r}
 # check that you defined infile from the config and that 
 # the file actually exists in the current directory
 stopifnot(exists("infile"))

 stopifnot(file.exists(infile))

 # read in the data
 x = read.table(infile)

 # do some stuff, make a plot, etc.
 result = mean(x$value)
 hist(x$value)
 ```

 Here is some conclusion narrative text. Maybe show some notes:

 - Input file used for this report: `r infile`
 - This report was compiled: `r Sys.Date()`
 - The mean of the `value` column is: `r result`

 Also, never forget to show your...

 ```{r}
 sessionInfo()
 ```

And the config.R helper script:

#-------- define the input filename --------#
infile = "data001.txt"
#----- Now just hit the source button! -----#

# check that the input file actually exists!
stopifnot(file.exists(infile))

# create the output filename
outfile = paste(infile, Sys.Date(), "analysis.html", sep=".")

# compile the document
rmarkdown::render(input="analysis.Rmd", output_file=outfile)

All I’d need to now is open up the config.R script, edit the infile variable, and hit the source button in RStudio. This runs the analysis.Rmd as shown above for the input (data001.txt in this example) and saves the resulting compiled report as data001.txt.2015-08-06.analysis.html.

(Crosspost at RPubs).

R: single plot with two different y-axes

2015-04-21T08:23:00.000-05:00

I forgot where I originally found the code to do this, but I recently had to dig it out again to remind myself how to draw two different y axes on the same plot to show the values of two different features of the data. This is somewhat distinct from the typical use case of aesthetic mappings in ggplot2 where I want to have different lines/points/colors/etc. for the same feature across multiple subsets of data.

For example, I was recently poking around with some data examining enrichment of a particular set of genes using a hypergeometric test as I was fiddling around with other parameters that included more genes in the selection (i.e., in the classic example, the number of balls drawn from some hypothetical urn). I wanted to show the -log10(p-value) on one axis and some other value (e.g., “n”) on the same plot, using a different axis on the right side of the plot.

Here’s how to do it. First, generate some data:

set.seed(2015-04-13)

d = data.frame(x =seq(1,10),
           n = c(0,0,1,2,3,4,4,5,6,6),
           logp = signif(-log10(runif(10)), 2))

x	n	logp
1	0	1.400
2	0	0.590
3	1	1.200
4	2	1.500
5	3	0.028
6	4	0.380
7	4	2.500
8	5	0.067
9	6	0.041
10	6	0.360

The strategy here is to first draw one of the plots, then draw another plot on top of the first one, and manually add in an axis. So let’s draw the first plot, but leave some room on the right hand side to draw an axis later on. I’m drawing a red line plot showing the p-value as it changes over values of x.

par(mar = c(5,5,2,5))
with(d, plot(x, logp, type="l", col="red3", 
             ylab=expression(-log[10](italic(p))),
             ylim=c(0,3)))

Now, draw the second plot on top of the first using the par(new=T) call. Draw the plot, but don’t include an axis yet. Put the axis on the right side (axis(...)), and add text to the margin (mtext...). Finally, add a legend.

par(new = T)
with(d, plot(x, n, pch=16, axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number genes selected')
legend("topleft",
       legend=c(expression(-log[10](italic(p))), "N genes"),
       lty=c(1,0), pch=c(NA, 16), col=c("red3", "black"))

Translational Bioinformatics Year In Review

2015-04-10T15:47:00.000-05:00

Per tradition, Russ Altman gave his "Translational Bioinformatics: The Year in Review" presentation at the close of the AMIA Joint Summit on Translational Bioinformatics in San Francisco on March 26th. This year, papers came from six key areas (and a final Odds and Ends category). His full slide deck is available here.

I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.

GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009. This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations. The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.

Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.

A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation. Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.

SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types. The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.

Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations. They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed. Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.

A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs. By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.

dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.

A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date. The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein. But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.

R User Group Recap: Heatmaps and Using the caret Package

2015-04-10T10:01:00.002-05:00

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

Using and Abusing Data Visualization: Anscombe's Quartet and Cheating Bonferroni

2015-02-26T13:30:00.002-06:00

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

Let’s load and view the data. There’s a built-in dataset, but I munged the data into a tidy format and included it in an R package that I wrote primarily for myself.

# If you don't have Tmisc installed, first install devtools, then install
# from github: install.packages('devtools')
# devtools::install_github('stephenturner/Tmisc')
library(Tmisc)
data(quartet)
str(quartet)

## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...

set	x	y
I	10	8.04
I	8	6.95
I	13	7.58
…	…	…
II	10	9.14
II	8	8.14
II	13	8.74
…	…	…
III	10	7.46
III	8	6.77
III	13	12.74
…	…	…
IV	8	6.58
IV	8	5.76
IV	8	7.71
…	…	…

Now, let’s compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.

library(dplyr)
quartet %>%
  group_by(set) %>%
  summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))

## Source: local data frame [4 x 6]
##
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817

Looks like each dataset has the same mean, median, standard deviation, and correlation coefficient between x and y.

Now, let’s plot y versus x for each set with a linear regression trendline displayed on each plot:

library(ggplot2)
p = ggplot(quartet, aes(x, y)) + geom_point()
p = p + geom_smooth(method = lm, se = FALSE)
p = p + facet_wrap(~set)
p

This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.

With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.

Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.

library(dplyr)
set.seed(42)
ndset = 64
n = 100
d = data_frame(
  set = factor(rep(1:ndset, each = n)),
  x = rnorm(n * ndset),
  y = rep(rnorm(n), ndset))
d

## Source: local data frame [6,400 x 3]
##
##    set       x       y
## 1    1  1.3710  1.2546
## 2    1 -0.5647  0.0936
## 3    1  0.3631 -0.0678
## 4    1  0.6329  0.2846
## 5    1  0.4043  1.0350
## 6    1 -0.1061 -2.1364
## 7    1  1.5115 -1.5967
## 8    1 -0.0947  0.7663
## 9    1  2.0184  1.8043
## 10   1 -0.0627 -0.1122
## .. ...     ...     ...

ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)

Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.

library(dplyr)
results = d %>%
  group_by(set) %>%
  do(mod = lm(y ~ x, data = .)) %>%
  summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
  mutate(bon = p.adjust(p, method = "bonferroni")) %>%
  mutate(fdr = p.adjust(p, method = "fdr"))
results

## Source: local data frame [64 x 4]
##
##    set      p   bon   fdr
## 1    1 0.2738 1.000 0.749
## 2    2 0.2125 1.000 0.749
## 3    3 0.7650 1.000 0.900
## 4    4 0.2094 1.000 0.749
## 5    5 0.8073 1.000 0.900
## 6    6 0.0132 0.844 0.749
## 7    7 0.4277 1.000 0.820
## 8    8 0.7323 1.000 0.900
## 9    9 0.9323 1.000 0.932
## 10  10 0.1600 1.000 0.749
## .. ...    ...   ...   ...

library(qqman)
qq(results$p)

BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?

results %>% filter(set %in% c(6, 28, 41, 49))

## Source: local data frame [4 x 4]
##
##   set      p   bon   fdr
## 1   6 0.0132 0.844 0.749
## 2  28 0.0338 1.000 0.749
## 3  41 0.0624 1.000 0.749
## 4  49 0.0898 1.000 0.749

When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I'll say it again, the null hypothesis is true in this experiment, because I've simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p<0.05 level. You'll notice that the Bonferroni and FDR-corrected p-values (correcting for all 64 tests) are still not significant.

The moral of the story here is to always look at your data, but don't "cheat" by basing which statistical tests you perform based solely on that visualization exercise.

Microbial Genomics: the State of the Art in 2015

2015-02-04T11:19:00.000-06:00

Current Opinion in Microbiology recently published a special issue in genomics. In an excellent editorial overview, “Genomics: The era of genomically-enabled microbiology”, Neil Hall and Jay Hinton give an overview of the state of the field in microbial genomics, summarize recent contributions, and give a great synopsis of each of the reviews in this issue. Hall and Hinton’s editorial overview goes into a little more depth, but here’s a rundown of the reviews in this special issue. There’s a lot of good stuff here!

Quantitative bacterial transcriptomics with RNA-seq (James Creecy and Tyrrell Conway) discusses RNA-seq in bacteria and how transcriptome analysis adds a wealth of annotation information to the genome.

One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly (Sergey Koren and Adam Phillippy) describes newer long-read sequencing technologies and their characteristics, discusses how microbial genomes can be easily and automatically finished using these methods for under $1,000, and discusses challenges for microbial and metagenome assembly.

Using comparative genomics to drive new discoveries in microbiology (Daniel Haft) describes progress using comparative genomics to make new discoveries, and takes the reader on a “bioinformatics journey” to describe a code-breaking exercise in comparative genomics that starts with weak hypotheses and uses genomics to fill in the biological picture.

Taking the pseudo out of pseudogenes (Ian Goodhead and Alistair Darby) reviews how pseudogenes are surprisingly prevalent, and discusses how problems with genome annotation can be addressed by combining multiple “omics” data.

Ten years of pan-genome analyses (George Vernikos et al.) describes how pan-genome analyses provide a framework for predicting and modling genomic diversity, where the “core genome” of many bacterial species constitutes only the minority of genes.

Lateral gene transfers and the origins of the eukaryote proteome: a view from microbial parasites (Robert Hirt et al.) reviews the dynamic nature of lateral gene transfer, its role in microbial diversity, how it contributes to eukaryotic genomes, and how once again integrating different “omics” methodologies is needed to recognize the extent to which LGT affects eukaryotes.

The application of genomics to tracing bacterial pathogen transmission (Nicholas Croucher and Xavier Didelot) reviews how bacterial whole-genome sequencing gives you the ultimate resolution for investigating direct pathogen transmission, distinguishing transmission chains, and defining outbreaks. If you haven’t kept up with this quickly growing body of literature, this review is a great place to start catching up.

The impact of genomics on population genetics of parasitic diseases (Daniel Hupalo et al.) describes the influence of genomics on parasite population genetics and how burgeoning genomic data has enabled new types of investigations, and focuses on Plasmodium population genomics as a foundation for studies of neglected parasites.

R + ggplot2 Graph Catalog

2015-02-03T07:33:00.000-06:00

Joanna Zhao’s and Jenny Bryan’s R graph catalog is meant to be a complement to the physical book, Creating More Effective Graphs, but it’s a really nice gallery in its own right. The catalog shows a series of different data visualizations, all made with R and ggplot2. Click on any of the plots and you get the R code necessary to generate the data and produce the plot.

You can use the panel on the left to filter by plot type, graphical elements, or the chapter of the book if you’re actually using it. All of the code and data used for this website is open-source, in this GitHub repository. Here's an example for plotting population demographic data by county that uses faceting to create small multiples:

library(ggplot2)
library(reshape2)
library(grid)

this_base = "fig08-15_population-data-by-county"

my_data = data.frame(
  Race = c("White", "Latino", "Black", "Asian American", "All Others"),
  Bronx = c(194000, 645000, 415000, 38000, 40000),
  Kings = c(855000, 488000, 845000, 184000, 93000),
  New.York = c(703000, 418000, 233000, 143000, 39000),
  Queens = c(733000, 556000, 420000, 392000, 128000),
  Richmond = c(317000, 54000, 40000, 24000, 9000),
  Nassau = c(986000, 133000, 129000, 62000, 24000),
  Suffolk = c(1118000, 149000, 92000, 34000, 26000),
  Westchester = c(592000, 145000, 123000, 41000, 23000),
  Rockland = c(205000, 29000, 30000, 16000, 6000),
  Bergen = c(638000, 91000, 43000, 94000, 18000),
  Hudson = c(215000, 242000, 73000, 57000, 22000),
  Passiac = c(252000, 147000, 60000, 18000, 12000))

my_data_long = melt(my_data, id = "Race",
                     variable.name = "county", value.name = "population")

my_data_long$county = factor(
  my_data_long$county, c("New.York", "Queens", "Kings", "Bronx", "Nassau",
                         "Suffolk", "Hudson", "Bergen", "Westchester",
                         "Rockland", "Richmond", "Passiac"))

my_data_long$Race =
  factor(my_data_long$Race,
         rev(c("White", "Latino", "Black", "Asian American", "All Others")))

p = ggplot(my_data_long, aes(x = population / 1000, y = Race)) +
  geom_point() +
  facet_wrap(~ county, ncol = 3) +
  scale_x_continuous(breaks = seq(0, 1000, 200),
                     labels = c(0, "", 400, "", 800, "")) +
  labs(x = "Population (thousands)", y = NULL) +
  ggtitle("Fig 8.15 Population Data by County") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "grey60"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.margin = unit(0, "lines"),
        plot.title = element_text(size = rel(1.1), face = "bold", vjust = 2),
        strip.background = element_rect(fill = "grey80"),
        axis.ticks.y = element_blank())

p

ggsave(paste0(this_base, ".png"),
       p, width = 6, height = 8)

Keep in mind not all of these visualizations are recommended. You’ll find pie charts, ugly grouped bar charts, and other plots for which I can’t think of any sensible name. Just because you can use the add_cat() function from Hilary Parker’s cats package to fetch a random cat picture from the internet and create an annotation_raster layer to add to your ggplot2 plot, doesn’t necessarily mean you should do such a thing for a publication-quality figure. But if you ever needed to know how, this R graph catalog can help you out.

library(ggplot2)

this_base = "0002_add-background-with-cats-package"

## devtools::install_github("hilaryparker/cats")
library(cats)
## library(help = "cats")

p = ggplot(mpg, aes(cty, hwy)) +
  add_cat() +
  geom_point()
p

ggsave(paste0(this_base, ".png"), p, width = 6, height = 5)

R graph catalog (via Laura Wiley)

Microbiome Digest Blog

2015-01-20T14:55:00.000-06:00

I have a noteworthy blogs tag on this blog that I sort of forgot about, and haven't used in years. But I started reading one recently that's definitely qualified for the distinction.

The Microbiome Digest is written by Elisabeth Bik, a scientist studying the microbiome at Stanford. It's a near-daily compilation of papers and popular press articles mostly relating to microbiome research, split up into categories like the human microbiome, the non-human microbiome (soil, animal, plants, other environments), metagenomics and bioinformatics methods, reviews, news articles, and other general science or career advice articles.

I imagine Elisabeth spends hours each week culling the huge onslaught of literature into these highly relevant digests. I wish someone else would do the same for other areas I care about so I don't have to. I subscribe to the RSS feed and the email list so I never miss a post. If you're at all interested in metagenomics or microbiome research, I suggest you do the same!

Microbiome Digest

Using the microbenchmark package to compare the execution time of R expressions

2015-01-14T07:56:00.001-06:00

I recently learned about the microbenchmark package while browsing through Hadley’s advanced R programming book. I’ve done some quick benchmarking using system.time() in a for loop and taking the average, but the microbenchmark function in the microbenchmark package makes this much easier. Hadley gives the example of taking the square root of a vector using the built-in sqrt function versus the mathematical equivalent of raising the vector to the power of 0.5.

library(microbenchmark)
x = runif(100)
microbenchmark(
  sqrt(x),
  x ^ 0.5
)

By default, microbenchmark runs each argument 100 times to get an average look at how long each evaluation takes. Results:

Unit: nanoseconds
    expr  min     lq    mean median     uq   max neval
 sqrt(x)  825  860.5 1212.79  892.5  938.5 12905   100
   x^0.5 3015 3059.5 3776.81 3101.5 3208.0 15215   100

On average sqrt(x) takes 1212 nanoseconds, compared to 3776 for x^0.5. That is, the built-in sqrt function is about 3 times faster. (This was surprising to me. Anyone care to comment on why this is the case?)

Now, let’s try it on something just a little bigger. This is similar to a real-life application I faced where I wanted to compute summary statistics of some value grouping by levels of some other factor. In the example below we’ll use the nycflights13 package, which is a data package that has info on 336,776 outbound flights from NYC in 2013. I’m going to go ahead and load the dplyr package so things print nicely.

library(dplyr)
library(nycflights13)
flights

Source: local data frame [336,776 x 16]

year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1  2013     1   1      517         2      830        11      UA  N14228
2  2013     1   1      533         4      850        20      UA  N24211
3  2013     1   1      542         2      923        33      AA  N619AA
4  2013     1   1      544        -1     1004       -18      B6  N804JB
5  2013     1   1      554        -6      812       -25      DL  N668DN
6  2013     1   1      554        -4      740        12      UA  N39463
7  2013     1   1      555        -5      913        19      B6  N516JB
8  2013     1   1      557        -3      709       -14      EV  N829AS
9  2013     1   1      557        -3      838        -8      B6  N593JB
10 2013     1   1      558        -2      753         8      AA  N3ALAA
..  ...   ... ...      ...       ...      ...       ...     ...     ...
Variables not shown: flight (int), origin (chr), dest (chr), air_time
(dbl), distance (dbl), hour (dbl), minute (dbl)

Let’s say we want to know the average arrival delay (arr_delay) broken down by each airline (carrier). There’s more than one way to do this.

Years ago I would have used the built-in aggregate function.

aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE)

This gives me the results I’m looking for:

   Group.1          x
1       9E  7.3796692
2       AA  0.3642909
3       AS -9.9308886
4       B6  9.4579733
5       DL  1.6443409
6       EV 15.7964311
7       F9 21.9207048
8       FL 20.1159055
9       HA -6.9152047
10      MQ 10.7747334
11      OO 11.9310345
12      UA  3.5580111
13      US  2.1295951
14      VX  1.7644644
15      WN  9.6491199
16      YV 15.5569853

Alternatively, you can use the sqldf package, which feels natural if you’re used to writing SQL queries.

library(sqldf)
sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier")

Not long ago I learned about the data.table package, which is good at doing these kinds of operations extremely fast.

library(data.table)
flightsDT = data.table(flights)
flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier]

Finally, there’s my new favorite, the dplyr package, which I covered recently.

library(dplyr)
flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE))

Each of these will give you the same result, but which one is faster? That’s where the microbenchmark package becomes handy. Here, I’m passing all four evaluations to the microbenchmark function, and I’m naming those “base”, “sqldf”, “datatable”, and “dplyr” so the output is easier to read.

library(microbenchmark)
mbm = microbenchmark(
  base = aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE),
  sqldf = sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier"),
  datatable = flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier],
  dplyr = flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)),
  times=50
)
mbm

Here’s the output:

Unit: milliseconds
      expr     min      lq    mean  median      uq     max neval
      base 1487.39 1521.12 1544.73 1539.96 1554.55 1676.25    50
     sqldf  867.14  880.34  892.24  887.88  897.28  982.91    50
 datatable    4.12    4.57    5.29    4.89    5.43   18.69    50
     dplyr   14.49   15.53   16.59   15.86   16.58   25.04    50

In this example, data.table was clearly the fastest on average. dplyr took ~3 times longer, sqldf took ~180x longer, and the base aggregate function took over 300 times longer. Let’s visualize those results using ggplot2 (microbenchmark has an autoplot method available, and note the log scale):

library(ggplot2)
autoplot(mbm)

In this example data.table and dplyr were both relatively fast, with data.table being just a few milliseconds faster. Sometimes this will matter, other times it won’t. This is a matter of personal preference, but I personally find the data.table incantation not the least bit intuitive compared to dplyr. The way we pronounce flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)) is: “take flights then group that data by the carrier variable then summarize the data taking the mean of arr_delay.” The dplyr syntax, for me, is much easier to use and extend to much more complex data management and analysis tasks, so I’ll sacrifice those few milliseconds or program run time for the minutes or hours of programmer debugging time. But if you’re planning on running a piece of code on, for instance, millions or more simulations, then those few milliseconds might be important to you. The microbenchmark package makes benchmarking easy for small pieces of code like this.

The code used for this analysis is consolidated here on GitHub

Importing Illumina BeadArray data into R

2014-12-08T11:39:00.001-06:00

A colleague needed some help getting Illumina BeadArray gene expression data loaded into R for data analysis with limma. Hopefully whoever ran your arrays can export the data as text files formatted as described in the code below. If so, you can import those text files directly using the beadarray package. This way you avoid getting bogged down with GenomeStudio, which requires a license (ugh) and only runs on Windows (ughhh). Here's how I do it.

RNA-seq Data Analysis Course Materials

2014-11-20T16:08:00.001-06:00

Last week I ran a one-day workshop on RNA-seq data analysis in the UVA Health Sciences Library. I set up an AWS public EC2 image with all the necessary software installed. Participants logged into AWS, launched the image, and we kicked off the morning session with an introduction to the Unix shell (taught by Jessica Bonnie, a biostatistician here in our genomics group, and a fellow Software Carpentry instructor). I followed with a walkthrough of using FastQC for quality assessment, FASTX toolkit for trimming, TopHat for alignment, and featureCounts to summarize gene expression read counts at the gene level. I started the afternoon session started with an introduction to R, followed by a tutorial on analyzing the count data we generated in the first part using DESeq2 in R.

All of the rendered course material is available here. The source code used to generate this material is all on available on GitHub (go read my post on collaborative lesson development, if you haven't already). Much of the introductory Unix lesson material was adapted from the Software Carpentry and Data Carpentry projects.

I wrote a more thorough blog post about how the course went here on the Software Carpentry blog.

I also compiled a PDF of all the course materials, available on Figshare: http://dx.doi.org/10.6084/m9.figshare.1247658.

Operate on the body of a file but not the header

2014-10-14T13:22:00.001-05:00

Sometimes you need to run some UNIX command on a file but only want to operate on the body of the file, not the header. Create a file called body somewhere in your $PATH, make it executable, and add this to it:

#!/bin/bash
IFS= read -r header
printf '%s\n' "$header"
eval $@

Now, when you need to run something but ignore the header, use the body command first. For example, we can create a simple data set with a header row and some numbers:

$ echo -e "header\n1\n5\n4\n7\n3"
header
1
5
4
7
3

We can pipe the whole thing to sort:

$ echo -e "header\n1\n5\n4\n7\n3" | sort
1
3
4
5
7
header

Oops, we don’t want the header to be included in the sort. Let’s use the body command to operate only on the body, skipping the header:

$ echo -e "header\n1\n5\n4\n7\n3" | body sort
header
1
3
4
5
7

Sure, there are other ways to solve the problem with sort, but body will solve many more problems. If you have multiple header rows, just call body multiple times.

Inspired by this post on Stack Exchange.

R package to convert statistical analysis objects to tidy data frames

2014-09-16T10:23:00.000-05:00

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.

R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.

(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:

lmfit = lm(mpg ~ wt, mtcars)
lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344

summary(lmfit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.

But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:

tidy(lmfit)

         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10

The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().

View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

broom: Convert statistical analysis objects from R into tidy format

UVA / Charlottesville R Meetup

2014-09-11T14:55:00.000-05:00

TL;DR? We started an R Users group, awesome community, huge turnout at first meeting, lots of potential.

---

I've sat through many hours of meetings where faculty lament the fact that their trainees (and the faculty themselves!) are woefully ill-prepared for our brave new world of computing- and data-intensive science. We've started to address this by running annual Software Carpentry bootcamps (March 2013, and March 2014). To make things more sustainable, we're running our own Software Carpentry instructor training here later this month, where we'll train scientists how to teach other scientists basic computing skills like using UNIX, programming in Python or R, version control, automation, and testing. I went through this training course online a few months ago, and it was an excellent introduction to pedagogy and the psychology of learning (let's face it, most research professors were never taught how to teach; it's a skill you learn, not one you inherit with the initials behind your name).

Something that constantly comes up in these conversations is how to promote continued learning and practice after the short bootcamp is over. Students get a whirlwind tour of scientific computing skills over two days, but there's generally very little follow-up that's necessary to encourage continued practice and learning.

At the same time we've got a wide variety of scientists spanning all disciplines including social sciences, humanities, medicine, physics, chemistry, math, and engineering that are doing scientific computing and data analysis on a daily basis who could really benefit from learning from one another.

These things motivated us to start a local R Users Group. So far we have 118 people registered on Meetup.com, and this week we had an excellent turnout at our first meeting, with over 70 people who RSVP'd.

At this first meetup we kicked things off with an introduction to the group, why we started it, and our goals. I then launched into a quick demo of some of the finer features in the dplyr package for effortless advanced data manipulation and analysis. The meetup group has a GitHub repository where all the code from our meetups will be stored. Finally, we concluded with a discussion of topics the group would like to see presented in the future: ggplot2, R package creation, reproducible research, dynamic documentation, and web scraping were a few of the things mentioned. We collectively decided that either talks could be either research talks highlighting how these things were used in an actual research project, or they could be demo/tutorial in nature, like the dplyr talk I gave.

The rich discussion we had at the end of this session really highlighted the potential this community has. In addition to the diversity and relative gender-balance we saw in our first meetup's participants, we had participants from all over UVA as well as representation from local industry and non-profit organizations.

I, for one, am tremendously excited about this new community, and will post the occasional update from the group here on this blog.