<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Test early; Refactor often; Make things]]></title>
  <link href="https://www.eriwen.com/atom.xml" rel="self"/>
  <link href="https://www.eriwen.com/"/>
  <updated>2020-04-30T08:27:54-07:00</updated>
  <id>https://www.eriwen.com/</id>
  <author>
    <name><![CDATA[Eric Wendelin]]></name>
    <email><![CDATA[me@eriwen.com]]></email>
  </author>

  
  <entry>
    <title type="html"><![CDATA[Analyzing library use with BigQuery]]></title>
    <link href="https://www.eriwen.com/analytics/analyzing-library-use-bigquery-github-activity/"/>
    <updated>2019-07-15T06:00:00-07:00</updated>
    <id>https://www.eriwen.com/analytics/analyzing-library-use-bigquery-github-activity</id>
    <content type="html"><![CDATA[<p>How do people use your technology? 
Ask any website owner, and they’ll point to rich Google Analytics charts. 
Many applications have built-in analytics that tell developers how features are used.
If you are an OSS library author, however, you don’t have these tools available.</p>

<p>In this post, I’m going to demonstrate how you can use BigQuery to get usage info for your open-source libraries.
<a href="#library-use-analysis-with-example-queries">Skip ahead to the advanced stuff</a> if you don’t need any convincing.</p>

<h2 id="are-library-analytics-important">Are library analytics important?</h2>
<p>I think it’s widely accepted that data driven decision-making is effective in many circumstances.
Having downloads data is only a course-grained adoption indicator, but it does not help you understand important usage questions like:</p>

<ul>
  <li>How many users would I break if I change this API?</li>
  <li>Are projects adopting this new feature?</li>
  <li>Which deprecated APIs can I drop from the next major version?</li>
  <li>In what context are people using this library?</li>
</ul>

<p>I have provided <a href="#library-use-analysis-with-example-queries">examples to start answering these questions</a> below.</p>

<p>In my experience, even partial data and answers can de-risk large efforts or stop fatally flawed work before investing too much.
<a href="https://hbr.org/2019/06/the-first-thing-great-decision-makers-do">Understand your default action first</a> before jumping into data.</p>

<p>Data analysis won’t help you if you don’t have much data, obviously.
Even if you haven’t shipped something to measure, much can be learned from how users use similar technologies.</p>

<h2 id="what-you-need-to-know-about-bigquery">What you need to know about BigQuery</h2>
<p>I’m not going to cover how to use <a href="https://cloud.google.com/bigquery/">Google BigQuery</a> in this post.
They have <a href="https://cloud.google.com/bigquery/docs/">excellent docs</a> that will give you the basics in a jiffy.</p>

<p>All you need to know is:</p>

<ul>
  <li>It is a very high-scale data store queried using a <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/">standard SQL syntax</a>.</li>
  <li>It has <a href="https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset">many useful public data sets</a>, including a <a href="https://console.cloud.google.com/marketplace/details/github/github-repos?filter=solution-type:dataset&amp;id=46ee22ab-2ca4-4750-81a7-3ee0f0150dcb">GitHub Activity dataset</a> we’ll see today.</li>
  <li>You can query up to 1TB of data per month for free. You can get quite a lot done with that amount; I share <a href="#todo">data-savings tips below</a>.</li>
</ul>

<p>BigQuery is awesome.
You might feel a child-like sense of wonder when querying Terabytes in seconds.</p>

<h2 id="library-use-analysis-with-example-queries">Library use analysis with example queries</h2>
<p>Here are some queries I make to answer common questions about my projects. 
You can try these out yourself after you <a href="https://cloud.google.com/bigquery/public-data/">enable BigQuery and the GitHub Activity dataset</a> on your Google account.</p>

<p>The first thing we’ll want to do is save a subset of file contents to our own table to save a bunch of data.
Make sure you “Create Dataset” so you can store your own BigQuery tables.
You can story 10GB for free.</p>

<p>Create your own (much smaller) <code class="highlighter-rouge">files</code> and <code class="highlighter-rouge">contents</code> tables.
Change the <code class="highlighter-rouge">WHERE</code> clauses to only match what you’re interested in.</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="c1">-- "Save Result" to '&lt;your_dataset&gt;.files' table</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">repo_name</span> <span class="n">repo</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">path</span> <span class="n">path</span>
<span class="k">FROM</span> <span class="nv">`bigquery-public-data.github_repos.files`</span> <span class="n">f</span>
<span class="k">WHERE</span> <span class="n">f</span><span class="p">.</span><span class="n">path</span> <span class="k">LIKE</span> <span class="s1">'%.gradle%'</span><span class="p">;</span></code></pre></figure>

<p>The <code class="highlighter-rouge">github_repos</code> dataset has contents of all files under 1MB.
Change <code class="highlighter-rouge">&lt;your_dataset&gt;</code> and note this is only querying <code class="highlighter-rouge">github_repos.sample_contents</code> to save your data for demo purposes; change to <code class="highlighter-rouge">contents</code> to get everything.</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="c1">-- "Save Result" to '&lt;your_dataset&gt;.contents' table</span>
<span class="k">SELECT</span>
  <span class="n">files</span><span class="p">.</span><span class="n">id</span> <span class="k">AS</span> <span class="n">id</span><span class="p">,</span>
  <span class="n">files</span><span class="p">.</span><span class="n">repo</span> <span class="k">AS</span> <span class="n">repo_name</span><span class="p">,</span>
  <span class="n">contents</span><span class="p">.</span><span class="n">content</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.files`</span> <span class="n">files</span><span class="p">,</span>
  <span class="nv">`bigquery-public-data.github_repos.sample_contents`</span> <span class="n">contents</span>
<span class="k">WHERE</span> <span class="n">files</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">contents</span><span class="p">.</span><span class="n">id</span><span class="p">;</span></code></pre></figure>

<p><strong>NOTE</strong>: subsequent sections of this article will use <code class="highlighter-rouge">&lt;your_dataset&gt;</code> in order to save data.</p>

<h3 id="counting-project-adoption">Counting Project Adoption</h3>
<p>First, let’s count the number of repositories that use the library or API we’re interested in.
For example, we can count the number of open-source Android applications this way:</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span> <span class="n">files</span><span class="p">.</span><span class="n">repo</span><span class="p">)</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.files`</span> <span class="n">files</span><span class="p">,</span>
  <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="n">contents</span>
<span class="k">WHERE</span>
  <span class="n">files</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">contents</span><span class="p">.</span><span class="n">id</span>
  <span class="k">AND</span> <span class="n">files</span><span class="p">.</span><span class="n">path</span> <span class="k">LIKE</span> <span class="s1">'build.gradle%'</span>
  <span class="k">AND</span> <span class="n">contents</span><span class="p">.</span><span class="n">content</span> <span class="k">LIKE</span> <span class="s1">'%com.android.application%'</span><span class="p">;</span></code></pre></figure>

<p>You can inspect build scripts or <code class="highlighter-rouge">yarn.lock</code> or other files to get a different view of usage by version than what you’d get from your download numbers.</p>

<p>You might want to filter out repositories with little activity.
You can join to the <code class="highlighter-rouge">commits</code> table to achieve this:</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">repo</span><span class="p">,</span> <span class="n">TIMESTAMP_SECONDS</span><span class="p">(</span><span class="n">created_date</span><span class="p">)</span> <span class="k">AS</span> <span class="n">created_date</span>
<span class="k">FROM</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">repo</span><span class="p">,</span> <span class="k">MIN</span><span class="p">(</span><span class="n">committer</span><span class="p">.</span><span class="nb">date</span><span class="p">.</span><span class="n">seconds</span><span class="p">)</span> <span class="k">AS</span> <span class="n">created_date</span>
  <span class="k">FROM</span>
    <span class="nv">`bigquery-public-data.github_repos.commits`</span> <span class="n">commits</span><span class="p">,</span>
    <span class="k">UNNEST</span><span class="p">(</span><span class="n">commits</span><span class="p">.</span><span class="n">repo_name</span><span class="p">)</span> <span class="k">AS</span> <span class="n">repo</span>
  <span class="k">WHERE</span>
    <span class="n">repo</span> <span class="k">IN</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">files</span><span class="p">.</span><span class="n">repo</span>
    <span class="k">FROM</span> <span class="nv">`&lt;your_dataset&gt;.files`</span> <span class="n">files</span><span class="p">,</span> <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="n">contents</span>
    <span class="k">WHERE</span>
      <span class="n">files</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">contents</span><span class="p">.</span><span class="n">id</span>
      <span class="k">AND</span> <span class="n">files</span><span class="p">.</span><span class="n">path</span> <span class="k">LIKE</span> <span class="s1">'build.gradle%'</span>
      <span class="k">AND</span> <span class="n">contents</span><span class="p">.</span><span class="n">content</span> <span class="k">LIKE</span> <span class="s1">'%android%'</span><span class="p">)</span>
  <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">repo</span>
  <span class="k">HAVING</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">committer</span><span class="p">.</span><span class="nb">date</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">5</span>
<span class="p">)</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">created_date</span> <span class="k">DESC</span><span class="p">;</span></code></pre></figure>

<p>I haven’t found a way to use BigQuery to find the timestamp when the code I’m interested in was added.
This is possible with the GitHub API, however, if you have found a better way to achieve this please comment.</p>

<h3 id="counting-uses-of-specific-apis">Counting uses of specific APIs</h3>
<p>I often count results to see if they are statistically meaningful before doing text processing.</p>

<p>Now that you have a smaller dataset, you can query without much fear of using a ton of data.
Even then, Google Cloud offers a $300 free trial credit and won’t charge actual money without asking.</p>

<p>Here’s a query that counts the repos that use the <code class="highlighter-rouge">@CacheableTask</code> (a Gradle API for build caching).</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span> <span class="k">c</span><span class="p">.</span><span class="n">repo_name</span><span class="p">)</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="k">c</span><span class="p">,</span>
  <span class="k">UNNEST</span><span class="p">(</span><span class="n">SPLIT</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">))</span> <span class="n">line</span>
<span class="k">WHERE</span> <span class="n">line</span> <span class="k">LIKE</span> <span class="s1">'@CacheableTask%'</span><span class="p">;</span></code></pre></figure>

<p>This will just return the count.
You can also rank uses of a subset of APIs.
For example, this query ranks internal APIs by most used:</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span>
  <span class="n">REGEXP_EXTRACT</span><span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="n">r</span><span class="s1">'import ([a-zA-Z0-9</span><span class="se">\.</span><span class="s1">_]*)'</span><span class="p">)</span> <span class="k">class</span><span class="p">,</span>
  <span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span> <span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="p">)</span> <span class="k">count</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="k">c</span><span class="p">,</span>
  <span class="k">UNNEST</span><span class="p">(</span><span class="n">SPLIT</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">))</span> <span class="n">line</span>
<span class="k">WHERE</span> <span class="n">line</span> <span class="k">LIKE</span> <span class="s1">'import org.gradle.%internal%'</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="k">count</span> <span class="k">DESC</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span></code></pre></figure>

<h3 id="how-a-given-api-is-used">How a given API is used</h3>
<p>The GitHub Activity BigQuery dataset has all properly-licenced sources under 1Mb.
We can use this to look at the sources that use an API we’re interested in.</p>

<p>This query will select the details of all files that use a given API:</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="k">c</span><span class="p">.</span><span class="n">repo_name</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="k">c</span><span class="p">.</span><span class="n">content</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="k">c</span><span class="p">,</span>
  <span class="nv">`&lt;your_dataset&gt;.files`</span> <span class="n">f</span>
<span class="k">WHERE</span>
  <span class="k">c</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
  <span class="k">AND</span> <span class="k">c</span><span class="p">.</span><span class="n">content</span> <span class="k">LIKE</span> <span class="s1">'%@CacheableTask%'</span><span class="p">;</span></code></pre></figure>

<p>You can also look for remote endpoint identifiers such as <code class="highlighter-rouge">https://myservice.company.com/api/</code>.</p>

<p>This will give us all context surrounding the use of the API.
We can store this into another BigQuery table for further analysis.</p>

<p>Of course, you can also look at single lines instead (exploring use of Gradle’s ObjectFactory here):</p>

<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="k">c</span><span class="p">.</span><span class="n">repo_name</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="n">line</span>
<span class="k">FROM</span>
  <span class="nv">`&lt;your_dataset&gt;.contents`</span> <span class="k">c</span><span class="p">,</span>
  <span class="nv">`&lt;your_dataset&gt;.files`</span> <span class="n">f</span><span class="p">,</span>
  <span class="k">UNNEST</span><span class="p">(</span><span class="n">SPLIT</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">))</span> <span class="n">line</span>
<span class="k">WHERE</span>
  <span class="k">c</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> 
  <span class="k">AND</span> <span class="n">line</span> <span class="k">LIKE</span> <span class="s1">'%project.objects%'</span><span class="p">;</span></code></pre></figure>

<p>The results will look something like this:</p>

<table>
  <tbody>
    <tr>
      <td><strong>gradle-docker-plugin</strong></td>
      <td><code class="highlighter-rouge">val filteredImageName : Property&lt;String&gt; = project.objects.property(String::class)</code></td>
    </tr>
    <tr>
      <td><strong>ultrasonic</strong></td>
      <td><code class="highlighter-rouge">ListProperty&lt;File&gt; jacocoFiles = project.objects.listProperty(File.class)</code></td>
    </tr>
    <tr>
      <td><strong>gradle-script-kotlin</strong></td>
      <td><code class="highlighter-rouge">val message = project.objects.property&lt;String&gt;()</code></td>
    </tr>
    <tr>
      <td><strong>zap-extensions</strong></td>
      <td><code class="highlighter-rouge">val declaredAddOns = project.objects.setProperty&lt;String&gt;()</code></td>
    </tr>
  </tbody>
</table>

<hr />

<p>From here you can dig deeper into how your APIs are (or aren’t) used.</p>

<h2 id="advanced-tips">Advanced tips</h2>
<p>Here are some pro tips for working with BigQuery, and the <code class="highlighter-rouge">github_repos</code> public dataset in particular.</p>

<h3 id="use-the-sample_-tables-for-testing-before-querying-full-dataset">Use the <code class="highlighter-rouge">sample_</code> tables for testing before querying full dataset</h3>
<p>The <code class="highlighter-rouge">github_repos.contents</code> and <code class="highlighter-rouge">github_repos.files</code> tables are very large. 
Try your queries using <code class="highlighter-rouge">sample_*</code> tables first.
<strong>Heads up</strong>: The schemas for <code class="highlighter-rouge">sample_*</code> tables are <em>slightly different</em>.</p>

<h3 id="save-query-results-to-a-new-bigquery-table-and-use-it-for-subsequent-queries">Save query results to a new BigQuery table and use it for subsequent queries</h3>
<p>I’m repeating myself here because this tip is important: query large datasets only once to get the interesting subset, then query that table. You can store 10GB for free.</p>

<h3 id="use-cmd-shift-f-in-the-bigquery-editor-to-format-your-query">Use <em>Cmd-Shift-F</em> in the BigQuery editor to format your query</h3>
<p>I’ve found a good number of bugs in my queries by using the built-in formatting.
There are other shortcuts for running queries and auto-suggestions as well.</p>

<h2 id="conclusion">Conclusion</h2>
<p>I hope you found this a useful introduction to using BigQuery for OSS analysis.</p>

<p>If you want to go deeper, you might consider downloading and parsing sources using ASTs.
Matt Silverlock posted some <a href="https://speakerdeck.com/campoy/csi-gopher">interesting slides that describe this</a>.</p>

<p>Have you used this technique to understand OSS usage?
If so, what have you learned?</p>
]]></content>
  </entry>
  
</feed>
