<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Ahmet Alp Balkan</title>
    <link>https://ahmet.im/blog/</link>
    <description>Random thoughts on software development, tech, web startups and other boring stuff.</description>
    
    <lastBuildDate>Wed, 09 Jul 2025 10:39:11 -0700</lastBuildDate>
    
        <atom:link href="https://ahmet.im/blog/feed/rss.xml" rel="self" type="application/rss+xml"/>
    

    
    <item>
      <title>Kubernetes List API performance and reliability</title>
      <link>https://ahmet.im/blog/kubernetes-list-performance/</link>
      <guid>https://ahmet.im/blog/kubernetes-list-performance/</guid>
      <pubDate>Wed, 09 Jul 2025 10:39:11 -0700</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>kubernetes internals</category>
        
      
      <description>&lt;p&gt;At my current employer, we use Kubernetes to run hundreds of thousands of bare
metal servers, spread over hundreds of Kubernetes clusters. We use Kubernetes
beyond officially supported/tested scale limits by running more than 5,000
nodes and over a hundred thousand of pods in a single cluster.&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; In these
large scale setups, expensive &amp;ldquo;list&amp;rdquo; calls on the Kubernetes API are the
achilles heel of the control plane reliability and scalability. In this article,
I&amp;rsquo;ll explain which list call patterns pose the most risk, and how recent and
upcoming Kubernetes versions are improving the list API performance.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>From Metal To Apps: our Kubecon EU 2025 talk</title>
      <link>https://kccnceu2025.sched.com/event/1txGQ</link>
      <guid>https://kccnceu2025.sched.com/event/1txGQ</guid>
      <pubDate>Thu, 03 Apr 2025 11:45:00 +0100</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>linkedin</category>
        
      
      <description>We presented our LinkedIn compute infrastructure team&amp;rsquo;s journey moving LinkedIn&amp;rsquo;s large 500,000+ bare metal servers running thousands of microservices and a lot of stateful workloads to a Kubernetes based platform.
In this session, we talk about LinkedIn&amp;rsquo;s scale, how we automate bare metal server management and maintenance from the ground up, built Kubernetes node and cluster management layers for our needs, and how we&amp;rsquo;re building workload platforms for stateless, stateful and batch workloads.</description>
    </item>
    
    <item>
      <title>LinkedIn on the Kubernetes Podcast</title>
      <link>https://kubernetespodcast.com/episode/249-linkedin/</link>
      <guid>https://kubernetespodcast.com/episode/249-linkedin/</guid>
      <pubDate>Mon, 10 Mar 2025 20:19:29 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>linkedin</category>
        
      
      <description>Ronak and I were Abdel&amp;rsquo;s guests at the Kubernetes Podcast by Google ahead of our KubeCon talk in London next month. We talked about our work building the next generation of the compute infrastructure at LinkedIn with Kubernetes, the challenges we faced and our journey dealing with the scale and complexity so far.</description>
    </item>
    
    <item>
      <title>Every pod eviction in Kubernetes, explained</title>
      <link>https://ahmet.im/blog/kubernetes-evictions/</link>
      <guid>https://ahmet.im/blog/kubernetes-evictions/</guid>
      <pubDate>Thu, 27 Feb 2025 20:19:29 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
      
      <description>&lt;p&gt;Anyone who is running Kubernetes in a large-scale production setting cares about
having a predictable Pod lifecycle. Having unknown actors that can terminate
your Pods is a scary thought, especially when you’re running stateful workloads
or care about availability in general.&lt;/p&gt;
&lt;p&gt;There are so many ways Kubernetes terminates workloads, each with a non-trivial
(and not always predictable) machinery, and there&amp;rsquo;s no page that lists out all
eviction modes in one place. This article will dig into Kubernetes internals to
walk you through all the eviction paths that can terminate your Pods, and why
“kubelet restarts don’t impact running workloads” isn’t always true, and
finally I&amp;rsquo;ll leave you with a cheatsheet at the end.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>So you wanna write Kubernetes controllers?</title>
      <link>https://ahmet.im/blog/controller-pitfalls/</link>
      <guid>https://ahmet.im/blog/controller-pitfalls/</guid>
      <pubDate>Wed, 22 Jan 2025 21:26:45 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>controller development</category>
        
      
      <description>&lt;p&gt;Any company using Kubernetes eventually starts looking into developing their
custom controllers. After all, what&amp;rsquo;s not to like about being able to provision
resources with declarative configuration: &lt;a href="https://youtu.be/zCXiXKMqnuE?t=128"&gt;Control loops&lt;/a&gt; are fun,
and &lt;a href="https://kubebuilder.io/"&gt;Kubebuilder&lt;/a&gt; makes it extremely easy to get started with writing Kubernetes
controllers. Next thing you know, customers in production are relying on the
buggy controller you developed without understanding how to design idiomatic
APIs and building reliable controllers.&lt;/p&gt;
&lt;p&gt;Low barrier to entry combined with good intentions and the &amp;ldquo;illusion of
&lt;em&gt;working&lt;/em&gt; implementation&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;&amp;rdquo; is not a recipe for
success while developing production-grade controllers. I&amp;rsquo;ve seen the real-world
consequences of controllers developed without adequate understanding of
Kubernetes and the controller machinery at multiple large companies. We went
back to the drawing board and rewritten nascent controller implementations a few
times to observe which mistakes people new to controller development
make.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Notes on OpenAI Kubernetes outage</title>
      <link>https://ahmet.im/blog/openai-kubernetes-incident/</link>
      <guid>https://ahmet.im/blog/openai-kubernetes-incident/</guid>
      <pubDate>Mon, 18 Nov 2024 17:01:00 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
      
      <description>&lt;p&gt;Last week, OpenAI has suffered a several hours long outage and &lt;a href="https://status.openai.com/incidents/ctrsv3lwd797"&gt;published a
detailed postmortem&lt;/a&gt; about it.
Highly recommend reading it. These technical reports are usually a gold mine for
all large-scale Kubernetes users, as we all go through similar set of
reliability issues running Kubernetes in production.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Tale of a Kubernetes node-feature-discovery incident</title>
      <link>https://ahmet.im/blog/nfd-incident/</link>
      <guid>https://ahmet.im/blog/nfd-incident/</guid>
      <pubDate>Fri, 15 Nov 2024 00:00:00 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>controller development</category>
        
      
      <description>&lt;p&gt;This is the analysis of a low severity incident that took place in the
Kubernetes clusters at the company I work at that taught me a lot about how to
think about the off-the-shelf components we bring from the ecosystem into the
critical path and operate at a scale much larger than these components are
intended.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Kubernetes CRD generation pitfalls</title>
      <link>https://ahmet.im/blog/crd-generation-pitfalls/</link>
      <guid>https://ahmet.im/blog/crd-generation-pitfalls/</guid>
      <pubDate>Tue, 10 Sep 2024 09:57:53 -0700</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
        <category>controller development</category>
        
      
      <description>&lt;p&gt;A quick &lt;a href="https://sourcegraph.com/search?q=context:global++%2Bkubebuilder:object:root%3Dtrue+fork:false+-file:vendor+-file:staging+-file:test+-file:demo+file:go%24+count:20000+-repo:upbound+-repo:Azure+-repo:provider+-repo:crossplane&amp;amp;patternType=keyword&amp;amp;sm=0&amp;amp;groupBy=repo"&gt;code search query&lt;/a&gt; reveals at least 7,000 Kubernetes &lt;a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/"&gt;Custom
Resource Definitions&lt;/a&gt; in the open source corpus,&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; most of which are
likely generated with &lt;a href="https://sigs.k8s.io/controller-tools"&gt;controller-gen&lt;/a&gt; —a tool that turns Go structs
with &lt;a href="https://book.kubebuilder.io/reference/markers"&gt;comments-based markers&lt;/a&gt; into Kubernetes CRD manifests, which
end up being custom APIs served by the Kubernetes API server.&lt;/p&gt;
&lt;p&gt;At LinkedIn, we develop our fair share of custom Kubernetes APIs and controllers
to run workloads or manage infrastructure. In doing so, we rely on the custom
resource machinery and &lt;code&gt;controller-gen&lt;/code&gt; heavily to generate our CRDs.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Why Kubernetes secrets take so long to update?</title>
      <link>https://ahmet.im/blog/kubernetes-secret-volumes-delay/</link>
      <guid>https://ahmet.im/blog/kubernetes-secret-volumes-delay/</guid>
      <pubDate>Wed, 28 Dec 2022 13:45:00 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
      <description>I&amp;rsquo;ve recently done a Twitter poll and only 20% of the participants accurately predicted that it takes Kubernetes 60-90 seconds to propagate changes to Secrets and ConfigMaps on the mounted volumes. So I want to take you on a journey in the codebase on how the mechanics of these volume types work and why it takes so long.
Before going on this journey, I would answer the poll &amp;ldquo;nearly instantly&amp;rdquo; (like the majority 40% did).</description>
    </item>
    
    <item>
      <title>Pitfalls reloading files from Kubernetes Secret &amp; ConfigMap volumes</title>
      <link>https://ahmet.im/blog/kubernetes-inotify/</link>
      <guid>https://ahmet.im/blog/kubernetes-inotify/</guid>
      <pubDate>Thu, 22 Sep 2022 16:00:37 +0000</pubDate>
      <author>Ahmet Alp Balkan</author>
      <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Ahmet Alp Balkan</dc:creator>
      
        
        <category>kubernetes</category>
        
      
      <description>&lt;p&gt;Files on Kubernetes &lt;a href="https://kubernetes.io/docs/concepts/storage/volumes/#secret"&gt;Secret and ConfigMap volumes&lt;/a&gt; work in peculiar and
undocumented ways when it comes to watching changes to these files with the
&lt;a href="https://man7.org/linux/man-pages/man7/inotify.7.html"&gt;&lt;code&gt;inotify(7)&lt;/code&gt; syscall&lt;/a&gt;. Your typical file watch that works outside
Kubernetes might not work as you expect when you run the same progam on
Kubernetes.&lt;/p&gt;
&lt;p&gt;On a normal filesystem, you start a watch on a file on disk with a library and
expect to get an event like &lt;code&gt;IN_MODIFY&lt;/code&gt; (file modified) or &lt;code&gt;IN_CLOSE_WRITE&lt;/code&gt;
(file opened for writing closed) when the file is changed. But these filesystem
events never happen for files on Kubernetes Secret/ConfigMap volumes.&lt;/p&gt;</description>
    </item>
    
  </channel>
</rss>