Writing is clarifying

Productivity and Entropy

2026-03-07T11:22:43-08:00

I’ve spent over 25 years watching software systems accumulate entropy and drift into disrepair and failure. I’ve led several projects and know the excruciating pain it takes to bring such systems back on track. That experience makes me cautious about unbounded productivity gains with AI. The early productivity gains are clear, but we should discuss complexity and entropy before jumping to conclusions.

While there is no universal definition of complexity, when we say some software is complex, we usually mean a system with many parts that occasionally behaves in ways that go counter to our understanding. Entropy is a measure of uncertainty and unpredictability. It shows up as tangled dependencies, unpredictable states, cascading failures, etc.

Frederick Brooks explored these topics in the context of productivity 40 years ago in his ACM article titled No Silver Bullet: Essence and Accidents of Software Engineering. Since that article is behind a paywall, see this 1995 reproduction. In that article, Brooks made a few observations: (a) software construction involves two kinds of complexities, essential and accidental, (b) essential complexity is irreducible, and (c) there is no silver bullet for productivity.

AI is reshaping both essential and accidental complexities. We can now automate several aspects of determining what to build and how to build it. It sounds like we found the silver bullet that Brooks has given up on: more tokens → higher productivity → fewer people → more profits. But is that so? Let’s probe into the origins of complexity and entropy before jumping to such a conclusion.

Path dependence

The idea of path dependence is that early choices have irreversible consequences. As a system gets successfully adopted, early choices create a lock-in. We see it in every successful tech company: some early assumptions and designs usually dictate the rest of the software’s evolution. Those assumptions and designs ultimately influence not just code evolution but also team structures and even culture.

Paul David, an economist who introduced the idea of path dependence, uses the QWERTY keyboard as an example of how early typewriter designs in the 1880s locked us in since then. Path dependence constrains future choices. By limiting choice, it increases the cost of changes as requirements change and business conditions evolve.

Once I read about path dependence, I could not unsee its impact in the companies I worked at in my career. At every place, path dependence was at play - large monoliths, custom frameworks that lock the data in particular databases, particular team structures because “it has always been that way”, etc. Changing such things takes enormous time and effort. Most of what we consider legacy, or technical debt, is often the result of path dependence.

As we add more features to such software, the path dependence constraint will lead people to work around it. Then it becomes more difficult to understand the implications of changes. Entropy accumulates over time as we try to work within the constraints of path dependence.

Will AI help you circumvent path dependence? One might argue so: you could direct an agentic coding tool to refactor and rewrite the code to tear apart path dependence. However, in practice, that can prove to be disastrous. Rewriting any complex system, including all its dependencies, while preserving data and existing user behavior, is a hard task. The risk is high.

The next three are based on two excellent books on systems thinking: Thinking in Systems by Donella Meadows and Drift into Failure by Sydney Dekker.

Competing feedback loops

In the first chapter, Meadows introduces the concepts of “stocks” and “flows” and two kinds of feedback loops: balancing (stability-seeking) and reinforcing (amplifying, growth-seeking). Stock is the material you work with. In the context of software development, stock refers to the amount of code, services, data stores, various components, teams, and people. Flows are activities we do to manage the stock. Flows change the stock.

The best way to think about stability-seeking and amplifying feedback loops is to ask whether your organization is changing the stock for stability or growth. When focused on stability, you constrain the flow to reduce bugs and improve system stability and performance. When focused on growth, you increase the flow to prioritize business growth. You can try to improve both at the same time, but you can not ignore the tension between the two.

In practice, no software organization does either alone. Some parts of an organization may be stability-seeking (such as your infrastructure or platform teams), while others may be growth-seeking (such as your feature teams). Similarly, your architects and senior engineers may prioritize the architecture’s stability and integrity, while the rest may prioritize speed. The tension between these two leads to conflicting choices and workarounds. The net result is an increase in the overall system’s entropy.

Now, bring AI into the picture. Unless constrained, AI will rapidly accelerate conflicting feedback loops. Empowered by a high-speed tool, each team could attempt to optimize the system in conflicting ways - some for stability, and many for growth. Most might get their outcomes in the short term, but very quickly, the competition between these factions will increase entropy faster than it would without AI.

Delayed feedback

Delayed feedback is like the slow drip you forgot to fix in the basement, and now you have mold in the house. In software, certain delays take time to manifest. For example, you might delay some cleanup or scale-out activity because everyone is busy introducing other changes to the system, unaware that entropy has been increasing and that the system is reaching a critical state. Things would be fine for some time, and one day you might find yourself in firefighting. The delay in the feedback creates a false sense of safety, which then leads to delayed repairs. Per Dekker, delayed maintenance and repair contribute to systems drifting into failure.

Delays show up in other forms, too. In one case, a minor data corruption issue remained undetected for several months, and correcting it became expensive and time-consuming. In reality, in any large software-powered enterprise, there are likely several such slow feedback loops at play. Such delays usually result in future unplanned maintenance work and drain your productivity calculations.

Will AI help detect such delayed feedback loops sooner? Will it prioritize such repair work over other kinds of work without our prompting? Or will we have more such delayed feedback loops as we rapidly change the system with AI? While we don’t have evidence for either, AI will likely introduce more delayed feedback loops, requiring more frequent unplanned maintenance.

Stale/incorrect models

Meadows also reminds us that whatever we think we know about the world is a model. Models are incomplete, and different people construct different models to deal with the world outside their minds. Meadows says in Chapter 4,

… our models fall short of representing the world fully.

What does this have to do with software? As software ages and multiple people touch the software, our models drift apart. For example, the senior-most person in the company who wrote the original software (and thus the creator of path dependence) might have one model of the software. A junior engineer who joined the organization recently will have a very different model of the software. As software ages, changes by different people with different models lead to even more drifted models of how the system is supposed to work. Eventually, nobody would have the complete picture to reason about the system. Most technical debates I’ve witnessed are the result of people holding different models of how the system is supposed to work. People argue about how to add or modify something before checking whether they share the same assumption about how the current system is supposed to work.

As more changes get made, the coupling within the system changes in unexplainable ways. The result is increased entropy. Changes become time-consuming to make and difficult to validate.

AI will likely add more fuel to this situation unless we find ways to coerce everyone to use the same model of how the system is supposed to work. It will be difficult to construct such unified models for large monoliths, or even for monorepos.

Will AI have a better model of software (including systems’ runtime behavior and user behavior), carefully balance between stability-seeking and growth-seeking patterns, and manage entropy?

No. The four factors we discussed above - path dependence, competing feedback loops, delays in feedback, and above all, incomplete models will create a complexity ceiling for AI.

Just consider our models. Like us, AI builds a model of the system and uses that model to determine actions. Like all models, an AI-built model would be imperfect too. Further, multiple people working on the same system will likely see their AI tool generate a slightly different model tailored to their use. AI thus magnifies the same model problem we humans have. It is like 100 teenage developers working on the same system, each with a different model of the system.

Since we’re the ones setting the goals for AI, we will likely continue to favor growth-seeking feedback loops over stability-seeking ones, delay maintenance, and allow our systems to drift toward failure faster.

AI will require us to hold on to good software engineering principles even tighter. Those who understand this will build systems that grow and last. The ones chasing unbounded productivity gains won’t know why they failed.

Agentic IDEs - 100 Teenager Dev Problem

2026-01-25T21:04:41-08:00

Imagine this situation: You are an experienced dev manager. You know how to assemble 15-20 adult software developers, analyze a moderately complex set of requirements, develop a project plan, break the work down, and delegate tasks across the team. You are capable of running appropriate SDLC rituals and delivering results on a predictable schedule. One day, you arrive at work only to find your team replaced by a hundred hyperactive teenagers. Those teenagers have googled everything and believe they know it all. They are excited about their first job, curious, restless, and eager to start working on their keyboards. Your task is to organize this team to achieve the same results much faster.

Where do you begin? How would you approach this challenge? That’s the kind of skill needed to produce predictable outcomes quickly with agentic IDEs. While my comparison of agentic IDEs to managing a team of 100 teenagers might sound snarky or funny, I recognize that it’s unfair to teenagers. After all, our future depends on them. I’m making this comparison to emphasize that taming agentic IDEs to achieve predictable results is much like getting a 100-teenager software development team to work well.

Here are some reasons why it is important to draw such a reference.

Agentic IDEs act confidently. They behave as though they know everything there is to know and impress you.
Yet it takes little effort on your part to coerce them into contradicting themselves, changing their minds, or giving up altogether.
You can give a complex specification to get them to quickly build a working system, but they can also just as easily send the system into chaos when asked to fix a bug.
Unless you are carefully watching, they can break your architecture and make changes in unwanted places. You must be extremely granular and specific with your instructions.
They act like they know how to follow instructions, but they can conveniently ignore them. They will give you vague explanations about why they ignored your instructions.
Depending on how their system prompts were designed, some respond obediently while others respond tersely to your questions.
Once in a while, you have to pull them back by their tails to not get into rabbit holes.

None of these patterns should be surprising. LLMs are massive prediction machines. They hallucinate, don’t understand language, code, or concepts. They are sophisticated statistical engines designed to predict the next token. Coercing them to produce decent outcomes is an art and not a science.

So, how would you get them to produce effective outcomes? Through December 2025 and early January 2026, I’ve dealt with three moderately complex problems: modernizing a legacy codebase, an mTLS-based system for authentication and authorization between services, and an internal developer CI/CD platform for creating and deploying apps in the cloud. Each problem would normally take 3-6 people over several months, and I spent about $500 (a decent portion of which went to AWS). In each case, I addressed full-stack concerns and got working solutions.

My experience varied from mediocre to excellent. I used multiple agentic IDEs. As I reflected on the differences between scenarios where they worked well and those where they struggled, I began to notice patterns and formulate hypotheses. I have some evidence supporting these hypotheses, and I will continue to refine them in the coming weeks and months. The goal of these hypotheses is to find out how to tame these agentic systems to produce reasonably correct software solutions.

Hypothesis 1: Good engineering principles matter more than ever.

Let us go back to the 100-teenager analogy. How would you organize a 100-teenager team to do some constructive work?

You would create an overall plan for how the work should be carried out. You would break the work into manageable parts and assign each to a small enough team. You will provide them with clear instructions on what to do. You will outline some dos and don’ts.

You will provide sufficient space and boundaries between subgroups so that each can work independently to accomplish its task. You will consider the dependencies between subgroups and identify what can be done in parallel and what must wait for other subgroups to finish. You will set communication pathways and protocols between the subgroups. You will assign the task of ensuring the correctness of those dependencies and communication to a few. You will tell them how to test that various subgroups did their work correctly to your satisfaction.

In other words, you would meticulously focus on a documented system design, architecture constraints, component choice, project plan, modularity, interfaces, composability, module testability, tests (unit, integration, end-to-end, etc.), acceptance criteria, etc. The same holds for taming agentic IDEs to work for you.

Furthermore, when you have a large number of people (that too, teenagers) working for you, you would focus more on correctness and completion and less on getting an MVP out the door quickly. When I compare the situations where I made steady progress and where I had to fight code regressions, the difference was how well I thought through modularity, interfaces, tests, and acceptance criteria at every phase, instead of short-circuiting the process with poorly tested modules and porous interfaces. The more structured I was in determining how the work should be done and in following contract- and test-driven development, the greater the throughput I achieved with agentic IDEs. The more component-level, integration, and end-to-end tests I had, the faster it was to implement new features or find bugs. Taking shortcuts had the opposite effect. In one particular instance, I wasted almost half a day getting AI fix some regressions. My experience made sense in light of the seven points I listed above.

To summarize, meticulously following good software development practices is essential to get high throughput with agentic IDEs. Some of those principles may matter less when you have a small human development team that can only produce a handful of changes a day, but not with agentic IDEs that can make dozens of changes in minutes.

Hypothesis 2: High development throughput depends on simple, automated, and well-integrated dev tools and processes.

When was the last time you carefully reviewed all the dev tools and processes and eliminated friction points? Probably never or for a long time. But the time is now. Here is why.

The development throughout is a function of the tools and processes employed. Consider these:

Dev tools: Source control, local dev setup tools, local builds and tests, artifact repositories, secrets management, CI/CD pipelines, test and integration environments, logging and monitoring systems, distributed tracing, docs and standards, task and bug tracking systems, ticketing and approval flows, internal developer portals, etc.

Dev processes: Architecture and system design procedures and reviews, planning rituals, sizing and estimation, task breakdown, checkpoints such as standups and status meetings, code reviews, change review processes, etc.

Your current development tools and processes, whether you like them or not, are perfectly aligned to yield your team’s current throughput. Furthermore, those tools and processes are a result of your company’s and team’s legacy, culture, technical debt, organization’s structural debt, business conditions, and leadership attitude. To add, most aspects of these tools and processes are evolutionary. Those are often a patchwork of what you inherited, what you learned, and what you instituted over time. Those are likely not designed from first principles to yield high throughput.

Just like the 100-teenager dev problem, agentic IDEs introduce a scaling problem. Your current tools and processes will become bottlenecks as these tools can design, reason about designs, write, test, and debug complex software problems in seconds or minutes rather than hours or days.

For example, in my experiments, I gave my agentic IDE seamless access to my source control system, CI/CD infrastructure, and cloud-based deployment environment. That smooth integration allowed me to let my agentic IDE review and commit code, review CI logs to catch issues, review CD steps, check deployment logs, perform health checks, and, in general, do everything I would do manually.

But if your development flow involves hand-offs across teams (like dev to QA to release), meetings, manual checkpoints, approvals, tickets, fragile test environments, logging and monitoring systems that are hard to access, long lead times between commit to production, etc., then your team’s throughput will be severely constrained. This is the time to carefully review tools and processes, eliminate bottlenecks, and prepare for a world where agentic IDEs, and not just people, are the ones iterating through every aspect of the SDLC. This step will involve moving cheeses and breaking norms, which brings me to my next hypothesis.

Hypothesis 3: Yet, most enterprises will struggle to get high developer throughput.

Here is my stark prediction. Getting licenses to agentic IDEs to everyone at work is the easy part, and most enterprises will struggle to yield consistently high throughput with those tools. Many will get occasional proof points, which I call “accidental wins” when a team performs a task or project at 5x or 10x throughput. Yet, many will likely struggle to achieve sustainable throughput increases.

I am basing this hypothesis on the following factors.

First, most managers are usually detached from the finer details of how work gets done within and across various teams to know how to (a) improve engineering hygiene and rigor to support my first hypothesis, and (b) rewire the team’s dev processes and tools to support my second hypothesis. They are usually busy with meetings, reviews, escalations, 1:1s, and all such operational tasks. Their memory of how good software should be built may also be dated. But leading the change with agentic IDEs requires you know how code moves through your system of processes and tools.

Second, even when a manager is capable and willing, most organizations don’t incentivize improving hygiene and rigor on existing code bases. Say, your team’s test coverage is low, and you want to prioritize improving test coverage for a few sprints. Or you want to invest in refactoring the code to make it more modular. Or you want to invest in automating some long-pending manual workflow. Which product manager or senior leader will support you and make room to prioritize such hygiene factors over shipping a new feature?

Third, managers, due to the nature of the pressures they face, develop throughput-bursting habits over time. For example, consider how you estimate the time required to complete a project. You will base it on prior experience and everything that went wrong in the past. You also make room for some attrition, leaves, dependencies, and so on. However, your estimates will influence and control the pace at which work gets done. To try what it takes to yield high throughput, you should be willing to say yes to bigger ideas. But bigger projects come with risks and uncharted territory. Why bother and leave the predictable territory?

Under the right conditions and organizational factors, managers can be great enablers of increased team throughput by focusing on the factors I identified in my first and second hypotheses. But when those conditions are not right, managers will be the biggest bottlenecks for change.

As I wrote last month, agentic IDEs require you to shift your mindset from artisanal to industrial software crafting, which introduces a scaling problem. Your systems of SDLC that were originally set up for human developers to emit at most one or two merge requests per day will not allow dozens or more per day with agentic IDEs. It also introduces a leadership problem. As a manager, you may already be dealing with tightened headcount budgets, and trust levels in your team may be low. You may lack the conviction and confidence to stand before your team and proclaim that agentic IDEs are the solution.

Yet, the opportunity to learn and lead change is huge. You now have powerful tools that can turn each member of your team into a superhuman and do impactful work. For many in their 40s and older, this may be the last big change they witness in their careers. It is therefore imperative for managers to get their hands dirty, lead from the front, and figure out how to let their teams become superhumans at work.

What Counts

2026-01-04T21:55:12+05:30

Welcome to the new year. I spent my time traveling, coding, and reading during the slow period over the past two weeks. This is not the first time I have taken the time to reflect and write in long-form about what matters to me and what I believe (1, 2, 3). But this time, I’m sharing a few things without accompanying prose. These are what I want to count more, at this time.

(+) Learning to share, and earning to give away
(-) Sharing for the echo, and earning to impress

(+) Sharing space
(-) Elbowing

(+) Getting big things done
(-) Incrementality

(+) Taking time to reflect and refine, every day
(-) New Year’s resolutions

(+) Production
(-) Consumption

(+) Subtracting things
(-) Adding things

(+) Typing things, one character at a time
(-) AI-sloppery, sprayed in bulk

(+) Honest silence
(-) Platitudes

(+) Presence
(-) Rethinking the same things

(+) Personhood, playing a role on a canvas
(-) Selfhood, fixated at the center of the canvas

Thanks, Steve, for writing What Deserves My Attention Now, which motivated me to take a break from what I was doing this week.

Five More Leadership Lessons

2025-12-30T10:41:04+01:00

A year ago, I compiled twenty tiny leadership lessons. That article resonated with many. Here are five more based on what I observed and felt in 2025. I hope you find these ideas helpful.

Learn to deliver big things

Learning to bootstrap and deliver big projects is an essential leadership competency. Here is why. Most corporate tech projects are incremental. With such projects, you improve a thing or deliver a feature or two on a predictable cadence. Such projects may be necessary for continually delivering value on a foundation you have already built. However, incrementality will only take your product or tech stack so far. Such projects won’t give you a chance to make step-function improvements. Incrementality also breeds debt. You need the organizational capacity to handle transformational projects successfully. Taking on large projects also tests your abilities in multiple ways - from formulating a compelling narrative to garnering support for implementing the mechanics for a successful completion. Try to tackle at least one big thing every year.

But here is the challenge. More often than not, large projects struggle to finish, and the promised outcomes never materialize. Most of us have seen these at work: someone initiates a big platform project to unify two or three legacy systems, but they end up with one more unfinished system to manage. Or, someone plans a modernization project to decouple legacy systems, but the systems continue to rely on the legacy. The problem is usually the leader’s upfront underestimation of what is required to complete it, or their inability to prioritize the hard parts first. Such unfinished projects typically exacerbate the situation for everyone and add to the pile of debt.

As they say in climbing, getting to the top is optional, but getting down is mandatory. Learn to do both.

Don’t start with resources

One of the most common managerial bad habits is asking other managers for resources. Say, team A needs team B to do some work. Team A’s manager would develop a task list of requirements for Team B and approach Team B’s manager to allocate resources on Team A’s timelines. Team B’s manager would typically respond by explaining why the tasks or timelines are not feasible. Things get even more complicated when Team A depends on Team B, and Team B needs Team C to do something to accomplish what Team A wants. Imagine getting everyone to agree on what needs to be done and a timeline. Such dependencies usually lead to stalled projects or watered-down outcomes.

A better approach is to align on the objective or the vision. Imagine how Team B’s manager would react when Team A’s manager approaches them to talk about the business problem and seeks a partnership? That would motivate Team B’s manager to help out. Together, they might also discover some shared problem that both teams could benefit from. In a recent example, when someone similarly approached my team, we probed the issue and identified shared concerns that needed to be addressed, which led to a shared objective. The resource problem disappeared because both teams were willing to address it.

Share space

Earlier this year, I was reviewing some of the books I had purchased over the past few years but had never read. One of those books was Robert W. Keidel’s Seeing Organizational Patterns. The central thesis of the book is that autonomy, control, and cooperation are the three variables in any organizational design or role assignment, and that one must purposefully specify which to optimize. In an autonomy-favored design, each team would pursue its work with minimal dependency on others. In a control-favored design, the hierarchy forces alignment among teams, whereas in a collaboration-favored design, people must form working relationships to get anything done. The sentence I most liked from this book is: “Organizational design is a purposeful specification of relationships” among autonomy, control, and cooperation. These days, it is common to see organizational designs that balance autonomy and cooperation.

However, autonomy won’t work without cooperation. To exercise your autonomy, you must learn to share space with others. Let me give an example. In most companies I have worked at, friction between product management and software development teams is common. Why? Each would feel autonomous in certain decisions and would pursue them independently. The product team might independently develop a prioritized roadmap, expecting the development team to follow orders. Alternatively, the development team might independently plan to address technical issues that would disrupt the product team’s intended timelines. Each would become upset that the other made those decisions without consulting them. The result is damaged cooperation and a diluted focus on outcomes. A better alternative is to share the space - that is, exercising your autonomy in a collaborative setting. Most of us mistake autonomy for speed, but, contrary to this perception, pursuing autonomy within a collaborative setting is faster than pursuing it independently. Try it out. It will take you a long way.

There is more to technical strategy than writing it down

Search your company’s intranet - you might find several discarded or partially implemented strategy docs, 5-page one-pagers, or 10-page 6-pagers. Strategy is much more than writing it down.

What is strategy? Strategy is the process of driving change from point A to point B. It involves choosing what to do and what not to do. You articulate the strategy in terms of goals, choices, initiatives, and metrics. Then you go through the moves to implement the strategy, which is where the rubber meets the road. Business schools teach models for implementing strategy. While I won’t go into the details of those models, here are some questions to consider:

Did you get the hard facts right? The strategy can’t just be “we are going to build this and this.” Have you identified the hard facts to show that doing those things would lead to material benefits?
Do you have the credibility to implement the strategy? Do people trust you? Have you already earned the stripes?
Is there commitment behind the strategy? Do you know who the stakeholders are, and whether they agree with and are willing to defend the strategy’s objectives? The commitment needs to be deeper than lip service. You want people to be excited about the strategy.
Is the timing right? Are the conditions suitable for change? If not, should you wait?
Do you have the right alliances to implement the strategy? Who all needs to work on the implementation? What’s in it for them?
Do you have the right people to execute the strategy? Do they have the skills?
How do you propose to manage change? Strategy may require people to let go of existing ways of doing things, and you may face resistance to change. How do you propose to address such resistance?
Do you have the proper organizational structure to support the strategy?
How will people make decisions during strategy implementation? Who decides? Who provides inputs? Who is accountable for what?
How will you make tradeoffs and manage risks? What are the guiding principles?
Do you have quantitative or qualitative ways of measuring progress?
What processes are required to manage the implementation? Processes help catch blockers early and rebuild necessary alignment as things change.

Having answers to questions like this can make strategy implementation fun and rewarding. In their absence, strategies will languish as merely decorative slides or loftily produced documents.

Do less harm

This one is particularly important to me this year. We live in an interconnected world. Most of the choices we make at work or in our personal lives contribute to harm. We all understand direct harm, such as injuring someone or stealing something, and we don’t intentionally engage in such acts. Unfortunately, most of the harm we inflict in the contemporary world is structural and not individual. The fact that you live in a particular place and vote a certain way can make life less equitable for a class of people. A product you build at work can directly displace jobs for a group of people that you may not even know exist. Our shopping habits might have enabled a few oligarchs, who may then have funded activities you would disapprove of. That is the nature of an interconnected world, and structural harm is part of it. Jay Garfield, a renowned Buddhist scholar, elaborates on this “structural violence” in his Buddhism and Nonviolence in the Contemporary World. Read it, or better yet, listen to his lecture at the University of British Columbia in 2023 on the same topic. As he rightfully says in that lecture, “ethics is meant to be demanding.” Thinking about ethics should trouble us.

Furthermore, 2025 has been a surreal year. Hundreds of thousands more died, and millions were displaced due to war and genocide. We have developed sufficient inertia and political unwillingness to do anything about climate change. Bullying is now an acceptable form of public discourse. Most of the wealthy joined the bullies and narcissists to fortify their fortunes and economic interests. Systems of governance, policy, trade, healthcare, and welfare are being actively rewritten. It is now socially acceptable to belittle and bully the less privileged.

Unfortunately, we are not neutral observers watching the action between perpetrators and victims. We are all part of that action. You may be right to say that it is better than it has ever been historically, but that doesn’t change anything if you are on the receiving end. What do you do in such a situation? Engage. Speak up. Support things that minimize harm at work and in personal lives.

The Beginning of the End of Artisanal Software Crafting

2025-12-17T20:39:48-08:00

We are likely the last generation of developers to have an emotional attachment to the shape of our code. After letting AI modernize a 14-year-old codebase to make it faster and better at low cost, I realized that the era of artisanal software crafting is beginning to end; we are entering the age of industrial software production, and each of our cheeses is going to move.

The last time I coded fiercely was during 2011-12. Back then, REST and HTTP APIs were the rage. Node.js was young and rapidly gaining popularity and adoption. I just left Yahoo to join eBay. While at Yahoo, I witnessed the traction Yahoo Pipes got for easily integrating multiple public APIs and feeds into responses that had the data you needed. However, Yahoo Pipes was a hosted service, and developers could not use it to integrate with their companies’ APIs. Soon after joining eBay, I built a quick prototype of a SQL-based query engine in Node.js to easily aggregate eBay’s internal APIs. In that prototype, I modeled each API as a table and used SQL for selects, filters, joins, and related operations to massage response data. It was lightweight and fast.

I called it ql.io. It was like GraphQL before it was introduced in 2015. While GraphQL modeled data as gs a graph, ql.io modeled data sources as tables. A few leaders at eBay saw my demos and encouraged me to keep going, and I went into a mad coding spree. I wrote a more elaborate query grammar, used an open-source parser generator to compile SQL-like scripts into an execution tree, built an engine to orchestrate HTTP requests, and created a graphical user interface with syntax highlighting to learn the syntax and quickly try it out. A few more engineers joined me, and we continued building it. People who saw the demos loved it, and we open-sourced it in November 2011. However, by late spring of 2012, despite receiving nearly 1,000 stars on GitHub, the project struggled to gain adoption both within eBay and beyond. We decided to abandon the project and pursue other areas. I have not touched the code since then, and it rotted badly.

Then, two Saturdays ago, I downloaded Kiro (it could have been Cursor or another tool, but Kiro had free credits to get me started), forked ql.io, and pointed Kiro at the codebase. I provided some context about the project and asked Kiro to analyze the codebase. In about a minute, it gave me a detailed analysis of the good, the bad, and the ugly, highlighting long-broken dependencies, language changes, callback hell, long-retired test frameworks, vulnerabilities, and more. I then asked Kiro to propose a detailed plan to modernize it. I set constraints, such as not breaking module-level interfaces and ensuring the test suite remains functional at every step. It gave me a 10-step project plan with detailed steps. I asked some probing questions to modify the plan.

We (my dev servant and I) then got to work. Over the following days, we modernized the codebase to work with the latest version of Node.js, adopted modern language features such as promises to address callback hell, upgraded all tests, and rewrote the examples. Through this process, I got Kiro to add more tests to address some lightly tested areas, fixed all the vulnerabilities, wrote a performance test suite, set up test gates on pull requests, and even generated infrastructure code to deploy ql.io on AWS using a serverless architecture — all without touching any code artifacts. Mind that I had to pull Kiro out of rabbit holes, challenge its proposals, and force it to backtrack the changes multiple times. But in the end, it turned out okay. Kiro made thousands of edits to the code, and I did not even bother to review the changes, other than asking Kiro to play the role of an annoying code reviewer and account for the feedback. I leaned on tests to ensure the correctness of what Kiro did.

In case you’re curious, see the fork and the AI-generated summary of modernization. The codebase is AI-generated, and it does what it was meant to do.

THIS IS SCARY AND EXCITING.

I believe we are at the very beginning of the end of artisanal software crafting, and we are about to enter into industrialized software building.

How we practice software today is artisanal. In this mode, we carefully choose languages and tools to craft code and develop an attachment to them. We have opinions about its shape and structure, and we have ideas about its beauty and elegance. We build religions around languages and code. Then we set up tools and processes to address the complexity of building software in teams. We create roles such as architects, designers, test engineers, performance engineers, infrastructure engineers, and site reliability engineers to tackle different aspects of building and running software. We do all this because building software has been time-consuming, and we spend our working hours playing these roles and following the processes. The artisanal way helps us cope with that complexity. But we may be the last generation to practice this craft in this way.

When I wrote ql.io, I learned a lot. I developed better mental models of what to build and how to build. You could wake me up at night, and I could explain the design and structure. I could apply the lessons learned in the years that followed.

But this rewrite was different. I had Kiro play the roles of a project manager, an architect, a Node.js developer, a performance test engineer, an infra engineer, and a code reviewer. I felt like I was the master, a jack-of-all-trades focusing more on the “what” and less on the “how.” Yet, other than becoming effective at guiding Kiro, I didn’t learn much. But I cleared 14 years of code rot in a few days. The rewrite didn’t take multiple people and months. It was cheap. I could throw it away and do it again in a few days. This is industrial software production. Software thus becomes a fungible, like that trinket you buy on Temu.

While I continue to stand by my skepticism due to AI hallucinating and having jumbled representations of concepts, AI-native development environments have gotten so much better to get almost the same kind of experience you would have with an imperfect colleague. After all, we are all imperfect in some way.

In the coming months and years, we will continue to industrialize software production. The cost of producing software will likely plummet. We will likely enter an era of cheaply constructed software abundance. Our relationship with code is about to change in ways we didn’t expect even a couple of years ago.

Will we continue with the belief that “programs must be written for people to read and only incidentally for machines to execute” (this was from the first edition of Structure and Interpretation of Computer Programs)? Does it matter if people no longer understand the code as long as machines (including LLMs) can read, create, and manipulate the code? Likely not, as software production gets industrialized.

As we go through this shift, each of our cheeses will move—some at a crawl, others at a sprint. Our current organizational structures were designed for a world where people handcraft code at a specific, manageable pace. But when production speed increases by an order of magnitude, “artisanal” habits become bottlenecks.

It will soon be impossible for teams to justify spending weeks on tasks that an “agent whisperer” (I am borrowing this phrase from a friend of mine, Anand, who helped me clarify my thoughts for this article) can orchestrate in a few hours. It can get very, very difficult for individuals and teams to justify why they do things in a certain way. It can be difficult for specific tools, internal platforms, functions, and roles to justify their existence. Brace yourself for the change.

If you are an engineer: Relinquish the glamor of being an artisan. Stop measuring your value by the code you write and start measuring it by your impact. Become an “agent whisperer.” You define the “what” and set the boundaries of correctness to meet that “what.” In an industrial setting, the generalist who can steer the machine is king.

If you are a manager: Get hands-on to learn what’s possible, and then figure out how to lead change to produce some intentional productivity wins. Don’t just hand out AI licenses and hope for the best. Plunge into the discomfort. Consider your team a system and determine what to dismantle to serve a high-throughput world.

The Future is Here, If You Are Ready

2025-12-04T20:33:48-08:00

These are exciting times to be a hands-on software developer. A decade ago, the excitement was about learning and adopting cloud services. Although everyone benefited, not everyone had the opportunity to be on the front lines adopting the cloud, because most of the innovation and change happened at lower levels of the software stack. It is different this time. Innovation and excitement have now landed in development environments like Cursor, Kiro, Copilot, Antigravity, and Claude Code. For the first time, you can experience intent/spec-driven software development: you can declare your intent to your AI-native IDE, and, based on its understanding of the codebase, let it generate code and orchestrate developer workflows. These development environments are now active collaborators.

But there is a catch. The underlying tech hallucinates, lacks conceptual understanding, and is incapable of consistently creating correct, maintainable, and secure code. Unless you get the basics right, we’re looking at AI slop creeping into our codebases and many slow-moving organizations being left behind this wave.

Here is what contemporary research says.

First, LLMs hallucinate. Hallucinations are not bugs; they are inevitable, and scale laws are not the answer to eliminating hallucinations. Though OpenAI disputes the claim that hallucinations are unavoidable, research argues otherwise. For example, a widely cited paper, LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples, points out that hallucinations are a property of LLMs and are merely examples we don’t like. Another paper this year, LLMs Will Always Hallucinate, and We Need to Live with This, challenges the notion that we can fully mitigate hallucinations. I also found papers showing that neither parameter scaling (larger models) nor inference scaling (longer execution) reduces hallucinations.

Second, LLMs lack conceptual understanding. LLMs develop internal representations with semantic value, but they don’t understand concepts in the way we do. There are also examples to show that LLMs’ internal representations are a jumble of contradictions. See, for instance, Potemkin Understanding in Large Language Models.

Third, though there are techniques to minimize hallucinations, there are limits. For example, Chain-of-thought (CoT) prompting can reduce hallucinations by instructing an LLM to decompose the problem. However, research shows that such prompting can also make hallucinations appear more coherent and plausible. In other words, CoT can make hallucinations more credible. Retrieval-augmented generation (RAG) is a practical option to reduce hallucinations, and yet it does not eliminate hallucinations. For example, the paper Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review gives several reasons why RAG won’t eliminate hallucinations.

In the context of software development, there is ample research showing security and maintainability issues. For example, Veracode’s GenAI Code Security Report found that about 45% of code-generation tasks contained security flaws. They also confirm that larger models don’t perform significantly better than smaller models. Related research shows that more code-generation iterations don’t improve the situation. With respect to maintainability, there is evidence showing increased code smells, redundant dead code, etc., that contribute to long-term debt. See, for example, a recent Sonarsource report. Though I’m citing two reports from companies with commercial interests in the problem space, search https://arxiv.org to find ample research papers with examples.

Jonathan Ostroff, a York University professor, summarizes the impact of these limitations very succinctly:

Beyond tests, my experience with LLMs is that they have no concept of software requirements, architecture, and design.

* What is the structure of the system?

* What are the software modules?

* What is the relationship between them?

* Information hiding: What are the externally visible properties of modules at the interface, i.e., the externally visible properties that other modules using it rely upon?

Where does this leave us? It means two things: first, we can’t rely on LLMs alone to ensure correctness (including quality, robustness, and security) and maintainability (testability, adherence to the coding practices prevalent in your organization, code smells, etc.) of the code they generate. LLMs are too unreliable to guarantee correctness and maintainability. There is a mistaken belief among some developers that you need to find the next best model and get the prompt right to produce correct, maintainable code. That’s not the case. You can’t prompt your way out of this.

But this problem is not new in software development. We have been building reliable systems out of unreliable parts. What you need is closed-loop thinking to yield reliable outcomes, which, in the case of code generation, include code correctness and maintainability. In a closed-loop system, you have an idea of a desired output, and you give an input to the system that you hope will give you the desired output. You then measure the actual output, detect the difference between the actual and desired outputs, and adjust the input to reduce that difference.

I’m really excited to see that AI-native IDEs like Cursor, Kiro, Anti-Gravity, Copilot, and others are fast approaching a point where you can begin building correct, maintainable code in a closed-loop setting. It’s time people let go of their classic IDEs. This new genre of AI-native IDEs will surely improve rapidly, helping developers enhance code correctness and maintainability.

We need to get three things right to get the most value out of these tools: (1) well-written specs that encapsulate not only your intent, but also the practices you want the output to follow, and your correctness expectations; (2) comprehensive tests that can validate correctness of your code, and (3) the CI/CD machinery to get fast iterations.

Most folks I talk to seem to focus more on picking the right AI-native IDE or the next best model, but the focus should be on designing the specs, investing in tests, and building a solid CI/CD feedback loop that enables rapid iteration. In fact, the next generation of CI/CD systems may look different to enable lightning-fast iterations.

Those organizations that already have a culture of comprehensive testing and decent CI/CD systems have a significant advantage now. Those can leapfrog others in terms of developer productivity and rapid reductions in time-to-value. But those with complicated release processes, manual handoffs, difficult-to-test code bases, broken dev/test environments, and painful-to-work-with CI/CD systems will not be able to AI their way out of their pain. That’s the stark reality.

Leading in Low-Trust Times

2025-11-23T09:57:51-08:00

The tech world has been experiencing a low-trust climate in recent years. Cynicism is the air. Widespread and repeated layoffs, the mechanical and impersonal handling of layoffs, delayering, shrinking team sizes, disappearing backfills, tightening budgets, etc., have created a sense of scarcity and mistrust among the rank and file of many tech companies. In such a climate, it is delusional to think that employee trust and engagement toward their leadership has stayed the same as before.

The issue of trust has come up multiple times in my recent conversations with my tech friends. They all shared the same view: the tech industry enjoyed a relatively high level of trust until around the time of the pandemic, but that is no longer the case. One friend even said that having people report to you is now a liability - though it sounds like an extreme case, it might be true in some companies. More and more, people don’t seem to believe their leadership’s strategy and decisions. Additionally, there is fear and uncertainty about getting caught up in the next round of layoffs. Would you trust your leadership chain in such a situation?

Low-trust situations at work have consequences. Ignoring such situations only prolongs the damage. In a low-trust climate, your team would be less willing to follow your lead to support your vision and strategy. They might see your inspirational pep talks about why things are great as delusional or tone-deaf. They would be less enthusiastic or even apathetic to the purpose of your team - why bother when things seem to be falling apart, or you could be out in the next round? Given their fear of uncertainty, they would be less likely to be collaborative and more likely to fall back on poor organizational citizenship behaviors, such as self-benefit and preservation. Organizational citizenship is a psychological concept that refers to voluntary behaviors that go beyond formal job requirements to improve team performance, reduce friction, and generally foster a positive culture. Low-trust situations eventually create a vortex of cynicism. Engagement suffers, and good people will leave the team. This doesn’t sound very optimistic, but we are psychological beings that rely on reason to justify emotions and instincts.

It is not my place to judge decisions in the tech industry. These companies have experienced leadership and boards, and they do what they believe is necessary to pursue their economic interests and fulfill their fiduciary duties. You and I don’t have to agree. Companies and industries undergo periods of change for various reasons. Over twenty-five years in the tech industry has taught me that change is constant, and we experience cycles of things we like and things we don’t.

So, how would you lead when your teams don’t trust you?

A common response is to stay silent and hide behind prepared remarks. You might think, after all, that you are not the one who caused the low trust situation, so you would let “them,” the executive leadership, the board, the investors, or some similar group, figure it out and fix it. Or you might act like a victim, vent to your team and peers, and blame the “people upstairs.” Your team may sympathize with you and see you as a buddy on their side, but such behavior won’t help you lead or be an effective manager. These passive leadership behaviors lack humility, courage, and authenticity and keep fueling cynicism.

The solution to this challenge isn’t complicated. The best way to handle such situations is to stay authentic. Authenticity is the antidote to fight back mistrust and cynicism. Here are some techniques to consider:

Begin by listening and acknowledging the situation. Create a climate where people feel comfortable telling you how they are feeling. You may not have answers, but people appreciate it when their managers listen and acknowledge how they feel. When I faced low-trust situations, I resorted to listening tours and unscripted Q&A sessions.
Be open, honest, and straightforward. Don’t hide behind scripted statements. Transparency, even when the reasons for a low-trust situation may not be popular, helps repair trust. Tell your teams like it is. Transparency is a hallmark of authentic leadership.
Only then, provide or co-develop an objective direction. When possible, provide a framework and invite the teams to participate in that process to address the challenges they face. Create conditions for people to collaborate. Collaborative problem solving improves organizational citizenship behaviors. In a collaborative problem-solving setting, attention shifts from scarcity to abundance.

Low-trust situations are not uncommon. Companies go through phases we sometimes don’t like. Leading through such situations calls for authentic leadership. Rekindling trust may be hard and time-consuming, but not complicated.

One last word: the sky is not falling. The tech industry is still one of the few careers where we get paid to learn every day. Over the last thirty-plus years, this industry has created opportunities for many. Most of us in this industry have reinvented ourselves multiple times. The industry is not going away anywhere. The opportunities to learn and reinvent yourself are not going to disappear tomorrow.

Tangled Mess

2025-10-26T09:11:42-07:00

Once upon a time, our software was simple. All it took was a database, a user interface, and some glue code between them, running behind a web server farm, to go online. Most of the business-critical data stayed in the database. People rented space in colocation centers and bought servers to run their databases and code. Those colocation centers were not reliable, and your business could go down due to power or cooling failures. The servers were also not reliable, and you could lose data due to disk failures. A nasty cable damage somewhere, and the network could be out for hours or days. To get around such booboos, IT teams rented space in secondary colocation centers, bought servers and expensive replication systems to replicate data to those secondary sites, and took steps to bring up secondary sites in the event of disasters.

Businesses were worried about doing business with other companies whose infrastructure failures could hurt them. So legal people got involved to add contractual language to their agreements. Those clauses required their vendors to implement disaster recovery policies. In some cases, those contracts also required declaring procedures for recovering critical systems at the secondary site, testing them once or twice a year, and sharing the results. Since breaches of such contracts could lead to termination of business agreements, companies hired compliance professionals to draft rules for their tech teams to follow. Tech teams begrudgingly followed those rules to ensure their backups and replication were working and that they could bring up the most critical systems in the secondary colocation centers. Paying such a disaster recovery tax made sense because the infrastructure was unreliable.

It was okay, sort of, until the cloud happened. Cloud providers started offering a variety of infrastructure services to build software. Open source gave us even more flexibility to introduce new software architecture patterns. As engineers, we loved that choice and started binging on it, adopting whatever cloud companies began offering. Of course, this choice gave us a tremendous speed advantage without requiring years of engineering investments to build complex distributed systems. Those services also began offering higher fault tolerance and availability than what most enterprises could fathom. Moving systems from on-prem to the cloud became necessary to take advantage of that flexibility. Many people have built their careers around cloud transformation, and that is still ongoing.

For a while, cloud services were flaky. It fell to customers to keep their businesses running even when a cloud provider’s systems were down. It seemed sensible to work around those problems and build redundancy in a second cloud region. Many people tried that pattern. Some, like Netflix, succeeded, or at least are known to have succeeded; I don’t know if that is still the case today. Many had partial success to the extent of getting some stateless parts of the business running from multiple cloud regions.

Around the same time, the SaaS industry took off. The proliferation of online systems has increased complexity and fueled the hunger for automation across enterprises. This created opportunities for SaaS-based companies to fill that gap and offer a variety of services, from infrastructure to customer service to finance to sales and marketing. Relying on third-party SaaS became a necessity for every enterprise. You can no longer take code to production without depending on a subscription or pay-as-you-go service from another company. The net result of this flexibility and abundance is that almost everything is now interconnected.

We are now a tangled mess. There are no more independent systems. Our systems share the fate with large swaths of the Internet. Almost every business now depends on other companies, mostly consuming their services. Thus, all bets on building redundant secondary sites are off. You not only have to get your part right, but you also need all your dependencies to do the same to be fully redundant. Most companies can’t even make their software redundant across multiple locations due to the variety of services they are building, their interconnectedness, and the types of infrastructure needs. After all, building highly available and fault-tolerant systems requires more discipline, talent, and time than most enterprises can afford. Let’s not kid ourselves.

Where does this leave us?

First, get unstuck from the old paradigm of redundancy in secondary sites. That is over-simplified thinking. It no longer makes sense for most companies to waste their precious resources on building redundancy across multiple cloud regions. Yes, cloud providers will fail from time to time, as last week’s AWS us-east-1 outage. Yet they are still incentivized to invest billions of dollars and time in the resilience of their infrastructure, services, and processes. As for yourself, instead of focusing on redundancy, invest in learning to use those cloud services correctly. These days, most cloud services offer knobs to help their customers survive disasters (like automated backups, database failovers, and availability zone failures). Know what those are, and follow meticulously.

Second, if you truly need and care about five or more nines of absolute (i.e., not brown-out) availability for your business, make sure your business can afford the cost. To achieve such availability, you have to do several things right. You need the right talent who understands how to build highly available, fault-tolerant systems. In most cases, you have to develop that talent in-house because such talent is rare. Then you need to standardize patterns like cells, global caches, replication systems, and eventual consistency for every critical piece of code you create. You will need to invest in paved paths to make it easy to follow those patterns. Implementing those patterns takes time, and you need to get them right. Most importantly, you also need a disciplined engineering culture that prioritizes high availability and fault tolerance in every decision. Your culture needs to embrace constrained choices and sacrifice flexibility in favor of high availability and fault tolerance.

Third, somehow cajole your compliance and legal people to refine or avoid the dangerous parts like “secondary sites.” Unless your company’s architecture is stuck in time, like 20 years ago, such language no longer makes sense. Refining such language can be easier said than done since some contracts are dated, and fixing that may be the least important priority for your legal teams. Don’t get me wrong. You still need to invest in game days and similar failure-embracing exercises to build resilience in your culture. But how you practice resilience needs to evolve.

We are indeed in a tangled mess. Resilience is costly. Know what you need, figure out the business cost, and do what you can.

Rock and a Hard Place

2025-08-04T13:09:13-07:00

It’s not worth rehashing that the tech landscape is undergoing a tectonic shift. Thirty minutes on LinkedIn can feel like being caught in a stampede. While some find it exciting and share their enthusiasm openly, I bet many managers are confused, exhausted, and burned out privately. Most are probably leading teams unprepared to adopt AI, unsure how to get involved, understaffed to meet their current workload, and uncertain about how far tightening budgets will go. It’s a good case of feeling stuck between a rock and a hard place.

On the one hand, there is an AI-led technology change happening at a rapid pace. There are new ways of building software that were not possible until now. However, finding strategic, credible guidance is hard, and developing an independent point of view is time-consuming. The signal-to-noise ratio is too low for most people to discern potential from hype. Further, AI investments are still early, but their returns have not yet been fully proven or justified. Where do you even start? If you and your teams aren’t already using AI or developing AI agents, you might be worried about falling behind. It is natural to feel concerned about staying relevant in this change.

On the other hand, companies have been cutting back on headcount. The era of large tech teams with many layers of managers appears to be over. Will we go back to the highs of 2022 anytime soon? Likely not. I came across a WSJ article a few days ago that shows most of us in the tech industry already knew: tech jobs peaked sometime during the first half of 2022, followed by a steady decline since the middle of 2023. The percentage of manager jobs has been declining even more rapidly. Simultaneously, there is an implicit call to most managers to leverage AI tools to show productivity gains. But, given the nature of software engineering jobs and various company cultures, situations, legacy, and technical debt, showing productivity gains with any AI-based productivity tool is easier said than done.

Amidst these constraints, it is not unreasonable to feel confused, exhausted, or burned out. What do you do about it? I’ve spent almost three quarters exploring this topic, and here are my suggestions to managers.

Begin by giving up the notion that these constraints are unfair or that you are immune. Business models, economic factors, and business leadership create such constraints now and then. These constraints create opportunities for some but may disadvantage many. It is fair to want to understand what is driving such constraints, but there is no point in feeling like a victim or debating the fairness of such constraints. Instead, I recommend focusing on adapting your leadership style and your teams to the changing situation. But what does this mean in practice? For example, in steady state, your default management style may involve an infinite loop through work intake, execution, and delivery. Such steps allow you to maintain stability. That style won’t work when you want to adapt yourself and your teams for change. To prepare for change, you ought to focus on finding blockers and enablers for change, inspiring the team, rallying them to embrace change, and providing adequate air cover for change.

Second, tone down the common managerial instinct of treating your org size as a proxy for your impact. The tech sector enjoyed a period of getting headcount to generate new business value. It’s excellent if your org size grew because of that, but I suspect that any such growth would be highly restrained for a while. Instead, focus on creating and showing the impact of the headcount you already have. What matters more is the outsized economic value you can generate without putting your teams through death marches.

Third, most importantly, create space for your team to adapt. Instead of waiting for new headcount that is unlikely to materialize, figure out what is keeping your teams busy and develop a strategy to reduce the workload so that they can invest that time in learning new things and experimenting. Here is a process that might work.

Start by thoroughly diagnosing what is keeping your teams busy. Such a diagnosis will help you improve your understanding of the nature of your team’s work, how they receive new work, current processes, prevalent culture, etc.
Determine how much time your teams spend maintaining the current systems instead of on new projects that create new value. You may notice technical (like technical debt, quality gaps, and fragile systems in production), process (poor planning, unclear expectations, too many handoffs, poor documentation, etc.), or team/culture-related challenges (unproductive assumptions, apathy, etc.) behind the busyness of your teams.
Use such a diagnosis to also reflect on your leadership behaviors and mistakes, such as poor delegation, confusing organizational design, slow decision-making, and lack of role/goal clarity.
Create a roadmap to address the findings and simplify your team’s work. Make sure to incorporate people (training, experimentation), processes (planning, communication, etc.), structures (like org design, dependencies, etc.), along with tech in your roadmap. Drive a sense of urgency and inspire the team on why it matters. Get involved to unblock your teams. Make organizational and process changes to simplify the way work is done.
Along the way, identify opportunities to leverage AI for repetitive, mundane, or grunt work.
As you see results that create space for your team, invest that time to build an objective AI strategy that is aligned with your business objectives.

In other words, there is no magic fix for dealing with these constraints. Steps like the ones above are just part of managing a team. Sometimes you work to bring stability to the team, but other times, you thoughtfully disrupt the current ways of work to drive change. Employ the latter style to create efficiency, and use the savings to invest for the future.

Hard Things First

2025-07-12T21:16:20-07:00

Excellence is unlikely to happen if you don’t address the hard problems early in project execution. Habits for cultivating excellence won’t form unless you shed light on hard problems from the start. Excellence requires practice, and you need the team to work through the strenuous parts to build endurance. Psychologically speaking, teams build confidence in themselves when they solve the hard parts early. Self-efficacy enhances a team’s morale and attitude, creating a positive feedback loop that fosters high performance. Simply put, when you lead a team to confront challenging aspects, the team will cultivate the right habits for excellence, and their efficacy levels increase, ultimately leading to improved team performance.

Many project ideas fail to reach their full potential for excellence because people either overlook the challenging aspects or assume that those issues will be addressed in some unspecified future. Unfortunately, hard problems rarely become easy over time unless you act. When you or your team formulates a new idea or goal, most get excited about starting something new. There is hope of achieving remarkable success, driving value for your customers or stakeholders, and thereby reaping personal benefits, such as recognition or career growth. As people work on those ideas, more often than not, reality sets in, excitement fades, and the team performance gradually reaches its level of mediocrity.

Even when you build a minimum viable product (MVP) and want to incrementally improve it, unless you make principled choices early and create the proper infrastructure to practice incremental improvements, you may struggle to enhance your product beyond its MVP.

Often, the hard things may be technical, specific to the systems you work with. Let me give you an example from my personal experience to illustrate the point. At one point, in an enterprise-level transformation project, addressing a massive, yet critical, legacy piece of software was the hard problem. In the first incarnation of that transformation project, the project founders didn’t diagnose the current state to identify the hard problems. They started the project and asked the people to run. The project picked up pace, but folks began solving smaller and more convenient aspects of the project. The hard problem remained. Some people became skeptical and didn’t believe in the project’s success because they knew the hard parts, but they were not forthcoming or influential in changing the course. Several months later, the project had to be paused and restructured to address the hard problem and reignite the fire in the project.

However, sometimes the hard problem may be the organization’s readiness or the necessary alignment required for the project to be successful. In one particular case, I had to delay the tactical execution of a project since the team was not ready. The hard problem was finding the right leader and then fixing the team structure. In a more recent situation, there was a struggle to integrate AI into a team’s ways of working. But the hard part was not AI — it involved (a) teams not having the time due to the current busyness, (b) some debt that was contributing to their busyness, and (c) not knowing where to employ and benefit from AI. The team began leveraging AI once we started addressing those hard parts.

I hope I persuaded you of the benefits of tackling the hard problems early. But how do you get going?

First, start with a thorough diagnosis of the situation to identify the hard problems. Diagnosis is one of my favorite leadership tools. Most leaders and teams are persuaded and sometimes misled by the words and phrases used to describe outcomes, and fail to take the time to question why success is guaranteed. It’s no wonder projects fail or never rise above mediocrity. Diagnosis helps identify potential blockers for success. It can help reveal organizational, technical, and cultural obstacles.

Second, you must then have the backbone to acknowledge and have the power to influence others to focus on the hard parts. Acknowledging the hard parts privately is one thing, but influencing others to recognize them takes work. People may ignore you and mistake you for a naysayer who does not believe in the leadership or the organization. This brings me to my third point.

Third, practice surfacing and solving hard problems. Start small and gradually build a track record and credibility. Work hard at building the track record early in your career. Recognizing, acknowledging, and solving hard problems is a hallmark of leadership excellence. Such excellence, too, requires practice over time.

Frame for Success

2025-07-02T23:12:14-07:00

Language is a loose construct for exchanging ideas, though it is the best tool we have. We try to distill a complex mix of difficult-to-explain mental phenomena and express them through words, hoping that those receiving the message can recreate the same mental images in their minds. But that’s an impossible expectation because that’s not how our minds work. In reality, people on the receiving end interpret those words using their own unique mental processes to assign meaning. Communication is such a difficult thing.

The same is true for problem-solving. Each of us sees the same problem differently, based on our individual experiences, skills, attitudes, and motivations. Essentially, the sender and receiver use their flawed transcodings.

So what? Any big goal you present to a large group can be interpreted in many ways. Individual and team skills, attitudes toward each other and the organization, and personal motivations all influence this process. They cause people to use different mental models to understand the problem and figure out how to solve it. That’s why, in most organizations, as goals are passed down to teams and then to individuals, people develop their own interpretations and come up with creative ways to connect what they believe matters to the organizational goal.

The result: Scattered focus, where everyone’s attention is spread across multiple subgoals, tasks, and outcomes, leading to poor team performance and subpar outcomes.

It is the leader’s job to create focus, helping most of the organization see the same problem in the same way, so that the team employs a cohesive set of techniques to solve it. That’s the task of framing the problem. In my previous article, I discussed the importance of diagnosing before delegating. Framing is another essential step to incorporate before delegating.

Poor framing often leads to poor performance. Here’s an example: once, a team achieved outstanding results in a year, which was impressive. The following year, their performance was sloppy, and they failed to make a significant impact. Why? It was the same team and similar conditions. What changed? An analysis revealed an issue with how the problem was framed. The problem was too vague; people interpreted it differently, and they didn’t focus on finding and fixing the root causes. In other words, the team’s focus was scattered, resulting in poor performance.

In another instance, several smart individuals came together to address a few problems. They all used the right words to describe the situation. They wrote documents outlining what should be done. They formed working groups, held multiple meetings, and provided status updates. Nine months later, nothing had come of it. Why? Their problem statements at the beginning were vague — they sounded correct but failed to create focus, leading to wasted time and missed opportunities.

Such situations taught me that high performance is neither a constant nor a fixed attribute of any team. Just because you bring the best people together to solve a problem doesn’t mean the group will deliver a great outcome. You need to focus on framing the problem in a way that helps the team perform.

What is framing the problem? Let me offer three characteristics of well-framed problems.

First, framing a problem is like taking a picture with a narrow depth of field, which brings the subject into sharp focus while blurring the rest of the scene. It involves emphasizing specific aspects of the problem and bringing them into focus while diverting attention from non-essential parts. A well-framed problem leaves less room for misinterpretation. It is intentionally specific about what matters. A well-framed problem fosters alignment among different teams and stakeholders, as everyone understands what truly matters.

Second, due to its narrow depth of focus, a well-framed problem is not open-ended but constrained. People involved in the project would immediately understand what not to prioritize. When the constraints are explicit, everyone knows which problem to address first and what to tackle next. Unconstrained problems, on the other hand, can lead to scattered efforts and potential misalignment among the various participants working on the problem.

Third, a well-framed problem also includes conflicting key results. Let me give you an example. One particular problem involved replicating a large amount of data with a high consistency guarantee but also requiring a certain failover time budget to switch between the primary and the replica. Solving for one or the other is easier than solving for both. This conflict spurred creativity. Initially, the team found it uncomfortable to tackle this conflict, but they eventually found a solution. I used this technique a few other times, and every time, we found creative approaches to handle the conflict. In his book, Unreasonable Hospitality, Will Guidara shares similar examples of this technique.

To reiterate, don’t assume everyone gets it just because you said so. Work on framing the problem to create focus, add necessary constraints and conflicts before delegating the problem. Do this early in any project execution. Such a process can help increase focus and alignment, as well as spur creativity.

Diagnose Before You Delegate

2025-06-14T23:48:53-07:00

Diagnosis is often an overlooked step in goal setting, project planning, and execution. I have come across many projects where a senior leader proclaims a worthy goal, some get excited, and most nod — after all, the goal sounds right. The leader then delegates the goal down the hierarchy, and six months later, little progress is made. Eventually, the following year’s planning cycle arrives, and old goals are replaced with new ones, repeating the drill. It’s a symptom of a slowly decaying organization with low performance and little accountability.

A more common occurrence is paltry outcomes: goals are well-intended, sound right, are delegated appropriately, people start with excitement, but then encounter execution hurdles; timelines slip, results don’t materialize, people lose faith in the goals, and the goals fizzle out sometime later. So much time and energy lost.

Such instances are relatively common in large organizations due to the sheer complexity of problems and the numerous layers of people involved in completing tasks. Reason? It is usually not an issue of intent or team competence, but rather the missed crucial step of leaders not diagnosing the problem before crafting goals, launching projects, and delegating them down the organization.

I have made this mistake myself a few times. In one case, I led a goal formulation exercise to set goals, delegated them to the respective teams, and waited. Nothing happened. Reason? I had just assumed that everyone knew how to break the problem down and execute. Big mistake. Recognizing this, I conducted a root cause analysis, an effective diagnostic tool. I gathered a few in front of a whiteboard and began asking questions about the problem, how they were attempting to solve it, the hurdles they were facing, the assumptions they were making, and so on. That exercise blew my mind, as I found that the team was stuck due to some technical hurdles, unclear problem decomposition, self-imposed constraints, and unhelpful assumptions. That diagnostic exercise helped me figure out how to bring the project back on track. Participants of that exercise also enjoyed the process as they saw a path forward, and their moods lifted.

In another case, we had a big organizational goal, but one of the teams took some shortcuts in the architecture for expediency. I was uncomfortable with those shortcuts but didn’t act. The team gained the initial momentum, but that didn’t last long. Execution slowed down as complexity and debt accumulated. My mistake was not intervening early to diagnose the problem and set the appropriate course of action. In that situation, I didn’t challenge the prevalent beliefs that led to those shortcuts.

The lesson is to incorporate diagnosis into your leadership practice. Diagnosis is a valuable tool for enhancing team performance. You can’t drive change and improve team performance without incorporating diagnosis before and even after delegating. Ignoring diagnosis makes you a laissez-faire leader, i.e., one who does not feel the need to provide direction, distances themselves from poor team performance, and eventually avoids accountability. On the other hand, an effective leader would employ diagnosis to form crisp goals, design metrics that matter, craft constraints to drive focus, and design the proper execution rituals for high performance.

But how do you go about diagnosis?

First, begin by moderating the bias for action. You must pause everyone from jumping into action to instead focus on developing a thorough understanding of the problem space, the nature of the outcome, and the constraints involved. Some may get upset with you for holding up everyone, but it is necessary. In one case, when a CTO inspired everyone to pursue an ambitious goal, everyone ran to execute. However, six months later, it became clear that people were running in different directions, solving the easier parts of the problem while ignoring the more difficult parts. Meanwhile, skeptics remained on the sidelines, as they did not see a clear path to success; they waited for the project to fail or for someone to rescue the project. The project regained its traction only after an intensive diagnosis was conducted to establish a structure, set guardrails, and implement a prioritized sequence to solve the problem.

Second, when diagnosing, cut through organizational layers. Don’t prematurely delegate the diagnosis to someone. Be active during the diagnosis. Never hesitate to ask questions. If your organizational hierarchy consists of multiple management layers, don’t hesitate to go down a few layers below to bring people to a common forum, set the stage, and get down to diagnosis. Don’t worry about not respecting the management layers. Involve them in the process instead.

Third, use diagnosis to discover prevalent assumptions. Ask questions to understand how others perceive different aspects of the problem. What one team considers a significant problem may not be a major issue in the broader context. Similarly, a minor inefficiency or inconvenience at the team level may compound into a bigger deal at the broader organizational level. What may be an efficient approach at a team level may be expensive at a wider level due to its side effects. A well-conducted diagnosis helps you develop shared context across your teams — all participants in the diagnosis benefit from that context. Shared context also helps improve trust because everyone is heard and receives a more comprehensive understanding of the problem. You may also discover prevalent beliefs and attitudes, and learn about how work is done. For example, you might find the prevalence of a fixed mindset or helplessness regarding particular challenges, such as known but often overlooked process inefficiencies, technical debt, and rarely questioned issues, as well as misplaced assumptions about peer teams. Diagnosis helps remove that fog.

Use diagnosis as a leadership tool as often as you need. Simply asking questions is a good start. Consider frameworks like the “five whys” to ensure your diagnosis is exhaustive.

Leadership Fundamentals and Tech

2025-03-18T08:16:03-08:00

Yassine Kachchani recently interviewed me for his Exec Engineering newsletter. I am crossposting the interview on this site for posterity.

Highlights

Leading teams with firmness and humility
Transforming teams beyond technical solutions
Building developer platforms for flow state
Managing hands-on leadership in small teams
Navigating AI disruption through rapid experimentation

Yassine: You’ve written extensively about the human aspects of technical leadership. What sparked your interest in the psychology of leading engineering teams?

Subbu: I came across a fascinating statement by Paul O’Neill several years ago: “With leadership, anything is possible, and without it, nothing is possible.” Let’s zoom out for a minute and consider all the big positive things in recent history. We can’t name many that happened without strong leadership. But, to appreciate this statement, we must know what leadership is. Leadership is nothing but influencing others to follow you. Most of us get that point, but how do you influence others? That question prompted my interest in the psychology of leadership.

Influencing is a serious business. It takes intense self-development, learning about others, and perspective-taking. To influence others, you need to recognize and accept that we are all motivated by different things. You have to model the right behaviors and set the standards for excellence so that others will follow you. It takes inspiring others to do things they wouldn’t otherwise do. You have to have a track record of past outcomes to have the credibility to inspire others. You have to have the tenacity to stick with ambiguous situations. You must have the backbone to take a position and fight for what you believe is right and the humility to let go and yield to others. Above all, you must learn to manage your attitudes and emotions better daily. These are all aspects of human psychology.

Y: Throughout your career, you oversaw major technical transformations that changed how hundreds of engineers worked. What have you learned about leading teams through significant technical change?

Subbu: I was lucky to have dealt with a few technical transformations over the last 12-15 years. Most of those were complex and required naivete, patience, and humility. Once I got a taste of the first few, I pursued more such opportunities as I found such work incredibly rewarding. Those experiences taught me two good lessons.

First, care enough and then build the courage to put your hands into ambiguous problems. Every company has such issues. These are complex and depend on multiple teams to agree, with many unknowns and no clear path forward. Most organizations struggle and rot when not enough people care and dare to stand up and say, “I’m going to give it a try and get it done.” You also need to build that capacity and culture in your organization.

Second, most technical transformations are not technical at all. They appear to have a technical problem at the core, but once you begin to peel the layers, you will stumble on structural, cultural, and leadership issues. For example, one of the messy transformations I dealt with involved aligning several leaders and their teams over a sustained period. In another example, it took a few organizational tweaks to give space for a few teams to work differently and have the right goals. There were technical components in these transformations, for sure, but those were less significant when compared to driving and managing the change. So, the lesson is to look at problems that require a transformation as change management or adaptation problems, not technical ones.

Y: As both a builder and leader of developer platforms, what patterns have you found most valuable for improving developer productivity?

Subbu: In my experience, the most useful and successful developer platforms focus on creating the flow state. How do you get to a flow state? You typically get into the flow state when your goals are clear, most steps to realize those goals are clear, and the tools you use give you clear feedback and guide you to move code from your keyboard to a live environment where you can experience the output of your work. In this process, how various tools work well together is more important than the popularity of individual tools. But if most things you need to do force you to pay attention to things outside your goal accomplishment and get lost into rabbit holes to make things work, you won’t reach the flow state.

So, here is my message to teams building developer platforms: focus on simplifying choices, providing clear feedback, and offering a friction-free end-to-end experience to facilitate the flow state. Don’t get distracted by popularity wars like this tech vs. that tech. Engineer your approach to create the flow. Look at your code structure, build systems, test environments, release pipelines, and infrastructure choices, and engineer those to enable a smooth flow. That will help you allow a higher throughput of work.

Y: From your experience building teams of different sizes, what have you found essential for building high-performing engineering teams?

Subbu: I once made the mistake of imagining that I must do such and such to build high-performance teams. I thought of a prescriptive approach based on what I saw other leaders do and tried to follow. My approach failed miserably. That’s when I realized that what helps drive performance needs to be situational. Many factors go into building high-performing teams, but what works in one situation may not work in another. You have first to diagnose the current situation to determine what may be blocking strong performance. It varies from team to team and organization to organization. Could it be goal setting? Could it be cultural norms? Could it be a structural issue? Could it be politics? Based on your diagnosis, determine what needs to happen to improve team performance.

Y: Many of our readers are leaders running small engineering teams. What advice would you share with leaders who are trying to balance staying hands-on while building their leadership capabilities?

Subbu: In such cases, you should be ready to wear multiple hats, like managing projects, reviewing code, reviewing or even documenting designs, triaging bugs, etc. To be effective at these, you have no option but to balance being hands-on while acting as a manager-leader for the team. You can’t just do one or the other.

On this note, allow me to refer to a reproduction of a 1950s article in the Harvard Business Review — Skills of an Effective Administrator. The author, Robert Katz, a management Guru from that era, wrote about three skills for managers: technical, human, and conceptual. Even though that article is over 70 years old, his recipe still holds good.

Y: Following your recent one-day AI coding experiment, what aspects of engineering leadership do you think will become more crucial, not less, as “AI coding” becomes the norm?

Subbu: Driving and managing change is the most essential skill to sharpen to accommodate this AI disruption. The landscape has been rapidly changing and will continue to change for a while. It is also going to be very noisy. The barrier to entry is lower, and the required skills are less specialized than a couple of years ago.

Companies are going to have to figure out how to derive value from this disruption. There are many hypotheses to be tried out, which will require a lot of experimentation across the board. While most experiments will likely fail, those that succeed might alter how people work. This is no longer a technology problem but a change management problem. To be successful, you have to find comfort in letting go of stability to let teams learn, experiment, and be comfortable failing fast.

My One-Day AI Coding Experience

2025-01-03T15:11:43-08:00

I set a goal for January 2nd: create a Q&A bot to answer questions based on all the articles I wrote over the years and all the Twitter messages I sent. I wanted to complete it in a single day. My project involved crawling my blog, parsing all my tweets, using RAG with the embeddings from parsed content, indexing those in a vector store, and calling an LLM to process my prompts to answer questions. I also wanted to use GitHub Copilot with Visual Studio Code to see how fast I could go. I did some prep work by reading a LangChain tutorial to build a RAG app and got some basics working before starting.

My experience blew my mind. By mid-afternoon, I had most of it working. I got so excited that I had to step out to run to cool myself. There are clear implications for dev productivity, generalist software developers, and leaders.

For context, my coding credentials are shallow: I’ve not coded in a while and will not pass a serious Staff Engineer coding interview. Yet, I completed my project by the evening.

My Experience

Let me begin with my developer experience. I generated new code using GitHub Copilot and iterated to meet my needs. Copilot generated more than 50% of my code. I let Copilot teach me how to fix some runtime errors. I asked it to improve my code to handle edge conditions. I generated unit tests and used those tests to find minor issues. I prettified the code and generated documentation. Look at my prompt history to get an idea.

Write code to crawl a website to find all the unique links on that site. Ignore non HTML content.

“Prettify this file.”

“Write unit tests for the parser.”

“Catch the edge conditions in this file.”

“How do I clear indexes in OpenSearch from the command line?”

“Write Python code to convert tweets.js in the data subdirectory to JSON and then parse it to extract fields id, created_at, and full_text.”

“Adjust the parse_tweets.py code to sort the output by created_at in the ascending chronological order.”

“What’s wrong with line 29?”

“Add code to index embeddings of parsed_tweets to OpenSearch in the file parse_tweets.py file.”

“Send one document at a time on line 55.”

“Refactor index-the-blog.py and crawler.py to remove duplicated code. Move the crawl and crawl_the_site methods from index-the-blog.py to crawler.py. Also, adjust the unit tests in test_crawler.py accordingly.”

“Add response headers for all mocks in test_crawler.py.”

“Write better code.”

The generated code and Copilot suggestions were not always correct, but I could fix issues quickly to keep moving. The end product works and does what it is supposed to. I indexed all the content in OpenSearch and created a basic command-line Q&A bot. The answers are not spectacular, as the information in my tweets is shallow, but the bot works. You can check it out on GitHub. I could feed more content to the bot to make the answers useful and continue to iterate on it.

I tried the last prompt ("Write better code") after switching Copilot to use Sonnet 3.5 Preview to make my code respectable and faster by switching to async IO. It also added more error checks, retries, logging, and configuration. This prompt promoted me from a junior engineer to a senior!

Implications

Don’t be a skeptic and wait. We are already at a point for double-digit productivity gains. My coding days started with vi. Back in the day, getting ctags to work with vi to navigate code felt awesome. Later, debugging and refactoring with IDEs like IntelliJ brought modest productivity improvements to handle structurally more complex code. However, AI-assisted IDEs offer the opportunity for double-digit productivity gains to complete a lot of glue work involved in software development.

Sceptics will argue that the generated code is imperfect and point to examples of when the generated code was silly or incorrect. But I question such attitude as being narrow. Given how generative AI works, AI-generated code is usually good enough to iterate. AI assistance makes it easy to consume borrowed knowledge (like searching Stackoverflow). Once you have assembled the building blocks with an assistant like Cipilot, it is up to you to iterate on it to make it work. Your mileage will vary based on the code and task complexity. However, consider that coding is no longer about writing every character by yourself but about assembling solutions by reusing prior work. AI-assisted IDEs help you iterate the assembly part of coding faster. If you are in tech, learn to take advantage of such tooling instead of waiting skeptically.

Another common skepticism about productivity gains is the argument that most developers spend most of their time on non-coding tasks. While this is likely true, it is an argument for improving inefficiencies in the developer flow, not against using AI-assisted tools.

Generalists have tremendous opportunities to learn and broaden their skills. The AI space continues to be democratized, significantly lowering the bar for entry. Even two years ago, I could not have done what I did in this project without much more training, expertise, and time. I was, in fact, motivated to work on this project because I felt challenged when someone said, “We need the AI team to help us build a RAG for the chatbot.” I wanted to find out if that is still the case.

My experience shows that most generalists could put together decent solutions, leaving the specialists to focus on more advanced tasks. Most of the complex parts of such solutions deal with usual full-stack development issues like provisioning cloud resources, setting up roles and access policies, collecting and processing data, putting together a user experience, collecting logs and metrics, etc. So, if you are a generalist software developer, learn to put together such solutions.

If you are a manager leading tech teams, figure out how to activate your team. Leadership faces some challenges: ignorance of what is possible, skeptics in their teams, and organizational bureaucracy preventing such tech adoption. Handle these challenges as a priority in 2025. Get your hands dirty to learn what is possible with these tools and form opinions to shape the direction of your teams. Then, figure out ways to activate your teams, create early wins, persuade the skeptics, and learn to benefit from these tools.

There are as many industry claims of significant developer productivity as there are that counter such claims. A recent MIT, Princeton, UPenn, and Microsoft research showed a 26% increase in task completion by developers. Another study by Uplevel showed that Copilot introduced 41% more bugs. Given the complexity of the software we deal with, both could be true in different circumstances. Such tooling will not eliminate the human in the loop but might help reduce some grunt work. The only way to find out is by trying it out rather than ignoring it. The SDLC will continue to be disrupted throughout 2025 and beyond. Ignorance or apathy is not a strategy for success.

Twenty Tiny Leadership Lessons

2024-12-30T21:45:59-08:00

Most leadership learning is experiential. We observe, learn, and emulate from others, often subconsciously. Yet, the core of such learning starts shallow, leading to behavioral and decision-making mistakes, learned and uncorrected bad behaviors, and dysfunction. Some get better with experience and scope, but more often than not, we wing it, frequently repeating the same behaviors and mistakes for years. Recognizing this challenge, two years ago, I enrolled in a Master’s program in the Psychology of Leadership at Penn State University. It turned out to be an excellent investment of time and money. In this article, I share the top twenty things I learned from those studies.

1. Roles and not traits: At its core, leadership is a role you play in a situation. You don’t become a leader just because you show some personality traits. You are a leader when you can influence others to follow you to accomplish a common goal in a particular situation and not otherwise. To become a leader, you must create a willingness from others to follow and work with you.

2. Followership: The leadership process also includes another critical role: that of a follower. Followership is not about unthinkingly obeying orders but about actively supporting the leader’s vision and contributing to the team’s goals. You can be a leader in one situation but a follower in another. You must be flexible and willing to follow others for effective organizational outcomes. You will be less influential if your ego prevents you from following others.

3. Being a team member: Your other role is that of a team member. Consider an organizational setting. Say you manage ten people, and your manager manages eight people, you being one of those eight. What is your role? There are three roles to play. First, you play the leadership role for the people you manage. You could be their mentor, coach, cheerleader, supporter, etc. for those ten, but you are not their teammate. The second role is that of a follower to your manager. The third role you play is a team member with your peers. Being a team member is different from being a leader or a follower. It requires effective collaboration. As a team member, you establish partnerships, influence your peers, be influenced by them, drive or contribute to organizational decisions, and adaptively contribute to broader organizational goals.

4. Leadership and management are complementary: J. P. Kotter, the management guru, described the differences between leadership and management: leadership is about driving and managing change, while management is about creating order and consistency to cope with complexity. The former involves establishing direction, showing a positive can-do attitude, the drive to get things done, creating a vision, inspiring others, clarifying the big picture, influencing and aligning, etc. In contrast, the latter requires planning, organizing, resource allocation, project execution, staffing, budgeting, incentives, etc. Learning about these differences helps you adequately invest in developing your leadership and managerial skills.

5. Standards for excellence: Setting standards is a potent tool for driving change. From behaviors to technical or business goals, it is up to you to raise the bar of excellence. Setting standards for excellence is one of a leader’s top three jobs. The other two are creating a willingness for others to follow and building teams. Don’t accept the status quo. Keep on raising the standards for excellence. Doing so stimulates everyone’s creativity.

6. Three skills: Broadly speaking, leadership growth depends on acquiring three skills: technical, human, and conceptual. Technical skills are the usual domain-specific competencies to execute tasks and are usually the ones that get you your first job. Your leadership growth gets stunted when you find it challenging to work with others. Conceptual skills take you further into articulating visions and establishing roadmaps and strategies. Instead of coasting based on what you learned in school and winging, focus on developing these three skill groups.

7. Task vs. relationship behaviors: Leadership effectiveness depends on balancing task-oriented and relationship-oriented behaviors. Task-oriented behaviors involve taking the necessary steps to get work done. In contrast, relationship-oriented behaviors involve concern for people. Excessively task-focused leaders usually lack strong human skills and ignore team building. They only focus on results. Excessively relationship-focused people care more about making people feel good than getting things done. People love to work for relation-oriented leaders but may not achieve their potential. You need to find a balance between the two.

8. Power: Power is the leadership currency to influence others. Your technical expertise, brand, what you have, what others want, role and title, place in the organizational hierarchy, network, and ability to reward and punish others all contribute to your power. How you use them is up to you; your values, standards of behavior, and ethics help you appropriately use your power. As I wrote in Mid-Career Stuckness, a lack of understanding of one’s own and others’ power and a failure to develop sources of power can get one stuck in one’s career. Effective use of power also involves inspiring and motivating others rather than simply commanding them.

9. Defining yourself: Gaining power and influencing others is not easy, and the situation may not be in your favor. For example, for whatever reason, your boss might have formed an inner circle that excludes you, thus giving you less access to challenging work assignments and resources. Gaining trust and entering that circle may take time and patience. Or, you may be in a less critical part of the organization with fewer opportunities for visibility and influence. Disrupting such a situation could be challenging, but I would not give up. Seek opportunities to contribute to broader organizational outcomes or keep driving improvements in your area. Expect resistance, but don’t give up. You will have to find ways to continue to define your role instead of letting the situation box you into a definition. Be persistent.

10. Being situational: There are many leadership theories, such as authentic leadership, transformational leadership, servant leadership, etc., but being situational is the most practical approach to leadership. It means exhibiting the behavior that the situation requires. To be situational, you must assess the situation and ask yourself what the situation needs. For example, showing anger and frustration is a valid response if the situation demands not tolerating someone’s behavior at work. When your team is not yet competent in a particular area, it is OK to be hands-on and micromanage the tasks. But then, when a part of your team is highly proficient and requires the least supervision, it is OK to pay cursory and periodic attention. Applying the same style everywhere is usually the least practical leadership approach. See Authenticity and Acting for some examples.

11. Being adaptive: One of the challenges of leadership is diagnosing the problem. Say you inherited a team that is under deep toil. People are overworked. The systems are critical, and yet there is ample technical debt that will take time to repay. People know the solutions to modernize but lack the competencies and time to drive change. What do you do in such a situation? Should you start to modernize the systems first to clear the debt? Should you make an organizational change? Should you fire the bottom performers and get new talent? Where do you begin? The book The Practice of Adaptive Leadership gives a framework for such situations. It describes two kinds of challenges: technical and adaptive. Technical problems have a clear problem definition and a solution. Adaptive problems, on the other hand, require learning and change to identify the problem and the solution. Most organizational problems are not technical but adaptive. For such problems, the path from the problem to the solution is not a straight line.

12. Leading strategically: Strategic leadership is a foreign construct for most mid-management. Middle managers expect the C-suite to develop the company strategy and then hand it down to the hierarchy to align everyone’s work to implement it. However, most organizations have areas that could use strategic approaches. For instance, most of my career involved building and improving horizontal platform capabilities to enable other parts of the company. Such areas are unlikely to become part of the overall strategy, and it is up to the mid-managers to lead strategically in their respective functions. But how does it work? Strategic leadership involves identifying a target state, establishing the business case to reach that state, mobilizing and aligning others for support, articulating the right goals and measures, running programs and processes to implement the strategy, and adapting as you go. The more senior you are in the organization, the more important such skills are. Without strategic leadership in the middle, organizations can rot.

13. Ethics and being sensitive: A tricky and potentially taxing part of being a leader is dealing with sensitive topics where any response could harm someone or send the wrong message. Access to power gives leaders ample opportunities to misuse or abuse that power, creating a toxic work atmosphere. Even indecision in the face of bad behavior ruins the work culture. People overlay their morals and values onto the leader. There is an implicit expectation that leaders know and do what is right. You gain followership when your behavior conforms to and goes beyond such expectations. Alternatively, others lower their standards when they see their leaders lowering theirs. Hence, having the backbone to maintain moral, ethical, and behavioral standards is integral to being a leader.

14. Motivations and attitudes: As a group activity, the leadership process involves interactions between people with different motivations, attitudes, and behaviors. Motivation drives people to behave a certain way, and attitudes reflect our positive or negative evaluations of the world. Our motivations and attitudes drive our behaviors. However, motivations and attitudes are internal to us, while behaviors are visible. It is worthwhile asking what motivates you. I use that information to develop and maintain positive attitudes. However, since it is tough to judge what motivates others, it is best to avoid guessing their motivations.

15. Inspiring others: Storytelling and creating an inspirational vision is one of the least frequently used techniques for influencing others. Say you need another team to do some work to unblock your team. The most common path is to approach the manager of that team and ask them what you need from them and by when. The most common response you receive is to come back later since they are busy with their priorities. A better approach is to start with the why and make that team part of that why. When you get this step right, you and that manager will likely find ways to align your objectives for broader organizational success, thus creating a win-win situation.

16. Goals, rewards, and mechanisms: People perform better when they know what they are expected to do, how their performance will be evaluated and rewarded, and when they have set mechanisms to follow to plan and organize work and track progress. Keeping these un-specified or under-specified leads to subpar performance. There is also a mistaken belief that people know what they are expected to do and how to get there and that there is no need to monitor progress. However, clarity of goals and progress-tracking mechanisms keep the focus and alignment.

17. Watching for biases: As human beings, we take cognitive shortcuts called heuristics. But, heuristics lead to biases and decision-making errors. For example, based on some project outcomes, you might have concluded that a particular team is underperforming and their manager is incompetent. A closer inspection might have revealed factors outside the team’s control. Developing critical thinking skills like role-playing, getting others’ opinions, and deeper analysis of the situation helps avoid biased decisions.

18. Errors and dysfunction: We all like good leaders. However, we often encounter leaders who sometimes show dysfunctional behaviors such as indecisiveness, ineffectiveness, moodiness, and unpreparedness. There are a few techniques to practice to minimize such tendencies, like getting 360-degree feedback, including others in critical decisions, and developing self-awareness.

19. Toxicity: Not all leadership is positive and healthy. Toxic leaders exist; you may work with or work for some of them. These are bullies, abusive, control freaks who purposefully punish or hurt others for personal gains. If you are fortunate, those are not above your level in the organizational hierarchy, so be candid with your feedback. If such toxic leaders are within your organization, act swiftly to weed them out. There is no need to tolerate such toxicity. But if you work for toxic leaders, walk out as soon as possible.

20. Self-awareness: Finally, leadership requires self-awareness at its core. Can you sit down for ten minutes and treat every feeling, emotion, and thought that comes into your mind as an object separate from yourself? Can you detach from those and let them float by? Can you also do so amid a high-stakes discussion with others? You are self-aware when you can. Self-awareness multiplies your leadership in multiple ways. It helps you shift your focus from your wants and needs to the needs of others and the situation. It sharpens your sense of reality, thus improving your leadership presence. It helps you be situational and adapt to change. It reduces biases in your decision-making. But trust me, developing self-awareness is the hardest of all these lessons.

I hope you find these lessons helpful. They helped me better influence, adapt my style to various situations, lead strategically, build a high-functioning team, and develop a solid leadership bench to produce solid outcomes. As Eduardo Briceño says in his TED talk about alternating between learning and performance zones, it helps to learn about leadership without the pressure of performance so that when you get on that leadership stage to perform, you will have figured out some proper techniques.

Contemplative Reading

2024-12-01T14:52:27-08:00

Like most bibliophiles, I read many books and acquire even more than I can read. Most of my reading is non-fiction across philosophy, psychology, natural history, biographies, economics, health, and business leadership. But, over time, I realized that I forget more than 99% of what I read, and only a few ideas stick in my head. That’s sad. Why spend so much time and energy acquiring and reading books to gain just a few ideas or just the memory of having read a book? That changed this year. About a year ago, I stumbled upon contemplative reading to increase the quality of my reading experience.

Contemplative reading is a slow form of reading, usually involving the reader asking questions like

What is the text saying?

What does it mean?

What does it mean to me in my life or situation?

How would I explain this idea in my own words?

Essentially, you ask what the material is telling you. You ask yourself if and how it might apply to you. You make notes. You may write about it in your journal. You try to explain to others. Contemplative reading may also be repetitive, where you stay on each topic for hours or even days. This is a slow internalizing form of consuming information to assimilate the essence and blend it with the ideas you already have in your head.

On the other hand, the most common form of reading is fast and consumptive. We read or listen in to get the gist quickly. We want to read what everyone else is reading so we don’t miss out on good ideas. Such consumption is helpful in the short term, but perhaps not in the long term, as the ideas fade away and only the memory of the book survives in our minds.

Not every book deserves to be read contemplatively. For instance, most business/self-help style books are fluffy and repetitive; reading a summary or browsing such books may be good enough. But once you find the right book, contemplative reading is a low throughput high-ROI exercise. Contemplative reading expanded what I read. Now, I’m not afraid of picking up dense books that take time and effort. For example, during this summer and fall, I read Jay Garfield’s The Fundamental Wisdom of the Middle Way. It was a dense commentary book on the philosophical arguments of a second-century Buddhist scholar, Nagarjuna. The text and ideas are abstract. I started reading it in the early summer and got frustrated as I could not understand most of what the author was saying. Then, I switched to a contemplative style from the first page. I made copious notes. I went back and forth. That experience changed what I got from the book, and it became one of my favorite books in my library.

By the way, contemplative reading is not new. Reading in most spiritual practices is contemplative, involving reading, meditating, praying, discussing, etc. It’s worth a try to get more out of reading.

Monkey Business

2024-10-06T10:08:00+05:30

In 1999, Harvard Business Review published Management Time: Who’s Got the Monkey? This article uses a monkey metaphor to warn managers about accepting and carrying every problem that comes their way on their backs instead of deflecting or passing on such issues to others. It has been a popular article—I encountered it in multiple conversations. The metaphor of a monkey on the back is easy to explain. The metaphor is so powerful that it motivated me to address some root causes when I first came across it. Though I don’t relate to the solutions proposed in that article, I used it in several coaching conversations. The metaphor helps debug the root causes behind managers struggling with busy calendars, not investing in themselves, not having detached vacations, or, more importantly, not having time to be strategic. In this article, let me introduce that metaphor visually and point out some techniques to address the root causes.

Below is a visual illustration of this metaphor — the visualization is mine. Imagine a manager walking in the corridor, encountering a team member struggling with a challenge, and gladly volunteering to solve their problem, thus inheriting a monkey on their back. This ‘monkey’ could be a project deadline, a team conflict, or a technical issue. As the day went on, the manager would run into various other meetings dealing with misalignments, escalations, due dates, etc., and inheriting more ‘monkeys’ on their back. The manager would not mind inheriting one more ‘monkey’ in their 1:1 with their manager. Thus, by the end of the day, the manager would have plenty of problems to address, i.e., ‘monkeys’ on their back, leaving no time for self-care, team development, or strategic leadership. That’s the gist of this metaphor.

The metaphor sounds funny. It is easy to visualize others carrying monkeys on their backs. But when you are the one carrying those monkeys, how do you fix it?

An easy answer to managing the ‘monkeys’ is to set boundaries and develop a thick skin. This doesn’t mean being defensive or apathetic. It’s about learning to say no to some issues, accepting others but not immediately, and becoming skilled at deflection. You might even decline some ‘monkeys’ by providing a list of reasons. Many experienced managers use these techniques. However, it’s important to remember that such defensive methods can make you seem unhelpful to others at work and be a poor organizational citizen. As I wrote in Leadership for Results and Peace of Mind, being useful is a leadership behavior I value. Hence, I don’t recommend setting boundaries as the only technique. Use it moderately and appropriately.

What else can you do to pass monkeys with ease? Consider a few more essential factors that help you responsibly pass the monkeys.

The first thing to check is the structure and capabilities of your team. Ask a few questions:

Do you have a clear charter for your team and an idea of what they are supposed to do and not do?
Have you defined the roles and areas of responsibility for each direct report? Do they understand their roles and areas of ownership? Are they capable of meeting those expectations?
Is your team’s structure easy for others to understand? Do people outside your team understand who does what?

If the answer to these questions is no, you may carry more monkeys than you should. You might not be set up to delegate effectively to your team. Your first challenge is to work on shaping your team. Having a bunch of direct reports does not make you a manager — you have to shape the team for easier management. You must organize the team and develop people for efficient execution and results. Without a clear charter and structure, others have no choice but to come to you for problem-solving. You can not delegate, i.e., pass the monkey, because you don’t always know who to delegate it to. If others know who to get to, they might gladly bypass you, which helps reduce the influx of monkeys toward you. Effective delegation can be a game-changer — it develops others and gives you more control over your time and responsibilities.

The second thing to check is your team’s operating processes.

Do you have a ritual or forum to triage ad hoc issues? When someone is looking to hand off their monkey to you, can you instead show them the way of an existing process? The process could be as simple as a bug queue or a roadmap intake process.
Do you have a decision backlog and a decision-making process to groom that backlog of pending decisions or approvals? Some of the monkeys may be pending decisions and approvals. Instead of accepting such monkeys as they come, can you point them to the process?
Do you have a process to deep dive into your team’s work so you’re better equipped to know what your team is working on, the current state, and any lingering issues?

Again, if the answer is no, it’s time to take action. Well-designed and implemented processes are powerful. They streamline team management and equip you to handle incoming monkeys. Having such processes in place can make you feel more equipped and prepared to handle the challenges that come your way.

The third thing to check is whether you have a strategy for your team. A strategy involves determining what to do and what not to do. It’s about making explicit choices concerning what is in your team’s scope and what is not. I will get into the specifics of strategy development in a separate article, but a strategy provides a filter to decide what monkey to accept and what not to accept. Having a clear strategy can make you feel more focused and purposeful in your role as a manager.

Don’t be mistaken — the central premise of this article is not about deflecting or passing monkeys so you can relax. It is about creating discretionary time on your calendar that you can use to invest in what matters. For example, if your answer to why you don’t have a point of view on your team’s strategy is lack of time, then you must find a way to make time. If you can’t disconnect on your vacation because your team is not ready to manage independently for a few days, you’ve not done a good job developing your team to step up occasionally. Your calendar tells you whether you have any discretionary time to pick up such developmental activities.

Leadership for Results and Peace of Mind

2024-07-05T12:04:46-07:00

I deeply care about getting results. I like to see things improved. I also want to enjoy my work and like the people I work with to enjoy what they do. Plus, I like to have peace of mind every day. How do you get all three right most of the time, if not always?

After years of testing various ideas and behaviors, I developed a leadership framework that has proven effective in achieving results, enjoying work, and maintaining peace of mind. While there’s no one-size-fits-all leadership recipe, I’m eager to share a few key behaviors that have helped me and could benefit you.

The following factors motivated me to formulate these leadership behaviors.

Leadership does not just begin at the top and flow down. Leadership is a process of influence. Your effectiveness depends on your ability to influence people across multiple degrees of separation. On the one hand, you are expected to direct and motivate your teams to get results, which can be challenging given expectations, budget, timelines, and their motivations and attitudes. On the other hand, you also need to influence your peers, your managers, their peers, and stakeholders inside and outside your organization to get the necessary resources, funding, support, and, most importantly, alignment.
Your work relationships and the influencing process come with emotional baggage. Ambiguity, constraints, setbacks, wins and losses, consequences of failures, power, politics, and other “soft stuff” challenge your psychological state. Very few people gracefully handle such “soft stuff.” To cope, many become passive, selfish, edgy, pushy, dominating, controlling, moody, arrogant, narcissistic, or dishonest. Most of these behaviors have consequences for others, which brings us to the next point.
Your character and balanced behavior matter to others. Here, I’m using the word character and not authenticity, as authenticity can mean different things for different individuals. Character has a simpler meaning — it refers to your moral and ethical qualities. As a leader, you can be a force for good for your company, your team, and other stakeholders. Or you can hurt others by playing against, blocking, taking credit, favoritism, etc. Leadership roles give you ample opportunities to be contemptible. The choice is yours.

Given such diverse operating forces, how should one lead to (a) get results, (b) enjoy what they do, and (c) have peace of mind?

The Framework

You can’t have peace of mind, foster an enjoyable working environment, and yet produce results unless you build a harmonious relationship between your role and pursuit, your attitude toward others, and the interests of others. My framework consists of five leader behaviors: (a) practice equanimity,(b) nurture power, (c) drive, (d) be useful, and (e) develop others. Of these five, equanimity is the foundation, on which you layer power to drive, being useful, and developing others. That’s my recipe.

Some of these behaviors may sound paradoxical or defeatist to those who have worked for or learned from aggressive leaders, but as I clarify below, you can have the cake and eat it, too.

Behavior 1: Practice Equanimity

Equanimity, often overlooked in leadership, is a state of even-tempered mind. It allows you to handle victories, failures, praises, and abuses gracefully. This calm state of mind gives you the power of presence, the ability to listen, observe more, and react less. It enhances your self-awareness and emotional intelligence. With equanimity, you can lead complex issues with a steady hand and not panic. Most importantly, equanimity ensures you have peace of mind every day.

Equanimity increases distress tolerance and improves your ability to perceive reality, increasing your ability to make better decisions and handle healthy conflicts. When you are equanimous, you are less likely to be swayed by strong emotions or biases, which can distort your perception of reality. This balanced state of mind allows for more objective observation and understanding of situations, leading to a more accurate sense of reality and enabling you to make better decisions and handle conflicts.

Behavior 2: Nurture Power

Leadership is the process of influencing others to get things done. How do you influence others? You influence others through various types of power, such as your knowledge, how others perceive you, who you know, your credibility, the budget, and headcount you manage, who you report and manage, etc.

Power gives you leverage. The more types and quantities of power you possess, the more leverage you have to influence others to get things done. So, don’t be agnostic of power and politics at work. Acknowledge that power and politics are part of the natural fabric of any organization. Understand the sources of power you have and the sources of power others have, and then continue to nurture your sources of power. It could include your leadership competencies, relationships, what you have done for them, budget, headcount, strategic projects, strategic capabilities, etc.

But don’t get anxious about accumulating power. Be patient. Begin by analyzing your sources of power and weaknesses. Develop organizational awareness. Be strategic about nurturing power. Some sources of power, like the title, headcount, and budget, can disappear or change as companies change, whereas your relationships, competencies, and what you have done for others stay with you. Also note that power has a nasty way of corrupting your character. Be aware. You must develop a healthy relationship with power and politics.

Behavior 3: Drive

Drive gets results. When you are driven, you will find things to improve. No drive, no results. Leadership does not matter much without results. Look around you — how many leaders are actively driving, and how many are just going through the managerial mechanics to stay the course?

Driving does not just mean doing what is expected and keeping things steady. You must have the foresight to see challenging problems and the backbone to stand up to address those. Driving involves challenging the status quo, stepping up, setting up unarguable goals, inspiring others, driving clarity, paving the path, creating necessary alignments, and organizing and operating to get results. To drive must be your job. Mind that drive does not come naturally to everyone. You have to practice again and again until it becomes your nature.

Behavior 4: Be Useful

Being useful is a form of humility. Show humility and be useful to others. Seek to understand how your goals might help others and the broader organization. Ask your peers what you can do to help achieve their goals.

It might seem paradoxical and counter-intuitive to be useful to others instead of always focusing on what you want. Here is a secret — you increase your influence when you shift your focus from your goals to the goals of the broader organization. When you do so, your team will also collaborate more amongst themselves, and they, too, focus on shared goals instead of their narrow personal goals. Try it out — you will realize that being useful to others detoxifies your work life and contributes to your influence.

But being useful takes courage. What if others’ goals are so much more important than yours, and you might need to give up on them to support theirs? That’s possible. So be it. If that is the reality, face it.

Behavior 5: Develop Others

Leadership is a team sport, and your role as their leader is that of a coach. You have two choices: treat your team as tools for you to use to get what you want, or treat them as individuals with distinct motivations and attitudes and invest in their careers by providing opportunities, stretch assignments, and increased scope. Exercise the former to get your way, but make it toxic for your team. Or, exercise the latter to reduce toxicity at work while increasing your leverage. Think of this analogy — when you invest in others, they will come to the war and fight for you, and you are less likely to die alone. It’s a win-win.

But you may only be able to develop some. So, use your intuition to pick your bets. Mentor and coach them. Open up stretch opportunities to increase their scope and performance. As they develop, so does your ability to produce results and influence.

Now What?

I didn’t come up with this framework overnight. It took years of asking why I wanted to lead. I experimented with different versions of my ideas to find logic and cohesiveness between them. I also spent nearly two years studying topics related to leadership psychology, such as goal setting, strategy, motivation, attitudes and behaviors, power and influence, ethics, driving organizational change, and leadership theories like servant leadership, transformational leadership, and authentic leadership. I complemented this activity with some philosophy studies.

Of the five behaviors I listed, equanimity didn’t initially make it to my list. But as I wrote and rewrote my framework and experimented with different ideas, equanimity bubbled up to the top. I consider it the most essential characteristic for living and leading well. I strongly recommend you take up some mindfulness practices to learn more about equanimity and how to get into an equanimous state. It takes rigorous practice. If you want inspiration about this quality, watch Ted Lasso. In this clip, what does Ted imply when he asks Sam to be a goldfish?

The next one to assess is your sources of power and their reach. Nurture power. Don’t treat power as evil. I’ve written about power in the past with some references. I recommend reading Power: Why Some People Have It and Others Don’t and Managing With Power by Jefferey Pfeffer and The Psychology of Persuasion by Robert Cialdini.

The foundation of equanimity and a moderate (not excessive) penchant for power equip you to drive. Being useful to others and developing others helps you lead larger, and these behaviors also contribute back to your equanimity and power.

Let me know if these ideas make sense. If you are interested in exploring your leadership and want to talk to someone, don’t hesitate to contact me. You can drop me a message on LinkedIn or email me at “subbu at this domain” for a coaching conversation.

Goal Crafting

2024-05-05T12:28:30-07:00

Goal crafting is one of the most essential leadership activities. Organizational performance and team growth depend on well-crafted goals. Without a good goal-crafting exercise, your teams may focus on what is in front of their noses, solving what seems quickly solvable. Good goal crafting forces you not to ignore or postpone problems that require new ways of thinking, collaboration, or hardships. Without a good goal-crafting exercise, you can get stuck in the status quo or focus on what matters to you or your opinions, not what your stakeholders might need. Good goal crafting creates and drives your organizational strategy.

Here are my guidelines for setting objectives and key results. In this article, I use goal and objective interchangeably and consider OKR a practical goal-setting framework.

Make your goal unarguable: An unarguable goal is one that most people agree with as it aligns with the organizational principles and direction. You’ve already lost your battle when others debate and argue about the validity of your goal. A well-crafted goal makes people at least say, “Of course, we should do that,” or ask, “Why are we not already doing that?” Unarguable objectives are typically not subjected to individual opinions. People may disagree on how to accomplish such an objective but not disagree with the objective itself.
Manufacture consent: A leader’s job involves creating willingness for others to work with the organization to support their objectives. Such willingness manufactures consent, and people will refer to the goal when debating priorities or choices. A well-crafted goal makes others associate with you as they like to see the same outcomes because it benefits them. Here is a litmus test — you’ve done an excellent job crafting an objective when other teams speak of your goal as their goal, too. That’s an indication of inspiration and manufactured consent. When others begin to talk of your goal as theirs, you have inspired others to work with you toward that goal, and you have a much higher chance of realizing the goal. You will likely struggle to get their time and support when that does not happen. When my team asks me to escalate some issue to another team, I usually probe the goal first. Often, the root issue turns out to be a misaligned goal and not understanding the broader context.
Let it make everyone uncomfortable: Well-crafted goals should make your team uncomfortable. They should put them out of their comfort zone, testing their assumptions and technical and human-relationship competencies. Such goals require a growth mindset and learning things that have not been done before. On your part, a well-crafted gaol requires fierce determination and unwillingness to give up. It should force your organization to continually seek options to get around obstacles. In the best case, your organization finds multiple options when no option seems possible.

What about key results? Consider two key attributes of key results.

Meaningful: Your key results should be meaningful to your stakeholders. You should craft the key results in terms of what makes sense and is beneficial to your stakeholders. For example, replacing five ways of doing something with one way might benefit your team and be operationally beneficial to them. But why should your stakeholders care about your team’s operational efficiency? What’s in it for them? Perhaps replacing five ways with one way might help your stakeholders eliminate some pain and make them productive. Think of that pain, and craft your key result to focus on that pain. Consider what matters to your stakeholders and not yourself.
Measurable: Ideally, your key result should be measurable. Typically, objectives are qualitative, and key results are quantitative. Measurable goals force you to be data-driven, reduce the fog of opinions, and improve clarity. What you measure should usually mean something to your stakeholders. In the example of five ways of doing something, your key result could be to improve efficiency for your stakeholders by some percentage or to reduce the time they take to perform some tasks. Measurable key results help you track progress. People will know when they get to the finish line.

What about things (i.e., tasks or activities) your team need to do to realize the objective and key results? Key results are outcomes your stakeholders want to see. Those are generally fixed. You might change or refine the activities when the going gets tough, but should generally keep the goals and key results the same. Track your activities separately from your objective and key results.

Be creative. Consider goal crafting as a leadership and team development exercise and not just a word-smithing exercise to represent what you want your organization to work on. Remember that goal crafting creates and enables your strategy.

Thought Spirals

2024-03-06T13:18:37-08:00

Have you been in situations where you are in a thought spiral and stay on for hours or days? It happened to me recently, and that was not the first time. Since I’m of the same species as everyone who might be reading this (except those bots that pretend to be human) and that it is a common source of made-up human misery, I decided to write down my observations.

Recently, I felt less in control due to how I had been processing various events at work and in my personal life. Everything was and is okay, but my processing mechanisms malfunctioned, and I entered a few thought spirals.

I feel less safe and anxious when I go through one of those thought spirals. I make up stories about what is happening. I then begin making up plans and telling more stories about how I might be able to gain control of those situations. Those plans lead to more stories about gaining control, what and who might sabotage those plans, and attention turns towards judging myself, others, or circumstances. Those judgments keep me going in the thought spiral.

While meditation and journaling help, I recognize that the way to exit such thought spirals consists of three timeless recipes. These recipes have been known to us for thousands of years and appear in different ancient philosophies dealing with human consciousness. We need to seek them — more about those later. Here are the recipes.

Recipe 1: Be self-aware when you enter a thought spiral. Learn to observe your thoughts and emotions. Viewing your thoughts and emotions as an observer might sound puzzling or even delusional, but that ability is nothing but self-awareness.

Recipe 2: Remind yourself that there is nothing for you to control. The desire for control leads to anxiety. Observe instead of controlling. Reflecting on my behaviors, I influenced and dealt with situations the best when I took an observer position instead of a controller position. Being an observer does not mean watching on the sidelines and keeping quiet. It means being present and processing the proceedings around you.

Recipe 3: Turn your attention towards improving the situation and helping others. Doing so shifts your focus from anxiety to action. Clarity and success follow when you turn your attention from yourself to others. Again, this idea might sound defeatist and letting go, but it is the other way around. You have little chance of influencing others when you make any issue about yourself. Your chances improve when you make it about others.

Essentially, these recipes remind us that the best way to lead yourself is to take yourself out of the picture. That sounds hard initially, but it unlocks clarity, purpose, and joy. I’ve been told time and again that I lead with a steady hand. Some friends called me Yoda. I worked with some fantastic leaders who excel at maintaining a steady hand. I also know that I fail at it some times. I now know the secret.

These recipes not only help you gracefully exit those thought spirals, they help you gain the power to influence situations and others. My realization began nearly two years ago when I posted a tweet and pinned it to my profile:

The secret to power in leadership is detachment. Not detachment from the outcome or others, but detachment from yourself and your way.

The essence was that the best way to lead in difficult situations is not to make it about yourself and gain control. Dealing with such situations requires self-awareness, observation over control, and making it about others, not yourself.

Pluralism

2023-12-31T21:31:38-08:00

One of my most influential and inspiring experiences in 2023 was reading a couple of Ramachandra Guha’s books on Mahatma Gandhi. The first was Gandhi 1915-1948: The Years that Changed the World, which covered Gandhi’s life from 1915 leading up to Gandhi’s assassination in 1948. This book gave me a breadth of Gandhi’s role in India’s struggle for freedom and politics, which I enjoyed reading very much.

But it didn’t answer some of my fundamental questions about Gandhi’s leadership development. So, I picked up Gandhi Before India, which traced Gandhi’s life from 1869 to 1914. I found this book much more insightful than the former. It showed Gandhi’s development as a leader from his humble beginnings, his unsuccessful attempts at being a lawyer, and then his journey to South Africa, where his taste of discriminatory policies of British colonialism triggered his development. This book also highlighted the roots of South Africa’s apartheid, which took 100 more years to resolve with its struggle. This book gives you a front-row seat to watch Gandhi’s leadership development.

These books covered nearly 80 years of India’s history in about 1400 pages (excluding author’s notes) of fluid English. These are not just biographical sketches of Mahatma Gandhi. Based on extensive research, Guha narrated India’s culture, history, and certain facets of geopolitics. I’m not done yet. I’m now waiting for my turn at the King County Library System to read India After Gandhi.

My initial interest in reading these books was to observe how leaders develop. I had several questions in my mind. How did Gandhi become who he was? What led him to pick up the causes he picked up? What experiences shaped his philosophy of non-violent resistance? Who played what role in his life in shaping his philosophy? Through Guha’s books, I learned much more than I bargained for. In addition to giving answers to these questions, Guha’s books strengthened my preference for pluralism.

First, leadership development won’t happen unless you force yourself into situations that demand solving ambiguous problems requiring change and dealing with people of different viewpoints. You can read many books on how to influence others. Still, you won’t get to learn to influence without putting yourself in situations that demand you to develop the ability to influence. This is precisely what Gandhi did in his years in South Africa when confronting the discrimination of British colonizers against the so-called “Asiatics” — these were the peoples that British colonizers brought from India, China, and other Asian countries on indenture to South Africa. Gandhi’s philosophy of non-violent resistance and pluralism evolved during this time. Gandhi, as we know now, might not have happened if he had not put himself in those situations.

Of course, to lead change, you have to have courage to form opinions, stamina to hold your ground, tenacity to keep pursuing the cause with different tactics, patience to influence others, humility to be brutally self-critical and acknowledge mistakes, willingness to do the hard work of organizing and mobilizing people, and presence to reiterate messages again and again through writings and speeches. You see Guha describing all these behaviors in these books.

Second and most important, religious and political pluralism is the only viable path to peace and harmony. That was true during India’s struggles in the 1900s, and it is true today. Pluralism recognizes and permits different interests, convictions, and lifestyles to coexist peacefully. Per Guha, by the 1920s, Gandhi saw the “sustenance of religious and linguistic pluralism was central to the nurturing of nationhood.”

We have ample historical evidence of what happens when we don’t embrace pluralism. Not recognizing and permitting the interests, convictions, and lifestyles of others brought us near-extinction of the natives in the United States, racism and genocides in many parts of the world, the apartheid in South Africa, and even the ongoing Israel-Palestine conflict. It also got Gandhi assassinated in 1948 as the person who shot Gandhi on January 30, 1948, was a Hindu fanatic who got upset by Gandhi’s pluralistic views on Muslims. Though Gandhi persevered with his pluralism to a large extent till his assassination, India had begun drifting towards monoism, which is why Gandhi is less popular now in India than before.

Pluralism is hard. It requires you to check your judgment of others and your religious and political views. Pluralism may force you to give up your beliefs. For example, initially, Gandhi viewed native peoples in South Africa as inferior to Whites and Browns. Later on, he gave up that belief. Monoism, on the other hand, is easy. With monoism, for you to be right, the other person must be wrong. For your way of life to prosper, the other way must be stopped by all means. Unfortunately, monoism gets you clicks and generates anger.

Guha’s books broadened my perspective. I’m glad I read his books. I strongly urge you to read those. Remember that the other side does not need to perish for your side to exist.

Mid Career Stuckness

2023-07-15T16:28:09-07:00

It may sound harsh, but let me offer a hypothesis — most of us get disillusioned and potentially stuck in mid-careers, and the most common cause is not learning about power and influence early enough. It is not a lack of technical knowledge and related competencies that makes one get stuck in their mid-career — it is the lack of an understanding of their power, the power of others, and then not developing their sources of power and using those to influence others to get things done — is what gets you stuck often. This stuckness starts with an improper understanding of “power” and “influence” and subsequently not recognizing and appreciating your power and that of others. Based on an ad hoc sample, I can tell that many people at work secretly hope that power doesn’t exist and stay away from it, which is sad and career-limiting.

Why is this usually a mid-career phenomenon? In the early-career phase, your technical competencies get you the job and, perhaps, a few promotions. By technical, I don’t mean competencies related to software technology but essential skills associated with a particular profession. But as you move further in your career, what gets you the job and helps you succeed is your ability to influence (and be influenced) to produce ever-larger outcomes. Your technical competencies still matter, but not as much as they do during your early career. That’s because larger outcomes appropriate for your job level depend on getting others’ support and contributions to do what you want to get done. Those others may be your direct reports if you’re a manager, your peers, your manager, their peers, and even people outside your company. Even more important, what you want to get done can’t just be anything — it must be consequential and desirable to the organization you work for. So, you need others to cooperate for you to be successful. This is influencing and takes non-technical work.

But we don’t consider influencing as work and don’t put enough time and effort. There is a simple reason for that — most of us believe “we know what’s right” and “can get it done” but fail to recognize that succeeding at work requires creating willingness and cooperation between individuals with different motivations and different stakes in any outcome. You can only create that willingness and cooperation by recognizing and appreciating your power, that of others, and each others’ needs.

What’s the process of influencing called? It is leadership. As I mentioned in my previous article on followership, leadership is a process of influencing a group of people to produce some common goals. Once you ground this definition of leadership and put aside all the pop-leadership posts on social media (particularly LinkedIn, these days) on what a leader should be or do, it is easy to realize that leadership is an influencing process. When you fail to influence others, you naturally fail to do the job you’re hired for. As Paul O’Neill said in The Irreducible Components of Leadership, “With leadership, anything is possible, and without it, nothing is possible.” You can replace “leadership” in this quote with “influence” to get the point — with influence, anything of consequence to others is possible. Without it, nothing of consequence to others is possible. I added “of consequence to others” to highlight that the net outcome should mean something to you and others.

What’s the role of power in this influencing process? Before answering, let’s first put aside common misconceptions about power. Many associate power with being dictatorial, telling people what to do, being political for self-gain, being brash, contradicting, bullying, etc. We most commonly associate these with “not being nice.” Those are all signs of demonstration of power, but there is more to power than such perceptions. Let’s get back to definitions.

The most commonly used definition of power in leadership is that power is the capacity to influence others’ behaviors. For example, assume that I hold an enormous amount of money and dangle a big bundle of cash in front of you to make you dance against your volition, then my source of power is my wealth. You might consider such power bad or good depending on your beliefs (including ethics) and any harm done. Or, consider that you’re skilled at negotiating with difficult customers, and your boss relies on you and not your colleagues to deal with difficult customers. Then your source of power is your negotiation expertise. Power is essential to influence others, and you need to recognize and develop multiple sources of power to be effective at work.

With me so far? Assuming that you are convinced that power exists and can be a useful tool to influence others, the next question is, can you recognize your or others’ power? Why is this question relevant? Influencing is causing a psychological change in others, including their attitudes, behaviors, actions, etc., towards you. Then you better know what they want and where their power comes from. Also, ask yourself what you want and what power you have to influence others. Unless you recognize your power and that of others and what each wants, the influencing process feels like navigating without a compass. Let us review some commonly identified sources of power.

Your technical expertise: Your technical competencies in your professional domain are an important source of power for you to influence others. For those early in their career, their technical expertise is likely the strongest source of their power. As you look around at those early in their careers, what draws others to them is their ability to apply technical skills to get something others want. If your manager always allocates important and difficult technical problems to a few, it is most likely because your manager needs them done well. The more technical expertise you build, the stronger your influence on your manager. You will find this source of power called “expert power” in the leadership literature.
How you feel about others and others feel about you: When people see you as a role model, like working with you, want to be associated with you, or are drawn towards you for your charisma, you can influence them without formal authority. Treating others with enthusiasm, kindness, and empathy helps you influence others. Don’t make the mistake of always leading with data and using rational arguments to influence others initially. Consider Ted Lasso, the lead character in the TV show of the same name. He is not a football expert. You don’t see him rewarding and punishing people. Instead, look at how he feels about others and makes them feel. He was kind, transparent, ethical, present (well, mostly), considerate, etc., which allowed him to unify the team and influence others. He never held grudges, and often said, “Be a goldfish” (referring to Goldfish’s short memory). Check why others were drawn to him. The strongest factor of his influence is how he made others feel. You’re far from building such power if you see yourself as important and others as lesser beings.
What you have and others want, and vice versa: Don’t ignore what others want and what you want. That awareness helps you understand the dynamics of power and influence. Recently, in a coaching conversation, someone asked me about how to influence their boss about a project. I asked that person whether they knew what their boss wanted. The answer was no. Therein lies the trouble — you’re seeking someone to support you. But you are not entitled to that support. Before approaching others for what you want, learn about what they want and see if you can be of help. Through that discovery, you might find a path to get you what you want while providing what they want. You may even drop your project if you find a better opportunity in what your manager wants.
Your role and title: I’ve coached several in their mid-career, who were feeling helpless but possessing important-enough titles at work. I can not overemphasize this — your role and title give you the power to influence others. In the leadership literature, this source of power is also called “legitimate power” — it is the power that comes from others’ perception that you ought to do certain things or act a certain way. Usually, organization structures, roles, titles, and even social norms (like someone being called a “lead”) grant you this power. Say, your title at work is “Director of Engineering” or a “Distinguished Engineer.” You may think of such a title as a nice gesture your company granted you based on your accomplishments — like an honor bestowed upon you. But that’s a myopic interpretation. Your company gave you that title for you to use to produce results. You should therefore be comfortable enough to disagree or disapprove decisions, introduce new ways of working or procedures, or lobby for things because you’re a Director of Engineering or a Distinguished Engineer. Such actions and decisions may sound non-democratic and unilateral, but that’s expected of you based on your role and title. You’re not supposed to act helplessly and ignore to use your role and title to assert decisions and set direction. You should know your role and title-level expectations at your work and go beyond those.
Your place in the org hierarchy: No matter what you hear about flat organizational structures and everyone being equal, your place in the organizational hierarchy acts as a source of power. Who you report to and who reports to you matters in the influencing process. The reporting relationship may grant you access to information early and often, which you may use to influence others. It may sound unfair, but accept that the reporting relationship can play a role in the influencing process and deal with it rather than ignore it and regret it later. After all, even where people sit in the office (such as people sitting in the same row as the big boss vs. people sitting away) may indicate their importance and contribute to their power in the organization. Pay attention to such subtle factors to recognize your source of power and that of others.
Your relationships: Your network, including people you know and those who know you, can help you influence others. Relationships give you information, ideas, knowledge, and expertise that can assist you. Without such a network, how would people know that you exist, that you are looking for such and such, and that you can offer such and such? Reach out to others, and be available to others. Be a node in a graph and not a singleton.
Your ability to reward and punish: If you’re a people manager, you likely possess the ability to reward people (such as promotions, bonuses, equity, etc.) as well punish them by not granting them what they need, or worse yet, take away what they already have, such as letting them go. Such a source of power is common in workplaces, and it influences employees’ behavior. For example, the current economy and recent massive layoffs continue to influence employee behaviors for fear of being the next to be fired. Operating under such fear does not seem fair, but recognize that such factors play a role in influencing behaviors.

These are some examples of sources of power. I listed these to give you an idea that you may already be powerful to influence others, what you lack, and recognize the power of others. Such recognition is a fundamental step in the influencing process.

Note that there is no single source of power that works all the time. You have to pick and choose based on the situation and the people involved in that situation. Each source will have different effects on others — some of which may positively influence them toward the common goal, and some may not. There is ample literature to learn more. Read Understanding the Dynamics of Power in Healthy Organizations for a gentle introduction to power. Get Jeffrey Pfeffer’s books like Power: Why Some People Have It and Others Don’t and Managing With Power. Better yet, find ways to take his classes at Stanford.

Finally, don’t consider that others can’t have power for you to have power. A healthy demonstration of power is not a zero-sum game. You can be humane, kind, empathetic, respectful and yet exercise power. Don’t let the pursuit of power take out joy. Being respectful, and making others, including those with fewer sources of power than you, feel better and important helps you and the other person. Your sources of power might vary, but an effective combination of the power of different individuals can produce exceptional outcomes.

Followership

2023-06-18T08:53:55-07:00

In organizational settings, we take leadership far more seriously than followership. Followership is a rarely used term in organizations. In my career, I’ve only heard of followership once when a wiser colleague proposed we include “willingness to follow” as an expectation for senior individual contributors. But unfortunately, followership and “willingness to follow” sound like sycophancy — doing what your boss wants while sacrificing your values and dignity. But there is more to followership than taking orders.

Followership is a Thing

A widely accepted definition of leadership in psychology is that leadership is a process of influencing a group of people to produce some common goals. The person influencing others is a leader, and leadership is the process of influencing. Wikipedia puts it well by incorporating power — “leadership can be defined as an influential power-relationship in which the power of one party (the “leader”) promotes movement/change in others (the “followers”).” The source of that power varies from situation to situation and person to person. In some cases, the source of power could be authority; in other cases, it could be softer aspects like expertise, trust, and persuasion.

Nonetheless, the people willing to be influenced in a given context for specific goals are followers. Followership exists wherever leadership exists. Leadership is incomplete unless followers are willing to follow. Yet, we rarely discuss followership. There is far more discourse about leadership than followership. For example, some years ago, my employer brought someone to conduct an all-day leadership workshop. The thesis of that workshop was to persuade that everyone is a leader and should act like one. That’s a fine approach. However, that workshop forgot to mention the other side: everyone is a follower too, and followership is a skill to acquire. Such skills help you accomplish your and your organizational goals.

In leadership, influencing others and being willing to be influenced coexist. Suppose you are not ready to be influenced. In that case, the relationship between you and the person looking to influence you will be ineffective, and you won’t be able to work together to achieve that common goal. This happens far more frequently at work than we acknowledge. As a manager, you sometimes run into situations where most of your team likes you, except some who are less willing to ride along. As a follower, you may not get along well with your manager, while your peers seem okay with that manager. What do you do then? Let me focus on the follower side of the issue in this article.

Followership at Work

In organizational settings, for the most part, the org design determines who follows who and who the leaders are. Leadership and followership are roles we play at work. These roles usually start with the org design and hierarchy, and influence comes later. Good managers realize this and begin building relationships with their team from the start, but inexperienced managers may take their leadership and others’ followership for granted. Since they are the anointed “leader,” they expect their “followers” to follow along. Such managers may have good intentions but may tap into their positional power to influence their team, ignoring the softer aspects of power. What do you do when you’re caught in such situations?

In such cases, you could leave the team and the company and find a job elsewhere. You may get lucky to find a better manager elsewhere, but remember that your choice of manager could be one reorg away. As a Gallup survey said, good managers are rare. You can’t keep changing jobs until you find a manager with that you can get along. Your choice of your manager is not entirely in your control. Moreover, choosing their manager can be a luxury for many. Understanding followership and developing followership techniques could help you ride along and grow in your career.

First, most of us construct inflated views of ourselves based on past successes. When we encounter a manager who doesn’t share our opinions and ways, we struggle to influence and be influenced. Hence, put your self-inflated view of yourself aside and figure out your manager. Listen and build empathy. We expect our managers to listen and be empathetic — play the same card toward your manager. Figure out your manager’s context, strengths, weaknesses, how they operate, and their motivations. It takes work, and that’s one of the aspects of managing up.

Second, followership requires adaptation from your views and your way to another person’s views and practices. You should be willing to put aside your views and ways and enthusiastically follow someone else’s views and ways in the context of work. You might develop a better relationship with your manager in that adaptation process. Enthusiasm is vital for survival (i.e., being in the game) and growth (i.e., winning). Don’t be a grumpy and disgruntled follower. In his 1988 article In Praise of Followers, Robert Kelley puts such followers in his “alienated followers” quadrant. In his description, such followers “are critical and independent in their thinking but passive in carrying out their role.” I was in that quadrant more than once in my career. Not a worthy place to be.

But in this adaptation process, you might discover that your manager’s motivations, values, or behaviors are toxic and damaging to you and the rest of the organization. In that case, by all means, you should quit.

Third, also put your expectation of what a good manager should be aside. Be practical. Such views can prevent you from understanding your manager and building a relationship. Since good managers are rare, the odds of you working for a perfect manager are slim. One of my friends once said (quoting someone else) that “You don’t have to like someone to work with them.” You don’t have to like your manager’s values and leadership beliefs. Following does not have to require sacrificing your values. In one difficult situation, my coach once said, “Your manager is acting like a child.” Her point changed my perspective. When I imagined that manager as a tantrum child, my views became clearer, and I could build coping strategies to work with that manager. In any case, a job is a business relationship between you and your company. So, get off the high chair.

To develop personal resilience, you must learn to play leader and follower roles with various people. Don’t always walk away when the going gets tough. As long as you remember that leadership and followership are roles you play in an organizational setting, tone down your inflated views of yourself and inflated expectations of your manager, you can build a better relationship with your manager and resilience. Don’t take yourself seriously. Be humble. Be proactive and a problem solver. Raise your hand. Get feedback and dissent constructively. Don’t be a grumpy and passive follower. Remember that followership, just like leadership, is essential for goal accomplishment.

Leadership Masterclass for Individual Contributors

2023-05-29T22:54:06-07:00

I’m excited to announce that I’m offering a limited-edition masterclass for individual contributors (Principal, Distinguished, Staff, or Senior Staff levels) on June 24th, 2023. The sole purpose of this masterclass is to help you improve your leadership skills so you can lead with joy and not frustration.

If you want to attend, sign up for Leadership Masterclass for Individual Contributors. Also, forward to others that may be interested in this class.

This masterclass is limited to 10 individuals, offered online. The class is free, but each participant must contribute at least 300 USD to a charity of their choice to be eligible to join this masterclass. I will ask you to provide proof of your contribution. Based on interest, I might offer repeat sessions.

Attendees must have at least ten years of experience as individual contributors, as the material we cover may not be helpful for the less experienced.

Here is why I’m conducting this masterclass and focusing on individual contributor roles.

In most organizations, senior individual contributor roles are some of the toughest to succeed. Organizations expect individuals in these roles to demonstrate several aspects of leadership, such as setting and influencing the direction of several teams, driving consensus, dealing with conflict, establishing standards and procedures, and leading and mentoring others. For most, developing the skills to meet such expectations without the ability to manage and control resources (like people and budget) seems like a frustrating task. Even more frustrating, organizations rarely facilitate leadership development for individual contributors.

This program includes two parts.

Part 1: A two-hour class

In this part, we will broadly cover three topics:

Nature of individual contributor roles and potential leadership obstacles
Leadership competencies for effectiveness and continued growth
A framework to self-identify areas for development

The material is based on my personal experience, lessons observed from coaching several individual contributors, and a few of my lectures in private settings.

Part 2: A 45-min one-on-one coaching session

In this session, we will discuss your context, unique challenges, and potential opportunities. We will also develop and review a self-development plan.

Boss Test of Goal Crafting

2023-02-21T21:54:16-08:00

Here is a basic test to know if you’ve picked the right goals for your team.

Start with a simple question — Can you explain your goals to your boss and their peers in simple terms? If yes, you may be heading in the right direction, but keep going.

Then ask if your boss can explain and sell at least one of your goals to their boss. If so, that’s better. You’re doing something right and have a higher chance of being relevant to the organization.

But don’t stop there. Can your boss’s boss explain that goal to their boss? If so, you’ve made it. You’re highly likely working on the right things and will potentially survive and advance in your career, provided you deliver those.

This is my “boss test of goal-crafting.” Some of you reading this article may like these questions and agree with them. But many of you may feel that this is just doing what your boss wants regardless of what you and your team think you all should work on. After all, isn’t it your job, as the servant leader of the team, to support and help your team with what they want to work on?

Yes and no. Yes, you should help your team and support what they would like to get done. But no, you should not work on things just because you or your team thinks they are right. You will be doing a disservice to your team if you cannot channel their energy toward what your organization needs, which is what the boss test can help you figure out.

But what should you do when the stuff your team is dealing with is in shambles, and those need fixing? Should you dedicate all your energy to fixing those even if your goal fails the boss test? Alternatively, how about when you and your team know what to do, and your boss is incompetent? Maybe, but tread carefully and better produce a surprisingly positive outcome. But I wouldn’t, under normal circumstances, without first crafting a case to explain to my boss and their peers. After all, if the stuff your team is dealing with is in shambles, your boss better care about it. And it is your job to craft the right story to communicate to make them care.

Does this rule apply to even those areas that are not customer-facing or not in the critical path of the value-generation pursuit of your organization? Absolutely, and even more certainly, yes. The farther you are from value generation in your organization, the more critical it is for you as a manager to craft objectives that matter to value generation. As a manager, finding out what’s important is your job. If you don’t know, you’re not networking sufficiently with the people above you, or you’ve not learned how your organization’s business model works. Fix those problems first.

(I’m inspired to write this article tonight as one of the managers at work asked me today to clarify her understanding of the business value of her team’s work. Kudos to her.)

Authenticity and Acting

2023-01-28T12:37:07-08:00

Acting seems like the most inauthentic thing to do — like putting an external façade to manipulate the audience to get the desired effect. Most people learning to lead don’t associate acting with leadership. I balked when I opened the first chapter, “Presence: What Actors Have That Leaders Need,” of Belle Linda Halpern and Kathy Lubar’s Leadership Presence a few years ago. In that chapter, the authors described presence as “the ability to connect authentically with the thoughts and feelings of others.” I was okay with that definition, but it did not make sense to compare it to acting. For me, acting was an external façade, while authenticity was conforming your external behaviors to your internal state. I thought that acting was the opposite of being authentic.

I was not alone. Just last week, an ex-colleague of mine made a similar remark comparing acting in the context of leadership to ingenuity and manipulation. Some time back, I read Jeffrey Pfeffer’s Power: Why Some People Have it and Others Don’t. In that book, he wrote a chapter on “Acting and Speaking with Power.” I did not like his style or substance when I first read it. My firm belief in authenticity made me dislike the association between leadership and acting. That began to change as my understanding of authenticity and leadership developed. Let’s look at some examples.

Watch Obama’s speech on the night of the New Hampshire primary in 2008. This is the speech where he rallied the audience and the nation watching television with his “Yes, We Can” message. Watch the video and follow the emotions of the audience.

Obama started the address with a light-hearted laughter, but then you will notice concern, anger, disappointment, and optimism in his speech. Also, see the change in the pace of his delivery. He slowed down for applause sometimes, but other times he continued to build up the audience’s emotions. As the speech continued, notice an emotional harmony develop between the audience behind the stage and Obama. Then see the audience erupt in a chorus when he started the part of the speech with “Yes, We Can” at about the 10 min 30-second mark. You will notice that Obama was entirely in control of his and the audience’s emotions. Your political affiliation aside, would you doubt Obama’s authenticity? Would you call it acting?

If you’re not convinced, let’s watch Martin Luther King Jr’s “I have a Dream” speech, which he delivered near the steps of the Lincoln Memorial on August 28, 1963.

In that speech, MLK seemed aware of the historical context of his remarks. His delivery too had a purpose. Listen to him as he switches from describing the “sweltering heat of oppression” to “I have a dream” without a pause at the three-minute mark of the video. Listen to his voice quiver when he says, “not be judged by the color of their skin but by the content of their character.” A few seconds later, listen to his intonation change and pitch rise when he says, “down in Alabama.” Would you doubt his authenticity? Would you call it acting? He knew that the address was a defining moment in the history of the civil rights movement. A New York Times article written on the 50th anniversary of that speech described that speech as a “testament to the transformative powers of one man and the magic of his words.”

I hope you are convinced by now. Acting is what great leaders do. They choose the right emotions in front of the right audience at the right time for the right purpose. Not doing so dilutes the purpose, and the leader may not get that opportunity again.

Imagine Obama’s state of mind when he made the “Yes, We Can” speech. He didn’t win the primary that day in New Hampshire. He lost to his rival, Hillary Clinton. That speech was supposed to be a concession speech. Yet, that was one of Obama’s best speeches. Consider what might have happened if Obama said, “Good night folks, we lost it,” and left. For those curious, read the backstory of this phrase in The Washington Post.

That’s the hallmark of great leadership. Great leaders tune their emotions to the need of the hour but do not let their feelings ruin the cause. Note the difference between allowing your internal state to guide your behaviors vs. letting the purpose and audience tune your inner state.

That’s why great leaders can show anger in one meeting, kindness in another, and frustration in another. Each emotion has a purpose. When you don’t tune your emotions and words for the audience and purpose, you are letting opportunities for positive change go. This is why self-awareness and a calm internal state matter for leadership. That’s when you can choose emotions purposefully and time them to the right audience. Doing so does not make you inauthentic. It makes you situational.

Shaping Your Authenticity

2023-01-15T21:05:47-08:00

There are many genres of leadership theories. One of those is authentic leadership. Regardless of how leadership psychology researchers define it, the most common view of authentic leadership is leading while being true to oneself and acting according to one’s feelings, emotions, and values. Unfortunately, most of us make career decisions, exhibit behaviors based on this perception, and shortchange ourselves. Let me give a couple of examples.

You believe in empowering others, and you run all your meetings and interactions democratically to let everyone speak, solicit opinions, and then decide based on consensus. It feels good to do it that way. But then you struggle when consensus does not emerge quickly or dissent and strong voices take over the forum. Because telling people what to do and directing them is counter to your belief of empowering others, you remain uncomfortable even when the situation needs direction for the right outcome.
You avoid taking certain roles because you consider them political. For example, you may decide not to choose the managerial career ladder because you believe managers have to deal with politics and make choices counter to their beliefs. Or you may not show up and influence certain forums or people because you think something about them is political or against your core beliefs.

There are also many people that chuckle when referring to authenticity. They believe that the world is unfair and that you must play politics to be in and win the game. I once worked with a leader who believed others were out there to get him. For such people, authentic leadership is taking the high road and not accepting reality.

Furthermore, authenticity gets in the way of getting things done for most. But it does not need to be that way. You can be authentic and still deal with reality to evolve to be a better leader. But for that to happen, you must amend how you think of authenticity.

I recently came across an excellent article by Herminia Ibarra of the London Business School, who wrote the following in her The Authenticity Paradox:

Because going against our natural inclinations can make us feel like impostors, we tend to latch on to authenticity as an excuse for sticking with what’s comfortable.

That’s right. One of the flaws in our thinking is our belief that our authenticity is innate in us, as though we’re born with certain beliefs and values. We then limit ourselves to certain behaviors and possibilities and remain in a comfort zone that does not challenge our beliefs and values. We filter out possibilities because of the potential conflict with our authenticity.

There is a better way, which I learned a few months ago. As part of my coursework at Penn State, I had a chance to examine my beliefs and probe why I believe in them. This exercise was based on Bill George and his team’s 2007 article Discovering Your Authentic Leadership and several other papers on the psychology of leadership. Bill George is best known for his books like Authentic Leadership and Discover Your True North. I never read his works until I was forced to read his 2007 article as part of my coursework.

The key lesson from my exercise was that our self-stories strongly influence our beliefs and values. Instead of treating our authenticity as innate, consider that the stories we tell ourselves shape our authenticity.

I can give an example. People who know me or work with me closely know that I don’t pick battles quickly. That’s because certain situations I saw made me dislike interpersonal conflict and feuds. Because I disliked conflict, I rarely employed conflict as a tool of engagement. I instead chose the belief that conflict is not good and mostly avoided it. Consequently, my conflict muscle remained weak. In certain situations, conflict may be the most effective tool, for example, telling someone their behavior is not cool and they must stop it. Of course, that person would likely disagree with me, and I must be prepared to deal with it.

I would encourage you to do a weekend exercise: write down three or four of your most important beliefs or values. Then ask yourself why you picked those, and write down those stories. Then put yourself in situations that challenge some of those beliefs and values, and let those experiences influence you. For example, if you don’t believe in telling people what to do, put yourself in a situation where you have to do that to be effective. If you feel certain people are political, try to find a situation where you need to interact with them to get something done. See what happens. Let yourself change.

In other words, don’t think authenticity is something you’re born with. It’s not fixed. It’s something that you shape with experience. But you need to broaden your experiences to shape your authenticity. Break the mold.

Five Nuggets from 2022

2023-01-08T22:29:51-08:00

Hello. Welcome to 2023. The holiday break gave me ample time to reflect on 2022 and distill some insights and lessons learned. Here are the top nuggets I carried from 2022 into 2023.

Deal with interpersonal problems with two questions — What are you observing? How are you feeling? Once you deal with the second, you will be better equipped to deal with the real problem with positive perspectives. We often mix up both and tend to complicate interpersonal situations further. Practice asking these questions, and it gets better over time.
Emotions are not facts. In most situations, your emotions depend on the perspective you choose to look at a situation. See if it is possible to change your emotions by changing your perspective. More often than not, you will find a better perspective.
Observe more and react less. It makes you see more and hear more. There are exceptions, of course, but those should be rare. Doing this can be particularly hard when your situation is not going in the direction you were hoping it would go or you’re uncomfortable with some aspects of that situation. In such cases, I tend to interrupt or shift the discussion differently. The trick, I realized, is to pause more to see more and become comfortable with letting situations evolve a bit, regardless of what you like to see. Let things pan out a bit before reacting.
Prefer relationships over issues. Work extra hard to build relationships before approaching others to solve your problems. Know them as people and understand what they are trying to do.
A secret to power in leadership is detachment. Not detachment from the outcome or others, but separation from yourself and your way. When you put what you want aside, you increase your capacity to observe, listen, and influence others much more profoundly. Doing this in real-time can be difficult, but try it out.

This is an ice-breaker article. I’ve taken a break from publishing articles on this website for over six months. I didn’t stop writing these six months, though. I read and wrote extensively during the past six months in classroom situations. That’s right. Last spring, I decided to go back to school. After some research and the help of a close friend, I settled on two — the Stanford LEAD program and the Psychology of Leadership degree at Penn State. Both started in September 2022. In the beginning, I found it challenging to manage these on top of my day job. But as time went on, I got into a productive pace and began to enjoy what I was learning.

I can’t wait to write frequently in 2023. Here we go, 2023.

Building Career Resilience

2022-07-30T22:26:32+00:00

Several weeks ago, I was having a coaching conversation about career choices with someone. The question was whether to choose the individual contributor or the manager career path. That particular individual tried both and was contemplating what was next.

Our conversation centered around a few related questions:

What types of work or roles energize you? What kinds of work or roles drain you?
Should you favor roles that minimize the energy drain? For example, if some managerial work drains you, should you go back to designing software and writing code?
How to build career resilience?

Aside from letting that individual solidify their choice, that conversation allowed me to answer a few questions that I was facing myself about (a) building a long career with opportunities for personal growth and the growth of others around me and (b) letting my leadership beliefs, behaviors, contributions and impact define my working identity, as opposed to the logos of the companies I work for defining me.

My recently-ended sabbatical also allowed me to think long and hard about these questions, though I could not formulate these questions clearly until that coaching conversation. That’s why I love coaching conversations. In addition to letting the other person explore their career and leadership journey, they help me learn and improve.

From that conversation, we drew a few conclusions.

First, one must know where their energy comes from and what drains them.

Each of us learns to get energy from certain activities. These are usually the activities we become skilled at early in our careers and build some success. Positive reinforcement from such success makes us enjoy and do more of those activities. It’s like a deer always going back to certain ponds for drinking water. Water is there, it is fresh, and you enjoy it. Nothing wrong with it.

But making career choices solely based on where you get energy from can be a mistake, particularly if you want to remain useful and want to grow in your career. Here is why.

Eventually, you mistake such activities as things you are innately good at and continue the positive reinforcement loop. Others then identify you with those activities and may want you to continue those, thus perpetuating that loop. It might feel good when others pull you into conversations because you’re good at problems they want to see solved.

But you may eventually get bored and upset that you’re not getting opportunities to do other activities. I have had this happen to me in my career and have seen it happen to others.

A recent episode of Muriel Wilkins’ Coaching Real Leaders podcast also reminded me of this point. In this episode, the participant, Krish, gets pulled back into a role/function that he is good at. He feels he would be good at other things and is upset that he is not being called to do those other things. Though this episode does not delve into where Krish gets his energy from, the conversation in this podcast makes it clear that he gets energy from diving deep into certain kinds of problems. As a result, those problems keep drawing him back even though he wants to be called to tackle other types of issues.

You may find that what made you strong and gave you energy in the first place can get you stuck. So, pay attention to where you get your energy from.

This brings me to the second conclusion. Learn to draw energy from activities that drain you. This may sound counter-intuitive at first. Why bother learning to draw energy from things that you don’t enjoy? Here is why.

Think of where you are and where you want to be in your career a few years from now. Can you embark on a journey from here to there by continuing to do the same activities that energize you? In most cases, the answer is no. People operating at that future level most likely draw energy from a different set of activities from what you currently do, and those activities may look draining to you now.

Consider, for example, the process of negotiating, influencing, and coercing others to get something done. If those activities are draining to you, and you prefer to avoid them, for most people, their careers get stuck. Hate spreadsheets? Guess what? Most leadership roles at tech companies deal with spreadsheets. Don’t like conflict? To progress in your career, you’ll have to learn to deal with conflict to get things done. Not comfortable speaking in front of large groups? Most senior roles need you to address groups of people to influence, inspire, and create movement. Don’t like working with people? Inter-personal skills are essential for career success and growth.

Third, the ability to draw energy from diverse sources can help you build a long and fruitful career. Look at the nature around us. Organisms that can survive in diverse conditions can survive change while others perish. This is true for true when investing. You invest in a diverse portfolio to minimize risk. The same is true for our careers too.

Looking at the example of Krish above, how do you prevent what used to be your strength from getting you stuck? You diversify. You learn to draw energy from other kinds of activities early and often in your career. Diversity builds resilience and creates options. As you diversify, you will discover opportunities you didn’t know existed.

For instance, if you are a software developer good at coding certain problems, consider developing related technical or non-technical skills. Consider learning to influence others. Write. Speak at conferences, or teach others. Run a project. If you are a manager managing a particular problem domain, diversify into other disciplines involving different people, organizational, and technical complexity. Learn about finance. Develop product management skills. Run cross-functional programs. Support customers.

In the beginning, your attempts to diversify may drain you. You may question your decision to diversify and want to return to what you were doing before or your prior energy source. You might miss your prior energy sources as you’ve not yet figured out how to draw energy from the newer sources. But eventually, you evolve.

How did that conversation help my questions (a) about building a long career and (b) owning my work identity? The answer is to (a) get uncomfortable and learn to do things that may be draining at the moment and (b) do a variety of things at work and outside work.

Inputs and Outputs

2022-06-26T11:15:15-07:00

Focus on what you can control, but not let angst and frustration control you from discovering what you can control.

One of the best books I read last year was Colin Bryar and Bill Carr’s Working Backwards. Of all the chapters in this book, Chapter 6 on “Metrics” was the most influential for me. It clarifies the difference between inputs and outputs. Gaining weight is an output. Eating healthy and staying active are inputs. Last week’s Supreme Court ruling reminded me of this book as the verdict was just an output, culminating years of strategy, planning, and execution of certain controllable inputs.

Outputs make you emotional. You get upset when you gain weight. You feel happy when the scale shows a smaller number. You get upset when the stock market is down, and you suddenly seem less wealthy. But you feel glad when the market goes up, even though you know you did nothing to influence the stock market.

Inputs, on the other hand, take strategy and are actionable. Staying active and eating healthy are controllable inputs. These two inputs may or may not get you to lose weight, but at least those are two commonly used inputs you can control to influence the outcome. If those inputs don’t produce the outcome you want, then you have to find other inputs or at least get an appreciation of your uniqueness. Similarly, being frugal and diversifying your investments are two controllable inputs you can control to build wealth.

Along the same lines, we all are upset now since the Supreme Court struck down Roe v. Wade in the Dobbs v. Jackson case last week. Unfortunately, as legitimate as our feelings are, anger and frustration are unlikely to reverse last week’s Supreme Court ruling.

This ruling is the output of a complex legal and political system with a few controllable inputs that some people understood well and systematically manipulated over the years. Democrats mistakenly continue to focus on just one controllable input, which is to vote. President Obama made the same appeal last week, asking that we’ve got to elect officials committed to doing the same. But there are more inputs to control, and it takes a strategy and much more patient planning and execution to manipulate the inputs. Republicans understand this, but democrats continue to preach simplistic inputs. We’re less likely to influence the future if we fail to understand the system producing the output and learn about all potential controllable inputs.

In the case of Dobbs v. Jackson, the controllable inputs the right used includes gerrymandering, voter suppression laws, injecting favorable county and state electoral officers and policies, and specific cases to bring to the Supreme Court when the timing is right. The right has been shaping these for years, with several intermediate phases like Obama failing to appoint Merrick Garland to the Supreme Court in the final year of his presidency, Trump becoming the President, and a chance for him to appoint three justices. All of these finally culminated in a decision that took away the right to abortion, which was so far protected by the rights of privacy under the Fourteenth Amendment.

To change a system, we need to understand how it works and how to control it. If we don’t understand the system’s behavior and expect outputs we like to magically manifest, we will get nothing but disappointments. Outputs drive angst and emotions. But inputs need a strategy and patience.

Work as a Socio-Business Relationship

2022-06-05T08:07:17-07:00

In recent days and weeks, we have heard about layoffs and companies deferring or rescinding offers. CEOs and comms teams are doing their best to message the public and their employees to explain the rationale. Some leaders are brash and unapologetic, like in the case of Tesla. There is no doubt, these are devastating for the affected people, and some of these actions highlight ethical concerns. We can blame companies and their leaders for these actions, but I think the time is right to remind ourselves that our relationships with employers are socio-business relationships. That’s the reality — whether we like it or not.

As you work at any company, you build relations with people, learn, grow, develop others, make money, and sometimes make lasting friendships. You spend most of your waking hours at work or thinking about work. You let inter-personal relationships and work problems get under your skin even during non-work hours.

Yet, we mistake our relationship to an employer as a privileged relationship where you are entitled to certain compensation, benefits, well-being, and continued growth until you choose to end your relationship.

But that’s a mistaken view. Just like you can reject an offer before joining a company or leave your job when a better opportunity knocks on your door, companies too can end their end of the bargain at any time. Though it does not seem fair and we’ve room to make it better, but these days, our working relationships with our employers are weaker than other relationships, like your contract with your landlord or your utility company.

But most of us don’t take this seriously. Consider some of the mistakes we make:

You don’t keep an updated resume, and you claim that you’ve not written one in X years. Remember that your resume is your story. If you’re not tending to your story, who will?
You don’t pick up the phone when a recruiter calls. But you are your talent agent. If you’re not curious to know why a particular recruiter called you, how do you know what opportunities you’re not exploring? How do you know if you, as a product, are still relevant to others?
You respond to a recruiter saying that you’re working on such an important problem at work that the timing is not right for you to explore other opportunities. What information do you have about your employer to believe that the employer thinks the same way you do?
When a recruiter or a potential hiring manager asks you what you’re looking for in your career, you give vague-sounding answers like “I want to solve good problems,” “help the company grow,” or “lead others to accomplish their potential.” These are not wrong answers but are so generic that they apply to most people. But what is your purpose and vision for yourself? What are your criteria for designing and making your next career choice? What is your vision for the role?
You don’t take the time to summarize and document your accomplishments and lessons learned. If you don’t, who will do it for you?
You are so immersed in what your current job requires of you that you don’t take the time to invest in yourself continually. Your investment ideas are waiting in unread browser tabs and TODO projects that you never have the time to start and finish. Who will do it for you if you don’t continually invest in yourself?
You have no idea of your brand. Your story sounds like everyone else’s, and you have not thought of how to differentiate yourself.

There are more, and as a CEO of a one-person enterprise, it is your job to build your career on realistic assumptions.

On the flip side, let us not assume that our relationships with employers are pure business relationships. People that treat their jobs purely as sources of money in high-tech companies often fail to build social connections to develop themselves in their careers.

I will leave you one tidbit. Last week, I asked a colleague for advice about navigating the company I joined. He said, “ROI.” I asked, “what do you mean?” He said, “relations over issues.” This is a powerful career advice. His sggestion sums up why our relationships at work are social too. Once you build social relationships with others at work, you can work together well on issues to become a team player. But if you start with issues before building such relationships or do not pay attention to building relations, you may not succeed.

That’s why I like to remind you that our relationship with an employer is a socio-business relationship. The sooner we realize this dual nature in our careers, the more resilient we can become in managing our careers.

On Org Design Vulnerabilities

2022-05-08T15:58:07+00:00

As I reflect on my experience working with or for some experienced managers in large companies, I realize how good some are at developing…

As I reflect on my experience working with or for some experienced managers in large companies, I realize how good some are at developing robust org structures that survive leadership changes or other challenges. That experience taught me a few valuable lessons about the implications of poor org designs. While I would defer to Matthew Skelton and Manuel Pais’s 2019 book Team Topologies for a collection of patterns on org design, I want to share what I learned.

The first lesson is that good org design matters more than the attention it sometimes gets. People who enjoy flat structures or those who fear delegation and prefer command-and-control, or those that do not wish to rock the boat seem to focus less on their org design. Reflecting on successful designs, I noticed that their orgs didn’t dissolve when managers left or when their org picked up new initiatives. Their orgs had a cohesive purpose for their team to identify with. People outside the org could make sense of what the org did and did not and found it usually easy to figure out how to engage with them. Some people in such orgs built deep technical expertise and grew. Such orgs also created enough leverage to pick up and deliver large initiatives.

One of my mentors taught me this litmus test — would your org survive structurally if you leave? If the answer is no or “maybe not,” you have some structural issues to resolve.

The second lesson is that poorly constructed orgs will eventually disintegrate. I’ve seen complicated org structures that got broken up or deeply reorganized due to a simple change like the departure of the manager of the org. Though some managers deftly managed large, tangled org structures, their orgs were challenging to understand and engage. Those managers spent more time managing people and addressing communication challenges than delivering outcomes.

An org collapsed when the manager left in one particular case, and changes cascaded down the hierarchy. In another case, after their manager left, some of the team members left as they got disillusioned about the org’s identity and purpose, and the org was eventually broken up. Of course, the work suffered too, and rebuilding the work took time. In a similar case, an org was designed around a particular person’s skills and interests, and when that person left, the team disintegrated.

As an extension of the first lesson, the third lesson is that, even though you took care of designing a healthy org structure, you and your org may eventually suffer the consequences of a poor parent org design. In such cases, your role may be affected when your boss leaves or someone else tries to fix the org. Your boss’s role may get affected too.

Consequently, ongoing work and communication channels may suffer. Even though cleanups are essential and good, big org cleanups are expensive due to sudden changes in dynamics and communication paths. Periodic incremental structural adjustments are better since they allow people, dynamics, and culture to keep pace with changes.

The same litmus test applies to your parent org too. Would the parent org survive if your manager leaves? You may not have much control over this challenge, but you should watch out.

Career Choices and Business Models

2022-04-29T16:12:13+00:00

When considering career changes, your first criteria for choosing a company should be its business model. Read more to learn why.

There are multiple parameters for choosing between career choices. Popular parameters include the total compensation, the type of work you will be doing, the team and the manager, the title and visible perks, a company’s social status, or even their declared purpose (“we’re the world’s blah”). But we generally overlook what’s underneath all these — the business model. I had to think hard recently about my parameters. The first criteria I came up with relates to the business model.

What is a business model?

I will use a simplified description of a business model here. A company’s business model answers two questions. First, how and where does the money come from. Second, how and where does the money go?

Let us look at some examples.

A company running an online marketplace like eBay, Amazon, Expedia, Airbnb, Etsy, etc., makes money when sellers come to the marketplace to list products and services, and buyers come there to buy those products and services. Aside from building the tech to run the marketplace, the company may have to spend money to get participants, i.e., buyers and sellers, to the marketplace and keep them hooked. Amazon has been doing this for over a decade through capabilities like Prime and Fulfillment by Amazon. Online travel agencies like Expedia establish business partnerships with hospitality and travel companies to bring them to the marketplace. They also incentivize their past customers to return. But a big chunk of the money goes towards acquiring traffic through search and marketing channels like Google. Other e-commerce and media companies have similar customer acquisition strategies and handle associated costs.
Social platforms like Facebook, Twitter, Tiktok, etc., usually make money in two ways — when users see or interact with ads and sponsored promotional content or when the company sells access to the users’ data and insights. To keep this business model healthy and robust, such companies need to bring large numbers of participants (content creators and users) to the network and find ways to increase the users’ time on the platform. The more engaged the users are, the more likely advertisers to promote their content. So, these companies spend vast sums of money on building the best technology possible and then tuning it to bring and keep users on the network. Since such companies depend on getting hundreds of millions of users to their platform to be successful, they provide opportunities to work on solving complex technical problems of scale.
A media streaming company like Netflix’s business model is to keep adding subscribers to watch shows, series, and movies. They spend money to license or produce such media to keep acquiring users to subscribe.

Question 1: Is the business model robust?

Now that we know what a business model is, the first question is whether the business model is robust.

It boils down to learning about (a) what the company has to do to earn more money and (b) what they have to do to spend less to make that money. For example, if a business model requires paying other companies like Google or Facebook large sums to acquire and keep customers, their business is not likely robust. Their customers might stop coming once they stop spending marketing dollars to reach those customers. Such marketing spending is a big problem for many e-commerce companies. Innovating business models to decouple from such spending is non-trivial. Very few succeeded, and those that didn’t become stagnant.

Why is the health of the business model important? Many reasons.

First, when the business model is not robust, you may find it tough to connect the dots between your work and the business outcomes. You work on what is supposed to be the right initiative to grow the business, but you may not see a timely measurable impact. What happens when the critical business metrics don’t seem to respond to your work? You may become lethargic and develop apathy. Status quo wins.

Second, companies with weak business models tend to have poor incentive structures. Companies can’t pay you top dollar when their business model does not justify spending on attractive incentive structures. Though money is not everything, such companies are less likely to win top talent.

In contrast, a healthy business model can afford to create attractive incentive structures for its employees. The math is simple — when the business model is robust, the company will have more money to spend on incentives.

Third, equally important, strong tech companies emerge and remain so as long as their business models are healthy. A robust business model attracts and retains motivated and talented people who will push to improve technology and culture for self-actualization. Of course, courageous and innovative leadership is also essential, but not without a robust business model.

I argue that your first criteria for choosing a company should be its business model. A robust business model has the potential to create a flywheel to bring motivated and talented people, who will then raise the bar on technology and culture and are motivated to innovate the business model to keep it going strong.

But how do you get to learn their business model when you’re interviewing? You research online. You ask the interviewers or others. Understanding the business models of platform companies can be challenging due to the multitude of complicated and non-linear producer-consumer interactions. Still, you should ask and develop a point of view.

Question 2: How is the position connected to the business model?

The next question is how well is the position you’re interviewing for connected to that business model. The answer to this question can shed light on growth potential and incentives. More importantly, the answer can motivate you to work towards enhancing the business model and thus grow in your career. When interviewing, don’t just ask about your team and its peers. Also, ask about how the role can help the business model. If you are interviewing for back-end teams or shared systems, learn how those systems support the business model. Such knowledge over time will help you identify the right problems to solve.

Question 3: What about vision, mission, purpose, values, etc.?

Those usually inform you of the company’s intended purpose and the desired behaviors of their culture. But in practice, their primary goal is to manufacture consent with a business’s stakeholders like employees, partners, and even the society at large.

I’m choosing the phrase “manufacturing consent” without taking a stance on whether manufacturing consent is good or bad. It’s just an effective leadership tool. President Bush manufactured consent through his “axis of evil.” Obama, too, manufactured consent with his “Yes, we can.” So did Trump with his “Make America great again.” You may or may not like what consent they manufactured and what decisions they made based on that consent, and that’s your choice.

For instance, a company with a stated value of being customer-obsessed could keep everyone oriented to improve customer experience and not take decisions that hurt the customer experience. They could also use the same value to bulldoze purportedly customer-centric choices even when those are unfair to other stakeholders of the business model. Examples include arm-twisting suppliers or employing strategies that hurt some common good through externalization of costs, low wages, questionable labor practices, or some other impact on less visible components of the business model. Therefore, my suggestion is to be aware of a company’s stated vision, mission, purpose, values, etc., but not join a company solely based on those.

Question 4: Do you support that business model?

The final question to ask is whether you support that business model. Your response will likely be grey and not a binary yes or no. Here is why.

No business or individual operates independently. All work in a networked economy — producing, procuring, consuming, or offering goods and services. In a networked economy, establishing whether a business model is just or not can be challenging — very few companies escape from doing collateral harm to some segment of their stakeholders. As you peel the layers of any business model, you might discover some uncomfortable parts that may not sit well with your beliefs or values.

Let me take an easy example — Starbucks. Its mission is “to inspire and nurture the human spirit — one person, one cup, and one neighborhood at a time.” You can question whether a cup of a caffeinated drink inspires and nurtures the human spirit. Or you might decide not to support their business model due to their environmental impact. Or you may admire how well they standardized procurement, logistics, and stores to offer quality beverages and serve them in nice-to-hangout locations worldwide while still investing in the growth of their employees. Only you must judge whether their business model is just or not and whether you agree to work for Starbucks or not.

I don’t believe we should shield ourselves from the discomfort of knowing the collateral impact of any business model. You may make a tradeoff to accept some collateral damage. Each of us is making such choices every day anyway, and it is better to make tradeoffs with awareness instead of ignorance.

Probing into whether you agree with the business model or not will help you establish realistic assumptions of your impact on the company’s business model. It will also help clarify that your relationship with your future employer is fundamentally a business relationship and help you become aware of your tolerance of potentially uncomfortable parts of the business model. After all, we all live on the same planet with a shared fate, and our choices have a habit of catching up with us.

To summarize, incorporate the following questions in your search for the next job.

What is the company’s business model, and is it robust?
How is the position connected to the business model?
Do you understand the business model enough to answer whether you support that business model or not?

You might say that the business model is not essential for you compared to the pay, the title, the type of work, etc. That’s fine too, but I suggest that understanding the business improves your career success. You get to know how the company makes and spends money, and as a side benefit, you have a greater chance of becoming a better global citizen.

Must DOs for Interviewers

2022-03-25T20:10:16+00:00

Here are three must DOs for interviewers based on my recent interviewing experience with several companies.

Here are three must DOs for interviewers based on my recent interviewing experience with several companies.

1. Be Present

Presence is the most important behavior for an interviewer. Don’t distract yourself during the interview. The candidate would know when your eyes wander off to check an email or a message on your screen. When that happens, you break the communication flow. Don’t let this happen. Stay in the full-screen mode and disable notifications on all your devices. If you are taking notes, let them know.

Look, the candidate is giving you their time and attention. You also agreed to give yours to the candidate. The candidate must also have spent their time researching about you in addition to the company you work for. So, the least you can do is be present during the interview.

Further, do some homework to get to know the candidate before the interview. Most companies tell candidates to research about them prior to the interview. Reciprocate the same to the candidate even for a few minutes before the interview. That process will ease you into being present with the candidate.

Don’t jump into the interview with no time between your prior meeting and the interview. Declutter your mind, your screens, and devices, relax for a few minutes before the interview, and only then join the interview. Don’t start the interview with a “let me open your resume” and start reading while the candidate is waiting or speaking. Show your manners.

2. Be Personable

Attempt to connect with the candidate. Smile. You’re in the interview with them to know about them. Regardless of suitability, you and the candidate are fellow human beings, so start at that human level. Be personable.

There is no need to wear a cloak of smartness. You’re there to know about the candidate, but not show off how good you are. Candidates perform the best when they feel connected and respected. Don’t make yourself believe that interviews are about performance under stressful conditions and that you’re there to simulate such situations. Real-world stressful conditions differ vastly from interview conditions.

3. Be Curious

As an interviewer, you are in a position to judge the candidate. But don’t use the interview time to judge them. Judgement biases communication. Instead, be curious and explore the candidate during the interview, and defer judgment to a time after the interview.

When you begin to judge the candidate during the interview, your body language and tone will undoubtedly reflect your opinion. The candidate will likely feel your opinion subconsciously, influencing their subsequent behavior.

In essence, your judgment, mainly when unfavorable to the candidate, will affect the candidate’s behavior and eventually derail the outcome. Instead, a curious approach will let the candidate open up, be themselves, and play along with you so you can learn more about them.

Don’t make assumptions about things that the candidate has not told you. Ask them probing questions instead to establish facts as best as you can. Don’t jump to conclusions without probing.

Making Up Movies

2022-03-12T14:55:11+00:00

Continuing from my previous article on managing yourself, another classic trap that impedes personal growth is endlessly playing self-made…

Continuing from my previous article on managing yourself, another classic trap that impedes personal growth is endlessly playing self-made movies in your mind and believing in those plots. Let me describe a particular situation.

One evening, I got a call from an ex-colleague out of the blue. We’ve not spoken for years. He wanted to talk, and I said go on. He sounded upset and beaten. For the next ten to fifteen minutes, he gave me a crude sketch of what was going on at work with his current manager, the new manager, some details about how he felt about them, and some feelings of betrayal.

It took a while for me to piece together what was happening from his perspective. Initially, he felt liked by his then-current manager and built a good rapport. He had access to that manager, which allowed him to work on good projects. That manager’s org grew, and now he was asked to report to one of his peers and move to the skip-level. That felt like a rebuke to him. He was concerned about losing access and being pushed back. “What wrongs did I do? Why is my manager punishing me like this?” were his recurring questions.

Sounds familiar?

It was clear from his voice that he had been deeply in distress for several weeks. He seemed depressed and picked up drinking. I could hear the stress in his voice.

After about 30 minutes into the conversation, he became calmer. Let me present the pivotal moment of our discussion as I recall.

(after several minutes of me probing to understand)

Me: It sounds like you’ve been directing, acting, and living in a movie.

Him: What? Movie?

Me: The film you’ve been acting and directing in your mind.

Him: What?

Me: Yes, the same film where your manager is the villain, and you’re the victim. He is kicking you and punching you, and you’re hurting.

Him: ???

Me: You know that you have the power to get out of that movie and make a different one to play in your head?

Him: Hmm

Him: Now I see what you’re saying. Do you think I imagine all this?

Me: Most likely. Do you agree that your manager may not even know he is in your movie?

Him: Yeah

Me: And he may be directing and acting in a different film right now in which you don’t even exist.

Him: Yes, that’s possible.

Me: So, how about you stop making the movie you’re currently making and choose to make some other movie where you’re not a victim but a hero, where you imagine that things are happening around you and not to you?

Him: I see.

We then talked about alternatives to his current movie to recognize that he could control nothing but his perception of reality and his actions. He subsequently joined my coaching sessions, and we worked on some improvements.

Here is another familiar movie plot.

Sometimes, you want to have a crucial conversation with your manager or another influential colleague. It could be about a promotion, scope of work, or something of importance. You replay what you will say in your mind several times. You rehearse. You brood over. It’s yet another movie you’re acting in and directing. In the end, that meeting may not happen, may move around, or it may not go like the way you had imagined. This is just wasted mental energy.

Instead of replaying, write down what you want to say, and close the book until you’re about to meet that person. If you feel it is urgent, find a way to get it over with. Call that person on the phone, or slack them. Clarify your assumptions. Talk to others about what you’re thinking. By replaying what you’re going to say, again and again, in your mind, you’re just taxing your brain. Don’t do it.

What is sad about this pattern is how common it is. Why makeup such movies? Why ruminate in those stories and waste energy and time? Ancient philosophies have established that we make up much of our social reality. Cognitive scientists, too, have found the same. Mindfulness starts with minimizing such replays. The trick is to become aware, quickly, and pivot to do what’s in your control right now.

Don’t Chase Data Mesh, Yet

2022-02-02T22:13:36+00:00

Data mesh is a nice wishy-washy set of ideas to improve the current state of data. The principles are based on sound reasoning and are well…

Data mesh is a nice wishy-washy set of ideas to improve the current state of data. The principles are based on sound reasoning and are well intended, but I find the story incomplete to transform the current state of data radically. There is plenty of money flowing into the industry. So there are companies to be funded, differentiated products to be built, customer segments to be carved out, conferences to be held, and we may still miss the opportunity to drive change.

Here is why I say this.

You need modern developer-facing abstractions to decentralize data ownership to domain teams. Period. Otherwise, decentralizing centralized functions like data engineering would be more expensive and chaotic.
Those primitives need to expose automated polyglot access patterns to support different types of use cases and users. These patterns include CRUD, search, real-time analytics, streaming to data lakes, generating business events, etc.
More importantly, you need to argue for data mesh in terms of value and not pain. What’s broken is clear, but that’s insufficient to motivate and drive radical changes. Opportunity drives innovation at a much more rapid scale and rate than pain. I find today’s Data Mesh arguments generally pain-based.

Let me dive into each of these arguments. This article is a continuation of my previous article on the broken state of data.

Modern Data Abstractions

Let’s look back to see forward. Today, thanks to systems like Kubernetes, containers, and cloud APIs, most developers interact with and make changes to complex infrastructure without realizing the complexity underneath. The developer and operator-facing abstractions of systems like Kubernetes eliminated several time-confusing and frictional steps for deploying, scaling, remediating, and managing apps. These made principles like automating everything, infrastructure as code, immutability, and repeatability easy to do. A successive generation of tools made implementing these principles progressively easier, which made DevOps successful. In other words, principles alone won’t drive innovation; you need developer-facing abstractions to make it easy to do the right things.

So, what type of abstractions do we need on the data side? I surmise that these include usual CRUD operations, search, real-time analytics, schema tracking, creating business events, transforming such events, etc. Such abstractions should make domain teams own domain data and do most things without handoffs across org silos or needing an army of people to put these together manually.

Polyglot interfaces and multi-sided abstractions

Data is always a shared resource. Regardless of how you organize data ownership, you will end up with multiple teams producing and consuming related data. Modern data abstractions must facilitate multiple teams to contribute and utilize data for their needs without manually shoveling data around. For instance, some of your operational systems might use a proprietary query language for CRUD operations, your analytics teams might use SQL, and some app teams might use GraphQL to access that data. It does not mean you need to explicitly clone the data, create glue layers, or rewrite database engines to support polyglot access. Your core data store might do the internal reorganization for you and expose simple operational knobs. Such multi-sided abstractions will further shuffle complexity to be manageable.

This trend is already beginning to happen for some cloud-based offerings. Companies like MongoDB and Datastax seem to be headed in this direction, and there must be others. We should expect the innovation in this space to continue to make polyglot access easy. I suspect that disaggregation of database engines from storage will accelerate this trend and lower performance and cost penalties.

Value

My final point is about value. We tend to accept known pains. Yup, the current state is painful, but why drive a change when you have ten other value-generating projects in your enterprise? If you are a leader of a data organization in an enterprise, you need value-based arguments to lead socio-technical changes. Without it, your initiative won’t make it to the top-10 key results for your CEO.

Remember that DevOps did not happen just because we said so a thousand times. There was a clear value driver — reduce time to value. That was the single metric to go after. Once you aligned on that metric, change was possible. You were releasing once or twice a month, and now suddenly, you get to release hundreds or thousands of times a day. You’re taking less time to recover from incidents, and your teams are learning from production systems. Such stories broke the dev vs. ops silos and made cultural and organizational changes possible. We had some remarkable examples to show during the early days, like 10+ Deploys per Day: Dev and Ops Cooperation at Flickr in 2009.

But we don’t hear such clear arguments for value today. There are plenty of articles on “what is data mesh,” but I could not find any on “why data mesh?” to propose a singular value argument. For example, in one of those “what is” articles, the author says, “the faster access to query data directly translates into faster time to value without needing data transportation.” But how? What’s the return on investment? Another similar article highlights “greater data experimentation and innovation while lessening the burden on data teams to field the needs of every data consumer through a single pipeline.” That’s a pain argument. There is nothing wrong with these points, but these are not enough to drive step-function changes.

Quantifying value from data is already hard and quantifying value for a significant socio-technical transformation around data is harder. But that’s what you need to drive momentum. I’ve not seen arguments and examples along these lines. Sorry, I don’t know what that is either.

No socio-technical change will happen in one go, and winners won’t emerge overnight. We need multiple attempts over the next several years. Incumbents in the data landscape might not be willing to lead the path initially as it would impact their current business models. Open source will likely need to play a role to make it easier to put things together with standard parts. That’s my hypothesis for the future.

Managing Yourself

2022-01-30T16:40:54+00:00

As the readers of this blog know, I have been offering free coaching sessions since November last year. From then, I’ve coached over twenty…

As the readers of this blog know, I have been offering free coaching sessions since November last year. From then, I’ve coached over twenty people from several tech companies. I knew of a few from prior experience, but many others were new. Though most of the participants were individual contributors, there were some engineering, product, and program managers as well.

During these sessions, the participants and I explored various aspects of their leadership. Each participant is at a different point in their leadership journey, yet we identified challenges to overcome. In some cases, the challenges were apparent, but in most cases, we had to peel a few layers to discover areas of improvement and tactics to employ. What we felt a challenge turned out to be a symptom of some other underlying challenge.

But as I started spending more time with participants, I began to see a more significant issue. Several sessions underscored one essential skill we all need to acquire: managing yourself.

Managing yourself includes managing your internal state (how you feel about yourself and others and your emotions), time, relations at work, dealing with conflicts, setbacks, and difficulties, being present and investing in personal growth. Through these coaching sessions, it’s become clear that most of us aren’t even aware of ourselves, take managing ourselves for granted, and float along in our daily lives.

A few participants told me that time management was their biggest challenge, and they wanted to figure out how to be more efficient at doing and completing all the things they wanted to do. They have tried some approaches, which didn’t work. On probing a bit more, I ended up telling them that they don’t have a time-management problem, and their problem is self-management. Time is not a manageable entity — every moment, we inherit a new moment, and the current moment becomes the past. Each moment comes and goes. But, what we do with those moments is manageable.

You’re the CEO of yourself — a one-person enterprise — act like one

In the introduction to his classic management book High Output Management, Andy Grove writes that “you are in effect a chief executive of an organization yourself” (see page xv, in the first chapter). Although he expressly referred to middle managers in that chapter, his point applies to everyone.

Imagine yourself being the CEO of a one-person enterprise. To be a functioning enterprise, you have to set a purpose, a strategy, and objectives for yourself. You have a business plan and metrics. You have an operating plan to function as an enterprise.

As a CEO, every day, you get to decide how you’re going to invest your time, which is your primary fixed capital. You choose the activities you want to do and those you don’t want to do. You prioritize. A CEO rarely runs the company based on that morning’s news, social media chatter, or the stock ticker. Same for you. Why let incoming email, meetings, rumination, and doom-scrolling dictate how you spend your time?

Every day, you, the CEO, are accountable for the mood in your company. Have an all-hands with yourself and hear what you are telling yourself. Are you setting a purpose for each day, or are you letting daily activities take you in different directions? Are you mumbling to yourself, giving yourself excuses, or do you sound energetic and joyful with clarity of purpose and action?

Ask yourself about your investment plan for yourself. What skills are you acquiring? What are you learning? What alliances and partnerships are you building? Are you being effective at those partnerships?

When you imagine yourself being a CEO of your one-person enterprise, you don’t take conflicts and setbacks at work personally. You tackle those with strategy and tactics. You realize that your relationship with your employer is fundamentally a business relationship.

I find that such a mental model of being the CEO of a one-person enterprise can build self-awareness. It puts you in a position of driving action, not stagnation. It is an effective tool. Try it out.

If you’re open to working with a coach, I may be able to help. I’m still accepting a limited number of people for coaching. Send me a DM on LinkedIn or Twitter to get started.

2021 in Books

2021-12-28T19:40:25+00:00

Here are the top books that influenced me the most this year.

Here are the top books that influenced me the most this year.

Positive Intelligence by Shirzad Chamine

My coach recommended this book late last year, and I picked it up this year. In this book, Shirzad provides a framework to quiet out the worst (the saboteurs) by naming them and revealing them. His framework helped me discover my saboteurs and recognize them quickly when I see them again. I recommended this book to several people in my circle this year. In my coaching sessions, I ask people to read this book and come back to tell me what they discovered and the justification lies they tell themselves when demonstrating certain behaviors. This exercise usually leads to several insightful conversations and aha moments for self-improvement.

Though unrelated to this book by genre, I also recommend Seven and a Half Lessons About the Brain by Lisa Feldman Barrett to learn that “we all live in a world of social reality that exists only in our human brains” and that this social reality is neither innate in our minds nor fixed. We have the power to recondition it.

The Ends of the World by Peter Brannen

This book completely changed my understanding of the current global warming trend. Here is the uncomfortable truth you discover from this book — unlike the stuff we buy in stores or online, this planet has a “no return” policy on the carbon we extract, transform and consume at the most rapid rate this planet has ever seen. We have no easy and quick way to put back all the carbon we extracted in the last 100+ years in time to stop or reverse the current warming trend. The most effective and natural way to put back the carbon takes tens of thousands of years through a process known as “weathering.” Is the game over? I hope not, but we don’t have the time.

This book led me to Neil Shubin’s Some Assembly Required, which I also recommend reading.

Trillion Dollar Coach by Eric Schmidt, Jonathan Rosenberg, and Alan Eagle

Imagine learning about someone through a few accomplished people. This is a story of Bill Campbell. I finished reading this book just last week. As you read the book, you will recognize the power of listening, care, and candor to influence others. In his Hard Things About Hard Things, Ben Horowitz also talks about Bill Campell where he says that “truly great leaders create an environment where the employees feel that the CEO cares more about the employees than she cares about herself” and that “(Bill is) the man who is the best I’ve ever seen at this.”

However, if you’re interested in cultivating coaching habits, I recommend starting with Michael Bungay Stanier’s The Coaching Habit.

Below is the complete set that I managed to read this year. It was a productive year, and I managed to improve the diversity of books a bit this year.

Follow me on Goodreads so I can follow you back to learn what you’re reading.

On Feedback

2021-12-26T16:19:27+00:00

Asking for feedback can be challenging. It can be uncomfortable to know. It may force you to look at things that you would instead look…

Asking for feedback can be challenging. It can be uncomfortable to know. It may force you to look at things that you would instead look past. What if there is a better way?

What if we consider that the feedback is already there, floating in front of us, nicely wrapped, and all we have to do is grasp it with our hands, and unwrap it to know what’s inside? What if we seek it like we seek anything new or unopened, with curiosity and no judgment of the giver as well as of the seeker, to see what others saw, to hear what others heard, and to know what others felt?

Yes, that’s the nature of feedback. Whether we seek it or not, it is already there, floating around us, waiting for us to unwrap. If we are present enough, we can get glimpses of feedback from others’ body language, tone, and behaviors. If we’re curious enough, we can ask for it.

The choice is ours — grasp it to learn from it but not judge it or the seeker, or move past it with continued apathy.

Cultivate the Coaching Practice

2021-12-12T16:21:14+00:00

Most managers and companies spend weeks and months hiring but do little to feed and groom the minds of the people they employ – that’s what…

Most managers and companies spend weeks and months hiring but do little to feed and groom the minds of the people they employ – that’s what my current coaching experience reemphasizes.

We know that complexity in the workplace has increased over the past decade. We are demanding a higher rate of change and lower time to value from each other. We ask individuals to excel in boundaryless situations, influence others and lead cross-functional projects. Yet, most individuals are lost, unguided, alone, stuck, or stumbling to find their better selves at work. Companies compete to offer money and perks and yet rarely equip themselves to feed their employees with the mental nourishment they need to lead themselves. Coaching is one of the ways to provide that nourishment, but it rarely happens at work. Managers often lack the time or maturity to coach their team members.

Suppose you happened to have some coaching moments from someone at work — lucky you, you’re in the minority. If your workplace has a formal coaching program to pair you up with an internal or external coach, you’re in a great place — take advantage of such programs. However, for the majority, these don’t happen. The result is sub-optimal growth and performance. I believe people underperform, not for their lack of abilities and willingness, but for the absence of tools like coaching to lead themselves better.

Since I announced my offer to coach last month, I have had the privilege of conducting nearly twenty coaching sessions. Most participants mentioned that, through these sessions, they have had some breakthrough moments and discovered better ways of leading themselves at work. In a feedback survey I sent, 90% said they would recommend coaching to others. In these sessions, the participants are doing all the hard work reflecting and exploring different options during and between sessions. I was merely present listening, asking questions, observing, offering pointers to explore, and sometimes sharing personal experiences. That’s all. Given this experience, I plan to continue coaching in the future and urge other leaders to do something similar. More about the future later.

If you’re a manager or someone in a leadership role, I want to ask for three things.

First, read Michael Bungay Stanier’s The Coaching Habit. The book is easy enough to read and gives you seven excellent steps to practice coaching. Trillion Dollar Coach by Eric Schmidt, Jonathan Rosenberg, and Alan Eagle is another good book to read, but I find The Coaching Habit more direct to offer the crux.

Second, appropriately incorporate coaching in your 1:1s with the people in your team as well as other teams. Before your 1:1s, declutter and calm yourself, and put all the transactional work aside. You can’t offer coaching moments unless you’re present and are in the moment. Poor work habits like back-to-back meetings, email streams, and multi-tasking make it hard to be present. Don’t waste their time when you’re not able to be present.

Third, don’t judge people. You can’t offer coaching moments to people when you are in the mood to judge. If you feel compelled to judge others based on how they are approaching problems or what they are going through, you are not ready to provide coaching moments. People don’t trust you and open up if they sense that you’re judging. They may also mistrust you if you’re their manager since you can penalize them. Instead, fix yourself not to judge others first.

For some time, I’ve been looking for effective ways to develop others. I now know that coaching is one of the most effective ways to help others raise. To grow as a leader, incorporate appropriate coaching moments in your interactions with others.

Tell Me More

2021-12-07T00:17:58+00:00

As a form of asking a question, “why” is the queen. Asking for why demonstrates a curious mind at work. We expect a question with a “why”…

As a form of asking a question, “why” is the queen. Asking for why demonstrates a curious mind at work. We expect a question with a “why” to prompt inquiry. Five whys is a popular iterative interrogative technique to explore causes behind effects. But when dealing with people, “tell me more” is a more powerful way of inquiry than asking for “why.”

Try this next time you want to know why someone did something in a certain way. Instead of asking

Hi, I noticed such and such about this thing. Why so?”

ask

Hi, I noticed such and such about this thing. Can you tell me more?

There is a subtle but important difference between these two. I won’t tell which one does what to the listener, but one of these forms makes the listener want to protect themselves (imagine the listener wrapping their arms around themselves). In contrast, the other makes the listener open up (imagine the listener opening up their arms relaxed).

Try it out. Replace “why” with “tell me more.” Find the difference?

Broken State of Data

2021-11-19T16:31:02+00:00

In recent years, data has gotten a more oversized seat at the enterprise table. Data was never less prominent, though the focus used to be…

(Also see Don’t Chase Data Mesh, Yet.)

In recent years, data has gotten a more oversized seat at the enterprise table. Data was never less prominent, though the focus used to be more on choosing and running operational systems like SQL or NoSQL databases at scale. Given the prominence of closed-loop ML systems and the potential customer value they could generate from analytics (such as statistical, predictive, diagnostic, and machine learning) use cases, enterprises have an even stronger desire to be data-driven.

Just this week, I came across an insightful tweet by Adi Polak, a big data developer at Microsoft, who said, “If you can understand how to produce, collect, manage and analyze data, you’ll own your future.” She could not be more right. However, it turns out that being data-driven is more complicated than what appears on slideware.

Over the last few weeks, I have had a chance to speak with some ex-colleagues about the state of data, the emerging notion of data mesh, and how different vendors talk about data. For those not familiar with data mesh, see Zhamak Dehghani’s original articles How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture in 2019 and 2020, respectively. You can also find several videos on YouTube on this subject. As an outsider to this space, I have the luxury of a fresh perspective. Here is what I found out from my explorations.

The state of data is messy for several reasons, but these four seem to stand out.

Fragmented ownership and accountability across organizational boundaries
Centralization of specific functions like database management and data engineering but without a complete skin in the game
Skill imbalance — software development teams rarely treat data as part of their service, and data engineers rarely get exposed to the software development lifecycle; they often speak and approach problems differently
Glue tech that you need to shovel and transform data around to power analytics use cases, including machine learning

Often, the teams that produce data are decoupled from the downstream analytics and ML use cases. Likewise, the people dealing with analytics use cases are typically unaware of operational systems that produce the data. Data engineering teams get caught up in the middle with a mandate to make an enterprise data-driven. This chasm is not a surprise to the data practitioners who live through this mess.

On the one hand, such central data teams often lack the ownership and skills to implement this mandate. The same is valid on the operational side as well. When reviewing this article, Mammad Zadeh, who ran data engineering teams at LinkedIn and Intuit, rightly points out the lack of “right tools to empower developers to take on that responsibility.” When neither side is well-equipped, the enterprise is the loser.

In a recent example, I saw an ML team wait for the data engineering team to fix things in areas they don’t own. A subset of data stopped flowing into a particular destination due to some defect in the source app, but the data engineering team ended up holding that bug in their queue for weeks. They had no idea where to look since they didn’t own those source apps. There is no wonder why data scientists often cite data-related issues like data collection, clean-up, and quality of data at the top of their list of pain points. Machine learning, after all, is a feedback loop problem. You need all parts of the feedback loop working well together for it to learn quickly and produce value. But for that to work well, you must find ways for the operational and analytical worlds to work well seamlessly.

While it is exciting to see the innovation and choice in the database arena, the resulting flexibility has amplified particular challenges. First, it is hard for software development teams to make database choices for the data model and anticipated access patterns. Second, once a team makes a choice, they have to deal with the consequences of that choice in the face of data or access pattern changes or lessons learned. Changes consume time, create downtime, and could leave dirty data behind. Third, choice breeds sprawl, making it complicated to trace the source of truth, the lineage, and establish which data to consume.

These challenges also contribute to increased cost and time for governance. Just knowing what data field exists in what data store could be a program over several weeks. It is not a happy place to be in a centralized data organization entrusted with solving problems like establishing the source of truth, lineage, model management, and governance.

The situation on the analytical side seems worse, with the work often dissolving to keep things going or “keep the lights on.” By this, I don’t mean that enterprises struggle to do analytics or machine learning. There are great products and technologies for these use cases for sure, but these involve stitching together multiple sources of data, performing data movement and transformations. However, in the current state of affairs, the success of the analytical world depends on first finding the suitable glue and then keeping that glue together end to end. When the glue breaks, things can go unnoticed for a while. When someone notices, many people may need to contribute to fixing it. The current state of affairs does not seem to offer a winning strategy.

In my opinionated observation, there is a lot of vendor-speak in the material I reviewed, which shows that the buyers for the products are not the doers but decision-makers keen to get some outcomes quickly. Such decision-makers may lack time to dive deep. I believe that tech designed and sold to decision-makers contributes to widening the chasm between operational and analytical worlds and adds more glue in between.

Also, look at the tool landscape, like Matt Turck’s 2021 Machine Learning, AI and Data (MAD) Landscape. It’s a tool bazaar. Making sense of the landscape can be challenging, and you may not know if you’re getting the value you want or just adding more tools to your tool chest. The tools in the landscape are growing, and you will undoubtedly need some handcrafted glue to put things together to drive value quickly. Building upon an analogy by Erik Bernhardsson, Benn Stancil describes this situation better in his The Modern Data Experience, where he says that “hyperspecialization (of tools) makes us great at chopping onions and baking apple tarts, but it’s a bad way to manage a restaurant.” In my experience, ML problems are like running a restaurant — you need to establish decent closed feedback to derive value.

Finally, many organizations seem to have left out data in their microservices and dev ops transformations of the prior decade. These transformations radically changed technology and culture through massive decentralization of infrastructure, people, and responsibilities. Yet, we still see large data engineering organizations struggling between producers and consumers of data, holding the problematic parts.

Centralization of certain functions isn’t bad as long there is a clear shared responsibility model. Consider, for example, centralized platform teams that provide CI/CD tools for cloud environments. There is a clear separation of concerns and a shared responsibility model with platform teams owning the tools and individual dev teams owning their apps and apps’ health. However, I don’t believe that the current toolset and patterns can enable a similar separation of concerns for data. Consequently, I don’t think it is fair for organizations to entrust data engineering teams to own discovery, governance, data cleanup, etc., in addition to the infrastructure and tools.

When reviewing a draft of this article, my one-time colleague and mentor, Debashis Saha, who has led data engineering teams at eBay and PayPal, made an excellent observation. He said that “today’s technology creates a polarized architecture” and that “centralized data teams cannot keep up with the needs of the business or the decentralized operational teams that create the data mess.” It’s just hard to succeed.

Where do these challenges leave us? I have a few hypotheses to offer.

First, I don’t believe that the analytical world would drastically improve unless we take the domain-driven approach to data very seriously in the operational world. Any change needs to start with data production, most of which usually happens in the operational world. We likely require a new generation of developer-facing abstractions for operational systems to better bridge with the analytical systems. I don’t have answers; I have a hunch.

Second, we need organizational transformations to implement “data as a product” with a bounded context. It requires shifting responsibilities. You can have centralized teams provide and operate shared tech, but you’ve to give the data ownership and accountability back to domain teams. You also have to hire appropriately and adjust goal setting and planning rituals to support such decentralization. I’m not the first one to say this. Zhamak describes data mesh as a “decentralized sociotechnical approach.” I want to emphasize the social aspect of this description because you have to inflict changes at every level to reap the benefits of the concept of data as a product. But transformations are expensive and require leadership with tenacity, endurance, and foresight. These can be hard to come by.

Third, while the core principles of the data mesh, such as (1) domain-oriented decentralized data ownership and architecture, (2) data as a product, (3) self-service data infrastructure as a platform, and (4) federated computational governance, promise a better state for the analytical world, I don’t believe that there is a straight road ahead for all the reasons I describe above. Each of these four items requires significant shifts in tech, processes, roles, and responsibilities. I would march with eyes wide open, start from the first principles and be open to changing minds as we go.

Thanks to Debashis Saha for several discussions and feedback on this article. Also, thanks to Mammad Zadeh and Bala Natarajan for their perspectives and feedback.

Looking for a Coach?

2021-11-13T16:07:26+00:00

Hello. Are you an individual contributor in the tech field? Are you looking to make an impact at work but are running into obstacles? Would…

Hello. Are you an individual contributor in the tech field? Are you looking to make an impact at work but are running into obstacles? Would you like some guidance to figure out how to influence others? Do you find it hard to deal with conflicts at work? Do you feel you got ideas, but those are not gaining traction? If you’re sensing some of these at work, I can partner with you to uncover some ways to unblock them.

Update: See this page to get started.

Here is how it works:

You are an individual contributor with at least seven years of experience.
I will offer three sessions over three weeks, with each for about 45 minutes.
I won’t teach you or offer solutions. We will instead discuss you, the circumstances, and the challenges, and together, we will discover potential options.
You will not have to disclose anything that you do not want to.
You will also not tell me anything you must not, such as proprietary and confidential work-related details.
Our sessions will be completely confidential. You and I agree not to disclose the contents of our sessions to anyone without explicit consent from the other.
These won’t cost you a thing.

In return, I would ask you for two things:

Give me feedback
Donate a reasonable amount to a charity and share

Why am I doing this?

First, it can be tough to discuss such topics at work. Managers have more access to information and expertise to get help with these issues than individual contributors. Managers also get on-the-job practice over time. At the same time, companies expect individual contributors to act as leaders for their career growth, but they rarely provide any guidance. Hence, for this round, I’m limiting these sessions to individual contributors.

Second, I believe in the power of coaching to solve problems. Such issues are not new to me, and I’ve personally navigated such matters and coached others in the past. I’ve also benefited from some excellent coaches.

Third, I’m currently on a sabbatical and have the luxury of time to explore different things.

If you’re interested, follow me on Twitter, and send me a direct message. We will take it from there.

Being Nice and Effective

2021-10-23T11:06:13+00:00

I’m writing this article for the nice people in leadership roles. By the meaning of the word “nice,” such people are kind, polite, and…

I’m writing this article for the nice people in leadership roles. By the meaning of the word “nice,” such people are kind, polite, and friendly. They generally trust the goodness of the people they lead. They are not usually difficult to work with, care for others, are patient, and listen. They venture to be vulnerable. You don’t hesitate to approach them.

In contrast, not-so-nice leaders believe in their control and power more than the people they lead and hence can be indifferent and challenging to work with. They rarely show humility. They like to be right all the time. You hesitate to approach them. You can spot them in meetings — they run their meetings like court sessions.

I picked up this question of whether nice people can be effective based on a few passing comments of being nice in a couple of situations. The question bothered me a bit, and I wanted to find out.

What does it mean to be effective? Effective is getting things done. Effective is being clear of what you want and what you are willing to forego to produce results.

In 2000, Daniel Goldman wrote a widely referenced article in HBR titled Leadership That Gets Results. In 2017, HBR Press also published a book by the same title. At the beginning of this book, Daniel poses the question “What should leaders do?” and proceeds to answer with “the leader’s singular job is to get results.” Based on some prior research, he then introduces six leadership styles and their impact on an organization’s working environment, which he refers to as “climate.” Here are the six leadership styles.

Coercive: This style demands immediate compliance. You get told more often than being asked about what you think. This style hurts the climate.
Authoritative: This style mobilizes people towards a vision. They motivate people through vision and work to maximize your commitment to goals. You would know how your work is contributing to the broader vision. This style has the most positive impact on the climate.
Affiliative: This style creates harmony and builds emotional bonds with the people. Under such leadership, people share ideas and their inspiration with one another. This style also has a positive impact on the climate.
Democratic: This style forges consensus through participation. This style is characterized by listening to people and getting their buy-in to drive flexibility and responsibility, even if it takes more time. This style also has a positive impact on the climate.
Pacesetting: This style sets high standards for performance and creates “do as I do, now” pressure on people in an “I know it all” mode. Of course, this style also harms the climate.
Coaching: This style develops people for the future. In this style, you will see delegation get challenging assignments. You will be lucky to work with such leaders since you grow. This style also has a positive impact on the climate.

In this book, he describes climate with the following six factors:

flexibility — that is, how free employees feel to innovate unencumbered by red tape
their sense of responsibility to the organization
the level of standards that people set
the sense of accuracy about performance feedback and aptness of rewards
the clarity people have about mission and values
the level of commitment to a common purpose.

He identifies that authoritative, democratic, affiliative, and coaching styles have “the best climate and business performance.” The other two styles, coercive and pacesetting, hurt the climate. I’m sure you all have worked with or for managers that employed these two negative styles, and you could not run away from them fast enough.

Coming to being nice, since nice people are kind, polite, and friendly to others, they have better chances at practicing positive leadership styles. We can therefore dispel the perception that you can’t be nice if you want to be effective. Being nice is essential to being effective.

However, nice people can sabotage their effectiveness in at least the following ways:

Hesitation to act – could be due to waiting for consensus for decisions, setting a vision but not implementing the mechanisms to translate that vision into outcomes, defaulting to humble inquiry and coaching all the time, or participating instead of leading
Not employing their leverage – see “Behavior 3: Using your leverage” in my leadership document
Being diplomatic and tactful when candor is needed and forgetting that care without candor creates an ineffective climate.
Not saying “no” enough

Here is my summary:

First, being nice should not come at the expense of being effective. Likewise, being effective should not require not being nice. You can be nice and effective.

Second, don’t think in terms of extremes of being nice vs. being effective. Find the nuance by developing candor.

Third, don’t stop being nice if someone accuses you of being nice. Kinder leadership styles are in short supply. Instead, invest in radically improving your effectiveness.

Infinity Minus One Possibilities

2021-10-16T15:51:09+00:00

Career changes come in different forms, some of which are more socially acceptable than others. For example, you find a new job and call…

Career changes come in different forms, some of which are more socially acceptable than others. For example, you find a new job and call the break between jobs “funemployment.” It’s safe; you figured out the next steps and arranged to take a break. It appears to be the most socially acceptable style. Although less common, it is also socially okay to quit your job, take a break for a few months, and explore new opportunities. In terms of acceptance levels, though, it falls slightly behind funemployment. But one of the less socially acceptable styles is for someone to part ways on their employer’s terms and then take the time to explore possibilities. My recent departure from my prior employer fell into the latter category. There it is.

My situation is not unique — many people go through such career-changing events with far worse ramifications. I have the luxury of time and space to write about it and some friends and family to lean on. Yet, here is why I’m writing this article.

First, I have a choice to make. I can stay ambivalent about my departure and imply that I chose to depart without saying so clearly. I could even call it funemployment. Alternatively, I could choose to acknowledge that the choice was not mine, and yet I’m genuinely curious about what lies ahead. I’m choosing to go with the latter option to face myself with clarity and remove any ambivalence from my mind. You might say that it takes courage to do so, but I think it opens more doors.

Second, I want to break the stereotype that such breakups are bad. I want to highlight that it is perfectly okay to acknowledge such a change and then operate from the point of strength and not weakness.

Third, I’m sharing the mental models I used to process the change, which may be of value to others. Once I realized what was happening, I struggled over a weekend to regain my equilibrium. I concluded that what happened was a gift for me to explore possibilities that I may have been ignoring. But that conclusion was not instantaneous — some mental wrangling preceded it.

Over the last several years, I’ve been fascinated with the idea of playing with different mental models to influence how you feel about and react to events around you. For me, this is a way of building resilience and agility. I’ve been able to apply this technique to address several ambiguous technical and leadership problems when I felt stuck or noticed others stuck. I got out of difficult situations with more than one possible option when no option seemed to exist. If there is only one thing I could take away from my last job, it’s learning this ability and applying it at scale.

I think of a mental model as a pair of tinted sunglasses. You swap one pair with another with a different tint, and the world looks different.

As Lisa Feldman Barrett says in her 7½ Lessons About the Brain, “we all live in a world of social reality that exists only in our human brains.” This social reality is neither innate in our minds nor fixed. She adds that we “create the social reality with other people without even trying because we have human brains.” Per her research, one way to create the social reality is by “reliably copying one another to establish laws and norms to live in harmony.” In other words, we seem to learn about how we feel about events around us from others. We copy both the useful and not so useful mental models from others.

Thanks to such copying, in most cultures, we default to mental models like below when faced with such situations.

Victimhood: With this mental model, you think something awful happened to you, and you’re a victim. You paint certain people as perpetrators that worked against you and took something of value from you. You feel hurt. You feel right to believe that those perpetrators are wrong and that you’re right. You’re confident that they willfully did this to you. You think that they mistreated you. You wish that, if only those perpetrators cared, were thoughtful, and acted accordingly, this situation wouldn’t have happened.
Self-doubt: With this mental model, you attribute whatever happened to your incompetence. Once you start judging yourself, you will find several examples of your incompetence. All your prior mistakes dance around you. You worry that you’re not good enough and that you failed. You wonder if you could ever recover from this.
Entitlement: I find that “feeling entitled” is a companion to “feeling like a victim.” They go hand in hand. As you fight against victimhood, you will begin to remember all the good things you accomplished and wonder how they could harm you despite those good things. You feel that you deserve better and are entitled to a better outcome.
Rumination: In this model, you start to guess what might have happened and why. You want to find out who said and did what to you and why. You want to understand all the events that occurred and the rationale. You will have so many questions, and you keep thinking about those questions till the cows come home.
Heroism: You could also think that the place is burning up, and you were going to leave anyway, but perhaps not yet. You feel good about what happened because you know they were wrong, incompetent, etc.

These are some examples, and there can be more. All such models seem valid and justifiable. You may vacillate between these models to feel discouraged, helpless, angry, confused, etc., for a while. Depending on the culture you grew up in, and the traits learned in life, you may go through these for minutes, days, or even weeks.

Unfortunately, such mental models do nothing but close doors for future possibilities. They make us stuck in the past and make us carry that burden into the future. But it does not take much to find that you can form a different social reality with a different set of mental models.

Things generally don’t happen to you; they occur around you. Nobody did anything to you. You’re not a victim. You happen to be there when certain things are happening.
In a complex system like a workplace, you know that there is no single root cause. You’re okay not to know what happened and why. You know you can choose to second guess what might have happened, but recognize that you can’t do much with those guesses. You let go of rumination.
You admit that your employer, instead of you, making the first step is no different. A divorce is a divorce. You recognize that timing of who says it first is of little consequence in the long term. You question why it is okay for you to plan your departure to land elsewhere on your terms, but not the other way around? You challenge the notion that the former is superior to the latter.
You recognize that your current working identity is undone (see Herminia Ibarra’s Working Identity). You realize that you’re free to explore, unencumbered by your current work identity and the daily grind that comes with that identity.
You are aware of your strengths and areas of improvement. You know your accomplishments and what you stand for. You know that when one door closes, you still can discover and open a million other doors.

Unlike the first set of models, these models create a different social reality in which you don’t glue yourself to the past but are genuinely curious and excited about what lies ahead.

But discovering such alternative models can be challenging, particularly when emotions due to the first set of models overtake you. Most of the difficulty is being aware of the mental models you’re currently using. It takes practice to remind yourself that there are other mental models to explore. For me, journaling helps.

When things happen, you get to choose — entertain mental models that close doors or only consider mental models that open new doors. You can choose to carry the burden of the past or choose to be present, enjoy the moment, and build for the future. One weakens your backbone with the weight of the history — the other strengthens your spine for the next phase. When putting this way, the choice is simple. The faster you become aware of this choice, the better for you. That’s the gist of what I learned over time at work.

In the end, you might ask if I’m spinning a negative into a positive story and rationalize. Yes, that’s the point of choosing alternative mental models to aim for the outcomes you want. Our social reality is malleable. Conditioning the mind is an essential element in life to improve resilience. It involves picking up mental models that work for you and discarding those that don’t. I could not have gotten certain things done in my career if I hadn’t taken that approach, and I know it is valid for millions of others. That’s the ability I’m taking with me to explore possibilities.

Leading vs. Participating

2021-10-05T20:45:38+00:00

Over time, I observed a particular leadership trap that managers and others in leadership roles fall into. The trap is participating as…

Over time, I observed a particular leadership trap that managers and others in leadership roles fall into. The trap is participating as opposed to leading. I fell into such a trap myself a few times. Let me give you an example to highlight the difference between leading and participating.

This example involved the manager of a large team. The team ran into a defect in a system owned by another group. They first tried to work around the issue and then escalated it to the other team over email. The email thread grew over the next few days to include several others as the problem turned out to be in yet another system. Folks shared ideas on potential causes and fixes over email. It was not clear for days who was playing what role and who was accountable fixing the defect.

The manager was also on the email thread. Along with others, he too asked questions and shared ideas about potential solutions. Days went by. The email thread grew longer. Everyone, including that manager, operated as though they were collaboratively discussing the issue. The issue lingered for 3–4 weeks before someone fixed the problem.

In this case, the manager was participating instead of leading. A leader would rather drive the appropriate sense of urgency, clarify roles and responsibilities, ensure those are understood, and oversee a timely resolution. A leader would participate and discuss the issue only to the extent needed to get enough details for that facilitation; no more, no less.

There is a clear difference between leading and participating. A leader’s job is to get results. However, given the plethora of material on various leadership concepts like “coaching,” “servant leadership,” “empowering teams,” managers sometimes forget their role for delivering results and default to participating instead of leading. Under such a management style, problems linger, and outcomes take longer.

I want to describe a few patterns to check whether a manager is participating as opposed to leading.

Pattern 1: Reiterate the right things in the hope that the right things will happen automatically

Knowing and preaching the right things is one thing, but getting everyone to do those things for a positive change is another thing. Knowing and preaching the right are what advisors and experts do. They look at the problem landscape and tell you what you should do. But leaders don’t stop there. They need to figure out how to get the team to act on those and produce results.

I recall one particular conversation where a manager explained to me certain things that his product should do. But when I asked how he was going to turn them into reality, he had no plan.

In another case, a manager got the team together to discuss certain best practices. At the end of the meeting, everyone agreed to the best practices that they identified. The manager felt good that the team understood the need for and identified a set of best practices. The team members, too, felt good that they now have the best practices.

But then the team waited and waited for their manager to tell when and how to introduce those practices. The manager walked away, assuming that the team would practice those on their own. The waiting continued. A few months later, everyone forgot about those and went about their business as before. In this case, the manager assumed that the team would naturally practice but didn’t lift a finger to ensure that they adopted those practices. As Colin Bryar and Bill Carr highlight in their Working Backwards, good intentions don’t matter unless supported by mechanisms to translate those into excellent outcomes. That’s the manager’s job.

Pattern 2: Thought-leadershipping because you’ve thoughts

Here is another pattern for people with deep experience in a domain. When they come across a team dealing with problems they had previously worked on, they would get into the details themselves. In doing so, they may lose track of their role and indulge themselves. As far as they are concerned, they enjoy the subject and have valid opinions to give to help the team. But in doing so, they may shut the debate, or waste others’ time, or worse, inadvertently push the team to make unnatural choices.

I recall one particular senior manager of a large organization who used to have the team bring a series of topics to himself every week. People used to prepare and present that material to that manager. Then they would discuss, and he would offer intellectual opinions. As far as he was concerned, those topics were fun, engaging, and he provided much-needed expertise. The presenters used to feel great at the beginning for the chance to present to such a senior person. That sense faded away later on as no outcomes came out of those meetings.

It’s okay for a manager to institute rituals to bring information to them, but then they are responsible for letting those rituals yield positive change. In the above example, the discussions were just theater.

Pattern 3: Being friendly with no parameters

Some managers like to be nice to everyone and rarely want to confront anyone or even set parameters. They want to be liked, treat their team as peers, and not let hierarchies come in the way. They prefer to avoid introducing rituals and processes for fear of stepping on the toes of their team members. One manager thought it was more important for three teams collaborating on a project to pick three different tools and rituals for tracking their work than for picking one to reduce the communication burden. Consequently, team autonomy prevailed, but outcomes suffered.

Such managers employ democratic and coaching leadership styles that Daniel Goleman describes in his article Leadership That Gets Results (also consider reading his book by the same title). These two styles emphasize building consensus, participation, and developing people over producing results.

These two are basic styles for a manager to incorporate, but only in moderation when appropriate. Without a strong vision and mechanisms for goal setting, goal tracking, decision making, and delivery, such styles fail to create a productive work environment. People generally like to work for such managers as they feel heard and empowered and are free to take the initiative, but may fail to stretch and grow when the manager is too polite to set goals or challenge them. Some individuals may still grow and do exemplary work under such managers, but only because they want to, conditions permitting.

I ran into a few managers that employed this style. One particular manager comes to my mind. He was well-liked by the team for his leadership style. People raved about him when asked to provide feedback for their manager. Yet, the team’s project outcomes slipped, important decisions took weeks, and even those decisions were re-opened and debated because everyone felt empowered to question and discuss ad nauseam. People felt good, but results suffered.

The team became productive only after the manager and one or two vocal members left the team, and new folks joined. The new manager took accountability for outcomes and introduced some team rituals for decision-making and goal tracking. These simple adjustments made the team effective.

Here is the bottom line. Assuming a leadership role requires being responsible for results. As I discussed in my leadership document a few months ago, managers must employ their team as leverage to produce outcomes.

One last thing — hierarchies are real; they have a purpose, and regardless of your leadership style, your title or level in a company’s hierarchy make you responsible for outcomes.

Let’s Discuss Attrition

2021-09-05T18:45:17+00:00

Let’s discuss attrition. Attrition gets the most negative attention at workplaces. Folks talk in hushed tones about who is leaving, who is…

Let’s discuss attrition. Attrition gets the most negative attention at workplaces. Folks talk in hushed tones about who is leaving, who is going where, the types of offers they are getting, and what could be going wrong with your team or company. Managers periodically sit with HR teams to review attrition trends in different geographies, pay levels, and diverse groups of employees, nod their heads and move on. A few courageous folks surface attrition concerns during team meetings, townhalls and all-hands meetings only to get vague answers from their managers. Most of us consider attrition bad and don’t often look at it with a growth mindset.

For those leaving, there is the usual drama of long emails and LinkedIn posts about all the good things they are leaving behind. People write about why they are leaving, and how they are feeling, and sometime later, follow with why they are excited to join a new company. No doubt. It’s natural to feel emotional when leaving a team with whom you spent your most waking hours. It’s also natural to feel excited when embarking on a new journey. This phase replaces all the accumulated negativity of the company they are leaving with the positivity and anticipation of what they will do at the new company. This phase, too, is natural.

The mood is different for those not leaving yet. They fear missing out on the good things those leaving are likely getting. They worry about things falling apart in their current team or company. They feel uncertain of not knowing what others might have known about what’s wrong and speculate why others are leaving. Worse, people feel anxious about their worth and fear that they are not qualified enough to pocket the same kind of jobs that others are getting. These are nothing but symptoms of a fear mindset.

Most managers also play the victim when people leave their teams. They talk to their managers and HR teams about things that are supposedly driving attrition – like compensation, cultural challenges, lack of innovation, the market, the other companies paying more, and so on. I’ve had people come to me and tell me that some team is falling apart because one or two people left that team to work elsewhere.

For a manager, this is an unproductive drama. Fear and feeling uncertain about attrition are not going to take you anywhere. Let’s look at what you should focus on instead.

First, recognize that attrition is natural. Everyone moves on one day or the other, including yourself. There is no need to be dramatic about it. It’s okay for people to leave a team or a company. Also, realize that companies and teams go through cycles of new work, scaling out the work, stabilizing that work, or even shrinking back. Hiring, attrition, and divestments support these cycles and are healthy.

Second, reasons for attrition vary. You can try to put each exit into a bucket, but that’s simplistic. There is usually more than one reason for someone to leave a team or a company, and those reasons vary from person to person. I find it more important to get feedback during exit interviews than to speculate about why someone is leaving. It is difficult to know after a person has decided to leave.

Third, and most importantly, recognize that attrition is an output. You know when it happens, but you can’t control it. It is too late. Instead, you should focus on what you can control. You should also recognize that there is usually a lag of at least a few months between the controllable inputs, the things you should focus on, and the output, which is attrition.

Besides cash and deferred compensation, look at how well your teams are executing. Are roles and responsibilities clear? Are decisions languishing? How quick is your and your team’s decision-making? How is execution? Is the team producing results? Is the work aligned to business outcomes? How are those business outcomes?

Does your team have a purpose and a strategy to realize that purpose? Does your team understand and is driven by that strategy?

Are you investing in the growth and development of the people that work in your team? Are you spotting growth areas for each of your team members and taking the time to invest in their development? Are you facilitating stretch goals to promote growth?

Are you learning? Is your team learning? Is the team continually improving their work? Are they leading the change?

Look at all these things and more.

I’m sure you will find plenty of reasons to improve. Those are what you can and must learn to control. Even then, people will leave your team or company. When they do, congratulate them. Hopefully, your leadership enabled them to grow beyond their current roles and hence new opportunities knocked on their doors.

Every exit gives you, the manager, an opportunity to improve the team structure and dynamics. Attrition loosens the team fabric and gives you a chance to reshape it. Don’t let go of that opportunity.

I’ve had people leave my team. Though I panicked once and wanted to retain someone on impulse, I eventually took the time to figure out what I need for the next phase. That comtemplation and analysis helped me morph the team charter to support our strategy, and hire someone to lead that strategy.

It is also common to see others stepping up when someone leaves their team. When this happens, you have to ask yourself what made you not see it before that person left the team, and what you should have done differently. Perhaps you didn’t realize the person leaving has outgrown their role and you didn’t observe? Perhaps you didn’t notice that the other person is ready to step up but this other person is blocking?

Look around, and you will find examples.

Finally, keep calm. Remember that we’re all temporary custodians at work. Just focus on leaving things in a better shape than when you started.

My Leadership Document — 2021 Edition

2021-06-11T17:33:58+00:00

Leadership is a loaded word. We attribute a lot to it; we expect a lot from it; we know when we see it, and yet, we don’t have a concise…

Leadership is a loaded word. We attribute a lot to it; we expect a lot from it; we know when we see it, and yet, we don’t have a concise way to describe what it means to be a great leader due to its many facets.

I’m a student of leadership. I’ve been learning, observing, and practicing different facets of leadership for several years. In this post, I want to summarize how I view leadership currently, as of June 2021. I will describe a few core leadership beliefs that I believe in and behaviors that I lean on and practice. These beliefs and behaviors are neither absolutes nor complete. These reflect my personality, what I learned so far, my experiences, and the context at work in which I currently work and lead. These are subject to change as I continue to learn and practice this craft of leadership.

My Beliefs

Belief 1. Leadership is about being a better person

I fundamentally believe in John C. Maxwell’s point from his 2011 book 5 Levels of Leadership that “Leadership is much less about what you do and much more about who you are.” I read it 7–8 years ago and recently read the second half of the book.

I hold on to this belief as I believe that one must learn to lead themselves before one can lead others. As a person holding a leadership role, you face constant jugglery of tactics and decisions that impact others’ careers, growth, successes, and team outcomes. Unfortunately, a leader’s insecurities, opinions, bad habits, and ego come in the way of deciding what’s suitable for the team’s success and growth.

There were times when my insecurities and blind spots obstructed decision quality and positive change. My approach ended up controlling someone’s ideas once. I wanted to approach the problem space differently and ended up overshadowing that person’s enthusiasm. Later on, I realized my insecurities. There were more. I don’t think I’ll ever be out of the woods as I continue to discover behaviors that I must fix. Leadership is a journey, not a destination.

I’ve seen senior managers with a broad scope of large organizations struggle to lead due to their egos and insecurities. One particular manager wanted to take my spot in a technical forum because he felt entitled to that spot given his title. Another manager was hell-bent on getting his way and couldn’t stand the idea of his teams and others in the company taking a different way. Under his controlling leadership approach, the teams in his org got fragmented, and the strategies went nowhere. A different manager wanted to be in the driver’s seat but did not know where to drive. Finally, one particular leader’s strong beliefs on the charter of specific roles set back the careers of a few individuals by years. I’m sure most of you have stories to tell.

As John C. Maxwell writes in the same book, “If you want to become a better leader, you must not only know yourself and define your values, but you must also live them out.”

Upon my coach Janis Machala’s recommendation, I recently read Shirzad Chamine’s Positive Intelligence to discover a few of my saboteurs and lies I tell myself to justify some of my behaviors. I’m glad I did. Finding and dealing with one’s fears, egos, and poor habits is a challenging but necessary part of leadership growth. Such discovery must be a periodic exercise.

Belief 2: To create a high performing team, you must help others grow as leaders

This belief is a well-known leadership expectation, yet it’s often forgotten. You can’t create a high-performing team when you neglect team members’ leadership development. You need your team to grow to be better leaders for you to be a successful leader. Quoting again from John C. Maxwell’s book, “people development empowers the leader to lead larger.” Developing people is a force multiplier.

Easier said than done. It is why this belief is number 2 in my list of beliefs. People in leadership roles get busy day-to-day, and other than occasional feedback rituals, they don’t have enough time to develop their people. Consequently, most team interactions become transactions like meetings, reviews, plans, checkpoints, etc. Paradoxically, unless the leader carves out the time and does the hard work of developing their people, the leader won’t have time to scale themselves.

Furthermore, it’s not uncommon to see managers treating their team members as tools to create success. But, unfortunately, the same insecurities, egos, and bad habits that hinder personal growth also impede people’s development.

I must add that developing others requires investing in many other facets of leadership like listening, empathizing, coaching, inspiring, and even feedback, and managing performance mismatches.

Belief 3: Leaders must set unarguable goals for the team they lead

I’ve been a practitioner of this belief for some time. I first came across this notion of “unarguable goals” from an un-dated (2009?) Paul O’Neill’s talk on “The Irreducible Components of Leadership.” Watch below.

In this talk, Paul O’Neill shares a couple of nuggets.

He starts his talk with the first nugget, “With leadership, anything is possible, and without it, nothing is possible.” I can not help but repeat this in my head over and over. Sure, there are many components to yield positive change, but leadership is among the key elements. The leader’s beliefs and behaviors can make or break a team.

The best part of his talk is where he introduces the notion of unarguable goals — “It’s necessary for a real leader to articulate what I call unarguable goals and aspirations for the institution that they lead.”

Once I heard this, I kept seeing examples. Consider the unarguable goals that leaders in history like MLK and Mahatma Gandhi set for the people they led. Imagine back in the 1920s a lean guy coming back from South Africa to India and setting a goal “I’m going to get freedom for this country.” “Crazy!” others might have said at that time. Establishing such an unarguable goal takes courage and conviction.

I’ve had the most professional and team growth happen when the goal ahead was challenging with no clear path to success and plenty of reasons to give up midway. These are the kinds of goals everyone would agree with, but most would remain skeptical about reaching those due to the hurdles involved. Even in my current role, when I picked up a particular goal last year, more than half of my team were not convinced about reaching the goal. That changed over time, and innovation kicked in.

Unarguable goals drive creativity. Under a genuinely inspirational leader invested in their team’s growth, people go to great lengths to innovate, solve challenging problems, and endure difficulties. While doing so, people develop the skills necessary to deal with ambiguity, obstacles, and setbacks. With the experience gained, they go on to solve even more significant problems.

Positive transformational change starts with unarguable goals. It is the leader’s job to set such a goal for the team they lead. In the absence of such a goal, leadership dissolves into busy work and the illusion of progress. As a result, people don’t get a chance to discover their potential. I recall a particular discussion back in 2016 when a database engineer said that “we don’t know how to run databases well in the data center, and we would never be able to run them on the cloud.” But today, teams at work run large online databases on the cloud in multiple regions. That’s because we had an unarguable goal to be on the cloud.

But articulating unarguable goals does not mean you makeup goals that you don’t believe in and move on when the going gets tough. For example, a senior leader picked up an ambitious target for a metric for his org in a previous company. Large screens showed the progress in real-time. After a few months, the metric slowed down, that particular leader moved on to another pasture, and the screens disappeared.

On picking an unarguable goal, you must be honest to highlight brutal facts, invest in tackling the hard problems first, be grounded in reality, and continually persuade, inspire and walk with talk. You should expect criticism and pushback from peers and your team members. You must show empathy and patience to talk to people who don’t believe in what you’re saying.

You might say that the particular problems your team is dealing with, the constraints that the team is facing, or the size of your team does not empower you to set unarguable goals for your team. I beg to differ. See my Behavior 2 below. Look across your team and work areas, and you will find opportunities to set unarguable goals.

My Leadership Behaviors

Let me share how I think of leadership behaviors at this time. These are my beliefs in action. These behaviors are far more grounded in my current context than my leadership beliefs. As that context changes, I expect to refine and tune these behaviors.

Behavior 1: Setting the Pace

I was reminded of this phrase recently, and hence is at the top of my list. A crucial part of a leader’s job is to set the team’s pace. As we see time and again, as teams grow and change, organizational inertia sets in often. An issue first noticed by a team member could linger for weeks and months without a solution.

Setting the pace includes goal setting, tracking progress, timely decision making, continuous progress towards outcomes, hiring, addressing performance concerns, tackling lingering topics, anticipating issues, and asking lots of questions like there is no end. The leader and their leadership team must create a flywheel to keep things moving.

How might setting the pace appears in action? It depends.

It usually starts with a watch for lingering topics and chasing and resolving those as quickly as possible. Nobody likes to work in a team when problems remain for too long. By resolving lingering issues rapidly, you provide direction and a positive working environment for the team. When you let issues linger, you let apathy develop. People stop believing their leaders. They stop caring and move on.

It also involves some processes or rituals to check the pace regularly. Some time ago, one of my friends and an ex-colleague explained the idea of bringing information to you to scale better. There are several ways to do this. For example, in our team, we run a weekly portfolio review where we review all people-related issues like lateral movements, departures, hiring, balancing investments across different initiatives, etc. We have deep dives to go through strategic topics. Other areas require other rituals to bring information, uncover problems, and manage the pace.

Above all, setting the pace requires following the instinct and asking questions. Practicing this behavior requires coaching skills. My mantra is to try to ask five more questions when you think you’ve no more questions to ask.

Behavior 2: Watching for excuses

Nobody likes the word “excuse,” but the reality, we may not realize when we’re making excuses. But where do excuses come from? There are several sources, but let me share some patterns.

First, constraints offer a great way to make excuses. Think of everyday examples — “we lost this particular person to attrition;” “we’re not able to hire quickly;” “we have such and such unmet dependency;” “our team is not trained to do this;” “a particular person is on vacation,” etc. Yes, you wish these go away and that everything will be great afterward. You will only have new constraints to replace old mitigated constraints. Recognize that leadership is a constraint management game. Of course, your parental instinct might indulge you in listening to those constraints and sympathize. But I believe that, as a leader, you can’t accept such reasoning as inalienable and must probe into why and let the team come up with multiple options when no option seems possible.

Second, making decisions to suit convenience. When making decisions with incomplete information amidst constraints, I ask if you’re making a decision because it is convenient (say, to avoid some difficult people, org or technical problem) or believe it is the right decision. Unfortunately, convenience can be an enemy of good decisions. Convenience-based decisions tend to come back and haunt.

The third one is stopping at boundaries. As organizations grow and multiple teams form, people tend to stop at team boundaries and make excuses of why such and such is not working or slow or not meeting some other expectation. I often remind my team that stopping at such boundaries limits the outcomes we can get for the customer, but it also limits personal growth. When we stop at team or org boundaries, we grow apathy and sion. That’s because you are signaling your team that poor outcomes are somebody else’s fault.

Behavior 3: Using your leverage

The titles that come with leadership roles are not privileges bestowed on particular individuals. Instead, those are expectations on those individuals to use those titles effectively as tools of leverage to get things done. Unfortunately, people with such titles sometimes forget to use those titles for good use.

I recall one particular situation in a previous job when the team I was part of had multiple opinions about the future tech strategy. There was no clear plan, and people were debating for weeks. It was chaotic. One of my close colleagues pulled me aside and reminded me, “Subbu, I think you should be the one setting the direction. Why don’t you draft it down and share?” I worked on it for the next couple of weeks and presented it to everyone. I ended up using my role to align the team towards a particular direction and a path. I’m thankful for that reminder as we executed on that path to produce solid outcomes, though I took some arrows because not everyone agreed.

To this day, when someone brings a problem to me, I remind them their roles and titles give them the power to solve the problem themselves.

Behavior 4: Following through with commitments

I won’t delve into this behavior beyond emphasizing that producing timely and quality outcomes is an essential leadership behavior. I believe that people in leadership roles grow as leaders when they help their teams make difficult results possible quickly. Likewise, when leaders take quality and timeliness seriously, teams develop creative options when they find obstacles. It’s a win-win approach.

Behavior 5: Sharpening your knives

Here is my belief. I work in the technology industry and deal with certain kinds of technologies. Therefore, I must understand the technologies in use, how we’re building our systems, their strengths and deficits, and trends in the industry. Moreover, I must ask good questions, immerse myself in details when necessary, and know what types of strategic bets to make for the future. Since there is a limit to how much I can understand and grasp, I must know what questions to ask. In other words, as a technologist, I must keep my technology knives sharp.

I know not everyone agrees. I’ve worked with managers who view themselves above technology and details and consequently fail to ask the right kind of questions or set a direction for the team. I like Apple’s approach of experts leading experts. However, it does not mean that the leader must be a superior expert above the team. The leader must be able to speak the same language and capable of zooming in when necessary. You don’t see tennis players coaching football teams. Do you?

But watch for pitfalls. Managers with deep technical experience tend to chase details, offer opinions, and tell their team how to solve their technical problems. Nobody likes to be told. The trick is to curiously ask questions and let the team explore. Use your expertise to ask better questions to promote thinking but not dispense opinions.

Behavior 6: Picking up the hard parts of growing people

Unlike influential leadership, positional leadership puts managers in charge of precious assets — these are the people who bet their careers on and their company and their reporting managers. In this era of opportunity in the technology industry, people choose companies and managers. Managers can nurture them, grow them, and help them do great things. Or, they can use them as tools, not care for their development, not help them make lasting contributions, and let them wither away and not succeed.

However, growing people does not stop with learning and development activities, providing autonomy, coaching, mentoring, etc. Those are all necessary things. It also must incorporate hard parts:

Continually stretching your team beyond what they think they are capable of and making yourself and your team uncomfortable every once in a while
Setting unarguable goals for the team and challenging them with care and respect.
Providing timely feedback without sugarcoating
Active performance management. When you let someone go with a development mindset, you will favor that person to discover a future elsewhere.

These are my leadership behaviors and beliefs as of now. As always, I’m writing these down as writing is clarifying. I’m sharing these, hoping that they might help someone on a similar journey as I’m.

Stop Feeding the Monkey – Journaling

2021-03-21T15:19:02+00:00

Consider a few contemporary problems for tech workers.

Consider a few contemporary problems for tech workers.

Problem Number 1: Our attention has been fragmented for a while. We all have plenty of doom-scrolling opportunities on all our computing devices. As Cal Newport writes in his 2016 book Deep Work, “the rise of these [messaging, social media] tools, combined with ubiquitous access to them through smartphones and networked office computers, has fragmented most knowledge workers’ attention into slivers.” He then adds, “This state of fragmented attention cannot accommodate deep work, which requires long periods of uninterrupted thinking.” Right, but that’s just part of the problem. It affects mental health too.

Problem Number 2: Distributed offices and the pandemic have amplified the fragmentation – at least for most if not all. Along with everything, we “digitally transformed” even simple acts of getting information. What could be a turn-your-head-to-ask-a-colleague-a-question is now a meeting on both of your calendars or one more Slack thread. I recently joked with a colleague that “you can butt-type a slack channel name to find a real one that you’re already part of.”

People are spending more time “syncing” and “aligning” through their favorite communication tools. Just the other day, I had to set up two 15 min meetings with two different individuals in two different time zones to put two things together.

These acts further sliver focus and attention. Not “syncing” and “aligning,” on the other hand, breed anxiety as you stop knowing what is going on and feel left out.

Problem Number 3: Slivered focus and attention multiply work in progress. You never get enough information and time at once to finish solving one problem before the next issue or task comes up. The more fragmented your attention is, the more unfinished work you accumulate.

The hilarious part is when people share their screens in meetings. Many people’s browsers nowadays have 10s of tabs. Each tab is potentially some unfinished work in progress. You wonder if they will ever read and process every tab and their anxiety of not doing so.

Also, consider that work at the workspace is increasingly becoming complex. I’ve three-four slow-burning topics on any given workweek and at least one fast-burning fire to handle. These add to the already fragmented mind to juggle between all these topics thinking, feeling, and planning, ad nauseam.

The Consequence

Likely as a consequence of such problems, I’ve gone through years of feeling completely drained and empty by Friday evenings and used to take most of Saturdays to recover. Then Sunday comes, we start checking and firing emails. We’re back to the war zone by Sunday evening or the loo time Monday early morning.

I don’t seem to be alone. Others seem to be facing similar problems. I often see tired and droopy faces at the end of any workday. People routinely share admissions of being busy and a lot going on.

No wonder. At least in the tech sector, with stakes and rewards being high, there is little incentive to do less. Slowing down is the least attractive option.

We can talk about mitigating or avoiding these problems for hours, but these are not going to go away. I don’t see myself slowing down anytime soon. I don’t see me reclaiming large parts of my calendar anytime soon. I keep trying and failing. There is so much unfinished stuff to do!

I’m not a neuroscientist, and I don’t know what this constant jugglery between unfinished tasks does to the brain. But I can equate the effects of this jugglery to “feeding the monkey” in the brain with an endless stream of entertainment. The monkey first gets excited with the stimulation and keeps jumping from topic to topic, but then eventually gets tired. I needed a way to stop feeding the monkey.

Stop Feeding the Monkey

Thanks to a tweet by Steve, I discovered journaling recently. I’m 31 days into daily journaling. So far, I’ve not had the usual Friday-drains.

At least once or twice daily, I dump all unfinished things and sort them in my journal. I try to organize my thoughts, feelings, and plans into various journals. Once I sort things out in a journal, the monkey has much less to do until new information, new threats, or some change in conditions.

Consequently, I’m much more detached and relaxed on even some of the most challenging days. On the days I fall back to old habits, the monkey takes over, and I’m less focused and drained.

Essentially I’m using a journal to stop feeding the monkey. It’s a technique to let ideas and thoughts breathe. That’s my coping technique.

On Being Present

2021-03-07T20:23:14+00:00

The concept of “presence” eluded me for some time. Based on some muzzy ideas that I formed over time, I’ve associated presence with how…

The concept of “presence” eluded me for some time. Based on some muzzy ideas that I formed over time, I’ve associated presence with how others see me in one-on-one or group meetings. Several questions bothered me. Am I friendly enough or too nice? Am I critical enough or overly critical? Do people see me as soft and gentle and thus ineffective, or do they find me intimidating? Am I giving enough direction, or am I too vague or, worse, overly prescriptive? There is no easy way to know.

I’ve been inconsistent in my approach of being present with varying outcomes. I could get away as long as I was an individual contributor, but as I made my career change, I was left with a troubling realization that I have no formula for conducting myself, and I was “winging it.”

We associate presence with leadership. We attribute “being present” to being charismatic, energetic, confident, commanding, perhaps intimidating, witty, sharp-looking, steering the course, etc. I also thought that leaders were supposed to command others’ attention through “leadership presence.”

There is plenty of leadership mumbo-jumbo on the Internet to tell you that “if you want to be perceived as a leader, you have to have gravitas,” and that gravitas is “that certain something that makes a great leader.” There are plenty of kitchen sink articles like The 5 Cs of Leadership Presence and 12 Habits for Building Leadership Presence that further confuse and make the concept of presence distant from day-to-day lives for most of us.

Even Kathy Lubar and Belle Linda Halpern’s 2004 book Leadership Presence confused me when I first opened the first chapter with the title “Presence: What Actors Have That Leaders Need.” I had to read the book a second time to skim for the parts that made sense and opened my eyes. This book has some great nuggets, but I had to mine for those.

It took a few iterations for me to learn about presence. I realized that this concept is simple and yet fundamental. When combined with inquiry skills, this concept can be a very productive tool to influence and produce excellent outcomes. I’ve come to appreciate that what differentiates a poor interaction from a great interaction with others is your presence. Furthermore, the farther you get from real work in the corporate hierarchy, being present is the only effective way to lead.

As I realized that presence means nothing but being present, being in the moment, and not in the past and not in the future, parts of Kathy Lubar and Belle Linda Halpern’s began making sense to me. As I re-read the book, nuggets like below started appearing.

It means being present in the moment — focused totally and completely on what is happening right here and right now. It means, when you’re with people, giving them your full attention, so that they will feel recognized and motivated. (Page 18)

When you’re not present to the people you lead, it weakens their willingness to commit. (Page 18)

The key characteristic of that behavior, we think, is flexibility. It’s the willingness and ability to move and adapt freely as circumstances prescribe right now. (Page 51)

You must practice being in the moment with people. That’s the only way you can properly assess which role is appropriate to play. (Page 65)

(Here the role refers to one of “captain”, “conceiver”, “coach” or “collaborator” from page 63.)

Here is what it means for me to be present in practice.

When I’m talking to another person, I’m hearing what the person is saying, perceiving any emotions, and asking questions to learn more. I’m not thinking of what happened to me 5 minutes ago, what I like to do later in the day, or some other task or concern. I’m not looking at my watch, not my phone, not my email, and not certainly checking Twitter or Facebook. I’m giving 100% of myself to the moment and the subject of my meeting that person.

Similarly, when I’m in a group, I hear what others say; I’m drawing patterns, asking clarifying questions to seek information and validating assumptions, observing what people are saying, how they are saying, and feeling the temperature of the room. Through these steps, I’m focused on what is happening at that moment in that group. I’m helping the group stay focused on the subject of that meeting by merely asking questions. Again, I’m not distracted by checking email or multi-tasking.

But being present is hard. I rarely get it right 100% of the time. Many things sabotage us from being present. Our brains are like monkeys that wander across topics thinking, feeling, and planning. At least for me, my default state of mind is mental chatter. Furthermore, tools we use at work like emails and slack messages further add to our mental chatter. Consequently, being present takes effort.

Here are a few techniques I’m incorporating to improve my ability to be present.

Noting and tucking things away: I take a few minutes after most tricky meetings to note items down and tuck them away before jumping into the next thing I want to work on or my next meeting. This brief activity helps me declutter and calm my mind before I get to the next topic. Besides, I review my calendar the evening before and make mental or written notes about most of my upcoming meetings. This step helps me better prepared for the impending clutter.
Inquiring: The second technique is to stay focused by asking questions and listening, and not assuming. You can ask open-ended questions to bring details out. You can ask questions to surface blockers and hidden assumptions. You can ask questions to stretch people out of their current comfort zones. When you think you’ve asked all the needed questions, consider asking few more questions. Finally, depending on your situation, don’t hesitate to switch from humble forms of inquiry to confrontational forms. I highly recommend Edgar Schein’s Humble Inquiry: The Gentle Art of Asking Instead of Telling on the subject of inquiry.
Minimizing bookmarking: I think of bookmarking as the habit of accumulating pointers to information to read, process, and return to those later instead of dealing with it at the moment. It is like opening a new tab in your browser and keeping it open in the hopes of getting back to it when you have the time. As tempting as bookmarking is, it gives you an escape pass from being present. First, I will never have enough time to read and understand everything. Second, even if I have the time, it’s irresponsible to assume that I can build the context and master every topic. Third, even I have the time and can develop the context and mastery, it’s arrogant to assume that my deep understanding of that subject can help others do their job better.

Steps like this take practice and training to form muscle memory. I’m leaning on journaling to sharpen my approach. More about journaling later.

On Getting Promoted — Push vs. Pull

2021-02-21T18:23:04+00:00

I vividly recall a particular one-on-one conversation. … The question that bothered me most at that time was how to get to the next level.

I vividly recall a particular one-on-one conversation. Several years ago, I was getting ready for a one-on-one with my manager. I rehearsed my key talking points. It was about whether I was up for a promotion or not, and if not, why not. I had just built and launched a new project. The project got a lot of kudos, and I got moderate recognition. I led a small team to build it, and I designed and wrote critical parts of the code. I was super-pumped about how good I was. I thought I had “arrived.” The question that bothered me most at that time was how to get to the next level.

My one-on-one with my manager came and went. All I got was what I felt to be a general-purpose pep talk. He talked about “pushing yourself up” vs. “others pulling you up” and indicated that at the level I was at then, I needed to work on creating the pull. At that time, it felt unfair and outright wrong to be pulled up. It became clear to me much later.

I didn’t get promoted that cycle and not even the next. When I eventually got promoted a couple of years later, it didn’t start with a one-on-one meeting. My manager called me for my inputs into the case the manager was making for my promotion.

Promotions and growing to the next level are sensitive topics. For the people wanting to get promoted, anxieties increase as they approach the promotion cycle. Like my case above, people ruminate about this topic a lot and bring it up with their managers to get some vague sounding reasons and inputs. You may end up feeling entitled with a reaction like, “I’m doing great, why are you not promoting me.” Or you feel rejected that “this manager/organization/company is not valuing my work”. Alternatively, you may feel dejected, that you’re not worthy of a promotion and that you’re stuck. Unfortunately, these feelings are just saboteurs and may contribute to nothing but lowering your chance of getting promoted.

This topic is tricky for most managers too. Not every manager takes the time to identify and candidly share growth areas for their employees. Shit-sandwiching is not uncommon. Fearing conflict, managers may obfuscate critical growth areas behind vague platitudes and hints. I’ve had a chance to read some managers’ reviews that gloss over essential development areas so gently that you wouldn’t even notice what the manager is trying to imply. Consequently, you may keep building unrealistic expectations of your readiness. When that promotion does not happen, you may also harm team performance by feeling disengaged, rebellious, or hostile.

Many factors go into promotions. Besides being determined as ready, other factors include scope and budget. The scope can be vague, but it is usually determined by whether the person has a role ready and big enough at the next level to fill. In this post, I will focus on readiness and the types of problems you should be solving to get ready.

There are some great articles on this topic — google for “how to get promoted at work.” Most of those are right and can provide some useful guidance. But of all the tips and suggestions you will hear on this subject, the most crucial technique to develop yourself to grow into the next level is to solve ambiguous problems and not settle for well-defined problems with clear expectations and accountability.

But why?

Career ladders are not linear. Each progression from one level to the next requires acquiring a different set of skills. Earlier parts of your career depend mostly on improving the so-called hard skills. As this Wikipedia entry describes, these are technical and administrative “skills relating to a specific task or situation” and involve “understanding and proficiency” of “methods, processes, procedures, or techniques.”

You can acquire such skills by solving hard problems by yourself. Through learning and practice, you can polish your hard skills to produce outcomes faster and better with little or no supervision, and thus you can increase your impact on the team. This approach can help you get promoted a few times, perhaps relatively quickly. During this time, you will likely become comfortable pushing yourself to acquire hard skills, and people recognize you and will continue to lean on you to solve similar problems.

As you enter your comfort zone, the linear career progression through the acquisition of hard skills will stall. That’s when you will start to hear about “soft skills” like “influencing,” “conflict resolution,” “teamwork,” “empathy,” “communication,” and such vaguely described leadership traits when asked about getting promoted.

But how do you acquire soft skills?

Just like you acquire hard skills by solving increasingly tricky hard problems, you can develop soft skills by solving increasingly complex ambiguous problems.

Every company has plenty of ambiguous problems with the potential to transform and create significant opportunities. The problems are often messy (read my post on The Value is in Dealing with the Messy Stuff), and not everyone wants to touch those. Ambiguous problems have unclear, or several paths forward, and every direction appears to have its and pitfalls. Your opportunity is to create a path forward when no path seems plausible, for which you will need to be uncomfortable and demonstrate courage to take steps without knowing all the facts. You’re not guaranteed to succeed, and your approach and decisions could lead to failure. Such problems often fall in between multiple teams, and you will need to navigate teams, know and align people, and influence and persuade them to do certain things. People may not agree to the approach or trade-offs, and you will need to learn to deal with conflict and convince others. Since you won’t have all the information, you won’t make all the decisions. Thus you will learn to facilitate decision making thereby enabling others to make decisions. As Tanya Reilly describes in her Being Glue, you will learn how to do glue work. Such glue work may require you to focus on invisible decisions — these are meta-decisions you make to make others make decisions.

In other words, ambiguous problems are gold mines to acquire soft skills. If you want to get promoted, first learn to raise your hand to solve ambiguous problems. Look around at work, and you will find those quickly. When you find such problems, try to persuade others about why those are significant problems, and be open-minded about what they say. If such problems are not apparent, ask your manager or the manager’s manager to learn what’s holding back their organization and their key priorities for the team. Above all, don’t shy away from such problems. Don’t ask, “tell me what I should do.” The opportunity is for you to figure out.

As you succeed in solving such problems, others will notice. Someone or some group will want to pull you up to the next level so that you can help solve even more significant, more complex ambiguous problems. You will get promoted as people need you at the next level. They will pull you up because they need you at the next level. That’s what my manager was trying to explain years ago.

Here is my advice. Instead of letting the promotion anxiety distract you, focus on building and improving your hard and soft skills and be ready for that knock on the door. Raise your hand to solve problems before raising your hand for that next level.

You might argue that all this sounds ideal, that your organization is political, that promotions don’t work like that, and that you need to impress your manager and that manager’s manager, and so on. I can’t speak for every company culture, but I ask you to examine if your opinions about your organization are your saboteurs.

2020 — A Year of Privilege

2020-12-31T23:06:50+00:00

2020 was a weird year. Many things were different twelve months ago. At first, there was the pandemic. Before we even realized what the…

2020 was a weird year. Many things were different twelve months ago. At first, there was the pandemic. Before we even realized what the pandemic meant and the work-life changes that were yet to come, the travel industry (I work for a travel company) took a big blow along with many other sectors. It felt that there was no end to the uncertainty and need for change at work and home. Then add the social and political turmoil that we all went through. Millions of people lost their jobs. Some businesses may not come back. The manager of a food distribution center I volunteered at a couple of months ago told me that they serve several times more people this year than last year. A World Bank report says that the pandemic “is estimated to push an additional 88 million to 115 million people into extreme poverty this year.”

But as I introspect my life and look around into the lives of people I know, things seem just fine. Indeed, everyone went through changes. Despite some inconveniences, most seemed to have adopted well. Nobody stopped buying things. Thanks to millions of gig-workers that we don’t mind exploiting, we’re all getting what we need and want conveniently delivered to our homes. People are meeting in small bubbles and enjoying their lives. Everyone baked, and I did too. Folks are investing in their health, wealth, and well-being. For the most part, stock markets did well. Per CNBC, the top seven tech companies added $3.4 trillion in value in 2020. Many in the tech industry got richer.

How come?

Did we make all the right choices at the right time to prepare for a year like this? Maybe those who did not work hard and failed to make the right decisions are paying the price? It seems convenient and comfortable to think so.

Or is it because of the privilege we accumulated over the years through all the opportunities provided to us?

As I introspect my own life, circumstances, opportunities, and choices, I’ve no reason to believe that I worked hard and made all the right choices at the right time every time.

I must admit that I’m privileged. My gender, who I was born to, the schools I went to, the companies I worked for, and the people in my personal and professional life slowly but steadily contributed to my privilege over the years. I can’t wash off this privilege with some year-end charitable donations, or some pithy words of wisdom.

That’s my lesson in 2020.

Jumping Across Career Ladders

2020-11-16T20:27:09+00:00

I went through a career reboot earlier this year. All these years, I grew in my career through the individual contributor (IC) ladder…

I went through a career reboot earlier this year. All these years, I grew in my career through the individual contributor (IC) ladder. Sometime during 2019, I realized that it was enough. I wanted to reboot my career and create more opportunities to grow as a leader by switching over to the manager career ladder. Thanks to certain people who believed in me, I got an incredible opportunity to build and lead a new team at the beginning of this year.

I kept quiet on this blog till now, mainly because my mind was as clear as mud for several months. The journey was rocky first, and I had to retool how I work with others, strive for outcomes, and lead. Though I expect to continue this process, my head is less muddy than it was even a couple of months ago. Many people, including the people who report to me, helped with this transition; and I can’t be grateful enough to them. I’m also thankful to the people who suspected my readiness as it helped me understand why.

Most people who manage others start small earlier in their careers and slowly grow to manage other managers and managers of managers. As I skipped these usual stages, I got the luxury of making certain mistakes and developing perspectives that experienced managers may have forgotten about. I want to share my perspectives in this post as I realize that I’m not in the majority and that others may find these useful.

Some of what I share below may sound discouraging for those in the IC ladder, but these are important considerations no manager would tell you. Let me also be clear that I’m a sample size of one, that others may not share my perspectives, and your mileage might vary.

IC and manager career ladders are not equivalent.

IC career ladders are not new in tech companies, and more and more companies seem to be clarifying and establishing separate career ladders for ICs and managers. These ladders clarify competencies, expectations, and differences between different levels in each ladder. For those looking to understand their career progression possibilities, such ladders provide at least a theoretical direction.

Some companies end their IC ladders at roles like Principal Engineer and Senior Staff Engineer, while larger tech companies tend to extend these to Distinguished Engineer and Fellow roles. Career ladders, when published, highlight corresponding manager levels. For instance, a Principal Engineer level might correspond to a Director level, and a Distinguished Engineer level might correspond to a Senior Director. A Fellow might correspond to a Vice President. But these vary from company to company.

As you move up along the IC ladder, you get a perspective of big problems worthy of solving, plenty of access to senior leaders and organizational resources, a seat at the table to influence impactful strategic decisions, and a latitude to work across org boundaries. These are perks that people in the lower levels of the IC ladder or even the manager ladder can only dream about. Besides, nobody will stop you from exploring different problems.

People at the top of this ladder usually have huge followership; they have done a lot to help grow the company and are well respected. It is also common to groom and promote internal candidates into these roles instead of hiring from outside. When someone leaves at this level, the position may not get backfilled at the same level.

However, these roles don’t go on forever. First, IC roles reach a plateau sooner than managerial roles. Your career progression stops upon reaching the senior-most IC level. This may be acceptable for you as long as you retain the flexibility to solve your company’s most impactful or niche problems just through influence.

Furthermore, IC career ladders tend to run steeper than the corresponding managerial ladders. Just count the number of individual contributors and managers at the topmost IC level and the equivalent managerial level in your company. Fewer and fewer get to climb to the top, with most getting stuck in the middle. Although this is generally true for managerial roles, manager ladders seem to suffer less from this steepness.

As I began to realize that I’m leaning more on influence than on specialized skills to get things done, the fanciness of the IC ladder began to fade away, which brings to my second point.

To be successful as an IC, you must be a great influencer.

I can’t emphasize this point enough. If you are an IC and like to continue to grow in your career and do impactful work, you better learn to like people and learn to influence them. Don’t expect to continue to grow in your career by being the fastest or the best coder in the team. Even your deep domain knowledge acquired through working for years on the same set of systems might not help you keep going. Eventually, others below your level will do better than you do. The sooner you learn to scale yourself through influence, the better for you. I sometimes run into ICs that want to code and don’t want to deal with people. That may appear fine but be aware that such an attitude will limit your career growth.

The single most important advice I give to ICs is to learn to solve ambiguous problems. Ambiguous problems force you to learn to influence. It’s common to equate dealing with people and influencing them as “messy politics,” and I can tell from experience that it is a mistaken and counterproductive perspective.

In my case, all the impactful problems I could solve in recent years were entirely through influence. But influence is a slow process. While it may take several weeks or a few months for you to go deep into a problem and find a way to solve it, depending on your organizational culture and structure, you may have to spend a lot more time influencing and mobilizing others to execute the solution.

Moreover, as you go up the IC ladder, you will find that individuals and teams within the existing organizational structures can already pick up all easier problems. Consequently, the problems you pickup tend to be ambiguous with no clear ownership and accountability. Solutions may not fit well within existing structures and processes. Therefore, you need to have patience and tenacity to use your influencing skills to navigate organizational boundaries to align people.

IC roles are also consultative.

Influencing is a great force multiplier. When successful, you will find others preaching your ideas and approaches and multiply your impact.

However, the flip side of leading through influence is settling for consultation. As you get better at influencing, you may find that your role becomes increasingly consultative in nature. You will often get asked for inputs, and those inputs often count. However, inputs are not decisions. Due to organizational or business considerations that you’ve limited control over, your inputs may not see the light of the day. Accountability for outcomes usually lies with managers. Consequently, many key decisions, including hiring, team structures, objectives, decisions to start/stop projects, etc., lie with managers.

Also, no one is obligated to follow you and line up behind you unless they see value in what you’re proposing. Consequently, even when you succeed in influencing people across organizational boundaries, people may decide to follow their managers and not you in case of conflict.

These ladders diverge as you move up, making it harder to jump.

IC and manager ladders diverge as you move up. Opportunities to jump from the IC ladder to the manager ladder become far and few as you move up the IC ladder. The longer you stay in the IC ladder, the harder it would be to jump to the manager ladder. Jumping is possible, but you need to be patient and need sponsors who believe in you and clear obstacles.

Finding such sponsors won’t be easy. Even the people that worked with you closely in the past and leaned on your technical and influencing abilities to get major projects done will hesitate to take the risk of sponsoring you or giving you the opportunity. They may believe, rightfully so, that you are not ready or that you may not succeed as a manager. I’ve had this happen to me as well, and understand the risks involved. There are several valid reasons, but let me cite the three that I found most important.

First, your organizational leaders would have no way to judge how you would react in difficult or emotionally challenging situations. To give you a simple example, consider the difference between providing poor feedback about someone to their manager and acting on such poor feedback about someone you manage. It’s easier to give feedback than to take timely and constructive actions on people you manage. Others have no way to know how you would handle such situations.
Second, leading through influence as an IC is different from managing. As a manager, you will have to learn to use managerial leverage to improve organizational performance. For example, as an IC, you will tend to get down to work and develop one or more ways to solve the problem when you find an interesting problem. But as a manager, your job is not to solve it yourself. Your job is to leverage the people in your team, create a sense of why it is important to solve the problem, orchestrate means for the team to figure out how to solve the problem, and then set up supporting mechanisms to ensure that the problem gets solved.
Third, in case of failures, managerial roles have a larger blast radius than corresponding IC roles. It is easy to contain or repair any damage that an IC can cause to an organization. You can contain the repair to just that individual. But containing or repairing an ineffective, or arrogant, or know-it-all manager’s damage can be difficult. You may have to rebuild the entire organization.

There are more. As an IC, you may have theoretical knowledge of such differences and situations from reading books and observing managers. You may have hypothetical solutions to tackle challenges that managers need to deal with. But without a track record and muscle memory formed through trial and error, your chances of failure are high.

As they say, the grass is greener on the other side. Both roles offer you a way to learn and drive change. But, if you’re an IC that wants to jump to the managerial ladder, be patient. Continue to develop leadership competencies like influence, creating paths for ambiguous problems, dealing with conflict, etc. Even if you decide to stay on the IC ladder, these competencies will take you farther than your domain skills. As Tanya Reilly says in her Not all engineering leaders are engineering managers, “many of the leadership skills that make for good managers (e.g., setting a clear direction, caring about other humans, understanding the business, building consensus, communicating ideas) are also the ones that make stellar senior, staff, and principal engineers.”

Lessons from 2019

2019-12-31T17:49:08+00:00

As I look back into 2019 and prepare myself for 2020, I’m proud of two things. First, leading through influencing with little positional…

As I look back into 2019 and prepare myself for 2020, I’m proud of two things. First, leading through influencing with little positional power. Second, further polishing my skills to deal with ambiguity. Not unlike other years, there were ups and downs along the way, with abundant opportunities to lead, make mistakes, observe others making mistakes, and learn from those. Here are my top seven lessons from 2019.

Lesson 1: Don’t romanticize about what you want to build and how you want to develop it

Big ideas are essential to motivate, inspire, and energize. There are many examples at workplaces. Consider, for instance, re-platforming your code to follow some new-found design principles, or building new special-purpose logging and monitoring platform, or adopting a contemporary architecture using the latest and greatest tech. However, such ideas are just untested hypotheses, and may or may not solve your customer problems. So instead of selling what you want to build and how you want to develop it, focus on outcomes, and find ways to test your hypothesis by working backward from those outcomes incrementally. Romanticize about the results and not ideas.

Lesson 2: Have a point of view on what to standardize and what not to

Standards are a double-edged sword. On the one hand, you can cripple innovation by standardizing a lot. Excessive standardization can also lead to a culture of dependency and mistrust. On the other hand, standardizing little creates a wild west. You get agility early on but won’t be able to scale the organization due to limited leverage, lack of repeatability, and lateral moves of people.

However, over the last several years, cloud providers have entirely disrupted how organizations have been standardizing their infrastructure, developer platforms, and all the enabling systems. It’s time for a new way of thinking.

Here are three rules of thumb to decide the kinds of things you want to standardize while leaving the rest to teams to determine as they see fit.

First, identify architectural invariants: These are principles you want to follow that remain mostly unchanged with technology. Examples include:

Protocols and interfaces (APIs) between producers and consumers of the business capabilities, including security and access controls
Rules for making tradeoffs when building software, for instance, between optimizing a process for customers and partners, balancing between availability and security, or latency and availability, etc.

Second, make common agreements: These are conventions, tools, and practices you want to keep primarily aligned. Examples include your network designs, policies, and tools for managing costs and security controls, data locality, etc. Evolve these as paradigms change but keep mostly common for everyone.

Third, know where you want to foster rapid change: These are areas of differentiation for your business. If there are tools already available in the industry for the agility of these areas, adopt those quickly as standards to unlock agility, but don’t force standards onto desperate problem areas. For instance, your front-end teams may want to use a particular CI/CD system while your data science teams may use a different approach for their workloads. Identify additional common agreements (2) needed to balance between flexibility and commonality.

Lesson 3: Be comfortable about not having an opinion

It is a typical leadership mistake to lead discussions with strong opinions. I’ve seen very senior leaders start with their views. Strong opinions can cripple dialog and push the org culture into a box. Strong opinions are also an indicator of “I know it all” mindset. As I wrote in the past, organizations may start or suspend significant initiatives based on opinions of positional leaders with titles at work. Eloquent opinionated individuals take over conversations in meetings to steer the course to conform to their views. Senior leaders’ names get dropped to shift the course of an activity or to override some data.

It takes courage and trust in the process to leave opinions outside when entering discussions. Yet, it opens options you may not have considered before.

Lesson 4: Take the time to form mental models of how things work

Related to lesson 3, this is another lesson I learned this year. There are cases where you need to build a point of view and make decisions to move forward. In one particular case, on tackling a problem in one context, I had the opportunity to come up with producing similar outcomes in a different context. Instead of starting with the same recipe, I went on talking to several individuals closer to that context, asking them what they think are the issues and challenges, and how they would address those. This listening approach gave me a ton of insights to develop a point of view on how to approach the problem. It also helped me garner support for the approach.

Lesson 5: Choose opportunity over fear

On any day, when making decisions, choose opportunity over fear. It is not uncommon to come across arguments based on fear and uncertainty of competitors, third parties, and other entities. Concerns of vendor-lock-in is an excellent example of a fear-based approach. When presented with arguments, ask questions to turn attention towards opportunity. Opportunity based discussions usually uncover more options than fear and uncertainty based discussions.

Lesson 6: Resist status management

When going through change, be okay with not having a status in the change and yet be willing to influence or even lead the change. I was fortunate to work closely with some individuals who demonstrate this ability so well. However, the most common tendency is to focus on how you fit in the change and not on what the change is about. Read more at Status Management.

Lesson 7: Build resilience not to get rattled

Finally, know what you do well, and invest in yourself to build resilience not to get rattled. Interactions at workplaces can and do shake us. Not every meeting or email goes well. Workplace interactions are often based on limited information and lack kindness and empathy. When faced with such situations, I remind myself that the stuff is usually happening around you, and not to you. It helps to observe, explore a few mental models to understand what’s likely happening, and build personal resilience by selecting beneficial mental models. One of my career goals for 2020 is to further master this skill and more importantly teach others how to improve their personal resilience.

Studying an Incident

2019-12-30T15:36:18+00:00

It is not often you get an opportunity to study an incident to illustrate a few lessons. A recent incident that I describe below teaches…

It is not often you get an opportunity to study an incident to illustrate a few lessons. A recent incident that I describe below teaches three key lessons:

There are multiple perspectives on what happened and how to improve. The more complex the system is, the more perspectives you’re likely to discover.
Asking for what went well and how things worked, instead of just asking about what went wrong, opens possibilities for improvements that you would otherwise miss.
Resilience is what people do, and being resilient involves likely doing things you’ve not done before.

Below is an approximate representation of the incident timeline with certain notable events. I’ve omitted a few partially explored parallel paths.

Multiple Perspectives

First, notice that there are at least four separate perspectives of this incident.

Team A that operates the shared compute cluster:

“We should have checked before deleting the cluster … We should avoid manual deletes like this … We should create smaller clusters to reduce the blast radius …We should chaos test this …”

Team B that owns the apps in the critical path:

“They (team A) did what? How could they? … Can we bring up these apps quickly in the second region? … Who knows the steps? … Who do we need to verify?”

Team C (external) that runs the throttled AWS services

“They (team A) did what? How could they? … How do we (recover from the throttle)? … What is the risk to other consumers of those APIs? …”

Team D that is orchestrating the events on the incident bridge:

“How is the rest of the site doing? What else went wrong? … Can we shift traffic to the other region? Why not? Have we tested this before? … Do we have a list of apps running on that cluster? … Who do we need to call?”

Such perspectives show that the narrative of the postmortem report can vary based on who is writing the postmortem. Collecting all these perspectives, reconstructing the incident, and identifying potential corrective actions, may take separate interviews with each group. A single live large gathering of all the parties involved in a post-incident review meeting may not uncover all these perspectives. A typical operational review with a senior titled person at the head of the table will undoubtedly miss most of these perspectives, and the discussion will most likely focus on what that person feels everyone should do.

What Went Wrong vs. What Went Well and How

A common practice of conducting incident postmortems to ask “what went wrong,” then a series of whys to find what that happened, and then identify future corrective actions to prevent such an incident from reoccurring.

Such an approach, in this example, will lead to the first event in the incident timeline, which is when an operator made a change to delete a compute cluster. Subsequent probing would discover that that the operator made an incorrect assumption, that the operator did not make sufficient attempts to verify that assumption, and that the change was not peer-reviewed.

Consequently, you would identify potential corrective actions like the following:

Automating all deletes to include validation checks
Peer-review of all manual changes
Avoid creating large clusters to reduce the blast radius
Chaos testing cluster rehydration

However, asking for “what went well and how” would uncover a different set of events and potential corrective actions.

Recall from the timeline that the deleted cluster was hosting 100s of applications, most of which were customer critical. But why was impact not broader than noticed? What protected the rest? This line of questioning would uncover the following:

Most of those apps are redundantly deployed in a second region on a similar compute cluster in an active-active configuration.
The traffic management layer automatically routed the traffic to those apps after the apps hosted on the deleted cluster became unavailable.
The compute cluster in the second region scaled up automatically to support additional traffic.
However, a few apps, including the ones found to be in the critical path of the impacted customer segment, were not redundantly deployed in the second region.
The deployment system and the post-deployment configuration and validation checks were time-consuming, which prevented the quick deployment of those apps in the second region.

In other words, a certain amount of robustness was built into the architecture, which helped limit the impact.

You would then come up with a different set of action items from those identified in the “what went wrong” approach.

Ensure that all critical apps are deployed in at least two regions. Q: How do we know what is critical?
Identify all critical apps. Q: How do we keep it up to date?
Periodically test automated failover between regions.

This example shows that probing for both “what went wrong” and “what went well and how” are essential. Each contributes to improving your understanding of the complexity of the system, how things work and identify potential corrective actions. Extending this line of thinking to all the perspectives listed in the previous section further enriches this understanding.

Resilience is What People do

Finally, this incident also illustrates that resilience is what people do. This incident involved opening up minds to alternative explanations, conducting parallel streams of analyses, adapting to new information as it emerged, and attempting things not done before. Each major component involved in this incident behaved as it was supposed to. Regardless, the system as a whole faced a catastrophe, and it took human coordination, quick thinking, and ingenuity to recover from the disaster. In fact, what is typical between incidents is how people come together to restore the system, with the rest being unique to each incident.

As John Allspaw says:

Though such a distinction between robustness and resilience seems nuanced and pedantic, understanding and appreciating the difference can lead to vastly different thinking, investments, and outcomes. For example, not recognizing this difference may limit you to only focus on software solutions like below.

Improving observability, so you’re quick to know and learn about the dynamic behavior of the system.
Building or adopting closed-loop automation systems — These are solutions that follow an act-observe-correct pattern in a closed-loop to automatically remediate components of a system when those are observed to be drifting from their desired state.
Adopting cloud-managed services for their availability and more predictable behaviors, to replace do-it-yourself solutions.
Reducing blast radius to contain faults and to minimize coupling with techniques like circuit breakers and fall-backs
Traffic shifting and shedding for quick recovery
Chaos testing the system by subjecting it to various failure hypotheses

These are all essential investments. But such steps alone will not let the overall system, which includes people, the tools they use, and the rituals they follow, build the capacity to adapt to changing conditions during a catastrophe and to learn from those conditions. You need to go beyond to include the following to build resilience.

Understand how humans interact with the system and the assumptions they make when operating the system.
Avoid language that prevents dialog and discovery during and after incidents.
Move away from root cause hunting to understanding how the system works, improve your mental models for the system, and use that understanding to invest for the future.
Ensure that people with titles are also learning. Most in such positions likely have dated understanding of how to build and operate complex production systems. However, since such people set the tone of the culture for their teams, it is vital for them to also learn in this process.

In Summary

For any company generating value from its production systems, complexity is a moat. Despite our attempts to shuffle the complexity through automated layers of abstractions, complexity is the natural state of real-world systems. Each layer we add and each change we make change our assumptions about how the system is supposed to work, and how it is actually working. Incidents provide an excellent opportunity to validate those assumptions and discover ways to improve. The practice of continuous learning from incidents must be an integral part of your operations-culture to build resiliency.

Thanks to Willie Wheeler for reviewing an earlier draft of this post.

Retinitis Pigmentosa

2019-12-23T02:33:32+00:00

Nature and evolution offer some astounding patterns. Imagine a genetic disorder that affects males (sufferers), while their female…

Nature and evolution offer some astounding patterns. Imagine a genetic disorder that affects males (sufferers), while their female siblings (carriers) carry the same disorder to pass onto their male children to suffer from, and to their female children to carry to the next generation? In this pattern, females (carriers) show no signs of this disorder, suffers carry their mutation through their female children, and male children born to a sufferer escape this affliction. Some male children born to a carrier may also escape as well.

Retinitis Pigmentosa is such a genetic disorder that causes gradual loss of vision. There are three known variants of this disorder, one of which is transmitted from generation to generation through X-linked chromosomes. This X-linked variant exhibits the pattern I described above. My description of this pattern is based on empirical observations and is not scientific.

That’s what has been ailing our family tree for generations. No, I’m not suffering from this, because I’m the male child born to a sufferer. My father is a sufferer who lost his vision during his 30s and yet managed to have a rewarding career teaching accounting for undergrads until his retirement. He relied on the family to read from books, and to prepare lecture notes.

What’s even more astonishing is the story of one of my ancestors, Chilakamarti Lakshmi Narasimham, born in 1867. He writes in his autobiography that he had difficultly reading in school or playing in low light in his childhood, and lost his vision entirely during his adulthood. Yet, he lived an active life as a prolific playwright and novelist in the Telugu language. He was also a social reformer and was active in the movement for India’s freedom from the British. He wrote several original plays, stories, and novels. He even started and ran magazines and translated plays from Sanskrit. He wrote his autobiography, published in 1942, at the age of 75 by dictating the text to two scribes. His books are bought and read even today.

I came upon his autobiography just this week. While reading this book, I learned that his maternal grandfather, who was not born blind, also gradually lost vision by his mid-30s. This finding puts the earliest occurrence of this disorder in our family tree in the early 1800s. Any documented or oral evidence of this disorder beyond this is likely lost. Even my father didn’t know that his ailment is a genetic disorder until the 1990s. Most doctors he saw treated him for Myopia (nearsightedness).

There is still no cure for this, other than a carrier female deciding not to have children. Prospective experimental treatments cost a lot and are accessible to just a few.

In the era of plentiful of platitudes, it is easy to say that, despite such disorders, you can still accomplish a lot in your life. My father’s resilience is an example. Yet, having watched several sufferers (my father and his brothers), I know that life without vision is tough. The rest of us who can see don’t exactly make it easy for those who cannot see.

Why Learn from Incidents

2019-12-17T20:33:44+00:00

Resiliency related discussions usually delve into so-called “resiliency practices” like circuit breakers, bulkheading, and timeouts…

Resiliency related discussions usually delve into so-called “resiliency practices” like circuit breakers, bulkheading, and timeouts, followed by monitoring gaps, then release safety practices, fault-containment patterns like sharding and redundancy, and even chaos testing. Sometimes, these discussions also digress into concepts like “auto-remediation” and “self-healing.” But what rarely happens though is any question of learning.

This absence of learning from incidents in such discussions is not surprising. The number of peoples that realized the role of learning from the success and failure of production systems in the IT industry is still small. For even the die-hard pager-warrior teams, “learning from incidents” is an esoteric concept. Safety is a relatively new topic in this industry.

Moreover, most work cultures want you to demonstrate bias for action and not “understanding” or “learning.” After your team recovers your system from a production incident, you’re often measured by how quickly you take the next steps, which include publishing a postmortem, determining action items, and finishing those quickly. “Wait, I’m learning” is not an expected answer. In our work cultures, we suppose learning to be automatic and implicit, and not something to talk about or explicitly do.

But why learn from incidents, or more generically, from working or failing production systems? Why is it relevant? Let me share a few reasons why, based on my personal experience.

First, learning from incidents helps close the gap between how you imagine the system to be working (the “as designed” state), and how it is working (the “as it is” state).

We use a variety of mental models to explain how complex production systems work. We form those mental models based on what we know about those systems. The inputs for these mental models include documented designs, code, configurations, metrics, monitoring charts and other artifacts. However, as our production systems undergo change, and as they age, our mental models become rustic and drift from reality. What was true about a system six months ago or even six days ago may not be true today.

Consequently, our understanding of the “as designed” state remains incomplete. Moreover, each person in the team may have a different understanding of the “as designed” state of the system. Team members use their incomplete understanding to make further changes, thus potentially compounding the gap.

Incidents allow you to validate or even dispute your assumptions about how you imagine the system to be working. There is no better way than to learn from incidents to validate or dispute assumptions and bridge silos of understanding. By walking away from an incident after restoration, you lose that opportunity.

In other words, incidents provide a feedback loop to correct and improve your understanding. Such improved understanding can help you improve the system.

Not learning from incidents is like running production systems in an open loop with gut feelings and blind faith.

Second, learning from incidents grounds you into realizing that resilience is beyond a technology problem. Most of us routinely use terms like “resilience,” “robustness,” or even “self-healing,” and “auto-remediating” interchangeably. However, until you start to learn from incidents, you may not realize that people, processes, and culture are part of the system and play a vital role in keeping the system resilient.

As John Allspaw writes in Taking Human Performance Seriously, “the expertise and adaptive capacity of engineers is what keeps serious incidents from happening more often, and what keeps incidents from being more severe than they are.”

But how does this observation help improve systems that are currently falling over often?

When you’re dealing with such a system, you can’t just rush to using tech solutions alone. If you do, you might soon realize that your approaches are not working and that you must influence the culture first. Let me give you a couple of examples.

Imagine a work culture that insists on finding “the root cause” within a certain number of hours after an incident? Over time, teams in that culture get used to writing shallow postmortems to point to a “cause” so that they can move on to other work. The same happens in work cultures that insist on “five whys.”

Learning is non-linear, whereas the “five whys” approach makes you explore a linear sequence of events to find a purported cause. Instead, suppose you create a mechanism for the team to discuss what happened and what could have happened in an open format, they might walk away with a better understanding of how their system works, and what they could do to improve it.

Or imagine a work culture that uses a metric like “revenue loss” to measure the resilience of production systems. Teams in that culture get used to ignoring all other performance indicators as long as revenue loss is negligible. Consequently, they develop apathy towards issues that don’t result in revenue loss. Fixing such apathy then becomes a leadership challenge and not a technology challenge.

Though there is no cookbook for learning from incidents, recognize that it is a group learning activity. It involves sharing what you know, testing your assumptions, and adjusting approaches. Unlike individual learning, group learning allows for possibilities for critical thinking and exploration through dialog. Furthermore, it is not sufficient that the learning start and stop within the dev and operations teams. You also need leaders within the organization to foster the learning mindset for resilience.

On Public Speaking

2019-10-26T07:08:20+00:00

Public speaking is one of the most uncomfortable things I do. Though I’ve spoken occasionally over time, despite knowing the subject and…

Public speaking is one of the most uncomfortable things I do. Though I’ve spoken occasionally over time, despite knowing the subject and having done the work to earn the stripes to speak, I’ve always dreaded the experience. My usual speaking recipe used to be as follows — prepare some slides just days before the talk, think through some talking points, and show up behind the podium with no other form of preparation. The ideas that I originally had often ended up becoming too complicated or too flimsy to articulate well in the allocated time. Add my introversion and my imposter syndrome to the list.

I know I’m not alone. A lot of us struggle with public speaking. Most don’t even try for fear of failing. Public speaking can be intimidating and stressful. Though I don’t claim to be an expert public speaker, I want to share the single most important lesson I learned this year.

I began to take some steps last year. Initially, I watched several videos of other speakers and TED/TEDx talks. I also worked with a speaking coach for a few months. The coach made me realize some common mistakes of body language, tonality, breathing, the pace of delivery, etc. We also recorded some mock talks. Watching those was a terrible experience. The most important benefit of working with a coach, though, was to receive instantaneous feedback.

Yet, it took ten more talks and about a year to find a working formula. Here is the most important lesson I learned.

Build and own your plot first. Until this year, I used to prepare slides first. Nowadays, I take a more contemplative approach that does not start with slide-making.

Build a storyboard on a few sheets of paper or a text file. Each entry on the board includes an idea and a few talking points. Starting on a few sheets of paper or a text file helps focus on the plot instead of colors and fonts in PowerPoint, Google Slides, or Keynote.
Shuffle the order of the entries in the board into a plot with a beginning and an ending. Keep refining the order and talking points until the flow is linear and cohesive. Through this process, simplify the plot so people can follow you. A few twists and turns are okay but avoid disjoint ideas as they will make you struggle through transitions, and will confuse the audience. Moreover, the order should be natural for you to tell.
Only then translate the storyboard into slides. Use pictures and diagrams, with as few words as you can. If you’re using text, prefer large fonts. This approach will help the audience focus on you and not the screen.
Save the slides into images, and insert those images into a document. Then type your talking points after each slide. See one of my latest talks for an example.

Try to spend the most amount of time on the last step. This step is your playground to refine your plot. The beginning of your script should invite the audience into your plot. I wouldn’t worry about narrating your life story or how great the company you work for is unless those facts are part of your plot. Give some hints about your plot in the beginning. Also, take the time to summarize your key takeaways at the end.

This essential step helps you form muscle memory. Muscle memory helps you avoid looking at your slides or speaker notes when speaking. It frees you up to move on the stage and be yourself, and not remain glued to the podium. The act of writing down the script also forces you to think and clarify your points. It allows you to try various options to narrate your plot in your own words. Don’t skip this step unless you’ve given the same talk before.

I learned a few other lessons as well.

First, don’t get intimidated by those who speak well. There are so many articulate, confident, and persuasive speakers. But remember that they did not become as such overnight and may have gone through similar difficulties. Observe how they speak, but don’t feel threatened by their skills. Instead, stay humble and focus on your journey to improve.

Second, be open to feedback. Invite others to give you feedback. Let others tell you about your filler words, body language, slides, your pace, how you move or not move on the stage, etc.

Third, remember that you’re the expert with a few things to share. The audience wants to learn from you, but not to punish you for being a poor speaker. Breath, calm down and be yourself.

Penalties and Purgatory

2019-10-09T19:05:41+00:00

I spoke on this topic at the ServerlessConf New York on October 8, 2019. Below are the slides and speaker notes. The thesis of my talk is…

I spoke on this topic at the ServerlessConf New York on October 8, 2019. Below are the slides and speaker notes. The thesis of my talk is to discuss a few questions:

The community universally seems to agree that the combination of functions, events, and fully managed cloud-native services makes serverless what it is. Is this conclusion premature? What can we learn from contemporary solutions like Kubernetes?
Can code size reduction alone into smaller deployable units help reduce the time to value? Is there an inflection point of such reduction increasing MTTR?
How to do you make a case for tech adoption, and why it essential to understand how your business works?

Given that is is a 20 min talk, it stayed high level.

The Setup

Imagine you discover serverless today? You will find that the programming model is simple. You’re shipping small functions independently. This programming model is a microservices style taken to its logical conclusion. Then you will find that there are no servers to provision or manage radically simplifying the ramp-up and operational cost. Then you will find that the runtime model keeps costs transparent and follow the demand.

As promising as this direction is, it is important to challenge ourselves a bit and ask a few questions.

Slides and Notes

This talk grew out of my experience leading cloud migration at the Expedia Group. I had a unique opportunity to burn my opinions several times through over the last three and a half years. I had a chance to question my own beliefs, discard several of those, and form a few simpler ones. Hence you will find me taking a contrarian point of view on this stage.

Belief: Serverless is the future

Most of us at this conference, including those promoting alternative technologies, do believe that serverless is the future. Five, ten, or fifteen years from now, we can all look back to reminisce that we were there when it all started.

The programming model

But then let us look at the modern view of the serverless programming model. The programming model today consists of three things:

Functions as primitives in a constrained programming runtime to run stateless computational logic
Cloud-native services to do the heavy lifting for a wide variety of middleware services and state
Events and remote-procedure calls to glue everything together

Is this all we need to embark on the serverless journey? Back in 2019, I posed this question to someone in a cloud company, and the person answered yes. I didn’t believe it then, and I don’t think it is even now.

Why aren’t we there yet?

Serverless adoption has been on the rise. Where I work, we continue to witness an almost linear increase in function invocations over the past two-three years. We also hear AWS Lambda continues to enjoy wide-spread adoption from a large variety of organizations to solve unique and innovation problems.

Yet, why is the world not moving to serverless quickly enough? What could be holding us back?

Counter belief

Let’s examine a counter belief that sees serverless as an evil capitalistic pursuit to let cloud providers take all your code and data, and lock you into their claws. Though it is easy to make fun of this belief, this is real. Fear sells.

What properties can we borrow from Kubernetes?

For those that buy into that point of view, the alternative is a solution (Kubernetes) that provides distributed systems primitives on boring infrastructure primitives to let you run a variety of workloads on top.

While it is tempting to make fun of this approach in serverless conferences and meetups, how about we ask ourselves a simple question — what properties can we borrow from this approach? What can we learn, instead of picking a side? I won’t answer this myself, but I would encourage each of you to explore.

Landscape

Let’s look at other contemporary beliefs driving our thinking.

Belief: Small is better

We’ve been pursuing this idea that small is better through the adoption of microservices over the last 7–10 years. We’re witnessing tremendous gains of developer agility and lower time to produce value. This model has allowed every team to move at the pace they need to create value for their customers. A function-based programming model takes this approach to its logical extreme, where you’re constrained to write nothing but functions.

Code size and time to value

Yet, does a further reduction of code alone will continue to reduce the time it takes to provide value? I don’t think so. I believe there is an inflection point beyond which you may not see a further reduction in time to value.

A not-so-uncommon phenomenon

Here is some evidence from my experience studying production incidents. With thousands of applications running in production environments, I came across several examples where a fault in one part of the environment has an indirect impact on another part of the environment several layers away. Such incidents take time to detect and remediate.

Comprehension penalty

Without additional constraints, microservices-based architectures can lead to reduced comprehension of production environments, which can widen the gap between “as designed” (our assumptions of how things are supposed to work) and “as it is” (how things are working in the real world) states. This gap can contribute to increased MTTR.

(For more discussion on “as designed” and “as it is” states, see Forming Failure Hypotheses.)

Where are the app primitives in serverless?

This observation brings us to the question — have we found all the runtime primitives for a serverless future? Where are the application-centric primitives for:

run-time boundaries between applications, with clear demarcation of private and state and behavior?
service interfaces
non-fate sharing architectures?

Are there aspects we can learn from others? Are we throwing away most things we learned through the last 10–20 years? I don’t know the answers to these questions, but I believe that we must continue to explore.

OurCo vs the OtherCo

Let’s switch gear to tech adoption hurdles. Most developers I speak to share this belief privately. We often hear about tech adoption success stories at conferences and blog posts, while in reality, most of us worry that the places we work at don’t seem to be adopting the latest and greatest, including serverless. What could be going wrong?

Tech investment cycles

Imagine you went through an investment cycle (people, money, and time) to build a product. Sometime later, a tech innovation comes along. You may have to wait for your next investment cycle to be able to lay hands on that tech innovation. The point to recognize is that your investment cycles may not line up with the tech innovation cycles, and it is okay.

Multiple generations and migrations

The second hurdle we run into older code that does not seem to go away fast enough. Here is what usually happens.

You have a product built yesterday, which is producing value for your customers. It is mature, functional, and yet complex to manage as it has so many band-aids to keep it working. You get upset with this complexity and want to build the next-gen to incorporate all the learnings from the product you created yesterday. But you’ve not finished yet. You might still start to experiment with new ideas.

Survival penalty

The longer your company survived, the more generations of systems that you accumulate. You will be able to retire some, but those too take time. This phenomenon is the natural cost of staying long and going through multiple innovations to run any company. You need to develop business acumen to get comfortable with this.

Fear, anxiety or escalation won’t help

As developers, we tend to resort to fear, anxiety, or escalation. We worry about our teams not adopting the latest and greatest. We hope that some senior leaders at your company turn your wish into a dictum. You desire to be able to push people to adopt the ideas you believe. But these rarely work.

Influence can help

Influence can help. But how do you influence?

Understand how your business works

To be able to influence, you must learn to understand how your business works. Your business is typically turning a few things through a business model to produce value for your customers. The value may be revenue or some other benefit that your customers care about.

To influence, ask the following questions:

How does your business work?
How does it generate value?
What is your company currently pursuing?
How can you contribute to that pursuit?

Once you find answers to these questions, align yourself with that pursuit.

Conclusion

Where does this leave us?

First, stay humble. Don’t get attached to the side you picked. Picking sides may be fun and entertaining, but those don’t matter in the long run.

Second, prefer faster, better, and cheaper as much as possible.

Third, challenge the cloud primitives offered to you. No cloud provider holds the keys to all innovation.

Finally, align yourself with value generation to drive technology adoption.

Thank you

Thank you, ServerlessConf NYC.

Forming Failure Hypothesis

2019-09-27T22:49:40+00:00

Subjecting systems to failures is supposed to increase confidence in their stability. But why? How do you form useful failure hypotheses…

Subjecting systems to failures is supposed to increase confidence in their stability. But why? How do you form useful failure hypotheses? How do you reason about their safety? Why should your organization listen to you and invest in testing your failure hypotheses?

I recently gave a couple of talks on this subject:

“Safety in Chaos: Forming Realistic Failure Hypotheses” at Strange Loop 2019, St Louis, on Sep 13, 2019
“Forming Failure Hypotheses” at Chaos Conf 2019, San Francisco, on Sep 26, 2019.

These talks summarize over two years of my quest to improve production stability at work. Through this time, I had to put aside some of my prior beliefs, learn from the constant chaos that our production environments are, and form new hypotheses. Slides and speaker notes are below.

The Setup

Chaos engineering has only been around in the information technology industry for just about ten years. In contrast, other areas like patient care, emergency response, space and aeronautics, manufacturing, industrial engineering, mining, etc., went through several centuries or decades of learning through disasters to incorporate processes and practices to promote safety.

This nascency is why we continue to hear about the rationale for chaos testing , the tools and techniques to practice this discipline, and occasional success stories. What we don’t hear about though, is that the road is rarely smooth, that chaos engineering programs don’t always work as expected, and often die after a while.

The gist of my talk is as follows.

For your chaos engineering efforts to succeed, you must invest in learning from incidents.
Chaos engineering programs that don’t consider learning from incidents will likely fail after initial enthusiasm and excitement.
The need for and value of chaos engineering will fall in to place once you take the time to learn from incidents.

Slide 2: Slides and notes

You’re in the right place for the slides and notes. Look for other recent articles on this blog for more background material.

Slide 3: What is chaos engineering

Let’s take a brief look at this definition of chaos engineering.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos engineering is about making a calculated hypothesis about the system being able to withstand certain turbulent conditions. It is not about randomly killing servers or breaking other things. Some older descriptions of chaos engineering still refer to just doing those kinds of things.

Slide 4: A visual representation

Here is an easier representation.

Imagine a system operating in a stable zone in the middle. You make a hypothesis that the system will continue to operate in that state even after you introduce a perturbation, like a server crash, hardware failure, or network degradation or even a network partition, as long as the perturbation does not push the system beyond the (assumed) fault boundary (shown by the upward arrow). You ensure that the system is not pushed beyond that boundary.

If the system returns to the stable condition, you proved your hypothesis. If, however, your assumption of the fault boundary is not valid, you may find that the system goes into the danger zone.

Slide 5: Questions to consider during this talk

As we get through the rest of this talk, keep the following questions in mind. We will answer some of these questions directly, and the rest indirectly.

First, what is the system? Is it the just software and config, the servers and the network, and the storage? Or, does it also include tools, processes, and culture that is used to build and operate the software, config, and the infrastructure?

Second, how do you form a chaos testing hypothesis? Do you copy what some else did? Do you just guess? Or do you come up with your own? If so, how?

Third, how do you ensure system safety? How do you guarantee that the test does not push the system from a stable steady-state into an unstable state that you cannot quickly isolate, and protect your customers from?

Fourth, why should anyone listen to you, and invest time and energy for chaos engineering? How do you justify the cost of this activity against all other priorities that your organization is pursuing?

Slide 6: Foray into chaos engineering

Here is how a typical journey to chaos engineering starts. You either start cloud adoption or at least some modernization through some of the latest and greatest tech for agility. You start breaking down monoliths and begin to adopt cloud-native technologies. Things seem to be going well with a lot of excitement in the air.

Slide 7: How to build resilience

Then reality hits you, and you start to deal with outages. You also encounter some cloud outages like the recent AWS AZ loss in us-east-1. You will discover that your automation has bugs, or that you’ve not automated everything well. You will find that your changes are also causing production issues. Then you start to think about resilience. How do you build resilience?

Slide 8: Enter chaos engineering

That’s when you discover chaos engineering. You research it. You read some books. You will find some tools to practice chaos engineering. It sounds fun and exciting. You roll up your sleeves and get down to work.

Slide 9: Not everyone likes you to attack their apps

When you approach teams in your organization that you’re going to subject their apps and services to chaos engineering tests, some teams will enthusiastically tell you to attack their apps.

You will come across some teams that will say “No way. Our stuff is mission-critical. Don’t bother us,” while others would say, “We’re busy now. We don’t have time for this. Come back in x months.”

Such reactions will compel you to exclude all those apps and services, and work only with the teams that enthusiastically support you.

Slide 10: Randomly killing servers only uncovers trivial issues only

Then you will discover that randomly killing servers only uncovers trivial issues. This is plausible given that the teams that are enthusiastic about chaos engineering have already invested in some basic robustness practices in their apps and services. Those that have not take such steps are not anyway participating.

Slide 11: You can’t/won’t test more serious failures

You also can’t and won’t try serious failures because you know those would get you into trouble. These could include blackholing your critical database, or breaking the network between two data centers.

Slide 12: Self-doubt

The outcome — you become frustrated. You start to doubt yourself. You start to think that you or your company is not competent enough to adopt chaos engineering when every other company seems to be doing chaos testing all the time!

Slide 13: Valley of despair

That’s how you get into the valley of despair. Your chaos engineering program is likely to die at this point.

I faced this situation about a year and a half ago. How do you get out of this valley?

Slide 14: Null Hypothesis

For the sake of discussion, let’s consider a hypothesis that “chaos engineering has nothing to do with a system’s capability to withstand turbulent conditions in production.”

Starting with such a hypothesis lets you discover if, why, and when chaos engineering might help.

More importantly, some of the pioneers in the industry that discovered the need for chaos engineering, and incorporated this discipline into their org culture, went through a journey of self-learning and discovery. Starting with a null hypothesis might help you go through a similar discovery process but within the context of your production systems, your tools, and your org culture.

Slide 15: How is the system behaving in production today?

To start with this null hypothesis, let’s set aside the question of making the system withstand turbulent conditions. Let’s instead ask how the system is behaving in production today.

Slide 16: “as designed” vs. “as it is”

How do you check how is the system behaving in production today?

You can start with the “as designed” state. Your documents, diagrams, and even code give you an indication of how the system is designed and intended to work, but not as it is working in production today. Furthermore, even the metrics we measure, logs we collect, and the alerts to setup represent the as “designed state“:” and not the “as it is” state. In other words, the “as designed” state is biased by your expectation of how the system is supposed to work, but not how it is working today. The “as designed” state is as nothing but an imaginary state. It is an approximation, not real. It remains incomplete as the system complexity increases.

The “as it is” state, on the other hand, is complex. We don’t fully understand and can explain all parts of it. We can’t fully explain why it works when it works, or why it is not working when it is not working. When it fails, we struggle to explain why. We use war metaphors to conduct incident procedures.

Unlike docs, diagrams, code, logs, metrics, etc., incidents tell us about the “as it is” state of a system.

Slide 17: Learn from Incidents

That’s exactly what I did when I entered the valley of despair. I studied incidents.

See my prior blog post Incidents — Trends from the Trenches, and slides of my OSCON 2019 talk If Only Production Incidents Could Speak for more details of my approach and findings. In this talk, I’m going to share a few highlights.

Slide 18: Changes are contributing to a majority of the impact

My first observation is that changes are contributing to a majority of the impact. In addition to the references from the previous slide, also see Taming the Rate of Change.

Slide 19: Second and higher-order effects are hard to troubleshoot

The second observation is that, due to the mixed nature of our production environments with fast-changing and slow-changing services, as well as monoliths and micro-services and tech debt, it is getting hard to reason about second and higher-order effects of changes and failures.

Slide 20: Unclear fault lines

My third observation is that the fault boundaries (or “blast radius” of failures) are unclear. Often, the “as designed” state does not include the intention of fault boundaries. When they exist in the design, they remain untrustworthy.

I’ve witnessed incidents where a change or issue in one part of a system impacted an unsuspecting another part of the system surprising many people on incident bridges. I’ve witnessed incidents where hundreds of millions of dollars of investments in supposedly redundant data centers in different locations could not empower incident commander to make a decision to shift traffic away from a data center outage due to a power failure, all because the incident commander could not determine if it is safe to shift traffic due to some unknown interdependencies between those data centers.

Slide 21: Actions

What do these findings motivate you to do?

First, improve release safety through progressive delivery of changes, so that you are safely introducing changes into production.

Second, ensure tighter fault domain boundaries in the “as designed” sate. That is, be explicit and intentional in your design about fault domain boundaries.

Third, implement safety in the “as designed” state. This involves implementing not only fault domains but also other robustness techniques like being able to shed or shift traffic, failovers, fallbacks, circuit breakers, etc.

And then, determine what failures to test in production. In particular, the second and third activities above will allow you to increase the severity of your tests and help you discover the “as it is” state. You can’t test your hypothesis unless you account for such safety conditions in the “as designed” state of your system.

Slide 22: Reflecting on my findings

As I reflect on my approach incident analysis, what I find most rewarding is the act of learning from incidents. The discovery of the patterns like those I shared in previous slides are interesting and relevant, but not as much as the act of learning from incidents itself.

This journey made me realize the importance of learning from incidents. Incidents tell us about the “as it is” state, and listening to incidents leads us to discover chaos engineering. It makes you realize that chaos engineering would prepare you architecture, tools, processes, and culture to be resilient to failures.

Slide 23: But how to prioritize such work?

How do you prioritize such work? How do you convince various decision-makers at your organization to invest time and resources for chaos engineering?

First, pick the most critical areas to get the most value of your investments into this kind of work. This is because not every system brings in the same value. Pick the ones that are most important, and worth protecting in your organization.

Second, learn to articulate value. Don’t let anxiety drive the conversation. Most organizations start to rally troupes after major incidents to invest in hygiene activities like chaos engineering. However, such approaches may not always be sustainable.

I will give you an example. Last year, some of us were in a room debating whether to invest in deploying a critical part of our stack in a second region and invest in testing for failover between two regions. Some of us in the room were arguing for such an approach as, in our minds, it was the right thing to do. The rest in the room argued that doing so would cost more money, and more importantly adds a few months to the project timeline. That group wanted to defer the second region investments to a future date.

Ultimately what settled the debate was the back-of-the-napkin calculation by one of our teammates. With a few calculations, he pulled up an approximate per-minute revenue driven by that part of the system and asked how much downtime we can afford. We debated on whether an hour, or two hours, or 15 minutes, or 30 minutes is okay. Finally, most in the agreed that we should not go beyond 15 minutes, and agreed that we’re unlikely to hit that 15-minute mark without investing in a second region and practicing failover. Debate settled.

In this example, the value is a dollar number. But it does not have to be the case. Depending on your situation, pick a value indicator that reflects one of speed, quality, or cost. Most businesses and stakeholders are interested in being “faster, better, and cheaper.” Appeal to those when formulating value-based arguments.

Also, realize that not every part of the system may need to have the same robustness. Certain amounts of losses may be acceptable for some parts of your system. Tailor your investments accordingly.

Slide 24: Journey out of the valley of misery

Given this, how do you get out of the valley of despair?

First, learn from incidents.

Slide 25: And then

And then, learn to make value-based arguments and decisions.

Slide 26: How do you learn from incidents?

But how to learn from incidents? I don’t know. There is no documented text-book approach to learn from incidents. I took a particular approach to learn from incidents, which allowed me to make some observations and form new opinions. I’m sure there are other styles to make different observations.

That is why learning from incidents needs to be part of your org culture. You want everyone, and not just a few, learn from incidents.

Slide 27: How does it feel when you learn from incidents

While there is no textbook approach to learn from incidents, we can describe how it feels when you begin to learn from incidents.

Slide 28: Learning from incidents

First, you’ve built and updated several mental models of how the system works when it does, and doesn’t when it doesn’t. You have a better understanding of the scale and complexity.

Second, you’re not chasing symptoms but are beginning to understand the system as a whole. For instance, instead of creating a backlog of action items after each incident, you may start to look for systemic improvements.

Slide 29: Learning from incidents (Continued)

Third, you begin to understand the role of people, processes, and tools for success as well as failure. You will realize that resilience is not just about code, config, and servers, but it is also about the culture.

Last, you’re able to articulate the value of hygiene investments, including chaos engineering.

Slide 30: Lessons Learned

Lessons learned.

Slide 31: Learn from incidents

Lesson number one — learn from incidents. The need for and value of chaos engineering will fall in to place once you take the time to learn from incidents.

Lesson number two — sorry, but there is no lesson number 2.

Slide 32: Thank you

Repeating what worked for others may not get you far. Increase the time you spend on the “as it is” state to discover what works best for you.

If Only Production Incidents Could Speak

2019-07-18T15:14:54+00:00

Below are the slides and extended speaker notes from my talk on July 18, 2019, at OSCON 2019 with the same title as this post. See my…

Below are the slides and extended speaker notes from my talk on July 18, 2019, at OSCON 2019 with the same title as this post. See my prior posts for related material:

Taming the Rate of Change (November 19, 2018)
Incidents — Trends from the Trenches (Feb 26, 2019)

The Setup

I have a lot of stories to tell you after studying over 1500 production incidents over the past year. But since these are far too many, And we’ve only about 30 minutes, we will focus on patterns in broad strokes, and reemphasize some practices.
Why study incidents? First, I have several accumulated opinions, and I needed a way to shake those up to form better hypotheses grounded in reality. Second, most companies belong to “we’ve too much to do” category. No team has time to implement all the best practices in the world in their architectures to be highly fault-tolerant, elastic, flexible, and cheap. Trade-offs are essential. You’ve to pick and choose what you want to work on and why. An analysis like this provides you with some patterns to focus on.
I will cover four topics in this talk — (1) cultural inhibitors in the industry about production incidents, (2) how I approached incident analysis, (3) some key patterns observed, and (4) some hypotheses and recommendations.

Slides and Notes

You’re in the right place for the slides and notes.

Cultural Inhibitors

One of the things I realized when studying incidents is that the time you spent after incidents is as important, if not more important than the time spent during an incident. That’s because incidents tell you a lot about your architecture, your investments, your processes as well as your culture.
However, a few cultural inhibitors prevent us from learning from incidents.

Cultural Inhibitor 1: Single Root Cause Fallacy

This is the traditional model of looking at incidents, which is why most incident reporting systems still include fields like “root cause” and “component”. In this model, you look for a particular component or activity that caused the incident.
The hypothesis behind this point of view is that, if only such and such component had not failed, or if someone did a particular activity exactly as it was supposed to be done, the incident could have been averted.
For instance, whenever there is a hardware component failure is involved in an incident, the focus tends to shift to that component. Ultimately the vendor that sold that component gets called into the incident, and asked to provide a “root cause analysis”. I’m aware of stories where some of these vendors reply with “canned root cause analysis” reports to please their customers.
However, most large-scale complex systems are always under some stress, and yet appear to operate in stable zones. When certain conditions align, they get pushed into unstable zones, causing customer impact. We call this- as an incident.

What Caused the King’s Cross Fire?

This is a great example to dispell the single root cause fallacy, thanks to my colleague Ian Butcher who shared this with me some time ago.
What caused the fire? Was it the match? The wooden steps in the escalator? 20 years of layers of old paint on the ceiling? Staff not knowing how to the water fog equipment to put off such fires?
Read more about this on the Wikipedia.

Swiss Cheese Model

James Reason’s 2000 paper titled “Human Error: Models and Management” describes the Swiss Cheese model to make a point that accidents occur when several hazard conditions align.
In this paper, the author differentiates between two styles of understanding human error — the person approach, and the system approach. In the person approach, you look for first-order causes, like the mistakes a person did to cause an accident. In the system approach, you look at systemic factors and try to improve the system.
Each hole the slice is a latent hazard condition. When some of those conditions align, you have a loss.
You will see this in the recent HBO series on Chernobyl. Chernobyl didn’t explode because one single component failed or one person or team didn’t follow the procedure. The protagonists show several systemic failures over the years that lead to the Chernobyl disaster. These included technology, people, culture, processes, and leadership.

Cultural Inhibitor 2: Not Supposed to Happen

It is not uncommon to see teams treat incidents as events that are not supposed to happen, except when someone didn’t do their job are they are supposed to do, or some system didn’t work as it is supposed to.
We play heroics when they happen. We congratulate on incident recoveries. We then hurry to get back to business as usual.
This mindset that incidents are not supposed to happen also prevents us from learning from incidents.

Cultural Inhibitor 3: The “Big Ones” vs the Rest

High impact incidents steal our attention. We always remember the “big ones”. We talk about those for months/years. We give plenty of attention to the big ones in social media and the press.
Low impact incidents, on the other hand, languish for our attention. We quickly forget them. In most enterprises, teams don’t write postmortems or look for improvements.
This is natural because it’s not obvious why you should spend time on low-impact incidents.
However, this attitude also prevents you from learning from incidents. Every incident is a signal from your production environments about the state of your architecture, your tech debt, your processes, and culture. You can’t improve any of these unless you pay attention to these signals.
Every incident matters, and yet, you can’t treat each as a snowflake. So, how do you study incidents?

Approach — Don’t categorize

If you’re interested in learning from incidents, toss out incident categorization like this.
Bad way: Database, Network, Storage, Server, Vendor, Partner, DevOps, …
When you categorize incidents like this, you tend to point fingers to particular components in your architecture. You can produce nice reports, but you will miss opportunities for systemic improvements.

Approach: Look for Patterns Instead

Look for patterns instead. Patterns may tell you a different story from any particular incident.
As I spent more time studying incidents, I realized that focusing on patterns helps us make better value-based arguments to make sustainable improvements.
Otherwise, you will get frustrated that your post-incident processes like post-mortems, availability scorecards, error budgets, etc are not working, and you’re not making the impact you hoped to.
Let us look at five common patterns from the incidents I studied.

Pattern 1: Changes

The first pattern that came up in my analysis is production changes. A change may be a release of new code, or a config change, or maybe even an A/B test.
I find that a majority of incidents happen when a change is in progress or was recently made.
In my first analysis last fall, changes accounted for about 70% of the customer impact. In the second analysis on a larger set of incidents that occurred in 2018, change triggered impact was about 35%. In my most recent round, about 50% of the impact was triggered by production changes.
Often the impact was immediate. You start to notice an impact some key business metric and then realize that the change was botched.
In some cases, changes introduce latent failures that stay dormant for days. Those are difficult to debug or reproduce and can be frustrating to deal with. We will cover those later in this talk.
Let’s look at some well-publicized examples in the industry.

Twitter Incident Last Week

Twitter had an incident last week. They reported that “the outage was due to an internal configuration change”.

Google Blob Storage Incident in March 2019

Take a look at this postmortem report published by Google in March this year. Google’s internal blob storage had an outage. Their postmortem reported that “SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data”. Boom!

Facebook, Instagram, WhatsApp Incident in March 2019

You may recall a Facebook incident in March when Facebook, Instagram, and WhatsApp simultaneously suffered a massive outage. The next day, Facebook posted a cryptic tweet attributing the incident to a “server configuration change”.

Google 2016 Paper: “Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure”

There isn’t a lot of industry research that shows similar patterns, except this 2016 paper from Google which reports that “a large number of failures happen when a network management operation is in progress within the network”. No wonder.
These are not anomalies. Most enterprises have many examples like this. I’ve seen hundreds of examples where I work.
Are these companies incompetent and don’t know how to make changes to their production incidents? I don’t think so. All these companies run incredibly complex systems at large scale, and changing such systems is hard.
Let me offer a few hypotheses on why.

First, More changes ⇒ More entropy

At the Expedia Group, where I work, we make a few thousand changes a day. As we made steady progress in our cloud investments, we see a near-linear increase in the number of daily production changes.
Why does this matter? With each change, you’re adding to the entropy that already exists in the production environment. In addition to features and bugs in your code, you’re also introducing new failure modes or revealing existing failure modes.
Do note that, a vast majority of the changes are successful. In one particular batch of incidents over 6 months, I noticed that just about 5 out of 10,000 changes triggered a production impact.

Second, Difficult to Know Second/Higher-Order Effects

As we invest to reduce the size of our code through micro-services, and CI/CD, complexity is shifting to connectedness.
This connectedness is increasing the chance of success (which is to quickly create more customer value) or failure (which is the difficulty to comprehend).
Furthermore, our production environments include a mixture of fast-changing and slow-changing systems including those that we can’t get rid of like those big circles on this slide.
These make it difficult to know or predict second- higher-order effects of changes.

Every Change is a Test in Production

Consequently, every change feels like a test in production. There is no escaping from this.
Pre-prod testing can help, but no amount of pre-prod testing can make production changes predictable and safe.
Due to high entropy, it is getting harder to mimic production environments to test changes. Our pre-prod environments, when they exist, lack the same entropy and scale that our production environments offer.
Let’s get comfortable with this trend.

Pattern 2: Latent Failures

The second pattern I observed is latent failures.
These remain hidden in the environment for days, weeks or even months and don’t immediately cause issues. These wait for conditions to align, just like the swiss cheese in James Reasons’ swiss chess model.
Example: Config changes caught when nodes are recycled or scaled up days later.
Example: A change was made to a service behind a cache. The cache hid the fault for a few days until the cache was refreshed.
Example: Auto-scaling kicks in to scale up a cluster, and new nodes pick up the wrong config.

Config Drift

Config drift is a special kind of a latent failure.
Why do we have config drift? That’s because we rarely automate everything. We automate more frequently used workflows or more frequently occurring conditions, and leave the less frequently used ones in our backlogs. Those backlogs rarely shrink.
Unfortunately, those low-frequency workflows end up relying on tickets, manual steps, change requests, and team memory. Very few people would know how to make those changes, and when they leave or change teams, any knowledge leaves with them.
Any automation we do in the low-frequency areas is often open-loop, meaning that there is no observer to monitor and correct config drift. Over time, systems drift from their desired state.
I saw several drift related incidents with switches, routers, firewalls, and databases which don’t get as much automation love as CI/CD for stateless apps do.
One particular example I remember is a database failover. The automation let the active node failover to the passive node, but the passive node remained in the read-only state due to configuration drift.
In another case, a redundant firewall device failed to take over when the primary device failed. It turned out that someone manually made a config change several months before the incident, and forgot to undo it.

Pattern 3: Mismatched Assumptions

The third pattern I noticed is mismatched assumptions between layers. Production environments are large distributed systems with several layers between most components.
Each of those layers may make different assumptions about interfaces, behavior, scale, latency, availability, etc, and you may not know when you’re violating those assumptions.
But where do layers come from?
Layers between producers and consumers in a service-oriented architecture.
Or, wrappers on wrappers to fix interfaces or data munging
Or, between apps and automation platforms those rely on. For instance, a monitoring agent can make a node unavailable when it is just busy processing a request and waiting for a downstream system to respond.
Sometimes, we create layers to work-around organizational issues, thanks to Mel Conway. Those layers make their own assumptions about their downstream services.

Pattern 4: Not the Network

When I shared the outline of this talk with a colleague of mine, he remarked that “crap rolls downhill”, and the network gets blamed more often than it deserves.
The network may be unreliable, but you can’t easily distinguish between a slow dependency or an impaired network. So, during incidents, teams start with the network only to realize factors like application slowness, bad timeouts, resource (CPU/mem) starvation, etc.

Pattern 5: Unknown

Finally, a sufficiently large number of incidents have no clear reason for failure. They usually start with a key business metric going the wrong way. While you’re busy troubleshooting, the metric recovers on its own, and you don’t know why.
Such incidents show that we don’t fully understand the dynamics of production systems.

Let’s now look at some practices to improve. None of these are new techniques. Yet, I’m bringing these up to remind that improvements require systemic investments.
The first practice I recommend is to progressively increase confidence in the delivery of production changes.
I can not emphasize this practice enough. In order to go fast with continuous delivery, find ways to progressively increase confidence.
Do whatever it takes to increase confidence in the change before (in test and pre-prod environments) making the change, and while making the change. Don’t consider passing tests in a pre-prod environment as a green signal to go full-steam with a change in prod.
In this example here, the system is deployed in three independent fault domains. Choose independent data centers (like cloud regions) for those fault domains.
Don’t make the same change, whether it is code or config, everywhere at the same time. Instead, introduce the change slowly.
Progressive changes might feel like a slow approach. But only if you’re babysitting the rollout. Bake progressive delivery into your CI/CD instead.

Practice 2: Fail Partially

When there is an incident, your first task should be to protect the customer and the business but not to debug to know what happened.
You can afford to protect the customer and the business only when you invest in fail-safe mechanisms.
In this slide, I’m illustrating three important mechanisms.
First, the app is compartmentalized into three independent failure domains. When you spot an issue with one of these, you may be able to shift traffic away from it to healthy fault domains. You can also use the same pattern to scale out your service.
Second, you may shed some part of the traffic to preserve the rest of the traffic. You may apply rate limits, drop connections, show shunt pages, and send 503s.
Third, implement circuit breaks and falls backs for dependencies so you can limit cascading failures.

Practice 3: Vaccinate

Another name for this is “chaos testing”. I prefer vaccination because it makes the point of introducing faults to build immunity.
When planning your tests, incorporate tests to exercise fault domain boundaries, assumptions of scale and latency between layers, scale up and scale down conditions, as well as traffic shedding and traffic shifting.

The fourth practice in my list is to not away from incidents after recovery. Learning from an incident starts after the incident.
First, clean up after the incident. Leftovers usually contribute to config drift.
Second, write post-mortems. In addition to looking for contributing factors, think of systemic improvements to make.
Validate any fixes. Let those fixes form a new hypothesis for your chaos tests.
Yes, these are time-consuming activities. Do spend the time to learn.

Time and resources are always finite, while our backlogs are nearly infinite. Every team has enough work to do for years to come.
Amidst all this work, how do you influence the prioritization of work to improve production stability? How do you convince your team, or your manager, or person in charge of making prioritization decisions to invest in resiliency related improvements?
Examples: Architecting fault domains, investing in chaos testing, validating fixes after incidents, etc take time.
Learn to attribute value to those types of work. Learn to measure the effects of incidents like lost revenue, customers, time, productivity, etc.
If you’re not able to make such value-based arguments, you may remain frustrated.

To summarize my talk, incidents provide an excellent way to learn about your architecture, people, processes, and culture.
Study incidents. Form study groups to discover patterns from incidents. Patterns like the one we discussed here empower you to make value-based arguments to influence your teams.
If you’ve time to do just one thing, invest in release safety.

Status Management

2019-06-10T18:06:28+00:00

I learned about “status management” recently while reading Daniel Coyle’s The Culture Code. Since then I can not stop seeing status…

I learned about “status management” recently while reading Daniel Coyle’s The Culture Code. Since then, I can not stop noticing status management.

Unfortunately, as we are always going through change, as our work interactions are turning more and more fluid, and as role/responsibility boundaries are getting blurrier, “status management” is becoming an epidemic in most work interactions. Our reactions are often motivated by how each of us wants to “fit in”, which is an individual objective in the emerging status and not necessarily on how to impact the change, which is a group objective. In these cases, status management decides what an individual says or does. We see plenty of examples at work or on social media, of people giving opinions which are often designed to maximize their “status”. Decoupling oneself from their intended status during a change is hard and yet essential to inflict the shift.

Daniel introduces status management through a case study of how two groups, a group of kindergartners and a group of business students, conduct themselves in Peter Skillman’s Spaghetti Tower Design challenge. This is a fairly common challenge used for team building.

You can read an excerpt from this part of the book on Daniel Coyle’s website, but let me share the part that stuck with me.

The business school students appear to be collaborating, but in fact they are engaged in a process psychologists call status management. They are figuring out where they fit into the larger picture: Who is in charge? Is it okay to criticize someone’s idea? What are the rules here? Their interactions appear smooth, but their underlying behavior is riddled with inefficiency, hesitation, and subtle competition. Instead of focusing on the task, they are navigating their uncertainty about one another. They spend so much time managing status that they fail to grasp the essence of the problem (the marshmallow is relatively heavy, and the spaghetti is hard to secure). As a result, their first efforts often collapse, and they run out of time.

In the rest of the book, Daniel walks through the skills necessary (such as building safety, sharing vulnerability, and establishing purpose), to help “tap into the power of our social brains to create interactions exactly like the ones used by the kindergartners building the spaghetti tower.”

The trick, I think, is being okay not to have a status in the change, and yet be willing to influence or even lead the change. This requires that you feel safe that your job is not at risk even if you have no status in the change, or you trust your ability to find something else to do. This ability is one of the essential characteristics of growing as a leader. Not having status is not a failure.

Opinions

2019-05-30T22:30:43+00:00

Some of the best meetings I’ve had in recent years are those that I have had no opinions on. These were meetings where folks had…

Some of the best meetings I’ve attended in recent years are those that I have had no opinions on. These were meetings where folks had disagreements, and the best I could offer was to walk in with an acknowledgment that I’ve no opinion, and that I’m here to help figure it out. Before each of those meetings, I did spend the time to research the problem statements, learn about the points of view, and consider potential trade-offs involved. Yet, as I walk in, I offered no solutions. I had nothing to gain or lose. I had no sides to pick. I might or might not have a role to play in the outcome.

Some of my not so effective meetings are those in which a couple of us in the room held strong opinions on the topic, were very articulate about those opinions, and were bent on keeping the discussion conform to their opinions.

Nowadays, my rule is to leave opinions at the door as much as possible.

But leaving opinions at the door is easier said than done. What if you can’t control the outcome? What if you don’t like the result? What if the result ends up disrupting what you’re currently doing or conflicts with what you want to accomplish? What if you end up not having any role in the outcome? This state is like walking on shaky ground and can make you uncertain about yourself.

Role of Opinions

However, opinions are an integral part of how we think. They play a crucial role in our decision making. They help us process and discard vast amounts of information and only zoom into those parts that seem relevant. Opinions make it easy for us to decide and act quickly.

Most of us hold opinions on a wide range of topics regardless of our expertise on those topics. We employ opinions to appear knowledgeable, persuasive, competent and confident. We rely on opinions as proxies for knowledge, thoughtfulness, and competence. For example, I’ve opinions on the state of politics, technology trends, arts and sciences, environment, religion and so on. I’m no expert on most of these topics, and yet, I can appear and feel like an expert by liking, commenting on, or sharing expert opinions on social media.

Opinions also give you confidence and make you feel certain. Imagine the confidence you feel when walking into a discussion, or before taking action? More often than not, such confidence is likely based on the opinions you formed on the subject.

Opinions are also easy. That’s why we all have plenty of them. Opinions are not scientific facts, and so we don’t need to prove ours. You can discard any point of view with “it is just your opinion.” Opinions are not hypotheses. So, we don’t need to subject them to testing with data.

That’s also why we can form opinions quickly. In contrast, activities like deliberation, discovering facts, and testing hypotheses are slow and time-consuming. You have to work hard to unearth facts, form hypotheses, and test them.

Over time, opinions become part of our identity. As others discover our opinions, they put us into groups. They make us stereotypes. Opinions make it easy for others to deal with us.

We often hang out with people that share our opinions. Since we are more likely to be persuaded by ideas that form in our minds than those of others, we form allegiances with people that express opinions that are similar to ours. Consequently, the social groups that we’re part of converging into closed echo chambers. Such grouping makes us feel safe. We belong.

The Trap

Workplaces are no different. We use opinionated software frameworks and tools for productivity. We pick or discard solutions just because we don’t “like” something about those.

Even in the most data and hypotheses-centric workplace cultures, opinions continue to rule. That’s because, apart from not needing scrutiny to form one, forming opinions is a part of how we think.

Therein lies a trap.

Through the eloquent and repetitive articulation of opinions, we have the power to extinguish deliberation and dialog and shape the course of the group that we are part of. Most political speech and marketing spin function like this. The war on Iraq from 2002 is an excellent example in recent history. Then White House and the cabinet managed to shape the public opinion with little factual information to back up the claim of weapons of mass destruction. Fires from that war are still raging in the middle east. Similarly, we’re currently going through a period of opinion-shaping about global warming, where opinions and their articulation matter more than facts and hypotheses.

Similarly, significant initiatives get launched or suspended based on the opinions of positional leaders with titles at work. The highest-paid person’s opinions (HiPPO) may override data and other observations. Eloquent opinionated individuals take over conversations in meetings to steer the course to conform to their opinions. Senior leaders’ names get dropped to shift the course of an activity or to override some data, purely based on such leaders’ perceived opinions.

However, firmly held opinions can prevent us from learning. Opinions water down dialog and discovery of facts. They block us from listening.

Instead of letting a free-form dialog happen, when we don’t leave our opinions at the door, we end up forcing discussions to conform to our opinions. Meetings dissolve into arguments. Consequently, our own opinions become a barrier between us and potential opportunities. Opinions can get us stuck.

Leaving Opinions at the Door

That’s why it is important to practice the art of leaving the opinions at the door. There are several tricks to help.

First, recognize that you can’t afford not to have opinions on everything. You keep/nurture opinions on most things but the most important ones. Not having an opinion in front of an opinionated group will get you bulldozed. Not every workplace culture may have a framework for baloney detection and for minimizing such bulldozing. Using written forms (not slideware) of communication to lay out arguments can help.

Second, don’t be lazy when forming opinions on important topics that matter to you. It takes hard work to form well-grounded opinions. Keep verifying your opinions by adding facts.

Third, when disagreeing, ask yourself if you’re disagreeing based on facts, or based on your opinions. Change your disagreement into a question. At the same time, avoid the tendency to ask leading questions to confirm your opinions.

Finally, be willing to let go of opinions. I find that getting data, studying alternatives, and seeking dis-conforming facts help us let go of opinions.

Our opinions are like the debt we carry on our backs, wherever we go. They are useful until they are not. I remind myself that, to be comfortable with not having opinions, I must be comfortable with saying that I don’t know, yet.

I will leave you with this quote by Charlie Munger.

I never allow myself to have an opinion on anything that I don’t know the other side’s argument better than they do.

Incidents — Trends from the Trenches

2019-02-26T18:26:22+00:00

Most publicized production incidents are war stories. Each involves drama with dead ends, twists and turns, and a victory at the end…

Most publicized production incidents are war stories. Each involves drama with dead ends, twists and turns, and a victory at the end. Something innocuous happens, that then snowballs across several layers to take down some parts of a business. A big chunk of internal or external customers gets impacted. Several teams spend long hours on a conference call or in the war room to mitigate the customer impact.

You may recall well-publicized incidents like the AWS S3 outage in 2017 that impacted several AWS customers, including Apple iCloud, or the cyber attack on Dyn DNS that affected several American and European sites, or last year’s Amazon.com’s Prime Day outage.

Such incidents are rare, and yet they remain in our memories for years. In reality, most production environments encounter incidents almost every day. As you see below, the cumulative cost and customer impact of such incidents can be much larger than the infrequent dramatic ones.

During the fall of 2018, I set out to develop informed opinions on how to improve the availability of production systems at work. There is no dearth of architecture patterns, tools, techniques and processes available to improve availability. How do you determine which ones to focus on and when, and make continuous improvements? That was the question I was grappling with. More important, I also needed a way to challenge some of my own prior opinions.

Incident analysis

In order for this, I could think of no better way than to study incidents to spot patterns. During November and December of 2018, I spent several weeks to study several hundred production incidents. This sample set covered a very large set of customer-facing apps and services running on-prem and cloud, including some that are yet to be modernized, as well as the on-prem infrastructure. I meticulously went through each critical incident, read incident logs, and where available, reviewed postmortem reports and classified incidents based on a few categories of potential triggers.

A clarification on the terminology here. I’m using the term “trigger” and not “root cause” to classify incidents. This is to emphasize the fact that most production incidents have several root causes. A trigger may just have surfaced an incident.

This analysis was time-consuming and laborious. Yet, the insights I gathered were well worth the time I spent. In this article, I want to share my findings, offer some hypotheses to explain the findings, and what could be done to improve availability.

The chart below summarizes my findings. It shows the top 5 triggers behind these incidents, ordered by the cumulative customer impact.

The size of each slice represents the customer impact as measured by certain metrics, and not the number of incidents.

Contrast this chart to the one below, which shows the incidents by number under each category.

I omitted some categories in these charts due to those not being relevant for this article. A similar analysis of a different sample set of incidents might produce a different set of triggers, though I suspect that the above shows common trends across most large enterprises undergoing constant change. Few peers in the industry also conferred that they notice similar patterns.

Observation 1: Change is the most common trigger

About a third of the impact was triggered by changes. Of this, about 50% was due to software deployments. In my classification, a change could be any of the following:

Automated CI/CD releases
Semi-automated deployments legacy apps
Manual changes
Configuration changes, such as traffic routing, or ingress/egress filters
Experiments (A/B tests)

As I showed in Taming the Rate of Change, given that the production environment at work undergoes a few thousand changes every working day, the change failure rate is still low. The impact, nonetheless, is significant.

This observation supports the anecdotal evidence for a low number of incidents during long weekends and holidays when production changes are low. Just last week, a colleague of mine quipped that production systems were mostly stable during the recent Seattle Snowmageddon 2019 because most people could not get to work. Some areas also lost power and Internet access during that time.

A couple of months before this analysis that produced the above pie charts, I analyzed a smaller sample of just over 100 critical incidents that covered a particular set of business functions. For each incident, I asked a simple question — was there a change that preceded the incident. I grouped all incidents with a “yes” into one bucket, and everything else into another bucket. The result is below.

The result was surprising and extremely alarming. Over two-thirds of the sample of incidents was triggered by one or more changes. This finding led me to the latter analysis of the larger sample of incidents. Change is still at the top, by customer impact.

There is prior research to support this observation. An 2016 ACM paper titled Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure makes the following observation based on a detailed analysis of over 100 high-impact network failure events:

a large number of failures happen when a network management operation is in progress within the network.

Observation 2: Config drift accumulates over time and masks potential future incidents

The second trigger from the top is config drift, which contributed to about one-fifth of the impact.

For those not familiar with config drift, consider a cluster of nodes each of which is expected to maintain a certain configuration. The configuration may include the OS, OS level or application level dependencies, security groups and such access controls, config files etc.

The cluster could be a SQL database in an active-passive configuration, a Zookeeper cluster, or pair of network switches. In order for the cluster to stay healthy in case of failures of any one node, each is expected to be in a certain configuration. Now, say, due to someone manually making changes, or an automation defect, one of the nodes does not have the expected configuration. This is config drift.

It is fairly common for config drift to stay dormant for weeks or months and surface only when some other event happens. In one particular incident, one of the network switches configured in a pair drifted from its configuration. Months later, the other switch failed for some of the reason, and the drifted switch could not take over. This lead to network disruption. I’ve witnessed similar incidents in the past with other types of clusters, and have stories to tell.

Observation 3: We don’t always know why systems fail

The next biggest in my finding was a large number of incidents that recovered on their own after a while. Though this category was the third by customer impact (per the first pie chart in this article) on my list, it accounted for over 40% of the incidents (per the second pie chart) I examined.

To reiterate, for over 40% of incidents, there was an alert of customer impact, an incident was declared, relevant people got on the incident bridge, and while the investigation was ongoing, the impact mitigated by itself.

Unfortunately, such incidents don’t get the attention of postmortem analysis, and hence corrective actions.

Observation 4: Infrastructure issues are less frequent than commonly believed

Infrastructure related failures like data center power, disk or other hardware, WAN link etc. are less frequent than most people believe. The same is true for public cloud service or region failures. Such issues accounted for a smaller percentage of customer impact in my analysis.

In some of the incidents I reviewed, while initial investigations pointed to misbehaving infrastructure (such as a particular vendor’s appliance failing), further analysis revealed botched changes (see Observation 1) or config drift (see Observation 3).

Finally, the fifth in my list is incidents related to certificate handling. There were just a handful of incidents in this category, and yet the impact was not insignificant. The issues related to forgetting to renew certificates in time or not coordinating the renewal across multiple systems. While these are easily fixable through automation or even processes, such errors continue to happen in complex production environments.

What is going on

Given the large sample size covering a diverse set of apps, services, and technologies, analysis like this provides an opportunity to better understand contemporary production environments at a high level. Below is my hypotheses of what might be contributing to these trends.

First, we trip on ourselves when making changes. The biggest risk to the availability of production systems is constant change. Due to the adoption of microservices, and investments into containers, CI/CD, and the cloud, our ability to make changes in production environments has been rapidly increasing. There is no turning back from this trend due to productivity gains. However, change safety is not always an inherent feature in the tools used to make changes.

As I argued in Taming the Rate of Change, these technology trends are contributing to the following:

Hyperconnectedness: Enterprises are increasingly deriving value from connecting various services in numerous ways. In a sense, the value of the enterprise is slowly shifting from nodes (systems doing particular things) to edges (interconnectedness). This is increasing possibilities for both success and failure.
Side effects: Amidst hundreds or thousands of services, anyone making a change to a particular microservice is unlikely to know all the consumers of that service across multiple layers.
Hope driven releases: Production environments are often the only reliable environments to test a change. As most enterprises are decentralizing once-common release engineering discipline, pre-production environments are becoming stale, unreliable, and lightly monitored. Consequently testing in production is increasingly becoming vogue.

Second, the desire for speed may be stealing focus from automation. This analysis makes it clear that automation is rarely complete, with less frequently used parts of any workflow getting the least amount of attention.

Furthermore, as we move on from one generation of technology and architecture to the next one, we rarely leave the prior generation in the best possible shape.

Consequently, as systems age, less frequently used parts accumulate config drift. Unlike the other form of bugs, drift tends to remain dormant until some other event occurs before leading to a fault.

This trend is not limited to on-prem services. Apps and services deployed on the cloud are also subject to config drift. Teams adopting new technology usually start with automation to get going quickly, but not necessarily automate manageability tasks that come up in the future. This keeps the door open for drift to creep in.

Finally, the large number of incidents in the unknown category shows that our ability to comprehend the physics of hyperconnected systems is limited. Furthermore, as systems seem to recover on their own, we’re also losing the opportunity to learn from such incidents.

Potential ways to improve

This analysis certainly helped me refine my opinions on areas of investments. I want to highlight a few techniques to help deal with the trends I noticed.

First, the most important take away from this analysis is improving change safety. Progressive deployments (i.e., introducing the change bit by bit), feature flags, blue-green deployments, predictable rollbacks, and shadow testing are some of the ways to improve change safety. Anyone interested in increasing deployment frequency must also invest in such safety strategies.

The second area of investment is fault containment and redundancy. Some of the complex incidents take time to restore, and traffic shifting to a redundant copy (active or passive) may provide a faster and reliable alternative to in-place fire-fighting. See my article on Fault Domains and the Vegas Rule for a description of how redundant fault domains can help reduce time to restore. Another excellent article to read on this topic is Werner Vogels’ Looking back at 10 years of compartmentalization at AWS which describes how AWS uses “compartmentalization” for horizontal scalability as well as to contain faults to smaller domains.

However, maintaining redundancy is non-trivial. Apart from designing for redundancy, periodic traffic-shifting practice drills are essential for maintaining fault domain integrity and readiness to shift traffic.

The third area of investment is to either commit fully to automate systems or use a cloud-managed service to take care of most of the automation. I always recommend the latter due to increased time to market and lower operational overhead. Though this does not fully eliminate the possibility of config drift, it can at least help reduce the number of moving parts you’ve to automate yourself.

Next in the list of areas of investment is observability, in particular, tracing, to improve steady-state understanding of today’s hyperconnected production environments. Traces and service graphs help improve a team’s understanding of how their services are used and how they are behaving during the steady-state.

The fifth and the second most important area after change safety is investing to increase the time spent after incidents through post-incident rituals. Across the industry, most teams treat incidents as distractions and are eager to get back to regularly scheduled work as soon as systems are restored. This trend needs to change as incidents teach us about non-linear behaviors of complex hyperconnected systems. At work, we’re experimenting incorporation of a few post-incident rituals like peer-reviews of postmortem reports, and in some cases, subjecting the system in production to the similar triggers after fixes have been made.

Prior to this analysis, my approach to improving the availability of production systems involved adopting defensive strategies like Hystrix, ensuring redundancy, and adopting chaos testing. These are all essential techniques in a toolbox. This analysis gave a perspective on where to zoom in, and of course, boldly highlighted the need for change safety.

Let me end with a caveat. Any analysis like this will highlight some broad strokes while obscuring specifics. Take such findings as one of several inputs.

The Value is in Dealing with the Messy Stuff

2019-01-10T13:17:34+00:00

Over the last several years, I had the opportunity to lead a few projects that were too large for any single team to execute. I also dealt…

Over the last several years, I have had the opportunity to lead a few projects that were too large for any single team to execute. I also dealt with problems that did not fit to the existing team and org structures at all. Some of these were also “multi-VP” problems (see below). These were messy and ambiguous projects that tempt you to give up, and walk away.

I decided to write down my observations and lessons learned as I find that people who want to grow in their careers as managers or individual contributors must be comfortable to deal with such problems. Otherwise, they are unlikely to be able to influence and deliver anything of consequence.

It starts with silos

Who doesn’t blame silos at their place of work for its slowness, bureaucracy, and dis-function? Who does not want to eject themselves from such places to land in magical silo-free lands where things move perfectly fluid?

We blame silos for their resistance to change. Silos are known to produce architectures, processes, and operations that follow communication lines between those silos, with little regard to the overall problem. This is the essence of Conway’s law:

Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization’s communication structure.

In his Toward Simplifying Application Development, in a Dozen Lessons, Conway later clarified that the importance of this principle is to probe

(whether) your design organization is keeping you from designing some things that perhaps you should be building.

or even more important, to ask the following question.

Is there a better design that is not available to us because of our organization?

This question is the quintessential litmus test for silos. Most of us who had frustrating experiences with silos would answer affirmative to this question.

Yet, we also prefer small independent autonomous teams for faster decision making, short feedback loops, and to move accountability closer to where the information and execution is. In fact, we can sum up the whole micro-services movement as one to create small silos for agility and efficiency to create value quickly. Through micro-services, and micro-services inspired architectures, we are facilitating small independent deployments of code, each of which does one thing well, and are removing the coordination tax monoliths need to create value. As Wikipedia notes:

it parallelizes development by enabling small autonomous teams to develop, deploy and scale their respective services independently

Fair enough. But notice the dichotomy between wanting to break silos apart, and yet wanting small autonomous teams (i.e., silos) to move fast.

Why this dichotomy? Can these both wants be right? Is there a way to have the efficiency and agility of small teams and yet avoid the trap of Conway’s law?

This dichotomy is not fictional. You can lead no major body of work of consequence without facing this dichotomy. Regardless of your title or role at your workplace, dealing with this topic is an essential prerequisite to growing influence and leadership. You are unlikely to influence change if you remain frustrated with this dichotomy, or avoid it altogether by ejecting yourself to land elsewhere.

Silos are efficiency centers

The first thing to recognize is that we need silos for a reason. Once we understand how to decompose a large problem into several smaller problems, silos help solve those parts efficiently. You structure the silo as a work center to concentrate around the information to solve the problem, equip it with the resources needed, and push decision making to the silo. Thus, each silo becomes a center of efficiency to produce value quickly.

Silos allow clear roles and responsibilities. You know where to go and who to ask when there is a problem. Everyone in the silo is in close proximity to the information needed to autonomously make or change decisions. This leads to empowerment, and empowerment leads to accountability.

Silos also breed their own micro-cultures, which include rituals, processes, rules, language, pride, and ownership.

The culture and processes that the silo breeds for itself are usually optimized to run status quo efficiency. The status quo may be one or more problem areas, execution of certain tasks to produce some well-defined outcomes, or of producing something of value that fits in a broader context. A “user profile” team in an e-commerce company or the shipping team in a retail store are perfect examples of silos.

The silo’s micro-culture also makes up an invisible boundary around it. The “pride and language” that each silo develops constitutes the other face of a coin called “ownership and accountability.”

Out of necessity of autonomy and efficiency, silos develop a terminology of “us and them.” Hence they appear insular and resistant to change. For example when you approach the “user profile” team (a silo) for what you think is really an important feature, they may get a response that “our team will decide after the next sprint” or that “we have decided not to build that feature for x, y, and z reasons.” You may think that they are resisting change by being inflexible when in fact they may just be exercising their autonomy.

As we shall see below, such resistance is not always the silo’s fault.

Ambiguity

Silos become inefficient when you’re attempting any major change that spans multiple existing silos. Silos appear friction-some when you overlay a transformational change on top of existing silos.

I face this challenge with most large problems I work on. These problems don’t always map to existing silos. I can’t tell which of the existing teams can help solve the problem.

I will give you one fictional example of what I used to jokingly call as a “multi-VP” problem.

A multi-VP problem is one that either requires multiple VP level managers to work together to structure a solution, or the implementation of the solution takes longer than the tenure of a single VP level manager. That tenure may be around 2–3 years, where as the solution may take 5–6 years. In either case, the chance of getting the problem solved successfully isn’t high.

Here is my fictional example. It entails breaking a large monolith database that everything in a company depends on for most of the data, into smaller decoupled databases to facilitate a micro-services architecture. Sounds familiar?

The technical architecture for the problem is simple. It may involve building blocks like picking a new cool database technology, data modeling, data migration, dual-writing data, data migration, adapters for switching between old and new database etc. for each logical chunk of data in the monolith database. Drawing up the architecture is the easy part. Everyone gets excited about this part as it is considered innovation.

However, who should actually do the dirty of work of implementing this architecture? I’ve had some colleagues complain that they “need power” to “push through” their architecture.

Should we ask the team that manages and administers the monolith database to implement this architecture? That team may be equipped with the skills needed to administer the database servers, and databases efficiently. They may be the best in the company to run that database at scale with high availability. They may be adept in capacity management, upgrades, backups and restorations without a glitch. They may be experts in schema management and indexing. Just by looking at the query, they may have developed the skills to spot bottlenecks. But they are unlikely to lead a micro-services transformation as they may have never looked at the apps that depend on the database, or have the faintest idea of what those apps actually do, or practiced any coding.

How about we ask one or more of the tens of teams that depend on that monolith database? Not having the need to know anything about the nitty-gritty of managing a database on their own, they are likely used to throwing all their database tasks over the fence to the team managing the database. They probably also have long scrolls of features to build. Over time they might have developed certain development cadence that has no cycles left to learn and pick up database related work.

This is the nature of an ambiguous problem. It is large for any single team to pick and solve. The problem involves many building blocks with no clear mapping to existing silos. Implementing a solution takes a lot of time. The outcomes are unclear and are not guaranteed. You may encounter several surprises along the way.

I struggled, in the beginning, to get such problems solved. I used to get frustrated and sometimes considered moving on instead of trying to find a way. This changed once I tweaked my mental model.

My original mental model saw the organization as being political, inflexible, bureaucratic, and incapable of change with pockets of teams, vested in their survival, resisting any change. This is a fairly common mental model employed by a lot of us. However, this model is flawed, lethargic, and instills stagnation and not change.

My attitude changed once I changed my mental model. Now I see such problems as being ambiguous in nature. My mental model now recognizes quickly that existing silos are not optimized to resolve the ambiguity, that there is no problem with the existing silos, that the new problem requires a new approach, and the opportunity may be mine to break it down.

Disrupt current silos to create new silos

When faced with ambiguous problems like the “multi-VP problem” above, first recognize that you need to disambiguate the problem, restructure the complexity of the problem, and influence the organization to instill change. This is easier said than done. It requires empathy with existing silos, humility to let go of your ideas, and patience and tenacity to influence others.

While I can’t write down a prescription that everyone can follow, below are some of what worked for me, and what I saw others practice.

Owning up the problem, by recognizing that there is an opportunity to step up and own the problem, as messy as it may be, instead of acting like a victim facing a villain. You should be comfortable with the mess that comes with owning up an ambiguous problem.
Developing and clarifying why. It is not sufficient to say that the problem you saw is important. You have to be able to break it down into specifics that others can relate to, identify themselves with, and are motivated to solve. This, of course, requires identifying and building coalitions, and proactively identifying risks and blockers.
Following through the execution of a solution by helping form new silos to efficiently solve the problem. This may be the hardest part. But recognize that no other person may have spent as much time as you on the problem, and there is likely no one who can do this step for you.
Finally, remaining focused on the outcomes and not on “how” you want the problem solved. That is, you should be willing to let go of your ideas of the solution so that new silos can develop to own and efficiently solve the problem.

These are all traits of leadership.

Some of the most effective leaders I had a chance to work with are master-disambiguators. When faced with making a transformational change, they focus on disrupting existing silos only to form new silos to lead the change. They rely less on positional power and more on influence to disambiguate the problem and develop ways for others to contribute to a solution.

Back to the point. Instead of blaming existing silos, you may have to form new silos to make transformational changes. This does not mean re-organizing the existing teams into a new reporting structure. You may do so at a later time, or skip it entirely if the culture of the organization is built around units of work and not reporting relationships.

Deal with the mess to make an impact

To conclude, silos are not bad. Most silos are centers of efficiency. Instead of always trying to overlay a new transformation, whether small or large, on top of existing silos and getting frustrated with the slowness and friction, you may have to first structure a solution, then figure out what kind of silos you need to execute efficiently, and then step up to lead that change. You proceed, and then discover that the new work centers (silos) are not helping the next transformation. You go through the same process again. This is a cycle.

You will stop blaming the organization for its silo culture once you recognize that existing silos are optimized to run the status quo efficiently, but not to lead a change.

Conway may have recognized this when he writes about his second lesson in Toward Simplifying Application Development, in a Dozen Lessons:

Lesson 2: If you want the cleanest possible product you have to find the simplest possible design before organizing to build, or else you have to be prepared to reorganize.

Let me reiterate that this stuff is not easy. The dichotomy between wanting to break silos apart, and yet wanting small autonomous teams is a natural process of instilling change. It can be frustrating. There will be ups and downs. Success is not guaranteed and you will make mistakes. Therein lies an opportunity.

Update on Jan 12, 2019: In response to this post, Sean Gilles observed that I used “silo” and “team” interchangeably sometimes, and offered the following difference between the two

A team is part of an organization’s model of how problems get solved and silos are part of the observed reality of how problems get solved.

Thanks Sean.

Contemporary Views on Serverless and Implications

2018-12-29T23:53:42+00:00

We want near-instantaneous elasticity of resources and never have to pre-allocate resources or pay for more resources than needed. We also…

We want near-instantaneous elasticity of resources and never have to pre-allocate resources or pay for more resources than needed. We also want all the operational best practices baked into a runtime to free us from having to worry about most of the low-level automation, operations, and robustness to run our code. These are two of the most fundamental pursuits of cloud computing for nearly a decade, and serverless is the closest available to realize these opportunities.

However, as I look back into 2018, I found the year to be confusing for serverless on the message, value, and direction. Despite the potential, availability of a number of frameworks, publication of a number of books, and several conferences worldwide on this subject, and more important, continued enterprise adoption, we’re slow to realize the benefits.

What may be holding us back are our mental models and views on serverless. As a recent paper “Serverless Computing: One Step Forward, Two Steps Back” noted, “the notion of serverless computing is vague enough to allow optimists to project any number of possible broad interpretations on what it might mean.” I couldn’t agree more.

In this post, I want to summarize three contemporary views of what counts as serverless, and the implications of these views. My goal is to show that our views determine the outcomes and that unless we refine our views, we may not find a better future for ourselves.

1. Serverless as someone else managing your servers

In this view, a serverless capability shifts the operational responsibilities to a provider, so that, you, as the consumer of that serverless capability, do not have to think about managing servers. All the associated responsibilities like server provisioning, operating system upgrades, maintenance, capacity management etc., therefore, shift from the consumer to the provider of the capability. This view supposedly frees you from thinking about “ops” and lets you focus on your code.

In the extreme, you can extend this view to classify any service that someone else runs as being serverless. Here I’m using Wikipedia’s definition of a service as “a discrete unit of functionality that can be accessed remotely and acted upon and updated independently”. The slide below from Kelsey Hightower’s tweet exemplifies this point of view of serverless as an operational construct. At the time of writing this, I’m not aware of the original author of this slide.

Per this view, any cloud service qualifies to be “serverless”. You can grade each service by its “degree of serverless-ness” based on the ease of gaining agility, elasticity and cost efficiency. The easier a service gets these qualities, the more serverless the capability is. That’s what you see in the slide above from left to right.

CNCF’s serverless working group also falls into this view when describing “backend-as-service” (BaaS).

Backend-as-a-Service (BaaS), which are third-party API-based services that replace core subsets of functionality in an application. Because those APIs are provided as a service that auto-scales and operates transparently, this appears to the developer to be serverless.

BaaS is a jargon-word to describe multi-tenant middleware services.

A key limitation of this view is that it constrains you into focusing on outsourcing the heavy-lifting to a provider while ignoring a key property of serverless, which is that serverless also includes a programming and run-time environment to let you write and run your code.

For example, consider S3 and Lambda. Both are multi-tenant, elastic, auto-provisioned, pay-per-use services, though one offers you an API to store objects while the other gives you an opinionated programming framework and run-time to write and run your code.

This view also favors cloud providers like AWS that operate many managed services like those you see in the slide above, but not portable opensource serverless frameworks that exist in the wild today. Most open source and third-party serverless solutions require you to provision some resources and manage those yourself. For example, running a framework like OpenFaaS or Kubeless on Kubernetes is not serverless per this view since you still need to provision, manage, upgrade, and secure your Kubernetes clusters.

2. Serverless as functions and events

In this point of view, serverless is a programming model consisting of small units of code written as functions triggered by events through a declarative configuration. It is the idea of micro-services taken to its logical limit. In this view, functions, and events are the developer-facing abstractions to write applications.

This view entirely focuses on developer-facing abstractions. Per this view, any framework offering functions and events as abstractions is serverless. It really does not matter who operates the run-time and the resources that runtime needs. Below is tweet by Chad Arimura, who leads Oracle’s Fn Project, that summarizes this preference towards developer experience.

I’ve heard other proponents of Kubernetes based function frameworks express a similar view. As this view does not focus on who runs the servers, it provides the broadest umbrella for several open source projects to offer innovative and fun to use function and event-based programming frameworks.

Though developer experience is important, this view ignores properties like elasticity, cost efficiency, and lower operational overhead. You might also wonder if this view is nothing but a reverse-engineering of AWS Lambda to recreate the developer experience without the cost and operational efficiencies.

3. Serverless as functions as a service (FaaS)

This most commonly used view of serverless describes what was originally offered by AWS Lambda in 2014, and now followed by a few other cloud providers. It incorporates both a function and event-based stateless programming model, and a run-time offered as a service. In addition, you’ve access to a rich set of cloud services for middleware functions.

While this view serves certain stateless application patterns very well, it also pigeon-holes us into not thinking beyond functions and events. We can’t express every one of today’s and tomorrow’s programming problems in the world in the form of loosely coupled stateless ephemeral event-triggered functions.

No other work describes the consequences of this pigeon-holing better than the recently released UC Berkeley paper “Serverless Computing: One Step Forward, Two Steps Back”. Below are some example highlights from this paper.

For a model training problem: “Lambda’s limited resources and data-shipping architecture mean that running this algorithm on Lambda is 21×slower and 7.3× more expensive than running on EC2.”
For a low-latency prediction serving via batching problem: “This “serverful” version (that replaces Lambda and SQS with EC2 and ZeroMQ respectively) had a per batch latency of 2.8ms — 127×faster than the optimized Lambda implementation.” Text in parenthesis is mine.
For a distributed computing problem: “in the (unachievable) best-case scenario — when each leader is elected immediately after it joins the system — the system will spend 1.9% of its aggregate time simply in the leader election protocol. Even if you think this is tolerable, note that using DynamoDB as a fine-grained communication medium is incredibly expensive: Supporting a cluster of 1,000 nodes costs at minimum $450 per hour.”

There are several other undocumented examples like these. For instance, I would not pigeonhole many of Apache Spark powered large-scale data crunching solutions into event-triggered functions.

Cloud-native managed services may fill the void for solving such problems while still offering elasticity, cost and operational efficiencies of serverless. I made such an argument in the past, and yet I recognize that waiting for such developments does nothing but stifle experimentation and innovation.

What is next?

Where do these views lead us? Not far from where we’re today.

Though I can’t back up with numbers, it is very likely that less than a tiny fraction of a percent of today’s worldwide compute capacity is used for running serverless workloads. The serverless opportunity is nearly infinite, and it is clear that today’s views on serverless won’t get us to a point of providing near-instantaneous elasticity, cost, and operational efficiencies for most programming problems. It is foolish to assume that we’ve all the serverless primitives necessary to solve all current and future programming problems.

Yet, I’m hopeful of the future. More examples like the UC Berkeley paper above shall continue to shine the light on limitations of contemporary views on serverless.

I also look forward to us acknowledging that event-triggered function as the primary developer facing abstraction is just one of several possibilities and that we need new types of frameworks offered as services to solve other types of problems.

Tomorrow’s serverless offerings will likely be “frameworks as services”, with “function as a service” being just one possibility to solve a certain class of stateless programming problems.

Taming the Rate of Change

2018-11-19T17:41:44+00:00

These are great times for pushing code to production. Thanks to the cloud, micro-services, and investments in CI/CD pipelines, teams that…

These are great times for pushing code to production. Thanks to the cloud, micro-services, and investments in CI/CD pipelines, teams that used to release code once or twice a month to production until a few years ago are now introducing production changes several times a day.

For example, at the Expedia Group, which is where I work, we are witnessing a significant increase in change frequency (number of production changes a day), with change lead time (from committing code to a successful production deployment) for most changes in minutes. See the chart below that shows change frequency over the last two years.

Change frequency is an indicator of time to create business value. In order to create value in a given amount of time, you need to be able to release your code a certain number of times and learn from those changes. The less frequently you release, the longer it can take to create value. Increase in rate of change shows that you’re reducing the time to create value, thus increasing team performance. Conversely, low change frequency indicates high time to create value and low team performance.

As the 2018 State of DevOps report says,

Those that develop and deliver quickly are better able to experiment with ways to increase customer adoption and satisfaction, pivot when necessary, and keep up with compliance and regulatory demands.

However, change frequency alone is not a sufficient measure of team performance. As the same State of DevOps report aptly captures, production stability is an equally important measure of team performance. What good is high change frequency if the production environment is falling apart often for long periods of time? There is also empirical evidence to show that incident frequency stays low when change frequency is low. See below to notice a correlation between incident frequency (red line) and change frequency (green line) when the change frequency low. The correlation is seen during periods of holidays when fewer changes were being made in production.

Update: See my later article Incidents — Trends from the Trenches for more evidence. My analysis of several hundred production incidents shows that change is the top trigger behind incidents.

But can an organization sustain increasingly high change frequency while simultaneously improving production stability? Don’t the tools and cultural changes used to increase change frequency also improve production stability? It depends. The very tools and cultural changes to increase deployment frequency may also contribute to increased fragility.

Based on metrics for deployments (first two rows in the table below) and stability (third and fourth rows), the DevOps Report also categories teams into the elite, high, medium, and low performance. Once you start investing in micro-services and CI/CD, teams move from low/medium performance to high/elite performance based on deployment metrics.

However, a similar transition based on stability metrics is not automatic. Why so?

This question has no simple answer. Chaos people offer that continual chaos testing will help surface fragility. Monitoring and observability people want you to integrate with their tools to see what is going on. Service mesh people ask you to adapt their solutions to baking some of the stability best practices into your application runtime. The answer is a mixture of all these and more.

In this post, let me explore what is likely happening today, and what it might take to improve stability without sacrificing speed.

Observation: Physics of Interconnectedness

Interconnectedness is a common attribute of most contemporary architectures. Our systems are an interconnected heterogeneous set of fast-changing and slow changing components.

From experience, we can make the following observations of such architectures:

Since not every part of the architecture has the same need for high change frequency, each part may get different levels of people and time investments. You may chip away some parts of a monolith to gain change efficiency for those parts, and leave the remaining untouched. Consequently, monoliths and debt remain integral to some of our systems for far longer than we expect.
As it is getting easier to introduce new apps, overall architectures of our systems are changing faster than we can document them. This puts time pressure on the available knowledge and fragments team memory.
It has never been easier to introduce a diverse set of languages and frameworks into the architecture, leading to another dimension of heterogeneity.
With new code comes newly hidden assumptions about how various parts of the system work in the happy path, let alone assumptions about boundary conditions and failure modes. Every person making local decisions makes those with a peripheral understanding of how other components work. This is unavoidable as we can’t fully grok the complexity of our systems.
Cost and complexity of replicating these architectures end to end in dev/test environments are rapidly increasing, which is leading to testing a subset of changes directly in production. Testing in production is an acceptable and needed practice now.
Though a number of tests still get run in dev/test environments, most of those tests are localized and don’t exercise the interconnectedness of our architectures. The same is true about stress testing.
Traditional capacity/stress testing assumes that our systems are linear, producing predictable and proportional outputs given valid inputs. However, interconnectedness makes the relation between inputs and outputs of the overall architecture non-linear. The components of the architecture may appear linear, but not the overall system. Past success, based on certain initial conditions and inputs, therefore, does not predict future success.
Consequently, we don’t get to fully experience the dynamic and non-linear nature of our architectures until when there is a fault in production.

Whenever I participate in incident response, I take interest in observing how the participants reason about what went wrong and how to recover. Discussions on the bridge and incident Slack channels demonstrate some of the above observations. Participants offer assertions about what went wrong, and what should be done to fix. They base it on their own certain, deterministic and causal understanding of how the system is supposed to behave. Some would be right and some would be futile guesses. Sometimes resolutions are quick, and in some cases, resolutions take hours.

Observation: Unclear or Porous Fault Domain Boundaries

With rapid change, new components come into critical paths often. What was a stable path yesterday may have a few new components today with not-yet-well-understood failure modes. This leads to unclear or porous fault domain boundaries.

You may have had confidence till yesterday that a failure inside the fault domain does not cascade outside, and vice versa. New dependencies can erode your confidence quickly. Was the timeout for the new dependency configured correctly? Was that new dependency aware of the new traffic you may be planning to take? Is that dependency soft (i.e, we can still the request albeit in a degraded mode? Or hard (i.e., a fault in that dependency cascades)? You may not have enough time to catch up to answer such questions as the architecture is constantly changing due to high change frequency.

Observation: Incomplete Automation

Automate everything is a great slogan. In reality, automation is rarely complete.

There are several reasons why.

First, most frequently executed parts of our workflows get the highest priority for automation investments. For instance, CI/CD investments for stateless apps and services far outweigh similar investments for stateful parts. Stateful parts include your self-hosted databases, caches, queues, streams etc. The rationale is simple. In any given architecture, the need for change frequency is usually higher for stateless components than for stateful components. The usual attitude is to let the in-house expert deal with the stateful parts. “How was that database setup?” — you ask. The answer may be, “We don’t know. The DBA (or name your expert) set it up for us.”

Similarly, if one of your clusters is known to fail 3–4 times a year, would you spend two sprints to fully automate it, or jump into those failures to fix whenever there is a failure? Though the latter is nothing but unplanned work and contributes to the accumulation of forgotten failure modes, Managers and prioritization decision makers often pick the latter over the former. This seems counter-intuitive, but most people don’t work across long time horizons when prioritizing work.

Second, you need closed-loop automation for lights out management of systems. In a closed loop system, an observer monitors common failure conditions and autonomously takes corrective actions. However, building closed-loop automation is hard and time-consuming. Thanks to modern frameworks like Kubernetes, we’re in a much better spot today than ever before to implement closed-loop automation. However, having a solution is different from actually using it. This could be because you invested in your current automation sometime before a solution came along, and you may have sunk enough time, resources and processes to quickly change it all.

Third, the ease of use of cloud services makes it very tempting to create and configure resources manually through cloud consoles and CLIs. We all know it is wrong but do it anyway. As memory fades and team composition changes, those become brittle to change.

Configuration drift is one of the painful consequences of incompleteness of automation. Drift is like tree rot. It happens slowly, one config variable at a time, one hidden assumption now and then, just a few misconfigured alerts, and one more manual tweak here and there. That’s how drift accumulates over time.

Over my career dealing with infrastructure and automation, I’ve witnessed many cases with drift accumulating over a period of time to disrupt planned work, degrade critical services, cause difficult to explain bugs, or long times to restore because a critical team member is not on the bridge and so on. The lesson I learned is to always strive to increase the level of automation but also expect drift. I would plan to measure and monitor for drift regularly, and not blame incompleteness of automation for the failures drift may have caused.

Observation: Chaos Without Safety

Chaos engineering is not about randomly introducing faults into production systems. As Principles of Chaos Engineering explains, The idea of chaos engineering is to come up with hypotheses, create conditions to test those hypotheses, and then prove or disprove. Through such hypotheses testing, you gain a better understanding of the physics of your system. You help surface hidden and forgotten assumptions. Such understanding is essential to time to restore when failures happen.

However, despite some industry success stories, and even though chaos engineering is nearly 9 years old, its practice is still nascent in the industry. It is not often you would run into a team that says “We understand the value of chaos engineering. So we allocated x% of the development budget for chaos engineering practices”. A more likely answer is “This just isn’t the time to deal with chaos when we’re overbooked and understaffed. We’ll look at it later.”

As a mainstream activity, chaos engineering is perhaps where automation was 5–8 years ago, and experimentation (such as A/B testing) was 10+ years ago. So, what gives?

I see a few reasons for this hesitation.

(In this discussion, I’m ignoring those that that like to treat production systems as sacred that must not be willfully broken. Can’t help you. Sorry.)

First, a lack of confidence of recoverability from intentional failures inhibits the practice of chaos engineering. Would the system survive? What if we end up creating a massive production outage? Do we have the time to deal with the aftermath? Even in organizations that don’t punish people for breaking production systems on purpose, lack of confidence is a blocker for chaos testing.

Second, chaos engineering, when practiced in poorly understood complex environments, can tip the system beyond the point of equilibrium. You may not have the safeguards necessary to contain the effects of a chaos test. An intentional failure can quickly cascade, and lead the system into the zone of instability.

Finally, most enterprises lack dependable disaster recovery environments and practices. The fault domain, in such cases, envelopes the entire production environment across one or more data centers. Consequently, when a chaos engineering test goes berserk, there is no escape pod to fail-over to a healthy environment. Your only option is firefight the failure in place. Who would want to intentionally create a large fire, and then jump to fight it?

Culture of Safety to the Rescue

Stability concerns amidst high change frequency is a new reality for us to accept and adapt to. Like most things, stability consideration is not the sole terrain of any single tool or a practice.

Below are three phases of practices to consider to develop a culture of safety.

Before the change/Normal course

Design and build for redundancy: Redundancy helps improve safety. Lack of redundancy impedes your ability to test failure hypotheses, and thus your understanding of the physics of the system.

Pipelines to release safely through progressive or compartmentalized delivery, feature flags, blue-green deployments, canary releases, and finally change logging.

Pipelines to rollback: Sometimes rolling back suspected change may be the quickest option to restore from failure. Exercise CI/CD pipelines for rollback.

Failover testing: This is an important activity to perform to increase confidence in chaos engineering, and to practice reducing time to restore by way of traffic shifting. This type of chaos testing can be much more valuable than simply turning off random machines in your environment.

During an incident

Change visibility: Quickly review changes to isolate potential suspects.

Rollback: Rollback suspected changes.

Roll forward: If rollback is not possible, rolling a new forward may be your next best option, provided you’ve visibility into key production metrics.

Failover: If you can’t rollback, or can’t push a new fix, then fail-over to a healthy copy. This step takes practice. The investments made during the normal course to failover traffic to a redundant copy will come in handy here.

After the incident

Postmortem: The amount of time you spend after the incident is more important than the time spent during the incident.

Low/medium performing teams don’t spend enough time after an incident to analyze what happened, to curate and document the findings, lessons learned, and action items; and to follow through those action items in a timely manner. They get burnt out during the incident, and attention drifts away to other things in a few days.

On the other hand, high-performance teams conduct periodic operational reviews to review postmortems and follow up actions. They don’t let go of the post-incident learning phase.

Consider each postmortem as an opportunity to learn and reason about the physics of your systems, and not just as a chore to report the findings.

Post-incident validation testing: Once the fixes are made, validate design and code changes by testing if the system would survive a similar failure. Most available chaos testing tools help mimic a variety of failures. Test it in production as much as possible to increase confidence.

Remember that safety does not mean slowing down. It does not mean batching a large set of changes into big bang releases. A culture of safety means being deliberate of the actions, aware of the production environment, and conscious of customer experience. It requires you to develop an understanding of the complexity and the interconnectedness. The more you understand, the faster you can go.

Cloud Optimization Circus

2018-06-20T15:06:39+00:00

If you are a cloud adopter rapidly adopting cloud services, but not developing the finance governance muscle, you will certainly be…

If you are a cloud adopter rapidly adopting cloud services, but not developing the finance governance muscle, you will certainly be visiting the cloud optimization circus frequently.

I compare cloud optimization exercises to going to a circus because those exercises invite all the same characters and emotions that you find in a circus. There is fear (of wasting money), trickery (by folks showing you how much you could be saving), illusion (of savings that don’t exist where you’re told they exist), excitement (of finding savings), and drama (of playing heroics). It may be fun and entertaining once or twice. Not so when you’ve a mission to accomplish, unless, of course, the mission is going to the circus.

Security and costs are the two biggest risks of cloud adoption. Security is a risk because, teams that optimize for agility on cloud tend to ignore security initially only to realize later. Cloud certainly gives the building blocks for security, but it is up to you to use the building blocks in the intended manner. Cost is the second on my list. Cloud is cheaper once you understand how cloud costs work, and develop the governance muscle. Cloud can be very expensive otherwise.

Spend vs Demand

Back in February 2018, I gave a talk at the Container World Conference on Are We Ready for Serverless. One of the key themes of my talk was that serverless frameworks like AWS Lambda are the closest available today to ensure required supply of resources follow the demand for resources. Here is a hypothetical supply-demand chart.

Curve A shows the resource demand. This is the sum total of all resources required to run the business, which in this example varies during the day and the week. Curve B is the ideal supply and spend. In the best case, supply, and hence spend, closely follows the demand. This is possible with serverless frameworks. Curve C is what usually happens in cloud environments. Though supply varies due to auto-scaling and ephemeral usage, such as dev/test activities during the day tapering off over nights and weekends, it usually stays above the resource demand. Curve D shows the supply in data centers where it typically stays flat.

Let’s ignore serverless here. Though it is the most efficient and requires no effort to maintain the spend to tightly follow the demand, only a tiny fraction of total cloud workloads today run on serverless frameworks like Lambda. Serverless potential is yet to be realized at large, and each enterprise will have to carve out its own journey in the coming years.

Majority of cloud workloads today run on virtual machines followed by multi-tenant managed services including network and storage services. Though some managed services bill you for what you need and use, for the vast majority, the task of making the supply (C) to efficiently follow the demand (A) falls on development teams, an assortment of nascent tools, and mostly reactive practices.

However, the task of making the spend efficiently follow the demand is easier said than done. Cost consideration is usually an after thought as most cloud adopters’ early focus remains on speed of delivery and not cost efficiency.

Unfortunately, this topic does not get much attention in the cloud community. Cost worries are usually brushed aside with suggestions like “use auto-scaling”, “use spot instances”, “fix your automation to clean up”, or “turn off your machines when you leave work”. Look at conference talks, meetups and blogs — you will rarely hear about spend management practices, how to project costs, how to understand detailed billing data, how to maintain efficiency, best practices, failures, lessons learned etc. Consequently, most cloud adopters fail to realize the strongest lever that cloud offers — which is to manage the spend to vary with the demand. But such a lever won’t exercise by itself. You need to equip the organization with tools, practices and processes to actually do the work.

For enterprises migrating from traditional data centers to the cloud, spend management is a lever that they don’t have in the data center. In the data center world, you do your best to estimate what you need in a year or so from now, spend all that, and hope it meets the need. There is no turning back if you find yourself with spare capacity. This is why most tech teams operating in traditional data center environments consider data center resources as free for all practical purposes. It would be a great missed opportunity to not deliberately practice efficient spend management as you ramp up on the cloud.

Over the last two+ years of leading cloud migration at work, I’ve had a chance to look at this area very closely, and take part in building a successful cloud finance governance engine to increase cloud spend efficiency. Let me share my observations and experience.

Problem 1: Data is Plenty, and Insights are Shallow

A RightScale post from November 2017 states that about $10 billion is wasted each year across AWS, Azure and Google. Another report by BusinessInsider from Dec 2017 proclaims that “companies waste $62 billion on the cloud by paying for capacity they don’t need”.

These are staggering numbers for sure. In my experience, we can’t quickly project such numbers at the enterprise level without producing a bottom up baseline through exercises like zero-based budgeting applied to every workload. These are time consuming activities involving testing every workload for the best price-performance ratio. Even predictions produced by tools like AWS Trusted Advisor and cloud cost dashboarding tools like CloudHealth fall short in reality as these tools lack the context of the workload. Consequently, most dev teams don’t often pay enough attention to these predictions.

Furthermore, detailed billing reports like the AWS Cost and Usage Report provide a wealth of detailed billing records. Here a few sample billing records.

73xfp4egijc5zblnu4easfnmvmnw4gsdxpwo7v2jxu4epqct6q7a,2018-06-02T01:00:00Z/2018-06-02T02:00:00Z,,AWS,Anniversary,xxx,2018-06-01T00:00:00Z,2018-07-01T00:00:00Z,xxx,Usage,2018-06-02T01:00:00Z,2018-06-02T02:00:00Z,AmazonS3,USW2-Requests-Tier2,ReadLocation,,cf-templates-xxx-us-west-2,1.000000 0000,,,USD,0.0000004000,0.0000004000,0.0000004000,0.0000004000,"$xxx per 10,000 GET and all other requests",,“Amazon Web Services, Inc.",Amazon Simple Storage Service,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,S3-API-Tier2,GET and all other requests,,,,,,,,,,US West (Oregon),AWS Region,,,,,,,,,,,,,,,,,,,,,,,,,,, API Request,,,,us-west-2,,,,,,AmazonS3,Amazon Simple Storage Service,xxx,,,,,,,,,,,,,,,,USW2-Requests-Tier2,,,,,,,,,xxx,xxx,OnDemand,Requests,,,,,,,,,,,,,,,,,,,,xxx,,,,,,,,,,,xxx,,xxx,xxx,,,xxx,,,,
ip4a5gnjucytps5c7nittxxgwj3jo5yc5sszhu4kkk2wqqyzv4rq,2018-06-05T19:00:00Z/2018-06-05T20:00:00Z,,AWS,Anniversary,xxx,2018-06-01T00:00:00Z,2018-07-01T00:00:00Z,308506315341,Usage,2018-06-05T19:00:00Z,2018-06-05T20:00:00Z,AmazonVPC,USW2-DataTransfer-Regional-Bytes,VpcEndpoint,,arn:aws:ec2:us-west-2:xxx:vpc-endpoint/vpce-xxx,190.5361101758,,,USD,0.0100000000,1.9053611018,0.0100000000,1.9053611018,$0.010 per GB - regional data transfer - in/out/between EC2 AZs or using elastic IPs or ELB,,“Amazon Web Services, Inc.",Amazon Virtual Private Cloud,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US West (Oregon),AWS Region,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Data Transfer,,,,us-west-2,,,,,,AWSDataTransfer,AWS Data Transfer,xxx,,,,,,,,,,,US West (Oregon),AWS Region,,IntraRegion,,USW2-DataTransfer-Regional-Bytes,,,,,,,,,xxx,xxx,OnDemand,GB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Depending on your scale and activity, you might see millions of records like these every day. These records span over tens of resource types and product families, and tens of thousands of usage types. As new features are introduced, and as your adoption grows, the volume and detail, and hence the complexity of these records, also grows.

On one hand, having such a wealth of data shows the true power and potential of on-demand pay-per-use model of the cloud. This data can help you understand the implications of your architecture choices, be able to correlate workload patterns with costs, and make price-performance trade-offs. Insights from this data, when gained, can help bring cost awareness to the engineering culture.

On the other hand, most billing tools available today mainly focus on providing dashboards with high level metrics, but not many insights. For example, in one particular case, recommendations from Trusted Advisor showed significant potential savings in certain areas, while analysis of the raw billing data revealed much bigger opportunities elsewhere. The latter required deeper understanding of the billing data to spot inefficiencies.

Developing a deep and solid understanding of billing records is an engineering problem that consumes time and investments. It’s like understanding operating system level metrics. It is not optional to not understand such metrics. You’ve to build tools to process the data, visualize, and then derive insights. You can not also centralize all this to one particular tech team or a finance team, as you need every team spending on the cloud learn to gain their own insights. Such insights need to complement performance metrics to gain awareness of price for performance. This is why I believe that cost awareness must be part of the engineering culture, and it starts with developing an understanding of billing data.

Problem 2: Optimization Practices are Reactive

Unlike other drivers like dev agility, availability, and security; cost related practices often tend to be reactionary. Cost concerns come to the front seat only when there is a sense of urgency to reduce cloud costs. Otherwise, cost concerns get left in the garage back home. Whenever there is a realization of cost increases beyond budgets, organizations scramble to conduct optimization exercises, and when the dust settles, go back to the business as usual.

Part of this is due to Problem 1 above, which is not looking at the billing data, and/or not gaining enough insights from the data, and thus not being able to incorporate cost awareness into the engineering culture. The remaining of it is due to the holy trinity of cost, speed, and quality.

In order to produce an outcome of a certain quality, at any given time, you can either move fast while spending more, or move slow and be efficient. You can’t maximize all the three at the same time. The key question to ask therefore is, how much cost inefficiency are you willing to tolerate for a given amount of quality and speed.

Problem 3: Stigma and FUD

Though we hear about wastage on the cloud from reports like those I cited above, apart from a few “how we saved such and such by doing so and so” blog posts, we don’t hear much about building sustainable practices of spend management and governance; and most importantly real stories about failures.

There is a reason why. Most of us mentally equate having to optimize cloud spend to the business not being healthy. We compare it to other usual cost cutting measures that most companies take at various points in their cycles, such as letting people go, avoiding business travel, reducing discretionary spending etc, shutting down offices etc. Though more experienced managers and leaders see these as natural acts of cost governance, common perception remains otherwise. A shadow of stigma follows cost optimization.

However, most successful companies build cost governance into everything they do, whether it is hiring, business travel, discretionary spending, or technology related spending. Cloud costs are no different. Acknowledging that cloud spend is a variable that you can manage, that you must maintain the spend at a certain efficiency, and removing the stigma from cost optimization are essential to building a culture of cloud cost awareness.

Wherever there is a stigma, there is FUD. I’ve heard stories of cost optimization practitioners in the wild that make claims like “we will show you how to save $XXX, just give us $Y”. One friend once shared a story of a consulting team proposing to optimize for a percentage cut of the savings realized. Do you remember “termination assistance” from Up in the Air (pun intended)?

Such approaches might make sense in places where cloud adoption is not strategic, and is treated as a utility, like a third party maintaining your corporate media web site. However, these approaches don’t produce sustainable results for anyone running serious workloads on the cloud. While not rejecting the need to seek help, you’ve to equip yourself with tools, automation and cultural changes. This is the philosophy behind DevOps — you make teams autonomous and hence accountable for development and operations for higher team performance. The same goes for cloud costs too.

Cloud Finance Governance and Ops

This brings me to commonly dreaded term “governance”. Instead of treating cost optimization as a necessary evil, what we need is a practice of cloud finance governance and operations. Optimization is a part and parcel of governance. Here is how I describe cloud finance governance.

Cloud finance governance is pushing for responsible spending practices, and introducing checks and balances. It is about learning to operate spend management levers to trade between between speed, cost efficiency, and sometimes even quality.

Governance is not a bad word. Governance is not bureaucracy. Governance is not introducing roadblocks. When done right, governance is empowering, rewarding, and helps us exercise new muscles. Governance is what responsible families, cultures, societies and businesses must do in order to be adaptable and be resilient.

While prescribing a general purpose blueprint for how to practice cloud finance governance is tricky as each organization needs to determine what works best for them, below are some of the essential building blocks.

1. Attribute Costs

You can’t govern and optimize what you can’t measure. I was in situations with charts showing large unallocated cloud spend on a big screen in front, and struggling to explain where that money was going. You can’t optimize, let alone govern, if you don’t know who is spending what. Resource attribution to people and teams is fundamental to operating successfully on the cloud for cost as well security reasons.

There are several techniques to consider to maintain high percentage of attributed costs:

Metadata to map people and the organization structure to applications and resources
Resource tagging of all taggable resources — be aware that not all cloud resources are taggable
Using separate accounts for different types of workloads to reduce accumulation of shared expenses — particularly those due to untagged/untaggable resources, including network ingress/egress costs, NAT gateways, firewalls etc.
Modeling how to attribute cost of shared services — your organization may be running large shared services for logging, monitoring, caching, proxying, analytics, experimentation etc. Since these are used by many teams, you may end up in a situation of nobody being able to explain the cost of such services.

2. Gain Insights

The next step is to gain insights from the billing data. Insights aren’t easy and automatic. I approach this step by first observing cost and usage data across different dimensions (time, regions, accounts, dev/test/production environments, resource types, usage types, instance types, allocated vs unallocated costs, etc), asking questions, making hypothesis, and validating those hypothesis. This is an iterative process over time.

In order to do these, you need access to the raw billing data, stored and indexed in a form that allows fast and easy queries. At work, we built a data warehouse using Redshift and ElasticSearch for billing data. This system loads raw billing data as soon as it lands in S3, merges it with the metadata of people and applications, and loads into an ElasticSearch cluster for queries and visualizations. This process helped us several times to improve our overall understanding of costs, efficiencies and inefficiencies, and areas of improvement.

3. Automate Hygiene

While we like to automate everything, in reality, automation is never complete, and the degree of completeness varies by what you’re optimizing your automation for.

You may, for example, optimize for speed and availability, and decide to leave older deployments for a week or to allow for rollbacks. You may optimize for performance for your analytics workloads and decide to keep all your offline data in the S3 standard access class, and run the compute on pricey instance types. You may have a bug in your automation that forgets to propagate tags from EC2 to EBS, thus increasing unattributed costs. Your teams may have forgotten to upgrade some legacy EC2 instances that may be pricier for the same performance. I’ve seen all such scenarios and more that lead to waste.

You can improve hygiene by crafting policies (such as “all unattached EBS volumes shall be deleted after 48 hours”), and then automating those policies.

4. Build spend management levers

Remember that cloud spend is not a fixed sunk cost. It is a variable expense that can you manage. There are several levers possible:

Continually iterating the architecture to improve the price-performance ratio. Though it is commonly referred to as right-sizing, in reality, it is all about testing, looking at past data, and adjusting resource choices to improve price-performance ratio.
Shedding unnecessary/unwanted traffic
Scaling down passive regions (for applications using active-passive architectures)
Adjusting SLAs for your analytics jobs to take longer time to complete
Using cheaper tier options for S3 and EBS to trade performance for cost efficiency
Adjusting data retention policies
Using cheaper resources for test workloads

Furthermore, if you’re still running in the hybrid mode with some apps serving traffic both in your data centers and the cloud, another lever may be to shift traffic one way or the other to balance between variable cloud costs and fixed data center costs.

5. Forecast

Forecasting is another important aspect of developing a finance governance process, particularly for those moving workloads from the data center to the cloud, or those building new systems. During such phases, cloud spend tends to increase at a higher rate than in the steady state. Forecasting is less reliable during such phases as you may not have past data to build forecasting models. Regardless, you can correct for this by forecasting more frequently — ramp up some volume of traffic, build models for forecasting, forecast, then ramp up more. Also read Cloud and Finance — Lessons learned from a year ago on this topic.

6. Budget

Forecasting is what you expect to spend in future based on your cloud adoption plans, your team velocity, and the architectures your team is building. The budget tells you how much is being set aside for that area of spend. The difference should tell you how to tweak the plans, architecture, and levers you can exercise to meet the budgetary goal. Usually finance teams determine your budget.

7. Operational Reviews

Lastly, incorporate all cost related metrics, and insights into your periodic operational reviews. Most teams use such rituals to review overall KPIs of applications, and the status of projects the teams are working. Add cost related topics to the same. This is a place to observe billing data, to ask questions to develop better understanding of the data, to identify ambiguous areas, and to keep on improving the governance muscle.

To reiterate, cloud gives you many levers to manage costs. Discovering and exercising those levers requires thinking of how to govern cloud costs, and building the automation and processes to develop insights into costs, creating spend management levers, and knowing how to make cost vs speed vs quality tradeoffs.

DevOps and Governance

2018-02-11T20:33:13+00:00

Regulation and threat aware DevSecOps culture is where DevOps was 5–10 years ago. Auditors, controls, regulations and compliance are not…

Regulation and threat aware DevSecOps culture is where DevOps was 5–10 years ago. Auditors, controls, regulations and compliance are not the topics that most autonomous, empowered, cloud-native, continuous-delivery practicing, run-what-you-build teams know much about or want to welcome.

Despite the collective progress we made during the last 5–10 years to break the walls between software development, release management, and operations, there is still a wide chasm between contemporary DevOps culture, security, and compliance to controls.

Any enterprise that collects money, processes and stores customer and payment data, and partners with other enterprises is required to comply to certain controls. These controls set some expectations on what the architecture and DevOps practices must conform to. Depending on the areas of the business and geographies served, these controls may come from IT General Controls provided by Institute of Internal Auditors, laws like Sarbanes-Oxley Act (SOX) to govern accurate financial reporting, EU regulations like GDPR for data protection of all individuals, PCI DSS for information security of payment card data, etc.

More important, some of these laws and regulations require certain executive level roles to attest that the enterprise conforms to those controls, failure of which may have material consequences. Most of these controls are also subject to internal and external audits for compliance. Auditors look for demonstration of evidence at a statistically relevant level.

However, when internal or external auditors approach dev teams, they may get a frowning “we know the right thing to do, why are we talking about this now” response. As far as dev teams are concerned, security teams, with their big-brother approaches, are blockers for getting things done quickly; and audits and compliance are drills that they have to grudgingly deal with periodically.

Folks that are responsible for security and controls consequently end up resorting to strong-arm techniques to get conformance. The result is cultural divides between each of these groups representing different interests. Controls frustrate dev/ops teams, and exasperate controllers and auditors. The outcomes are friction and potentially ineffectiveness of controls.

There are several reasons for this continued divide.

Most of us on the tech side have a cursory understanding of regulations, compliance, and the audit procedures. Similarly, most regulators and auditors don’t always keep up with the changing technology and engineering practices.
Each of these groups operate with different sets of skills and experiences. Lack of familiarity between these teams leads to differences in expectations or mistrust.
Due to the general purpose and yet binding nature, regulatory language often looks ambiguous to tech teams, leading to subjective interpretations. For instance, PCI requires “file integrity monitoring or change detection systems check for changes to critical files, and notify when such changes are noted”. The interpretation depends on whether an application is stateful or stateless, how code and configuration are packaged and distributed, and how the data is managed. This ambiguity leads to tech teams not fully understanding and realizing what they need to do. Rules may sometimes appear arbitrary.
Cloud further frustrates controllers and auditors. Though there are a number of building blocks on the cloud to build secure and compliant applications, those are not sufficient to ensure governance and show proof of compliance. You’ve to string together processes and automation for governance. Just the loss of access keys alone have exposed many organizations to data theft in recent years. While periodic rotation of access keys is a necessary practice to reduce the risk, it is up to each enterprise to invent processes and automation to implement such practices. There is a lot more that cloud providers can and must do than providing lower level buidling blocks and the System and Organization Control (SOC) reports. There is a vaccum of higher level cloud services that simplify the cost of governance.
Though most tech teams understand that security is of paramount importance, they may lack sufficient understanding of rules and regulations that govern different types of environments (like production, PCI vs non-PCI, PII vs non-PII, analytics vs non-analytics etc.) and the rules that security teams impose on interactions between these environments. Again, such lack of familiarity leads to mistrust and perimeter friction.
Traditionally, security teams operate under a shroud of secrecy with a “we will tell you what to do, but not why” culture. This is partly driven by the desire to keep ongoing threats confidential. Lack of such awareness further fuels mistrust and friction.
Contrary to the blame-free DevOps culture tech teams tend to practice, certain regulations require certain roles to take the blame. In order to ensure that audits are impartial, audit teams enjoy a certain amount of autonomy with clear separation of duties. Separation of duties may end up strengthening the wall between tech teams and auditors.

Unfortunately, solutions are non-trivial and may take time. While technology in the form of automation and tools for compliance and verification is necessary, these cultural devides won’t disintegrate until and unless controls and security become part of an organization’s DevOps culture.

This won’t happen without short and frequent feedback loops between people and processes dealing with tech, security and compliance. Teaching and joint-learning should accompany these feedback loops to align mental models and vocabulary of each of these teams.

Serverless: Looking Back to See Forward

2017-11-12T22:27:41+00:00

Last week, I attended an all-day CIO forum on cloud in Seattle, organized by one of Seattle’s top venture fund groups. Several notable…

Last week, I attended an all-day CIO forum on cloud in Seattle, organized by one of Seattle’s top venture fund groups. Several notable speakers and panelists spoke about their views on containers and kubernetes, public and hybrid clouds, provider lock-in and portability amongst providers, and of course, serverless computing.

I observed two patterns that day. First, there is a genuine concern about cloud provider lock-in. Second, there is a desire and hope to avoid or at least minimize such lock-in by embracing open source abstractions and platforms, most notably Docker and Kubernetes. These patterns are best illustrated by the answer I got for my question to one of the key guests of the day.

My question was whether this individual would spend six months to build and launch a new service that works and takes advantage of everything one particular public cloud offers, or spend two years to make sure that the same would run on multiple public clouds. The answer I got was that, though he might start with the former for agility, he would invest in the latter for the long-term. The implication I sensed in this answer was that the latter is the right thing to do. I didn’t press on to ask if he had an opportunity to test this hypothesis in the real world.

While I’ve not shied away from stating my opinions on multi-cloud and cloud lock-in, this event made me acknowledge the necessity to take a few steps back from these views, and notice some changes that have been slowly happening over the last five plus years in the industry.

In this post, I would like to narrate three slow changing themes in the industry, and postulate where we might land in five plus years from now. My hypothesis for the future is that serverless services will take the center stage for most common application development, and consumption of open source for infrastructure automation and services will continue to drift from being strategic to opportunistic. Looking back to see forward might help us shape and take advantage of that future and not fight it to be left behind.

These themes are not independent, and are reinforcing and accelerating one another.

Theme 1: Inversion of Control

AWS Lambda is the first successful application of inversion of control for the data center. Inversion of control is a design principle in which applications receive flow of control from a generic framework. By implementing certain common and generic tasks, the framework relieves applications from having to implement those generic tasks at the expense of losing explicit flow of control.

Inversion of control is not a new concept. Wikipedia’s entry points to a 1998 reference. However, application of this concept at commodity scale to data center resources is relatively new. This is because of the complex nature of the common qualities that most applications need in the data center. These qualities include the following:

Just-in-time allocation and de-allocation of resources like compute (baremetal or virtual machines), network segments (for network isolation), network filters (firewalls), load balancers, and storage (file or block)
Elasticity of such resources to allocate nothing to as many as needed, with the needed quality of service
Keeping applications robust and available in the face of common infrastructure failures

As essential as these qualities are, achieving these still takes Herculean efforts in most data centers around the world today. Just try to get a new 1,000 node Hadoop cluster in an enterprise data center. You would be lucky if you get a working cluster in six months. Though compute virtualization has helped to an extent, resources in data centers are still horridly non-malleable. Through a series of evolutionary changes, we are now getting ready to abstract this complexity into frameworks operated as services, and the early benefits from services like AWS Lambda are clear.

See below for the inversion of control moving custom purpose built automation into app frameworks operated as services.

It is unfortunate that we call this inversion of control as “serverless”. Framework as a service is a less confusing choice to describe the new abstractions. Regardless, every development happened so far is helping us build better frameworks, and the trend shall continue.

Theme 2: Disaggregation of State

Most enterprises have built or used open source “platform as a service” tools that make it easy and quick to create, build, deploy and run applications. Such platforms succeed(ed) when dealing with stateless code like web apps or stateless micro services, but fail(ed) to tackle state. Examples of stateful applications include SQL or NoSQL databases, search clusters, key-value stores etc.

State introduces extra complexity into automation as it requires you to think of distribution aspects like cluster awareness, synchronization, consistency, sharding, replication, and partition tolerance; and quality of service aspects like fast startup, IO latency, IOPS etc.

Just think of how much code you could delete from Kubernetes if all it did was to run stateless microservices. Building generic automation abstractions for stateful systems is hard and time consuming, and most of all, it takes experience to get the abstractions right. Even replacing a failed node from a stateful system takes special considerations and most enterprises don’t dare automate such tasks for fear of losing critical data.

Moving state to external managed services like S3, Dynamo, BigTable, Spanner is lowering the bar for application automation, which is thus eliminating some of the complexity from frameworks operated as services for the provider. For the consumer, this trend is also eliminating operational tasks that system and database administrators usually perform.

Theme 3: Power of the Ecosystem

The most important change that is accelerating the adoption of serverless frameworks operated as services is the strengthening ecosystem of services in each public cloud today. Without the ecosystem, AWS Lambda would not have gained as much adoption as it did in such a short time.

Just to give an example, at Expedia, we ran over 6.2 billion invocations of AWS Lambda in October this year. Though this number is small when compared to other types of external or internal traffic we serve, Lambda’s rapid adoption would not have been possible without the surrounding service ecosystem with IAM, SNS, SQS, API Gateway, KMS, S3, Dynamo, Kinesis etc. Most of these Lambda functions are about 100–150 lines of long and are written in a day or two. Of course, we also built a deployment platform that makes creating and running Lambdas a breeze.

Without such a mature ecosystem, AWS Lambda would just be a curious cloud-enabled cgi-bin service.

To Fight or Not to Fight

These three themes are reinforcing and accelerating one another.

Frameworks as services like Lambda wouldn’t be possible without disaggregation of state and a strong ecosystem of services.
Similarly, adoption of S3 as an infinitely elastic data lake is fueled by the ease with which you can create and terminate data processing compute clusters like Hadoop map-reduce and Spark in a serverless manner without worrying about HDFS for state durability.
Complexity shift from apps to the service ecosystem is clearing the way for faster development and adoption of frameworks as services.

Resisting any of these themes has a cost. For example, resisting AWS Lambda for fear of lock-in reduces agility, raises infrastructure cost, and increases complexity of automation. Not moving data from an enterprise storage system to a cloud storage service like S3 for fear of losing control on data increases cost of data storage, makes storage less secure, increases automation complexity, and increases friction for systems that need to deal with state. Not adopting cloud provider managed services for fear of lock-in reduces agility, and raises the cost, particularly the cost of lost opportunity.

What’s in store for the future then?

Cloud services will continue to shift the complexity away from applications further fueling adoption of frameworks as services on public clouds. Docker and Kubernetes will become less relevant in the developer land than they are today as proprietary frameworks take the center stage. This is not a threat to open source. Open source shall maintain its place in many other forms.

Despite fears, natural laws of economy will shift the gravity towards frameworks operated as services. Since these are services and not code, these will remain proprietary. AWS Lambda may have been the first of its kind but won’t be the last. There is a lot of code in the industry that needs to move to app frameworks operated as services, and we will likely see more frameworks in future.

Even when multiple cloud providers offer the same open source abstraction, portability will be unlikely due to proprietary warts and the proprietary service ecosystem.

In summary, here is my suggestion. Participate in this journey, learn and not be left behind.

Technology Decision Making and Architecture Reviews

2017-09-29T14:13:38+00:00

In this note, I would like to implore you to not use architecture reviews as a means to improve quality of technology decisions. Instead, I…

In this note, I would like to implore you to not use architecture reviews as a means to improve quality of technology decisions. Instead, I would ask you to rely on acts of asking for, and giving feedback in open forums that favor autonomy, constructive feedback and dialog over correctness and objectivity of decisions.

Technology decision making processes like architecture review boards, architecture working groups, virtual architecture teams (VATs), or “A” teams (with the “A” standing for either “architecture” or the “best of the breed”) that rely on “review” of decisions tend to be slow, impede team autonomy and produce top-heavy decisions.

On the other hand, processes or rituals that use dialog and feedback as a means of offering improvements put the autonomy back in the hands of those seeking feedback, and empower them to determine how best to incorporate the feedback into actions. Without the autonomy to make decisions, whether optimal or not, no team can learn and own outcomes.

When you take the feedback centric approach, those giving feedback would be forced to formulate constructive and actionable feedback instead of arguing against the decision or explaining why the proposed decision is wrong or inferior. Since there is no review, there are no approvals to make, and no up or down votes to give. Opinions on why any particular decision is good or bad become immaterial, unless those are followed by constructive and actionable feedback showing better alternatives. Any bar-raising push for improvements happens via constructive feedback and not outright disapproval.

The onus for incorporating the feedback falls back onto the one seeking the feedback, thus maintaining autonomy. The feedback is non-binding. The feedback seeker is free to disregard the feedback or interpret it in ways that he/she sees fit.

Wouldn’t this approach likely to produce inferior outcomes? Before answering this question, let us look at the rationale behind commonly practiced architecture reviews.

Meritocracy

Most architecture review initiatives start with a desire to improve quality of decisions. Teams wanting to make decisions come prepared to present their designs and analyses in the form of a proposal. A nominated set of individuals discuss, review, offer feedback and/or critique. These individuals are usually considered the best in their roles.

Upon presenting the proposal, a decision is either reached or not reached by a process of approval or voting. If a decision is made, the presenter(s) will get to implement the decision. When a decision could not be reached, the presenters may need to come back with an improved proposal at a later time.

This approach assumes that the individuals reviewing the proposals know it all, and have earned the feathers to offer critique on any topic. Instead of helping the proponent improve by way of feedback, this style puts the emphasis on a review and approval of a proposal under the guise that those reviewing know the answers. This is very rarely the case. Even when it is, it disenfranchises those implementing the decisions.

Regardless, teaching someone how to make better decisions by way constructive feedback is far more valuable than making better decisions for them. When there is no teaching, there is no learning. When there is no learning, there is no improvement.

Silos and The Principle of Least Effort

What also prompts such architecture review boards or forums is increased size from a small team to a larger organization of several teams. When the size is small, decisions are easily understood by everyone and feedback flows through quickly. But silos form as the size increases. Individuals outside the silo find it difficult to understand what decisions are being taken, and the rationale behind those decisions. Nonetheless, decisions still move swiftly within each silo due to the autonomy.

However, autonomy without external feedback often leads to local optima ignoring alternatives that can help produce a global optima. In the absence of feedback, the principle of least effort takes over, and the team may gravitate towards known and comfortable decisions avoiding uncomfortable alternatives.

Architecture Feedback

I would argue that it is okay for the feedback process to produce inferior outcomes in the beginning, as long as there is a mechanism for the feedback to flow continually. The feedback, coupled with the autonomy to own the outcomes by the proponent, will eventually autocorrect decisions.

If you are ready to rechristen your architecture review ritual into an architecture feedback ritual, here are some suggestions for the feedback seekers and feedback givers.

Feedback Seekers

Approach as though you don’t have all the answers, let alone the best answer.
Remember that the purpose of feedback is to learn, and that feedback is not intended to disenfranchise you of your autonomy.
Don’t defend your solution. Consider that there are many ways to solve the same problem.
When you don’t like the feedback or agree with it, explain to open dialog, but not to defend your position.
Ask what you may be missing.
Take notes and ask clarifying questions about the feedback.
Avoid defensive phrases like “we decided” or “we want to”. Instead, start with “Here is what I/we thought … What do you think?”.
Don’t outright reject the feedback if you’ve already thought about it. You might instead say, “Thanks for the suggestion. Let me reconsider”, or “Would you mind walking me through this further? I couldn’t come to the same conclusion.”

Feedback Givers

Approach as though you don’t have all the answers, let alone the best answer.
Practice how to give constructive feedback about the things you disagree with.
Work on shaping your anxiety to push for what you consider as better choices into constructive feedback.
Ask clarifying questions to understand the context behind proposed decisions.
Instead of “This won’t scale/work/whatever”, try alternatives like “I’ve observed that this approach might not scale/work/whatever. Here are the reasons why. … I will be happy to walk you through such and such alternative.”
Set and raise the bar while providing additional context.
Don’t challenge or overtake the presenters’ autonomy to incorporate the feedback in the best way possible, including completely ignoring your feedback.
It’s okay to not have an opinion. Don’t be compelled to voice one.

Though this approach might sound radical, I submit that, the only way to raise the bar on technology decision making while preserving team autonomy is through feedback and dialog.

Accept | Tentative ✓ | Decline

2017-09-18T02:39:34+00:00

I’m done with the excuse of not being in charge of my time. Over the last two months, I started taking a few steps to deliberately simplify…

I’m done with the excuse of not being in charge of my time. Over the last two months, I started taking a few steps to deliberately simplify my calendar. I now manage extended blocks of work time on most working days. With the exception of a few, most of my meetings are recurring meetings. I skip meetings where I’m not required to help make decisions. I try to combine ad hoc meetings into other scheduled meetings.

The early results are positive. I’m more productive, less stressful, calmer and more focused than before. I’m still looking for patterns and refining my techniques. and I’ve ways to go.

I know I’m not alone when I admit that great chunks of our work time are spent in meetings. There were weeks where I spent 70–90% of my work time in meetings. Even during those weeks when this percentage was low, my free time consisted of 30 or 60 minute slivers spread throughout the work day. These fragments were clearly not sufficient for any deep work. I tried to catchup and get focused work done in the evenings and weekends. It worked, but only for a while.

Few Lessons

Like most others, I too took meeting overload as a consequence of increased collaboration, scope of work, and responsibilities. I thought that I just needed to master the art of multi-tasking and being effective despite frequent interruptions and context switching. A few resources helped me realize that this notion is just a fallacy.

First, thanks to Susan Cain’s Quiet, I learned that excessive meetings and frequent context switching over-stimulate my mind, and that each of us “need very different levels of stimulations to function at (our) best”. On any given day, the more think time I take the less stressful I’m. The more time I spend in meetings and context switching, the more tired I become at the end of the day.

To further quote from Quiet,

What looks like multitasking is really switching back and forth between multiple tasks, which reduces productivity and increases mistakes by up to 50 percent.

Say no more to multi-tasking.

Second, thanks to an external speaker series at Expedia, I had a chance to listen first hand to Eduardo Briceño talk about growth mindset, and “learning and performance zones”. The gist of his talk was that, as we spend years in our careers, we get stuck in the performing zone, and spend little, if any, in the learning zone. In the performing zone, we repeat and practice what we already know. We continue to demonstrate the same set of capabilities intent on minimizing mistakes and being effective.

Learning zone, on the other hand, is where we “engage in activities designed for improvement, fully concentrated on what (we) haven’t mastered yet, and expecting to make mistakes from which (we) can learn”.

To quote from Eduardo’s TED talk How to get better at the things you care about,

The reason many of us don’t improve much despite our hard work is that we tend to spend almost all of our time in the performance zone. This hinders our growth, and ironically, over the long term, also our performance.

However, time for learning zone does not create by itself. I needed to explicitly carve out time to be in the learning zone. The simplest way to carve out time for learning is to limit the number of meetings where I’m expected to be just in the performing zone.

The third source that is helping calibrate my time in meetings is Cal Newport’s Deep Work. Cal describes Deep Work and Shallow Work as follows.

Deep Work: Professional activities performed in a state of distraction-free concentration that push your cognitive capabilities to their limit. These efforts create new value, improve your skill, and are hard to replicate.

Shallow Work: Non-cognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend to not create much new value in the world and are easy to replicate.

Though being in the performing zone does not necessarily mean shallow work, deep work does require long durations of distraction free work time.

So, what’s the most important work I do at the end of every work week? It is curating next week’s calendar.

Paying it Forward

2017-07-31T22:57:18+00:00

I recently had an opportunity to volunteer for an Indonesian company called Vasham through the RippleWorks Foundation. Over a period of…

I recently had an opportunity to volunteer for an Indonesian company called Vasham through the RippleWorks Foundation. Over a period of four months, I worked with Vasham’s IT and leadership teams to observe, review, and help define a roadmap and teach some habits to practice execution. My role was more that of a coach than that of a problem solver. This four month experience allowed me to peek out of the West Coast tech bubble and work with a passionate group of individuals committed for a cause in Indonesia.

RippleWorks pairs up volunteer experts with social ventures like Vasham throughout the world. Vasham is a four year old Indonesian venture that wants to bring small Indonesian farmers above the poverty line by providing financing, expertise, and income security through out the farming cycle. There are at least 18 million farmers living below the global poverty line in Indonesia. Most of these farmers own small pieces of land, often about 2 acres. The farming cycle includes procuring farm inputs, farming, harvesting, and selling the the yield. Vasham plays a role in most of these activities through field operations, IT, financing, and procurement.

Making People Work Together

Vasham originally requested me to review their IT roadmap, improve efficiency, and help create a short-term (about a year) and a long-term (three years) IT roadmap. Vasham’s leadership walked me through their vision for the future, expected growth, and their technology needs. Given that the scope is an IT roadmap, I started my work with a technology lens.

Over the last four years, Vasham has been iterating with a few business processes to support the farming cycle, and built some IT systems to support those processes. It therefore made good sense for me in the beginning to shine spotlight on integration of their systems, automation of manual processes, and modernization of their technology stack.

However, this approach didn’t take me far enough. The more time I spent understanding the team, their organization, efficiencies and inefficiencies, and their operating constraints, the more I realized that lack of a technology architecture and a roadmap is not the problem. What Vasham needed was an ability to break down ambiguity, learn to be data driven, improve partnership between IT and operations, practice incremental execution, and deliberately turn unplanned work into planned work. These are common challenges for any growing organization.

Once this became clear, we switched gears to the following:

How to measure efficiency of their business processes through certain KPIs, and identify poorly performing areas
How to do agile planning to define a few work streams, and show how to prioritize
How to practice daily stand-ups, biweekly planning and retrospective sessions
How to let the leadership team facilitate feedback loops between their IT, operations, and finance teams through regular operations reviews

My four month volunteering experience with Vasham helped me reinforce a few points.

First, irrespective of how an organization is structured, it is important to establish, nurture and maintain feedback loops between teams to make the organization learn and function efficiently.
Second, roadmaps are relatively easy. The hard part is forming sustainable habits that get you there. These habits should let the teams measure, summarize, share, take feedback, and improve.
Third, the belief that tech talent is not acquirable is pervasive. I had to challenge Vasham’s IT team that they too can be good at Python (one of their internal systems was built in Python) by just making continuous learning a part of their jobs.

Pay it Forward

There are many ventures like Vasham that are trying to solve problems off the beaten path. The problems are ambiguous. Constraints are severe. Sponsorship is hard to get. These problems need a mixture of technology and human operations to create an impact. Attracting skilled and experienced individuals to work for such organizations is a constant struggle.

Here is my appeal. Seek the opportunity to volunteer for organizations like the RippleWorks Foundation, and pay it forward. I know I’m looking forward to my next one. It’s the least we can do. Bersama kita bisa. Together we can.

DevOps, Postmortems and Cloud Spend

2017-07-27T15:45:34+00:00

As I wrote previously here, here, and most recently here, I’m a strong advocate and practitioner of “cost awareness as part of DevOps…

As I wrote previously here, here, and most recently here, I’m a strong advocate and practitioner of “cost awareness as part of DevOps culture”. Cost related activities like choosing the architecture with cost in mind, forecasting for scale, and optimization to reduce waste are some of the several activities that DevOps teams need to conduct in order for autonomous teams to succeed.

In this culture, spikes or unexpected patterns in cloud spend are like production incidents. What do you do when you’ve a production incident? Once you restore the system, you conduct a postmortem, ask the whys, make observations, note lessons learned, and take corrective actions for the future.

That’s exactly what we did yesterday as we were analyzing numbers for a prior month. We observed that the spend is not in line with the expected. We conducted a postmortem asking the whys. Here is a simplified version.

Issue: We spent more that we expected by a certain amount.

Metrics observed:

Ratio of reserved (EC2 and non-EC2 combined) to on-demand instances: Fell from 71% to 64%

Utilization of reserved instances: Remained at 96%

Volume of compute: Increased as expected with forecast

Price per unit of compute (an aggregate metric to spot trends): Increased from $x to $y

Compute vs network costs: Marginally increased inline with forecast

Whys:

1. Why did the ratio of reserved to on-demand instances fall?

Because we ran out of reserved instances for certain instance types, and paid on-demand price.

2. Why did the reserved instance utilization remain the same?

Probably because some teams switched instance types.

3. What are the most expensive instance types in the month?

r4.2xlarge, …

4. What is the reservation coverage and utilization of the most expensive instance type?

39% coverage, and 100% utilization.

5. Which teams use that instance type, when did they switch, and what were they using before?

Team X switched from another instance type with 74% coverage with 100% utilization during the middle of the month.

… …

6. Why did we not observe this sooner to take corrective action?

Due to billing delays, our weekly review cycle did not spot the increase. Since the purchasing cycle is also time consuming, we could not have the corrective actions in time to influence the current month.

7. Why did those teams change from instance types that have more coverage to instance types that have less coverage?

They didn’t know. One of the teams thought that they were saving money by switching instance types while getting better CPU to memory ratio that they needed. They didn’t have access to the reservation pool, and even those with access to data could not tell if/how their usage has an impact on the overall reservation pool.

We still had more questions and some hypothesis to validate, but we also found some smoking guns that could explain the change. What did we learn from this?

On any cloud, as a team using cloud services, you are still responsible to design for, forecast, and optimize for cost.
The shorter the feedback loop between spend and the teams creating and running software, the better. Our current feedback loops are not short enough.
Savings through reserved instance purchase is a complex game, and you can only play it for a while. There are not enough tools in the world to play this game efficiently and forever.
Regardless of reserved instances, the portfolio of services on public clouds is still evolving. The pricing models have room to evolve.
As we see in phone and cable bills, price complexity helps the providers, and not the consumer. Whenever you look at these bills, you would always ask, am I paying more than I should, and am I subscribing to services that I don’t need. Definitely not a good feeling.
Stringent measures like governance committees and capacity approval boards are not an option. They spoil the culture.
Simplicity, where are you?

DevOps is not DevOps

2017-06-21T14:05:05+00:00

I learned DevOps the hard way about five years ago when I inherited some infrastructure and software that had zero automation and…

I learned DevOps the hard way about five years ago when I inherited some infrastructure and software that had zero automation and monitoring. Since then, I’ve run into many teams that called themselves as “DevOps team”. My most recent encounter was with a team that wanted me to review their “DevOps” org structure and portfolio. By the end of that conversation, that team thankfully changed their name into something more appropriate. They were just building a set of automation tools for other teams to use.

During this time, I’ve also run into some really good engineering teams that simply didn’t want to have anything to do with DevOps. Their rationale was that operations work is menial, and they didn’t want to waste their precious team deploying, upgrading, monitoring and performing other upkeep tasks. Let someone else deal with those was their answer. They just wanted to keep coding.

I’ve also come across another kind of engineering teams that originally embraced the DevOps mindset, built some automation, practiced some level of CI and CD tasks, but failed to keep the automation up to date with their software architecture. I recently met someone from one such team, who was hoping that another team would take over some of their core components and their operational responsibilities. His team was not able to keep up with those responsibilities. The other team in his case was a centralized team that manages an odd assortment of tools and systems that nobody else wants to manage.

Another team that started with a similar DevOps mindset ended up hiring a “DevOps” manager and creating a “DevOps team” to take care of operations as they could not improve their automation and operational processes to meet the increased scale.

I feel sorry for such teams. They were all missing an important element called feedback, and potentially paying a price. However, lecturing such teams that they must own and take care of their software themselves may not be effective.

Think Feedback

A better approach may be to probe about how they intend to maintain feedback between all aspects of software lifecycle such as architecting, coding, shipping, running, and operating their software so that they can be agile. Ask about how long any existing feedback processes take.

Here is why. DevOps is nothing but a culture of maintaining short and healthy feedback loops between all lifecycle activities related to creating and running software and systems.

The shorter the feedback loop between these activities is, the more agile the team can be. Agile teams have a better chance to learn quickly from mistakes. Their architectures can evolve fast. They are less likely to pick architectures or processes that do not support reasonably quick feedback.

When feedback loops through these activities are long, mistakes take longer to observe. Corrective actions take even longer to yield results. More important, hypotheses remain untested for months. Ground can shift in the interim.

Below is an example that shows potential interdependendencies between architecture and code, automated delivery, monitoring and SLA management, chaos experiments, and cost forecasting and optimization.

Each of these functions impact one or more of the others. For instance, analysis of the architecture tells how to test for failures, while a chaos experiment may reveal some inherent weaknesses in the architecture or the absence of some resiliency patterns in the code. Similarly, the architecture chosen may be prohibitively expensive to run when fully adopted, and thus may require changes to the architecture itself.

You therefore want the arrows between various functions take as little time as possible, so that you can apply lessons learned and be able to iterate. A team that practices all these feedback loops is more likely to succeed than a team that does not. Org structures and team names don’t matter as long as there are healthy and short feedback loops between various functions.

How do We Manage Cloud Spend at Expedia

2017-05-20T16:18:03+00:00

Since writing State of AWS Compute Pricing, several people asked me how we manage the cloud spend at Expedia. My colleague Abiade Adedoyin…

Since writing State of AWS Compute Pricing, several people asked me how we manage the cloud spend at Expedia. My colleague Abiade Adedoyin and I just published a detailed article on our tech blog, titled “Cloud and Finance — Lessons learned”. Not so surprisingly, the central theme of our cloud cost management is to make cost awareness as a part of our DevOps culture. Read the article to find more.

There is No Pendulum

2017-04-13T23:29:27+00:00

As public cloud spend goes up, will the pendulum swing back to enterprise data centers? Sorry, but there is no pendulum to swing back.

As public cloud spend goes up, will the pendulum swing back to enterprise data centers? Sorry, but there is no pendulum to swing back.

A pendulum exists only when you treat a public cloud as a compute center and not as a platform.

The top three clouds offer 60–100 services each today. These are well integrated to help get a lot of work done. Each of these have ecosystem of services built around them, and are growing strong as platforms. The fact that each of these platforms run in compute centers is just a necessary detail. Once you start building businesses on top of these platforms and are generating value, there is no going back.

Here is an analogy. Smart phones are certainly more expensive than feature phones. But feature phones are not coming back. Today’s smart phones are build around platforms.

How to Think About Multi-Cloud

2017-03-12T15:32:45+00:00

Public clouds are no longer equivalent. These are platforms with some overlapping basic capabilities and yet different in many respects…

Public clouds are no longer equivalent. These are platforms with some overlapping basic capabilities and yet different in many respects. Here are a few factors I would consider when thinking of multiple clouds.

Multi-cloud is not a resiliency play. It’s not necessarily going to shield you from any particular provider having a bad day. Any tech company undergoing rapid rate of change is going to have bad days. Recent s3 outage did prompt some questions in some online and offline conversations, and multi-cloud is not the answer to survive such incidents. Design for fault domains and fault isolation, and practice chaos engineering instead.
Don’t worry too much about lock-in at this time. The differences between public clouds are significant. The value of any public cloud is in the platform aspects and not the basic primitives. Abstractions to prevent lock-in introduce operational complexity and limit you to common denominator primitives like virtual machines.
A better strategy may be to use different providers for different types of workloads based on suitability of the provider’s platform capabilities for that workload. I would also keep any cross-provider data exchange asynchronous.
Above all, maintain fault domain boundaries. When you scatter your critical workloads across multiple providers’ regions, you might end up with large fault domains across those regions. This can make reasoning about resiliency hard and lead to longer times to recover. For instance, I would not put the front-end in one provider’s region, and the back-end in another provider’s region except when availability of that application is not critical.

The same considerations apply for hybrid-cloud too.

State of AWS Compute Pricing

2017-02-25T23:17:09+00:00

As flexible as it is, compute in AWS is optimized for the old capex world. It is a world where demands are predictable, consumption can be…

As flexible as it is, compute in AWS is optimized for the old capex world. It is a world where demands are predictable, consumption can be planned upfront, and things don’t change often. A culture of centralization for forecasting and budgeting, and approvals to determine who can use what and for how long, helps remain optimal in this world. In other words, you have to think and act like an efficient enterprise data center operator, and not as a cloud user to get the most bang for the buck. On the other hand, if you have bought into elasticity, on-demand consumption patterns, seasonal highs and lows, a culture of choice, and incremental test-and-learn development, you have to brace yourself for some complexity.

Instance Reservation is an Anti-Pattern

I’m not the first one to say this. Reserving compute instance hours for a year or three years with or without capacity guarantees for a given instance family and type is just like buying servers in data centers. It is a cloud anti-pattern, as it requires you plan your capacity needs upfront. You need to think in terms capex and not opex to use instance reservations efficiently.

Below is an example of what happens when you try to use instance reservations with an opex mindset (aka the cloud mindset).

Each bar in the above diagram represents number of instance hours for a given instance family/type in a given region. The number of bars grows as you consume more instance types in various regions.

Imagine you purchase reserved instances for certain types on a given day. A month later you look at the invoice and notice two patterns.

For some instance types in certain regions, you’ve run out of reservations and are paying on-demand prices for additional instance hours. These are the bars above the line.
For other instance types in certain regions, you’ve not used the all the reserved hours for that month. These are under-utilized reservations shown with bars below the line.

Both the types of these bars represent inefficiency. Above the line, you’re paying on-demand prices. Below the line, you’re leaving money on the table.

The only way to reduce the sizes of these bars is by upfront planning. However, in a dynamic and elastic world, as teams figure out the right computing needs, these bars will always go up and down, and it takes continual tweaking to minimize the spread. The following are the options.

Monitor reservations on a monthly basis, and purchase new reservations to avoid on-demand pricing. This helps lower the lines above the bar.
Create an internal spot market as once shared by Netflix to increase utilization of unused reservations.
Sell unused reservations in the spot market. However, as Jan Wiersma shows, the spot market to sell reservations is a ghost town.

All these options take engineering effort and involve some operational complexity. This is a result of forcing capex mindset into an opex centric cloud world.

Spot Instances are Poor Abstractions

An often suggested answer to reduce compute costs is spot instances (or Google’s preemptible virtual machines).

Spot instances shift the complexity of dynamic placement, task preemption, rescheduling, etc. from the cloud provider to the user. Spot instances work best when your workloads are idempotent, and you’ve a scheduler (like Mesos and Yarn) that can place workloads on instances, and move them around when instances go away due to changing demands in the spot market. Spot instances are not the best choice as general purpose compute.

It’s Time for Change

Capacity guarantees may be relevant for certain critical workloads for certain users. But reservations and spot Instances for cost savings don’t make sense. These are optimized for the seller and not the typical buyer. It’s time for these to die.

Fault Domains and the Vegas Rule

2017-02-17T03:50:26+00:00

(Cross posted at https://techblog.expedia.com/2017/02/16/fault-domains-and-the-vegas-rule/)

(Cross posted at https://techblog.expedia.com/2017/02/16/fault-domains-and-the-vegas-rule/)

Our teams at Expedia are active public cloud users. We use a simple design principle called the “Vegas Rule”, and a complimentary concept called “fault domain” to make resiliency-related decisions.

Fault Domain

A fault domain is a coarse-grained enclosure of apps, data and all the dependent infrastructure. The primary property of a fault domain is that any fault inside the fault domain does not cascade outside. All components inside the fault domain share the same fate. Below is an example. The outer circle represents a fault domain.

Any external dependency is soft, and the the services in the fault domain may, in the worst case, run in a degraded mode when that external dependency fails.

In order for a fault domain to be effective, a rule of thumb to apply is the Vegas Rule.

Vegas Rule

Our version of the Vegas rule states that “any request that enters a fault domain is fully served inside the fault domain”. When a fault domain does not honor this rule and the request goes through services outside the fault domain, those external services automatically become part of the fault domain thus extending its size. Potential exclusions for this rule include asynchronous communication (say, for database replication) between two fault domains or with an external service.

Below is an example. The red arrows violate the Vegas rule thus merging two fault domains into one large fault domain.

All About Time to Recovery

The real resiliency benefit of fault domains and the Vegas rule comes from having two or more fault domains. Vegas rule simplifies incident recovery procedures. Instead of identifying and fixing failing services, you can shift traffic away from a failing fault domain to other healthy fault domains.

Vegas rule works for services that use active-passive databases too.

When you don’t consciously identify fault domains for apps, services and databases, or break the Vegas rule, traffic shifting may not help reduce MTTR. You may have to identify and fix failing services which increases time to recover.

I learned about this concept when working on provisioning and scheduling in an IaaS layer. You will find references to this concept in Azure and VMWare documents. The same concept works for applications and services too.

Taking the Time

2016-12-27T15:31:01+00:00

Linear thinking is effortless and easy. All it takes is processing events as they occur as a sequence of actions and reactions with no…

Linear thinking is effortless and easy. All it takes is processing events as they occur as a sequence of actions and reactions with no regard to any interrelationships. Systems thinking, on the other hand is a “discipline for seeing wholes”. As Peter Senge describes in his Fifth Discipline, systems thinking “is a framework for seeing interrelationships rather than things, for seeing patterns of change rather than static snapshots” and “starts with understanding a simple concept called feedback that shows how actions can reinforce or counteract (balance) each other.” Systems thinking relies on observing existing mental models of reality, constructing new mental models, feedback loops that exist between events, and learning organizations.

A related framework is John Boyd’s Observe-Orient-Decide-Act (OODA) loop, which is a feedback decision loop to continually observe, orient and decide before acting.

These frameworks are not new. Similar patterns exist in control theory, agile development, and even in cluster managers.

What is implicit in all these is taking the time to deal with models, feedback loops; and observations, orientation and decisions before taking actions.

Busyness

While computers are good at executing feedback loops once those are implemented, it turns out that most of us are really poor at taking the time to deal with feedback loops at work. As email threads grow, and as meetings fill up working hours, our work lives force us to jump from event to event. Decisions happen on the fly sometimes ignoring their side effects, feedback processes, and any natural and essential delays between actions and consequences.

When you short-circuit the time for the feedback path, the closed loop collapses to a sequence of events in the feed-forward path, which is nothing but linear thinking. This style of working creates an illusion of increased productivity and busyness. It can simultaneously make you feel that your time is consumed by things outside your control.

What suffers most when you don’t take the time? Casualties include curation, narration, summarization, development of mental models, observing patterns, learning, dialog and sharing. These are all behaviors necessary for a learning organization.

Taking the Time is not Slacking

Systems thinking is a developmental activity. Development starts with taking the time. But taking the time is not slacking. It is not relaxing. It is not taking time off from work. It is about carving out the time for slow cycles that include curation, summarization, dialog, and sharing. Being fast really depends on spending some slow cycles with such activities.

Cloud Lock-in and Change Agility

2016-12-12T04:33:02+00:00

We are entering into a future of lock-in. Public clouds are no longer someone else’s computers or hosted services. These are fast becoming…

We are entering into a future of lock-in. Public clouds are no longer someone else’s computers or hosted services. These are fast becoming proprietary platforms with a bit of open source sprinkled in here and there. Almost everything you need to run an enterprise is now available as a pay-as-you-go platform on AWS, and Azure and GCP are not staying put.

As difficult as it may be to swallow and accept this trend, it is important to recognize that, anything that needs to be procured or downloaded, built, run, operated and maintained is up for lock-in. In this future, all the infrastructure primitives, software lifecycle automation services (including containers and cluster managers), what was once known as middleware, data processing platforms, the mechanics used for security and operations, as well as the all the enterprise data will be locked into to a few public clouds.

I don’t expect any enterprise to successfully operate on, or migrate to a public cloud and yet remain insulated from the cloud provider’s platform lock-in. Sure, there are several open-source and closed solutions (for instance, those in the container, cluster management and PaaS area, big data processing engines etc, various relational and non-relational databases etc.) that aim to help you stay agnostic. But such platforms insulate you at-best from just the mature aspects of these cloud platforms, such as, say the IaaS layer. But none of these help you participate in the massive platform shift happening in public clouds. When you consider this shift, lock-in insulation is like having the cake and eating it too. You can’t take advantage of all the new capabilities to innovate for your business while staying agnostic to the platform.

In this lock-in future, techniques of the past decade and half, such as open source and abstraction layers, won’t insulate us from lock-in. This does not mean that open source, open interfaces and open protocols don’t matter. They do, and will continue to matter to drive a culture of open coding, sharing, learning, collaboration, and interoperability. But not necessarily for lock-in insulation.

So, what’s the answer then?

The answer, in my view, is practicing change agility. Change agility is an organization’s culture of continually practicing significant changes.

During the last fifteen years, most enterprises reacted to lock-in concerns by over-investing in abstractions such as programing frameworks, parsers, databases, communication protocols, application servers, cloud providers, you name it. At the end of it all, what’s left were past regrets and large difficult to change monoliths. Those abstractions, in effect, did nothing other than contributing to a culture of change resistance.

A culture of change agility on the other-hand deals with changes as a matter of course and not as impediments. It embraces techniques like service orientation, asynchronous and decoupled communication patterns, micro-architectures, experimentation, failing fast, tolerance for mistakes, chaos engineering, constant feedback and continuous learning.

An organization adept at change agility does not see lock-in as an obstacle. It sees it as an opportunity to learn, experiment, and partake in creating its own future thus moving up the value chain. Not just once, but continually.

(Cross posted at https://techblog.expedia.com/2016/12/11/cloud-lock-in-and-change-agility/.)

Don’t Build Private Clouds

2016-11-24T23:54:38+00:00

I’ve been noodling on this post for over a year. I discussed, debated and explained parts of what I write below with several folks during…

I’ve been noodling on this post for over a year. I discussed, debated and explained parts of what I write below with several folks during this time. I also changed my jobs this year. From mid 2012 to early this year, I lead a team that built one of the largest mid-sized fairly successful private clouds. I now lead an effort to migrate several large-scale mission critical systems from on-prem enterprise data centers to a public cloud. This transition gave me the time and opportunity to refine and expand the scope of my thinking. So, here is my appeal. Slow down on your private cloud projects, and get out of enterprise data centers as fast you can. You may be shooting for a local optimum with your private cloud strategy, and not the global maximum for the business.

You don’t need to own data centers unless you’re special

There are very few enterprises in the planet right now that need to own, operate and automate data centers. Unless you’ve at least 200,000 servers in multiple locations, or you’re in specific technology industries like communications, networking, media delivery, power, etc, you shouldn’t be in the data center and private cloud business. If you’re below this threshold, you should be spending most of your time and effort in getting out of the data center and not on automating and improving your on-prem data center footprint.

While the overall demand for compute footprint grew across the board in the industry, the number of enterprises that need to build and operate data centers to host that compute has been steadily shrinking. There are multiple factors at play behind this trend.

The scale, quality and the breadth of cloud services has increased manifold in the last few years. There are very few use cases that the big three public clouds can’t deal with today.
You no longer go to a public cloud because you needed virtual machines on demand. You go to a public cloud to consume a large buffet of services.
Physical compute, storage and network infrastructure is brittle, prone to failure and is not malleable. Automating these infrastructure primitives and making them ready to host apps and data is an as-a-service exercise. These services are large distributed systems that require talent, focus, trial-and-error and years of learning and operational experience. Typical enterprise IT departments are not setup to attack such problems. Trying to emulate the same within your data centers takes years, and most likely shifts your focus away from your core business. More about that below.
Despite what infrastructure vendors claim in their brochureware, there is no single vendor that can provide you with a full stack of capabilities that meet or exceed what a public cloud can provide.
There are fewer snowflake workloads that require special purpose-built hardware today than there were a decade ago. In most cases, the choice you get with designing servers is illusionary and likely backwards looking. With each passing year, it is getting cheaper and less time-consuming to solve problems using commodity software building blocks running on commodity compute.
Despite hundreds of millions of dollars of capex investments, most private clouds are not resilient to common infrastructure or software failures. Services to enable modern resiliency patterns rarely exist in private clouds. Consequently resiliency remains a pipe dream.

Private cloud makes you procrastinate doing the right things

When executed to its completion, a typical private cloud journey involves four key phases:

Phase 1: Build private cloud, starting with compute, and then storage and network, then scale out to several independent fault domains (like public cloud regions), automate the network to make it possible to implement load balancing, DNS, and various failover patterns.
Phase 2: Move your stateless monoliths to the private cloud. Most enterprise have at least one generation of such monoliths.
Phase 3: Then deal with the stateful monoliths. These are your large monolithic databases running on handcrafted hardware. This is usually where private cloud journey hits the wall due to the risk and complexity in making such monoliths cloud native.
Phase 4: Then transform your culture to operate as a cloud native organization.

This is a multi-year journey with each phase involving several hurdles and taking years to execute.

Would you start with Phase 1 in an on-prem data center, or go directly to Phases 2 on a public cloud?

Private cloud cost models are misleading

A typical server with modern specs can cost between $5000 and $10,000 and can last for 4 years. A public cloud virtual machine with comparable specs can cost between $1000 and $1500 per month. Such comparisons make private cloud strategies compelling. However, there are additional costs to add.

Engineering costs to build and operate cloud services
Cost of automating the network (note that no network vendor wants you to automate with open APIs)
Cost of lost agility due to long planning, procurement and on-boarding cycles
Cost of lost business opportunity due to time spend building a private cloud

Don’t underestimate on-prem data center influence on your organization’s culture

The state of infrastructure influences your organizational culture. A modern enterprise running on programmable cloud contributes to autonomous teams, rapid learning, and faster iterations of ideas. Brittle, time-consuming, human-operator driven, ticket based on-premises infrastructure on the other-hand brews a culture of mistrust, centralization, dependency and control.

Say, for instance, a team wants to enable TLS all the way from load balancers to their app servers. Such a team will likely have to deal with networking teams, security teams, and potentially several middle managers to execute the change over a period of several weeks if not months. The same team could execute this change on a public cloud in under a week and move on to the next thing. There are numerous examples like this.

These difference between on-premises data centers and public clouds influence how teams think, plan and execute. These are nothing but attributes of culture.

On Influencing

2016-08-07T18:04:02+00:00

“You need to be visible and influence more to grow in your career” was the gist of feedback that I had received and ignored a few times during my career as an individual contributor. I didn’t find such feedback useful the first few times, as I didn’t really understand what it means to influence, why it matters, or how to influence.

“You need to be visible and influence more to grow in your career” was the gist of feedback that I had received and ignored a few times during my career as an individual contributor. I didn’t find such feedback useful the first few times, as I didn’t really understand what it means to influence, why it matters, or how to influence.

I was also not really open and willing to understand why influence matters at that time. For all I cared, influence was a jargon word. My attitude was that I just had to work hard and try to excel as much as I could, and everything else would follow. I couldn’t be more wrong.

Two things have had to happen to change my perspective. First, I was entrusted with opportunities to lead efforts that required formation of large teams, and collaboration with several talented individuals. Second, I also got a chance to work with and closely observe a few influential leaders lead through conflict and change. Both meant a personal struggle to understand how to work with others, how to seek collaboration and support, how to make compromises, and more importantly, how to make a positive dent on initiatives that no single individual can do alone.

Individual contributor roles in tech organizations share a few common expectations like (a) being able to create, such as coding, (b) being able to architect and lead, and (c) ability to make a positive impact for the business. Other than the ability to create, these expectations demand influencing and leadership. However, sadly, individual contributors neither get the coaching nor the mentoring to help them lead others.

Most individual contributors consequently end up making some mistakes:

Assume that only managerial roles are influential to let them command collaboration and alignment
Focus a lot on ideas to solve technical problems, but struggle to get the support they need to execute on those ideas
Take a passive advisory attitude towards outcomes thinking that outcomes are so-called management’s problems

Here are a few that helped me unravel this puzzle of influencing.

First, realize that leadership is not a title, but is a role others permit you to play. It does not matter whether you are an individual contributor or whether your title at work includes “manager” or a “director”. You are at the mercy of your team to give you the permission to lead them. Your success as a leader depends on getting and keeping that permission. If you’re unsure about this, read John Maxwell’s The Five Levels of Leadership.

Second, as important as sound technical ideas are, you can’t always be the fastest horse to generate the best ideas and execute on them. Many others are potentially looking at the same set of problems as you are, and will very likely come to similar conclusions. You don’t have monopoly on ideas. Period.

Third, empathize with the business and develop the context. I’ve heard some very senior individual contributors proclaim that it is the management’s problem to implement their ideas, or to adopt their creation. However, you can’t influence change unless you empathize with the things you want to change.

On the subject of empathy, I’ve an experience to share. Years ago, I wrote a piece of software called ql.io. I was extremely passionate about the ideas, and executed hard and fast. I orphaned the project about a year later as it failed to gain adoption. In retrospect, though my ideas were valid and the problems my ideas were addressing were real, I failed to develop the context under which teams were facing those problems. I was furious at the problems I was trying to solve, but failed to empathize with the people dealing with those problems. Teams that looked at ql.io liked what they saw, but could not quite figure out how to incorportate it into their apps. Consequently, I could not get the adoption that I depended on. The lesson I learned from that experience was that I need to understand the context, and empathize with the people and their current technical environment. No matter how great the ideas are, you can’t balloon-drop solutions.

Fourth, instead of trying to mint solutions to solve problems, help others develop better mental models to solve problems on their own. As Peter Senge says in his The Fifth Discipline, teams and organizations trap themselves in defensive routines that insulate their mental models from examination. It is often the established mental models, including your own, that prevent change from happening. Inquire into those mental models first.

Finally, resist the temptation to intervene into everything. As you grow up in the hierarchy and stay long enough in any organization, you will get an opportunity to see most things as they happen. This gives you a chance to jump into every conversation and voice your opinions. Resist that urge to intervene, let things pass, take an observer role, and intervene only when absolutely necessary. Inquiry into mental models starts with observation and not intervention.

As John Maxwell says in his book, “Leadership is much less about what you do, and much more about who you are”.

Originally published at https://www.subbu.org/blog/2016/08/on-influencing.

Turning Containers into Cattle

2016-02-21T17:23:56+00:00

Below are slides from my talk at Container World on Feb 17 2016 at the Santa Clara Convention Center.

Below are slides from my talk at Container World on Feb 17 2016 at the Santa Clara Convention Center.

Turning Containers into Cattle
_Turning Containers into Cattle Subbu Allamaraju Container World Feb 2016_docs.google.com

It’s the Manageability Stupid

2015-11-10T06:29:23+00:00

We’re living in interesting times. Till 4–5 years ago, infrastructure used to be locked by review boards and tickets. Open source has been slowly changing this world. Take OpenStack, Docker, Mesos, Kubernetes, you name it. …

We’re living in interesting times. Till 4–5 years ago, infrastructure used to be locked by review boards and tickets. Open source has been slowly changing this world. Take OpenStack, Docker, Mesos, Kubernetes, you name it.

But here is a message for everyone jumping into these. If you don’t understand manageability you will fail. Failure won’t happen immediately, but you will go through some phases over a period of time.

I’ll give a few simple reasons:

Infrastructure is brittle. It takes a lot of automation to make it malleable.
Most infrastructure software is distributed. Understanding their failure modes and designing for failure are non-trivial exercises.
It takes a lot of software to manage these systems at scale, and you need to apply software engineering to operational tasks.
There is no free lunch.

Don’t pick X, where X is any of the above tech, just because some name brand company picked it or event built it. Pick X only if you’re prepared for the long haul and ready to build a ton of software to manage X.

Lessons from the Cloud Bunker

2015-08-16T21:35:30+00:00

It’s been just over three years since I started working on building a fairly large cloud infrastructure at eBay. Our initial focus was to make infrastructure programmable and to enable agility through self-service APIs with some checks and balances for efficiency, security and availability. …

It’s been just over three years since I started working on building a fairly large cloud infrastructure at eBay. Our initial focus was to make infrastructure programmable and to enable agility through self-service APIs with some checks and balances for efficiency, security and availability. This model put infrastructure basics like compute, network and block/object storage in front of every developer at eBay. Though this allowed adoption of a wide variety workloads in our data centers, this journey also taught me a couple of very valuable lessons.

Don’t bet on ephemeral cloud abstractions

Ephemeral abstractions are things that fail. These may not recover from failures. The best example is a compute (e.g. a VM) with a local disk, an IP address and a hostname.

At the bottom of every public or private cloud stack diagram is the IaaS layer consisting of such ephemeral abstractions. Almost all IaaS layers implement a decade-old playbook from AWS. This playbook has three steps:

Automate infrastructure abstractions like compute, block storage, and various network functions and offer APIs to manipulate those abstractions.
Provide additional services for metrics, monitoring, orchestration etc.
Expect users to put together a closed-loop automation layer using (1) and (2) to make their apps resilient to infrastructure failures.

As important and hard as steps (1) and (2) are, the basic flaw with this approach is that the amount of engineering it takes to implement step (3) is non-trivial. Very few get it right. Consequently, most users of these abstractions subject their apps to infrastructure failures and remain concerned about cloud not meeting their expectations.

Here is why.

In order to make apps resilient to failures at the bottom of the stack (which includes ephemeral abstractions), the user is expected to monitor and detect failures, and bring each app back to its desired state as quickly as possible. This is not an operational or even a dev-ops problem. This is a software engineering problem consisting of what we call at work as the closed-loop P-D-M-R cycle.

The steps in this closed loop are self-explanatory. Without the “monitor and detect”, and “remediate” phases, as infrastructure failures happen, apps drift from their desired state.

This playbook worked out well for those in the industry that understood this ephemeral nature, and then have invested engineering efforts to build software for step (3) above. For the rest, this playbook is a breeding ground for pets. Consumers of pets of course will want pet-friendly solutions like live-migration, VMs that run on shared storage, IP mobility, or their own “highly-available racks” (pun intended). This is a slippery slope.

Moreover, most cloud providers, including open-source cloud controller software like OpenStack, don’t even offer all the building blocks necessary to implement a closed loop P-D-M-R cycle. In those environments, remediation exercises tend to be human/ticket driven.

Here is the net lesson. Having things that fail as the primary interface to cloud may have been an acceptable cloud strategy in 2005, but not anymore.

The future is durable and declarative abstractions

It is time to think of cloud as a provider of durable cloud native abstractions that are resilient to failures. The term “cloud native” is fairly new with no clear definition. I would describe a cloud native abstraction as one with two fundamental characteristics:

Durable: The abstraction may be built using ephemeral parts, but the abstraction itself is durable in the sense that it survives failure of its parts.
Declarative: The user of the abstraction declares the desired state, and the service providing that abstraction attempts to maintain the abstraction in that desired state.

A virtual or a physical machine is neither durable nor cloud-native. Neither is a container. But a cluster of Kubernetes pods is a durable and declarative abstraction. To a lesser extent, a Marathon managed cluster of containers is also durable and declarative.

In the world of cloud native abstractions, you wouldn’t put together the P-D-M-R closed loop using the IaaS primitives. You would instead create durable abstractions that automate and shield you from having to deal with the complexities of the P-D-M-R cycle. You set a desired state when creating that abstraction, and the provider will try to maintain the desired state. The desired state could be as simple as the size of the cluster, or as sophisticated as some application KPIs.

Having closely dealt with infrastructure failures and the complexities of making applications resilient to such failures, I’m confident to say that the end of the era of ephemeral abstractions has finally begun. Durable and declarative cloud native abstractions are the future of cloud.

Does it mean that IaaS is dead? I don’t think so. You still need the ability to create parts (e.g. a VM or a network port or a block of storage) when creating durable abstractions, but most users of cloud shouldn’t have to deal with such operations. That layer will eventually become a pure provider-side internal layer.

Originally published at https://www.subbu.org/blog/2015/08/lessons-from-the-cloud-bunker.

OpenStack Summit Keynote Video and Slides

2015-06-01T01:35:57+00:00

At eBay and PayPal, we been a power user of OpenStack to provide infrastructure building blocks via OpenStack services. I had the opportunity to share details of our OpenStack journey during a keynote at the OpenStack Summit in Vancouver on May 19, 2015. Here is the video and the slides.

At eBay and PayPal, we been a power user of OpenStack to provide infrastructure building blocks via OpenStack services. I had the opportunity to share details of our OpenStack journey during a keynote at the OpenStack Summit in Vancouver on May 19, 2015. Here is the video and the slides.

Journey and future of OpenStack eBay and PayPal from Subbu Allamaraju

Give Me Bare-Metal

2014-11-10T21:43:40+00:00

“Give me bare-metal. I can put together what I need myself. I don’t need a cloud” — this is the gist of some comments I came across in recent months.

“Give me bare-metal. I can put together what I need myself. I don’t need a cloud” — this is the gist of some comments I came across in recent months. One of the commenters just experienced Docker. Another managed to spin a cluster of front end apps in a few minutes using the Marathon framework on Mesos. I even heard a similar comment from a someone from the Yarn community; see Docker & Kubernetes on Apache Hadoop YARN for an example. However naive such a comment sounds like, I could not ignore the underlying sense of empowerment from using talk-of-the-day tech like Docker, Mesos, and Kubernetes.

While it is easy to take sides in debates about Docker, containers vs VMs, Mesos vs Kubernetes vs cloud platforms like OpenStack, take note that there are extremely important reasons why some of these are exciting to use. These are driving immense amounts of simplification for the user. What used to take large amount of code and operational procedures to implement commonly sought after ilities like availability, agility and efficiency is now getting simplified to some declarative assertions. Getting a VM is now considered boring. Dockerizing apps and deployments is new and exciting.

Underneath all this, some of the fundamental assumptions that the industry has made about Infrastructure as a Service, and Platform as a Service are getting disrupted.

Cloud platforms that offer nothing but VMs have been under threat for a while, but Docker has made that threat practicable. If I’m a front-end app developer, it’s much more convenient for me to use a container-based cluster management solution on bare-metal without dealing with VMs. After all why bother about provisioning and putting VMs together into a cluster when there is framework that does it for me, in addition to solving some of my app packaging needs?

Most private cloud programs, and even some smaller public clouds still start with compute as the primary offering. This is because compute is usually the first one impeding developer agility and it makes sense to solve it first. However, Docker is now making it easy to question the existence of compute-only clouds.

Platforms that require isolated and dedicated compute and storage clusters for each type of workload too are up for question. Such platforms aren’t able to run heterogeneous workloads on shared infrastructure to improve resource utilization. Though most CIOs love to talk about flexing down online workloads at off-peak hours to flex up batch workloads, the reality in most data centers is painted racks, i.e., dedicated servers/racks of compute and storage for different types of workloads. In these types of environments, cluster sizes are relatively fixed, utilization remains low, and moving capacity across environments usually requires manual changes. Scheduler frameworks like Mesos are set to solve resource sharing across heterogeneous workload types thus eliminating the need for separate environments. Note that true sharing still requires compute and storage with predictable performance characteristics with little or no interference, and a multi-tenant infrastructure layer to isolate workloads from one another for security reasons.

What is getting disrupted here is not the core building blocks that advanced cloud platforms provide, but how we’re putting together infrastructure building blocks for agility, efficiency, availability, heterogeneity etc. The patterns we’ve been using to arrive at these ilities are now becoming commodities. The likes of Kubernetes and Mesos are abstracting out the complexity involved in designing for these ilities. The era of commoditization of infrastructure patterns is finally here. The rewriting exercises that both CloudFoundry and OpenShift are going through in 2014 reflect this shift.

Let me start with one of the most basic patterns used to deploy stateless front end apps for scalability and availability. Here is an example topology.

This topology consists of several compute resources in a cluster behind a VIP attached to a load balancer or a proxy server deployed in two or more availability zones (coarse grained fault domains), and some kind of DNS based routing to send traffic to nodes in these clusters. You can scale this pattern by adding more nodes behind the VIP, and increase availability by spreading the resources across several availability zones. The larger the number of availability zones you use the less amount of spare capacity you would need to survive loss of any availability zone.

Most home-grown PaaS layers start by automating ways to provision and manage apps across such a topology. The pattern usually maximizes developer agility, availability and scalability. This pattern requires a service to provision compute resources, one to manage clusters of nodes behind VIPs, and another to manage DNS. You can automate the implementation of this pattern by making these three services extremely efficient. An orchestration service would help simplify the implementation.

The bottom layer here consists of the building blocks needed to implement the pattern shown in middle layer. In fact, the first generation of cloud at eBay looked exactly like the above. The infrastructure layer in that cloud was built to suite the implementation of the above pattern.

Services like AWS CloudFormation and AWS CloudWatch make this topology resilient to failures and workload changes with auto-scaling. With the addition of such services to our build blocks layer, we can refine our pattern for resiliency.

Kubernetes arrives at a similar outcome for containerized applications, with cAdvisor providing for monitoring. CloudFoundry’s Diego is treading along the same lines.

Consider Mesos for another example. It can help get the most work done out of a given fixed pool of compute resources by finely slicing and dicing compute and allocating to different types of workloads. You can, for instance, run front-end, batch, and Yarn with the help of Marathon, Aurora, and Myriad frameworks respectively on a single pool of compute resources. Depending on your security and data classification needs, you may need an underlying layer provide multi-tenancy with compute, network and storage isolation.

The general theme to notice here is a layered cake consisting of a layer of platforms enforcing certain patterns, utilizing a layer of foundational building blocks to implement those patterns.

In essence, the complexity of dealing with raw infrastructure services, and that of implementing various patterns, is getting restructured for the better. As Mårten Mickos puts it in a recent Newstack article, “it may seem like you are removing the complexity but really it is just is pushing it somewhere else.” Layers help shuffle complexity better. By keeping the lower layers generic, we can let new patterns emerge.

An effective cloud platform needs both a strong layer of infrastructure building blocks, and platforms enforcing certain patterns. It is not a question of one versus the other. I’m not attempting to list the basic building blocks here, but a quick scan of AWS should give a hint of what the most common ones are. In fact without the core building blocks offered as services with 4 nines of availability, your infrastructure may remain as pets and not cattle. More about that latter.

OpenStack on Diet

2014-11-04T06:45:13+00:00

Scanning the list of OpenStack projectson github, here is my pick of projects in alphabetical order that don’t need to be there at present in their current location.

Scanning the list of OpenStack projects on github, here is my pick of projects in alphabetical order that don’t need to be there at present in their current location. This is not a reflection on their quality or utility. This list is purely based their relevance as the foundational building blocks for offering infrastructure services.

This leaves the following essentials

Relevance of projects like OpenStack has been, and shall always be based on their merit and depth, but not on the length of their portfolios. To give an analogy, Linux is relevant because of its strong core, and not because of KDE and OpenOffice.

Automate Everything — But Don’t Ignore Drift

2014-10-06T13:38:49+00:00

Treat infra as code. This is one of the key maxims told in the devops circles. This principle advocates to treat infrastructure as code, to automate everything and to apply the same engineering processes and principles as you would apply to software development. …

Treat infra as code. This is one of the key maxims told in the devops circles. This principle advocates to treat infrastructure as code, to automate everything and to apply the same engineering processes and principles as you would apply to software development. While it is extremely important to practice this principle, drift is bound to catch you by surprise when operating at scale. That’s what I learned while building and operating the cloud at work.

Here is the most ideal outcome of automating everything. Every node in the system is automated, has the right configuration, gets updated in synchrony with others, and everything is as it supposed to be, right from day one. But this a naive view.

Treating infrastructure as code, and automating everything does not mean that everything agrees with that desire and behaves as told in practice. Though it is less difficult to control drift at small scale, when you’re operating thousands of nodes of different types in multiple locations, the chances for configuration drift are high, and it requires conscious effort to be aware of, to manage, and to mitigate drift.

To give an example, the system I deal with is a private cloud deployed across several geographically distributed availability zones. Overall, there are over 30 types of nodes that are managed through automation. While some of these nodes are stateless running in clusters behind VIPs, a few are stateful nodes with active/active or active/passive configurations. A large subset of nodes are hypervisors with some variations in their configuration to support different types of workloads. As the hypervisors are long-lived, they go through more number of changes during their lifetime than other types of nodes. A few nodes are software appliances.

All these nodes go through different rates of change made by different teams. Moreover, not all nodes manage their configuration state the same way. Though most load configuration from configuration files, some types of nodes are initialized through databases, and some through API calls.

Why bother

Your experience may vary, but in the systems that I deal with, drift is a bigger deal than originally imagined. Here are some of the consequences of drift.

Bad user experience: For instance, a few badly configured nodes out of hundreds or thousands may only impact a few customer interactions. Such silent bugs are hard to detect as they may not show up in overall system metrics and KPIs.
Incidents waiting to happen: Incompletely or inconsistently applied changes can mask problems and eventually lead to incidents. The specifics vary from system to system, but I’ve my stories to tell.
Impact on time to recovery: Even worse, drift can impact time to detect failures and time to recovery from those. Most unplanned drift discovery also happens during incidents.

Where does drift come from

There are four key contributors to drift.

Automation gaps and bugs

This is the most natural source of drift. Every automation gap is a potential source of drift. Like regular software development, automation too goes through an iterative development process, and gaps are natural consequence of that iterative process.

More over, the act of mutating a node’s configuration in place may leave cruft behind eventually leading to drift. Immutability can help mitigate such drift in some cases, but immutability is not an answer for everything. It is also expensive to implement in certain cases.

Human error and (bad) habits

This is largely an issue of culture and past habits. The causes in this category include debugging on live systems, and people making ad hoc changes that bypass configuration management and change control. Such drift is likely to remain uncaught until it leads to a noticeable issue or an incident.

Incidents

During incident management, the focus is on time to recovery and not automation and change control. The act of recovery is likely to introduce drift.

Transitional

When operating at scale, not every node and not every service is likely to be updated at the same time. You may also need to stagger certain changes over weeks or months to leave room to observe, tweak or even rollback.

How to manage drift

First, don’t deny drift. Acknowledge that drift is a possibility and that automation may be incomplete or buggy.

Second, build tools to regularly audit for drift. Automation is expected to reduce drift, but like most things, automation too may be work in progress and shall have bugs. Use audits to discover the state of drift. Awareness is a prerequisite for mitigation.

Third, extend the “measure everything” maxim to include drift. At any given point in time, be able to know what nodes/systems are in drift, and assign severity based on their potential impact. Wire up these metrics to your alerting systems so that the team gets alerted when drift is discovered.

Fourth, make drift mitigation a planned activity, sprint after sprint. Use drift metrics to track the mitigation progress.

Finally, reward right habits.

Monitoring and Alerting for OpenStack

2013-10-17T08:06:39+00:00

OpenStack is a loosely coupled distributed system. Given the number of moving parts in OpenStack, the particular configuration of an OpenStack deployment, and the underlying layers in the data plane, …

OpenStack is a loosely coupled distributed system. Given the number of moving parts in OpenStack, the particular configuration of an OpenStack deployment, and the underlying layers in the data plane, failure detection and debugging can get non-trivial. In particular, some components like DHCP agents, Quantum/Neutron APIs, their drivers, Nova metadata API, volume backend devices, Swift etc. have higher availability bar due to their involvement in certain flows in the cloud data plane. OpenStack deployments need significant investments in log aggregation, metrics collection, monitoring and alerting for early problem detection and recovery. In this post, I would like to show some of the practices and open source tools that you can use for problem detection and trouble shooting.

To set the context, take a common task like creating a new virtual machine instance. The diagram below shows all the components that must cooperate to bring up the virtual machine into a state that a tenant is able to use. The flow starts from a tenant request for a new virtual machine and ends with the successful completion of the cloud-init process. Boxes with thin borders show the control plane, while boxes with bold borders show components that participate during the boot process and later. In this example, quantum is backed by the NVP plugin, and keystone is set to use PKI for tokens. I’ve ignored the database in this diagram.

The best possible outcome for the requesting tenant is a virtual machine ready to use in a couple of minutes or less. Other possible outcomes include performance degradation of APIs like nova, glance, or keystone, or nova scheduler failures due to hypervisor or network capacity, timeouts due to performance degradation of any request along the flow, or partial failures such as a virtual machine that could not get its network interfaces up or a virtual machine that the tenant can not access due to metadata API failures.

In this flow, a 202 response code from nova API or even a virtual machine with the power state ACTIVE does not actually mean that the tenant got what it asked for. This is true for several other common flows as well.

Fortunately, most of the OpenStack components generate raw signals to debug failures. But it is up to the operator to build a system to collect those raw signals, and process them into metrics to monitor and alert upon.

Below is a diagram that shows common signals and a schematic of how to collect, process and generate useful metrics and events for tracking and monitoring.

The key signals to look for include the following:

Basics: These include usual signals like process health, restarts, CPU, memory etc across all nodes. In the diagram above, this is done via Zabbix, an open source solution for monitoring and alerting.

Log files: This is one of first sources to look at for most problem resolutions. In addition to detailed trace of warnings and errors, log files in OpenStack include useful information to know how various OpenStack APIs are behaving. For instance, here is a message from the nova-api log.

2013-10-13 18:56:11 INFO nova.osapi_compute.wsgi.server [req-69b8fce4-b18c-4a52-8c6f-4256d4e45db2 991a8f5572cdeb7fb04a91aca1c4cb7b 991a8f5572ca4b7fb04a91aca1c4cb7b] xx.xx.xx.xx,xx.xx.xx.xx - - [13/Oct/2013 18:56:11] “POST /v2/991a8f5572cdeb7fb04a91aca1c4cb7b/servers/c4ba5bba-228d-4997-b077-92507238c37a/action HTTP/1.1” 202 121 0.109871

This message tells you many things:

The component label nova.osapi_compute.wsgi.server tells that this message is due to an API request with a request ID, project ID and user ID. By aggregating all log files into a central store that can index those and provide a search interface, you can determine which of the components and hosts that the request went through and the outcome at each stage.
This message also tells that it was a POST request to create a virtual machine instance, the response code was 202, client IP addresses (masked in this example) and that it took 0.109871 seconds. By extracting these bits of data, you can trend traffic, failures, and response latencies over time, and use those metrics for KPI tracking as well as generating alerts when those metrics deviate from the guarantees you provide as a cloud operator.

Having used logstash, ElasticSearch, and Kibana for nearly a year in production in our cloud at eBay, I would not run an OpenStack cloud without this trio for log collection and processing. logstash is quite flexible for log collection and processing. For OpenStack deployments, logstash is particularly useful since not all components follow the same format for log messages. logstash’s grok filter helps normalize various log messages into a format that can be indexed and searched.

Once collected and parsed, you can send logs to ElasticSearch for indexing using the ElasticSearch output. With Kibana, which is now an ElasticSearch plugin, as a front-end to ElasticSearch, you can search logs in near realtime. Input/output plugins like zeromq help transport logs from their source to an ElasticSearch cluster.

Log searching is useful for debugging, but how to mine health indicators like request rates, error rates, latencies etc? This is where Etsy’s statsd comes in. By funneling parsed logs into statsd via logstash’s statsd plugin, you can extract timing metrics, counters and rates from logs. Once extracted, you can send them to a system like Graphite for metrics tracking and graphing, and a system like Zabbix for alerting.

Database tables: OpenStack database tables have a ton of useful data that an operator must look at periodically. In addition to showing usage growth and impending capacity issues, tables like nova.instance_faults and nova.instances tables include some key performance indicators. For instance, nova.instance_faults table keeps track of scheduling failures, which is an important indicator of system health. Similarly, instances with non-NULL task_state in the nova.instances table beyond a reasonable time after a task show potential health issues.

How to extract this information for monitoring and alerting? All it takes is some code to periodically poll those tables, count various indicators, and then write to systems like Graphite and Zabbix.

Synthetic Flows: What we found during a year+ of operating an OpenStack private cloud at work is that the above are not sufficient to help us proactively find issues before our users see them, and to reduce time to recovery. Here is why.

The signals above do not reflect all potential failures. For instance, none of the above sources may reveal that, though the virtual machine creation succeeded, the tenant was unable to use the instance due to a failure during cloud-init. Such a failure may have happened due to a failure in the DHCP agent, metadata API, Quantum or even just a slow database query.
Some failures may be due to bad user input such as a broken user-data script, or a bad image. Though we can filter out some bad user inputs for all API requests by ignoring responses with 4xx response codes, bad user inputs that get used during the boot process are hard to filter out.
Finally, no errors in logs, database tables, or other sources may not mean that all components are functioning as expected. It may just be due to no user exercising certain critical components of the cloud at the moment.

One way to account for these blind spots is to select a few key flows like the ones below that count towards the operator’s SLA, and synthesize those flows with controlled inputs such that you can programmatically assert the outcomes against expected.

Launch a virtual machine instance and verify that it is usable by the tenant.
Delete an instance and see that the instance is gone and tenant’s quota is back.
Attach a volume and write to it.
Take instance and volume snapshots.
Publish a new image, and repeat (1) with that instance.

Any deviations show health issues. In the schematic above, there is a synthetic test bot that is continually exercising the cloud and writing signals to Graphite and Zabbix.

There it is. Since OpenStack is not a closed system, you can use open-source tools like logstash, ElasticSearch, StatsD, Graphite, Zabbix etc to tame OpenStack into an operable cloud. Remember, OpenStack is not cloud.

Private Cloud Operating Principles

2013-09-16T19:28:30+00:00

Given the modular architecture of OpenStack as a cloud controller software, operators of OpenStack based private clouds have many choices on how to shape cloud as a service. …

Given the modular architecture of OpenStack as a cloud controller software, operators of OpenStack based private clouds have many choices on how to shape cloud as a service. For public cloud operators the answer is clear — emulate AWS. The answer can be difficult for companies building private clouds, particularly if those companies have been around for a while. What operating principles would you choose in such cases and why?

When we faced the same question at eBay last year, we chose a simple operating principle that we continue to practice every day.

Think and act like a public cloud provider to provide unfettered self-service access to native OpenStack capabilities.

Why should a private cloud operator choose such an operating principle? Isn’t it constraining? Yes, it can be constraining in the short-term, but in the long run, this simple principle allows for more innovation, agility, as well increased operational efficiency. Here is why.

Self-service: A public cloud mindset forces to care for self-service. No capability exists in a public cloud without an API that is integrated with all other cloud capabilities and works the same for every user. Users get access to those capabilities on their own with no tickets and approvals. For instance, in our cloud, users not only get compute, network and storage on their own, they can also bring in their own images and customize those to meet their needs unbeknownst to us. We in fact chose to put a lightly customized version of the Horizon dashboard along with all the public APIs in front of users and let them figure out what to do with those. This turned out to be infectious.

In the private cloud context, self-service brings in some challenges the biggest of which is compliance to various policies and processes in place. Most hurdles that users in large enterprises face do start with processes designed to enforce those policies. But policy enforcement is solvable. In stead of putting a person or committee in charge of approving and verifying compliance, you will need to build and put some software in-charge of compliance.

In addition to productivity gains due to agility, self-service helps democratize capacity planning and increase efficiency. I once worked in a company where teams used go in front of a committee a few months in-advance to get their capacity. When you’re faced with such a process, you tend to over-estimate capacity needs with the worry that you may not get the capacity you need when you need it. This is capacity hoarding. Self-service on the other-hand improves data center utilization and reduces waste since users know they will get the capacity that they need when they need it.

Abstractions: Self-service also forces a well-defined API layer abstracting the infrastructure from applications above. In the case of OpenStack, users don’t need to ask how to use those APIs since they can google for help on their own. Consequently, we get to focus on building and operationalizing the cloud while our users get to take charge of their use cases on top of the APIs. The challenge for the operator is to ensure compatibility with documented public APIs. For the opportunist, this challenge helps improve quality of OpenStack customizations needed to suite the business.

Unified Control Plane: One of the tempting patterns for private cloud operators is to spin up a separate instance of the cloud for each use case — such as a cloud for developers, a different cloud for QA, and an entirely different and isolated cloud for production use cases. In this model, cloud is treated as a software and tenants go to different places for different needs. Each cloud will have a different user experience, a different set of capabilities, and a different set of rules to play by. Such a model is less attractive for a public cloud operator due to user confusion, increased cost of operations and reduced flexibility to move capacity around. When you think and act like a public cloud provider, users don’t see multiple clouds. They see multiple regions and availability zones behind a unified control plane and can pick regions and availability zones to meet performance and availability constraints.

In the end it turns out that what’s good for business is not different for what is good for cloud users even when the cloud is private with the control plane behind a firewall.

OpenStack is not Cloud

2013-07-25T19:37:31+00:00

In recent weeks, I’ve heard a number of opinions about OpenStack. Some want OpenStack to get compatible with AWS while others don’t care. Some say that you need vendor-built

In recent weeks, I’ve heard a number of opinions about OpenStack. Some want OpenStack to get compatible with AWS while others don’t care. Some say that you need vendor-built production-grade OpenStack distributions to succeed while others want you to buy solutions — with software and people — to help you build and operate an OpenStack cloud. Some others want you to buy racks of OpenStack. But here is the crux. OpenStack is not cloud. AWS is cloud. The difference is extremely significant.

My team at eBay builds and operates an OpenStack based private cloud. Note that the opinions I express here are my own and do not represent eBay. Except for the network virtualization layer, we use publicly available OpenStack and other open-source software. We offer cloud-primitives such as virtual compute, network and storage on demand directly to anyone that wants. Like any OpenStack power-user, we’ve got our cuts and bruises amidst happy user testimonials, but this post is not about those experiences.

A cloud is a service, and not just software. As far as the users of the service are concerned, a cloud is a set of APIs and tools backed by an elastic infrastructure that offers what the APIs and tools promise. Users care about availability of the cloud, elasticity of infrastructure, and on-demand self-service access to maintain business agility. APIs and dashboards are critical components of user experience, but that’s just a small part of a Cloud.

AWS is certainly a cloud. It includes things that drive user experience such as APIs, dashboards, and an ecosystem around those. Behind the scenes it includes an elastic infrastructure. Users remain agnostic of how that infrastructure is built and managed except for certain qualities like availability, performance, scalability, elasticity, and efficiency.

However, OpenStack is a cloud controller software. Though the community did a nice job at putting together this software, an instance of an OpenStack installation does not make a cloud. As an operator you will be dealing with many additional activities not all of which users see. These include infra on-boarding, bootstrapping, remediation, config management, patching, packaging, upgrades, high availability, monitoring, metrics, user support, capacity forecasting and management, billing or chargeback, reclamation, security, firewalls, DNS, integration with other internal infrastructure and tools, and on and on and on. These activities are bound to consume a significant amount of time and effort. OpenStack gives some very key ingredients to build a cloud, but it is not cloud in a box.

Since a cloud is a service, you can’t approach it like it is boxed software like some pundits want us to believe. When you run a service, stuff happens. For instance, your hypervisors clog up disk space late in the night due to an obscure bug in the version of the OpenStack distribution you’re using. Or RabbitMQ gets into a split brain problem and the control plane freezes over. These are not OpenStack problems, but operational incidents that are bound to happen.

AWS API compatibility is what an operator should worry about above all these? Nah. You can fix API incompatibility with glue code.

ClickOps

2013-06-23T15:37:47+00:00

The act of performing systems administration and configuration by pointing and clicking on proprietary tools. ClickOps engineers usually receive requests to perform operations via ticketing systems, queue them for hours to days, and perform those operations in sequence. …

n (klɪk**.**ops)

The act of performing systems administration and configuration by pointing and clicking on proprietary tools. ClickOps engineers usually receive requests to perform operations via ticketing systems, queue them for hours to days, and perform those operations in sequence.

ClickOps engineers often confuse their role with DevOps although their intended responsibility is to increase work-in-progress by queuing up automatable tasks.

Thanks to @bsletten for suggesting this term in a conversation on twitter.

Code the Infra

2013-04-13T23:31:55+00:00

Code the infra. There is no other way to make operations predictable and repeatable. The opposite of coding the infra is what I call as “box hugging”. If you log into boxes to configure, install packages, start/stop services, or do any maintenance, you are a box hugger. …

Code the infra. There is no other way to make operations predictable and repeatable. The opposite of coding the infra is what I call as “box hugging”. If you log into boxes to configure, install packages, start/stop services, or do any maintenance, you are a box hugger. Coding the infra requires that you treat automation artifacts (shell scripts, puppet manifests, fabric scripts etc) and configuration as code. If you’ve no repeatable code to bring up bare infra into a desirable operational state, then you are a box hugger. Box hugging is a bad habit, and is bad for the business. It makes recovery from failures time-consuming. It does not scale with needs. Most fat-finger and admin cockup related outages start with box hugging. Sure, it may have worked 100 times, but just one fat-finger mistake is enough to make your team’s life miserable.

Two steps to cure box hugging — first, internalize the idea that the box you’ve just finished setting up meticulously is going to burst into flames the very next minute, second treat operations the same way as you would treat software development.

Coding the infra is not hard.

1. Treat infra as ephemeral

Infra is not permanent. It will fail. You can estimate MTBF with some assumptions, but failures won’t follow estimates. MTTR is more important that MTBF. When you treat infra as ephemeral, the act of bringing up new infra to a desired operational state becomes a normal and known practice. You ignore the dead nodes, and focus on bringing up new nodes as quickly as possible.

2. Think of system setup as a series of state changes

Start with the basic infra, and apply a sequence of steps to change the state of the infra to bring it to a desired state. The steps could be installing packages, configuring them, starting servers, setting cron jobs and so on. This is no different from most coding exercises — start from a known state, apply some computations, and arrive at a new state.

3. Make the steps repeatable

This is like coding any math problem. You first solve it on paper to arrive at an algorithm. Then you would code the algorithm so that you can repeat it every time you need to solve the same math problem again. It is the same with operational changes. It might seem time-consuming to treat operations this way, but unless you make the steps repetable through automation, you can’t recover from failures easily. Repeatability is a way of rehearsing recovery. Node died? Cool — just run the automation to bring up a new node. You’re back in business.

4. Implement idempotency

Repeatability alone is not sufficient when the state changes are numerous. You need to make each state change idemotent. Apply the same change again — the system should not burn up. Practicing idempotency makes the outcome certain. If something breaks in the middle you can replay the whole sequence of changes when you know that each step is idempotent.

5. Review, test and version control

Finally, apply the same engineering rigor to automation artifacts as you would apply to software development — that is, ensure that the automation scripts are peer-reviewed, tested and maintained in source control. There should be no difference.

DevOps is not just about integrating dev and ops, but is about treating operations as development, and development as operations.

Nodejs vs Play for Front-End Apps

2011-03-25T15:37:47+00:00

We often see “hello world” style apps used for benchmarking servers. A “hello world” app can produce low-latency responses under several thousands of concurrent connections, but such tests do not help make choices for building real world apps. Here is a test I did at eBay recently comparing a front-end app built using two different stacks

I’m resuccitating this old artile to support some inbound traffic.

Mar 29, 2011: The source used for these tests is now available at https://github.com/s3u/ebay-srp-nodejs and https://github.com/s3u/ebay-srp-play.

Mar 27, 2011: I updated the charts based on new runs and some feedback. If you have any tips for improving numbers for either Nodejs or Play, please leave a comment, and I will rerun the tests.

We often see “hello world” style apps used for benchmarking servers. A “hello world” app can produce low-latency responses under several thousands of concurrent connections, but such tests do not help make choices for building real world apps. Here is a test I did at eBay recently comparing a front-end app built using two different stacks:

nodejs (version 0.4.3) as the HTTP server, using Express (with NODE_ENV=production) as the web framework with EJS templates and cluster for launching node instances (cluster launches 8 instances of nodejs for the machine I used for testing)
Play framework (version 1.1.1) as the web framework in production mode on Java 1.6.0_20.

The intent behind my choice of the Play framework is to pick up a stack that uses rails-style controller and view templates for front-end apps, but runs on the JVM. The Java-land is littered with a large number of complex legacy frameworks that don’t even get HTTP right, but I found Play easy to work with. I spent nearly equal amounts of time (under two hours) to build the same app on nodejs and Play.

The test app is purpose built. It includes a single search results page that renders search results fetched from a backend source. The flow is simple — the user submits some text, the front-end fires off a request to the backend, the backend responds with JSON, the front end parses it, and renders the results using a set of HTML templates. The idea of this app is to represent front-end apps that produce markup with/without backend IO.

In my test setup, the average result from backend is about 150k — it is JSON formatted and not compressed. The results page consists of 8 templates — each for different parts of the page like header, footer, sidebar etc. The sizes of the template files range from 250 bytes to under 2k. In order to ensure that backend latency does not influence testing, search requests are proxied through Apache Traffic Server acting as a forward proxy. The cache is tuned to always generate a hit. Such a high cache hit is not realistic, but it helped me isolate the cost of having to go through uncontrolled public Internet to get search results for my testing.

Note that the test environment is not the most ideal — the test client, the server, and the cache
were all running on the same box. The box is a quad-core Xeon with 12GB of RAM running Fedora 14 (2.6.35.6–45.fc14.x86_64 kernel).

I ran the tests using ab.

ab -k -c 300 -n 200000 {URI}

The tests include the following configurations

Render — No IO: Render the page without any IO — this configuration generates HTML from the templates with empty results.
IO + Render: Render the page with results.
IO — No Render: Fetch results but don’t render — this is an unrealistic case, but it helps highlight the cost of IO vs cost of template processing.

The charts below show requests per second and mean response time.

From these, you can see that Nodejs beats Play on performance as well as throughput. However, in the pure IO case, I would not discount non-blocking IO on the JVM. I plan to post more results dealing with IO + computation scenarios.

The charts below show the percentage of requests completed within a certain amount of time in msec. The shorter the bars the better. Also less variance as you read from left to right on each chart is better — I would ignore the last set of bars on the right (time to complete 100% of the requests) as they may contain outliers.

When the workload involves generating HTML from templates off the file system without performing any other IO, nodejs does twice better than JVM based Play. As we introduce IO, performance across the board suffers, but more so with blocking IO on Play. But Play is able to catchup with non-blocking IO (via continuations).

I’m unable to make the source code for the test apps available publicly at this time. But I plan to create and post some new tests on github soon.

Performance of RESTful Apps

2011-03-01T15:37:47+00:00

A while ago I showed how chatty some well-known apps are on my iPhone. But this issue is neither new nor unique to apps on phones and similar devices. Efficient data retrieval from distributed/decentralized servers is a well-recognized problem in distributed computing.

I’m resuccitating this old artile to support some inbound traffic.

A while ago I showed how chatty some well-known apps are on my iPhone. But this issue is neither new nor unique to apps on phones and similar devices. Efficient data retrieval from distributed/decentralized servers is a well-recognized problem in distributed computing. For instance, in the abstract of his November 1994 paper A Note on Distributed Computing, Jim Waldo notes the following (emphasis mine).

We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.

Most front-end developers by now know and follow the best practices that Yahoo!’s Exceptional Performance team documented a few years ago. However, the REST community may have missed the bus and has some catching up to do. Performance of RESTful Apps is not one of the most frequently talked about topics online or in print. From talking to various teams, it often seems that a great deal of time is spent on URI/representation design, schemas, use of the uniform interface for CRUD, using the hypertext constraint etc. No doubt — these topics are all very important, but understanding and accounting for the performance characteristics in the design and implementation of server and client apps is no less crucial.

Here are some techniques to help build high-performance RESTful apps.

Composites for Performance

The best of all web performance techniques is to minimize the number of HTTP requests. However, RESTful Apps rarely follow this practice. This difference stems from how each side sees resources.

On one side, front-end folks optimize their servers to serve bulk representations for CSS, JavaScript or image sprites, or even data URIs for images to reduce the number of HTTP requests and thereby latency. On the front-end, most resources are in fact composites.

On the other side, API/service developers prefer clean looking resources and URIs (shown on the right side above). Though this can lead to chatty network usage, there is one specific advantage in offering a set of resources that are independent and less coupled with other resources — it leaves room for clients to innovate. It lets them combine data from multiple resources in numerous ways that the resource developers could not possibly think of.

Loose coupling is another benefit of this approach, as clients can evolve rapidly on their own.

The expense is of course latency, particularly when those client apps are not very close to the servers. Each client may need to submit several requests to the server in order get its job done. So, how do we go about fixing this without sacrificing the flexibility of less-coupled resources? One answer is to use composite resources. See my RESTful Web Services Cookbook for details.

With a composite, in stead of sending n-number of HTTP requests over 1-n connections, the client can open just one TCP connection to send an HTTP request to retrieve the data it needs — just like a browser getting CSS or Javascript bundles in the front-end. A composite changes a pattern like

GET /something HTTP/1.1  
Host: [www.example.org](http://www.example.org)

GET /something-else?params HTTP/1.1  
Host: [www.example.org](http://www.example.org)

GET /some-other-thing-related-to-something?params HTTP/1.1  
Host: [www.example.org](http://www.example.org)

GET /get-all-things-i-need-about-something?params HTTP/1.1  
Host: [www.example.org](http://www.example.org)

Each composite can generate a projection of state required for one or more clients. These composites can be more specialized than the resources they aggregate - as each composite can cater to particular client needs.

(P.S.: My usage of the term “resource” is not precise here as a “composite” is also a resource.)

This approach also shifts issues related to concurrency (such as ordering of requests based on success or failure), CPU (for generating projections, correlating related data from across representations etc.) and I/O (to fetch back-end resource representations in a serial/parrallel fashion depending on dependencies) workloads from the client to the server.

On the server side, the server hosting composites can also optimize its connection handling to resource servers to reduce TCP handshake and slowstart overhead. For instance, it can maintain pools of persistent/long-lasting connections (e.g., with keep-alive) between servers hosting composites and the resources (shown by bold arrows above).

In this post, I’m not going to discuss software choices to serve composites, but you may need to account for several features:

multi-tenancy or isolation of code execution, configuration and deployment so that different teams can build composites
data or control flow for fetching representations in parallel or sequentially based on inter-dependencies
query languages (such as YQL) to normalize data formats and to easily create projections
non-blocking or asynchronous I/O to better tackle I/O workloads

I’m personally excited about nodejs and async I/O support in Java 7 as both would let us build small and nimble broker apps to serve composites.

Of course - the idea of a composite is to add an extra layer of indirection on the server side to offset network overhead when performance is at stake. It is not meant to replace loosely coupled resources that can be manipulated using HTTP and linked using hypertext controls like links in representations.

Better Connection Reuse

Long-lasting TCP connections help reduce connection-establishment overhead as well as help the TCP stack settle to appropriate congestion window size. Reusing connections is usually trivial, and pooling is often part of client libraries. But there are a few precautions to take at the application level.

Avoid Explicit Connection Closing

\# Don't do this  
HTTP/1.1 200 OK  
Content-Type: application/json  
Content-Length: 1234  
Connection: close

{ ... body ... }

The first precaution to take is not to add Connection: close to requests or responses by default. Carelessly adding this header to requests or responses will prevent connection reuse. There better be a good reason to add this header such as a server that can’t handle too many open connections, or to prevent abuse.

Avoid Close Delimited Messages

\# Avoid this  
HTTP/1.1 200 OK  
Content-Type: application/json

{ ... body ... }

Make sure to include Content-Length or use Transfer-Encoding: chunked so that the recipient of a HTTP message can know when one HTTP request/response message ends and when the next one starts. If a response has neither of these two, then the recipient will need to read the stream till the connection is closed, which means that the connection cannot be reused. Close-delimited HTTP request/response messages are bad for performance.

Read Messages Completely

Incomplete reads will also prevent reuse of the connection. Incomplete reads usually happen when a client receives an error or a redirect response. These kinds of responses can have a body but clients can determine what to do by just looking at the response line and headers. For instance, a 409 response may have a body that explains why.

HTTP/1.1 409 Conflict  
Content-Type: text/html  
Content-Length: 1234

 ...

But not reading the body from the connection my prevent reuse in some frameworks - Java is known for this. Also be wary of libraries that translate 4xx and 5xx response code into exceptions - in this case, in addition to catching the exception, the client will need to read the body.

Tune Idle Connection Timeouts

Servers and clients usually close idle connections upon a timeout to conserve resources. Settings for idle connection timeout may be hard to find or even not exposed in client/server frameworks. When tuning is possible, ensure that the defaults are reasonable.

Try Proxies for Long-Haul Traffic

In some cases a configuration like the following can help:

Client apps that are short-living (like an app on a mobile/tablet or even on desktops), and hence connections can’t be persistent. Instead, the proxy can keep connections persistent which limits connection establishment cost to the first leg from client app to the nearest proxy.
Client apps that can’t maintain too many persistent connections — which is still the case for browsers today — though it is slowly changing.

Of course, this approach also lets the server distribute responses to caches on those proxies to further reduce network cost. Many variations of this approach are possible depending on how your servers are distributed and how far are client apps from servers. If you’re new to the idea of REST and are still wondering why HTTP’s uniform interface is such a big deal, here is why - once you implement HTTP reasonably correctly, you can reconfigure servers, proxies and caches as necessary without code changes.

Progressive Serving of Representations

Sometimes it is not the network, but generating a response is the bottleneck. This is particularly true for composite resources or resources that rely on a number of data sources to generate a response to the client. The typical flow in such cases is as follows:

read the request data such as the path and query string
decide what to fetch
fetch data from each dependent source in sequence or concurrently to the extent possible (which depends on dependencies)
prepare data for the response
write the data to the response

Of these steps, when I/O for dependent sources is done sequentially, the server takes at least n*t time to generate a response. If all the I/O can be done in parallel, it takes max(t) amount of time, i.e, it performs at least as slow as the slowest source.

On the front-end side, when it takes time to generate a page, a common practice is to turn to XMLHttpRequest or iframes to split the page into fragments and defer loading of slower parts of the page. Both these techniques potentially use additional connections. In a multi-tiered setup, this causes a flood of new requests from the browser to front-end servers, and from there to backend servers and so on. This also introduces new state management and security problems as the server may need to push state first to the browser only to get it back via XMLHttpRequest immediately.

An alternative is to progressively render the page over a single connection. In this case, the flow would be

read the request data such as the path and query string
decide what to fetch
fetch data from fast sources
initiate requests for slow sources
serve partial page based on response from fast sources
as and when a slow source responds, prepare a partial response and write to the client
after all the sources respond (or after some timeout), write additional chunks and finally end the page

Here, by “chunk” I mean “part of a message” and not an HTTP chunk.

The goal of this technique is to reduce user-perceived latency without using more network connections from the browser. In this flow, browser makes an HTTP request to the front-end server which writes snippets of markup and script over a period of time before ending the response. Since the server does not know the Content-Length of the page, it would use chunked transfer encoding where end of response is triggered by a zero-sized chunk.

This is called “progressive rendering”. This technique is well-known in front-end circles and Facebook calls this technique BigPipe. Progressive rendering depends on two things:

Server being able to write chunks over lasting connections — asynchronous I/O based servers like nodejs are very attractive for this (see my nodejs example or Bruno’s example using continuations).
Clients being able to process response as it arrives — in Javascript capable browsers, this capability is already present.

We can apply this technique for non-front-end resources as well provided (a) it is possible to retrieve data from fast sources before slow resources, and (b) data from fast sources is meaningful to clients. For instance, think of a personalized product resource that includes data about a product plus IDs, links, and brief summaries of related products. In this case, product data can be looked up from storage under a near-constant time (say, about 20 milliseconds) while finding related products may involve performing some computations on the user profile, past purchase history and other derived data which can be time consuming - say, taking up to 500 milliseconds. Here is an example of a progressive representation of such a product resource.

HTTP/1.1 200 OK  
Content-Type: multipart/mixed; boundary=abcdef  
Transfer-Encoding: chunked

\--abcdef  
Content-Type: application/json

{ ... product data here ... }

\--abcdef  
Content-Type: application/json

{ ... related products here ... }      
\--abcdef--

In this example I used a multipart media type as it provides a visible boundary between different portions of the representations, and the client can read the representation part by part.

If the client is a front-end app that generates an HTML product page for browsers, it can progressively render the product page as soon as it receives the first part, and then render markup for list of related products when the second part arrives.

HTTP/1.1 200 OK  
Content-Type: text/html  
Transfer-Encoding: chunked

  
  ...  
    
    ... HTML for the product data ...

This shows how progressive generation of arbitrary representations can be combined with progressive serving of the front-end to reduce perceived latency.

Epilogue

One of the patterns to notice from this post is that design considerations between HTML serving front-end apps and JSON/XML/whatever-speaking RESTful apps are not entirely different. Both rely on the same set of core architectural principles such as the uniform interface, visibility, hypertext, and so on. Whatever lessons we learn on the front end are certainly applicable for the so-called API servers.

Finally, it goes without saying that premature optimization is evil. My goal of this post is to point out the techniques you may already have in your toolkit. Apply them based on the need and experimentation.

If you find this post useful, try my book: RESTful Web Services Cookbook.

Can Pipelining Help?

2011-02-05T15:37:47+00:00

HTTP pipelining is often suggested as a way to dramatically improve page load times, or to solve multi-GET use cases for RESTful applications. Whether pipelining can achieve the intended effect or not truly depends on what gets pipelined and how the server implements pipelining.

HTTP pipelining is often suggested as a way to dramatically improve page load times, or to solve multi-GET use cases for RESTful applications. Whether pipelining can achieve the intended effect or not truly depends on what gets pipelined and how the server implements pipelining.

When using pipelining, a HTTP client sends idempotent HTTP requests (such as GET) without waiting for response of previous requests, and expects responses to arrive in the same order from the server. HTTP 1.1 says nothing about order of processing of requests on the server side — servers can process each request in sequence or in parallel. All that matters is the order of responses. However, in the real-world, pipelining is not often used due to a number of interoperability issues. Mark Nottingham recently captured some of these issues in an internet draft:

Anecdotal evidence suggests there are a number of reasons why clients don’t use HTTP pipelining by default. Briefly, they are:

Server implementations may stall pipelined requests, or close their connection. This is one of the most commonly cited problems.

Server implementations may pipeline responses in the wrong order. Some implementations mix up the order of pipelined responses; e.g., when they hit an error state but don’t “fill” the response pipeline with a corresponding representation.

A few server implementations may corrupt pipelined responses. It’s been said that a very small number of implementations actually interleave pipelined responses so that part of response A appears in response B, which is both a security and interoperability problem.

Clients don’t have enough information about what is useful to pipeline. A given response may take an inordinate amount of time to generate, and/or be large enough to block subsequent responses. Clients who pipeline may face worse performance if they stack requests behind such an expensive request.

Even if we fix all the interoperability issues (such as 1, 2, and 3 above), pipelining will not necessarily improve anything. Unlike non-pipelined requests, clients need to know a bit about the server’s implementation before deciding to pipeline requests. Here is why.

The key constraint in pipelining is that the server must send responses in order. This leads to the so-called head-of-line blocking problem.

Assume that the client opens a connection and sends three GET requests, g1, g2, and g3. Of these, let’s say that g1 takes longer to process than g2 and g3. But the server is still required to return responses in the sequence of g1, g2, and g3. Here is one possible implementation in a multi-threaded server.

Server receives a connection, and it gives the associated channel/stream to a thread t0
Server starts parsing the data in t0
Server finds g1, and hands it off to an application handler h1 in thread t1
Server finds g2, and hands it off to an application handler h2in thread t2
Server finds g3, and hands it off to an application handler h3 in thread t3
h2 finishes first and wants to write response — server blocks it since h1 has not finished yet
h3 finishes next and wants to write response — server blocks it too since h1 has not finished yet
h1 wants to write response — since g1 is the first request, the server lets it
Server unblocks h2, and it writes response
Server unblocks h3, and it writes it to response

In this model, the server explicitly blocks application handlers from writing response until it is their turn. Alternative implementations are possible:

The server can wait to read the next request (i.e., the request line, headers and any body) until the previous request is completely processed.
The server can buffer responses of application handlers (at least of those that finish earlier than previous requests) and write them in order to the client.

Over some limited tests during the weekend, I found that both Netty and Tomcat follow the first approach while Nodejs follows the second approach. Both approaches have their limitations, in particular when one of the requests early in the pipeline takes time to complete. In such cases, the client is better off sending g1 over one connection, and pipeline g2 and g3 on a second connection. This will reduce the serialization window on the server. However, in order to make such a choice, the client needs to have some prior idea about workloads involved in processing each request. When such information is difficult to assert (e.g., for a browser sending requests to an arbitrary servers), connection reuse via keep-alive is safer bet than pipelining. In any case, it is better to test before enabling pipelining in clients.

Chatty Apps

2011-01-25T15:37:47+00:00

We know that the first practice to speed up performance of a site is to minimize the number of HTTP requests. The same should be true for mobile apps too, but the results I find from some of the apps I commonly use on my iPhone show that the apps have not paid enough attention to this practice.

I’m resuccitating this old artile to support some inbound traffic.

We know that the first practice to speed up performance of a site is to minimize the number of HTTP requests. The same should be true for mobile apps too, but the results I find from some of the apps I commonly use on my iPhone show that the apps have not paid enough attention to this practice. I used Charles Proxy for my tests. I can not vouch for the correctness of some of the headers reported by this proxy, but the common pattern I noticed is the number of HTTP requests they fire up is not small. This pattern has obvious impact on the user-perceived latency and even battery life.

Here is a summary.

Bing

Task: Search

1-n GET requests for auto-completing search string.
One POST with XML for fetching search results
A couple of POST requests for instrumentation — these are like beacon requests with zero length responses.

Google

Task: Search

Seven short POST requests exchanging some application/binary content — not sure what their purpose is
1-n GET requests for auto-completing search string. The response is application/json, but the resposne is actually Javascript (and so the Content-Type should be `application/javascript’.
Four GET requests to log suggestions. These are again like beacon requests with zero-length responses.
One GET for search results

Mint

Task: Open the app

26 requests over TLS — I suspect that the number of requests depends on the number of accounts — now I know why this app is so slow on my iPhone.

Netflix

Task: Open the app

One request over TLS, probably for some token exchange.
One GET to fetch some config data
One GET to fetch some policy related info
One GET to get date-time from server (why not Date from a previous response?)
One GET to get rental history, two GETs to get rental queue, two POSTs for ratings, one GET for account info, another GET config request, one GET for catalog, and a POST for some misc data
A number of unconditional GET requests for static assets with expires set to a day later

Amazon

Task: Open the app

Just two POST requests with application/octet-stream encoded data.

LinkedIn

Task: Open the app

One GET for the profile
One GET for messages
One GET for some alerts
One GET for favorites
One POST to report metrics

The order and number of these requests depends on the UI state of the app.

Facebook

Task: Open the app

Six pre-flight GET requests — not sure what the purpose is — there is not much that I could discern from responses
One request over TLS
A POST multipart/form-data request with an XML response (falsely advertised as text/html) with some profile data
One multipart/form-data POST request — the response is XML encoded within XML.
A number of unconditional GETs for static assets — expiry set for a few months

Ignoring some funny use of HTTP, my key observation is that most apps are built on top of existing “APIs”. The APIs are providing access to different types of data, and the app is aggregating that data from the client (the phone) side. So, even simple actions like opening an app cause a number of requests from the client. All the apps I tested are branded and not built by third-parties and hence do have every chance to optimize the traffic.

BigPipe Done in Node.js

2010-07-25T15:37:47+00:00

Here is a quick hack

Stephan Schmidt says

I’ve implemented a proof of concept of BigPipe in Java (should run as-is in every servlet container):

See his blog post for the Java servlet class.

Here is the same (or more?) written in Node.js.

var http = require('http');  
var sys = require('sys');  
var url = require("url");

http.createServer(function(request, response) {  
    // Write the document  
    response.writeHead(200, {"Content-Type" : "text/html"});  
    response.write("-//W3C//DTD XHTML 1.0 Strict//EN"" +  
            "   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">");  
    response.write("");  
    response.write("Progressive Loading");  
    for(var i = 0; i < 6; i++) {  
        response.write("&;lt;/div>");  
    }  
    response.write("
");

// Now the snippets  
    var down = 6;  
    for (i = 0; i < 6; i++) {  
        var proxy = http.createClient(2000, "localhost");  
        var proxyRequest = proxy.request("GET", "/?id=" + i,  
                                         {"host" : "localhost"});  
        proxyRequest.addListener('response', function (proxyResponse) {  
            --down;  
            proxyResponse.addListener('data', function(chunk) {  
                response.write(chunk, 'binary');                  
            });  
            proxyResponse.addListener('end', function() {  
                if(down == 0) {  
                    response.end();  
                }  
            });  
        });  
        proxyRequest.close();  
    }  
    response.write("");

}).listen(8080);

http.createServer(function(request, response) {  
    // Some delay upto upto 2 seconds  
    var delay = Math.round(Math.random() \* 2000);

setTimeout(function() {  
        var params = url.parse(request.url, true);  
        var id = params.query.id;  
        response.writeHead(200, {"Content-Type" : "text/html"});  
        var content = "Content of Module " + id + "";  
        response.write("");  
        response.close();  
	}, delay);  
}).listen(2000);

I’ve no comments on the technique itself. The basic idea may have been implemented a number of times in different languages. The mileage varies based on the language/environment used.

POST Caching Example

2008-11-15T15:37:47+00:00

Here is a good example of caching of POST responses by Henrik Nordström in the HTTP WG

Here is a good example of caching of POST responses by Henrik Nordström in the HTTP WG.

The classic example of where a cacheable response to POST makes sense is the guestbook example (or unmoderated blog comments for those needing a more modern example, basically the same thing in a different era) where the visitor POSTs an addition to the page currently viewed, or “separate entity that accepts annotations” as it’s expressed in the RFC. The response given to the POST is the new representation of the page. Both
GET and POST uses the same URL in this example.

followed by

Note that the POST request as such can never be satisfied from cache. A repeated POST with the same form content will not yield the same result even if the response is cacheable.

The context of this thread was whether the HTTP method should be included in the cache key. See this thread for a complete summary.

Resource Identity and Cool URIs

2008-10-28T15:37:47+00:00

In response my InfoQ article on Describing RESTful Applications, some of the comments I received so far dealt with resource identity.

In response my InfoQ article on Describing RESTful Applications, some of the comments I received so far dealt with resource identity. When I sent a draft to Stefan in late October, he was curious to see why I used ID elements to capture a unique identifier of each resource.

Here is a snippet from one of the examples I used in that article.

JJ, in his latest post on the same article makes a similar comment.

Interestingly Subbu also defined a proprietary ID mechanism reinforcing the idea that a URI is not generally used for identity purposes. I would have preferred a “link” to a unique resource identifier.

A similar thought was expressed by Nick Gall in a thread on rest-discuss where he says

So ultimately, I’d prefer to see all identifiers as URLs (not just URIs) and have such URLs be permanent.

Since URIs are supposed to be permanent, i.e., since cool URIs don’t change, we should be able to use URIs to identify a given resource, and ideally there should be no need for proprietary identifiers. However, in reality, URIs are unreliable substitutes for identifiers for client applications to rely upon. Allow me to elaborate.

Look at the following HTTP GET requests.

GET /person/abc  
Host: [www.example.org](http://www.example.org)

200 OK  
Content-Type: ...

  
    
      rel="http://www.example.org/rels/person-with-addressbook"/>  
  Subbu  
  Allamaraju  
  subbu@nospam.com  
  ...

GET /myapp/person/abc?include=addressbook  
Host: [www.example.org](http://www.example.org)

200 OK  
Content-Type: ...

  
    
  Subbu  
  Allamaraju  
    
      
      ...  
      
    ...

GET /myapp/people?like=subbu  
Host: [www.example.org](http://www.example.org)

200 OK  
Content-Type: ...

      
    
    
      
    Subbu  
    Allamaraju  
    
    
      
    Subbu  
    Somebody

In each response, the client is receiving information about the same person. In the first case, it is receiving the first name and last name, in the second case, it is receiving first name, last name, and the person’s address book, and in the third case, it is finding the same person through a search.

Now, let us think of an on-line game review site that uses the server at http://www.example.org for all user data.

Here are some possible user scenarios.

I log into the game review site, and upon login, it greets me with my first name and last name.
I click on a link to view my address book.
One of my friends logs into this site, types in “subbu” in some search box, finds my name in the search results, and then clicks on a link to view reviews posted by me and all the contacts in my address book.

To implement these scenarios, the client needs to be able to (a) relate that all responses are referring to the same user, and (b) store additional data in its databases using the user’s identity as a foreign key in its database. What can the client rely upon?

Let me start with the “self” links. The person has a self link in each case, but they are all different. The client can not determine that the person with name Subbu Allamaraju found in the search results is the same as the one in the first or the second response. So, self links are useless to implement these scenarios.

There are three possible solutions to fix this problem.

Let the client guess that they all refer to the same person by trying to parse the URI.
Introduce another link with a relation value of, say, http://www.exampple.org/rels/identity and a URI that uniquely identify the entity in question.
Introduce an identifier in each representation that uniquely identifies the thing in question.

The first is an obvious no-no since it breaks URI opacity.

Of the remaining two options, I prefer the third one since what the client application needs is an identifier that uniquely identifies the entity, although the second option will work as well.

The key point is this. URIs uniquely identify resources but a URI used to fetch something is not always a good candidate to serve as a unique identifier in client applications. As I showed in the above example, there can be several URIs to fetch different kinds of information about the same entity. As far as HTTP is concerned, for the above example, there are three resources, each with a different URI. But as far as the client and server applications are concerned, we are talking about the same entity, which is a person. The URI that can be used to fetch these does not tell the client that they are the same. We need identifiers for that.

My design choice therefore is to include an identifier in every representation to uniquely identify the entity in question. I prefer using a URN as the value of these identifiers, since URNs are intended to serve as “persistent, location-independent, resource identifiers”.

Explaining State in HATEOAS

2008-10-15T15:37:47+00:00

Explaining “state” in “Hypermedia as the Engine of Application State” (HATEOAS) is a bit tricky, particularly when you have to do it under two minutes.

I’m resuccitating this old artile to support some inbound traffic.

Explaining “state” in “Hypermedia as the Engine of Application State” (HATEOAS) is a bit tricky, particularly when you have to do it under two minutes.

The problem is that, the word “state” means different things to different people. For most of us coming from some background in web development, state usually involves numbers, strings, booleans, and other objects stored in some place, say, in an in-memory session. For instance, every beginner-level book on web development includes a shopping cart style sample that stores the cart in an in-memory session. If we extend that notion to understand or explain HATEOAS, it would make us jump to the conclusion that, to make hypermedia as the engine of application state, the server will have to encode similar objects into some XML or such form of representation in response to each request. This line of thinking is a trap.

Here is why. Once the server includes such state in a representation, the next step for the client is to replay the state in future requests to the server. Then we are talking about exchanging those objects back and forth, and not every HTTP verb has enough room to carry that state. This line of thinking will then start to shoot holes into the notion of a uniform interface because, to fit the state in a request, the client may have to resort to POST. I can almost see message passing over POST as the next logical step. At this stage, whoever is trying to explain HATEOAS may have make some lame excuses and move on. Whoever is listening will then conclude that “yeah, this won’t work for my apps”.

Here is an example that I find most useful to explain the “state” in HATEOAS.

There are three pages in a UI. The first page has a link to go to the second page. The second page has a link to go to the previous page as well as the third page. The third has a link to the second page and another link to the first page.

A client starts from the first page, and then through the link on that page, goes to the second page. The fact that this page has one link to the first page and another to the third page implies that the current state of the application (i.e. the interactions) is that “the client is viewing the second page”. That is what it means by hypermedia as the engine of application state. It does not necessarily mean serializing application state, such as “2” into representations.

I admit that I am simplifying this a bit. The point is that state does not necessarily mean some data stored in representations. HATEOAS means that representations reflect the current state of the app through links with known relations. Those links may contain opaque references to some persistent state on the server.

Location vs. Content-Location

2008-10-10T15:37:47+00:00

Here is a quick note on the purposes of and differences between Location and Content-Location response headers.

Here is a quick note on the purposes of and differences between Location and Content-Location response headers. The question came up several times, and more recently in Bill Burke’s post on Atom too SOAPy for me.

Here is how HTTP 1.1 defines the Location header.

Location:

The Location response-header field is used to redirect the recipient to a Location other than the Request-URI for completion of the request or identification of a new resource.

The use of the Location header is straight-forward. The server can use this header when it creates new resource in response to a POST request, i.e. while returning response code 201. Or, it is can use this header to redirect the client to a different Location with one of the 3xx codes. The use of the Location header otherwise is unspecified. In particular, its use is not defined for GET.

Content-Location, on the other hand, has a much narrower usage.

Content-Location:

The Content-Location entity-header field MAY be used to supply the resource Location for the entity enclosed in the message when that entity is accessible from a Location separate from the requested resource’s URI.

The name of this header could be a bit confusing. One way to understand this is to relate it to other Content-xxx headers such as Content-Type, Content-Language, and Content-Encoding. Just like the way Content-Type header declares the media type of the entity in the response, the Content-Location header declares a URI for the entity in the response. Also note that the Content-Location header is not defined for PUT and POST.

Content-Location header is useful for content-negotiated responses when both server-driven and agent-driven negotiations are in use by the server. Here is an example.

GET /myResource  
Accept: application/xml

200 OK  
Content-Type: application/xml  
Content-Location: [http://example.org/myResource?format=xml](http://example.org/myResource?format=xml)

...

Here the server is telling the client that the same content (i.e. a variant of media type application/xml) is also available at http://example.org/myResource?format=xml at the time of the request. The client can use that URI in future to directly fetch the negotiated response. Caches can use this URI to associate the requested URI (i.e. http://example.org/myResource) to URIs of variants (such as http://example.org/myResource?format=xml), and flush those variants while flushing the resource, or to flush previously cached representation at the variant URI if that representation is stale.

Finally, neither header is meant for general-purpose linking.

Thanks to my colleague, Mark Nottingham for clarifying some of the finer points over several emails a few months ago.