How to Run a Better Engineering Organisation

First of all, forgive the brazen and somewhat presumptuous title; I have never run an engineering organisation, so you might want to take my opinions with a pinch of salt. However, I do believe that I can offer a useful perspective on what ICs (individual contributors) want out of the organisation they work in.

When you work at one of the fast growing companies in the history of ever, doing a lot of interviews is inevitable. I must have done something like 200 interviews in my almost 4 years at Uber (the large majority being in the first 2 years when I lived in SF) and the most common question I get asked at the end of each interview is 'What do you like the most about working there?'. Softball question aside (hint: asking good questions is an important part of an interview), this gives me a good chance to introspect and think about why I say that I love working here. That answer naturally changes over time, but most recently I've settled on what I think is the closest to the truth. What it comes down to is that it's a fantastic place to work as an Engineer. In this post, I'll expound on that answer and lay out a series of points which I believe may improve the engineering in your company.

Before we get started, make a mental note that this post isn't titled 'X simple steps to...'. Some of these things are extremely difficult to do. Engineers might claim to be more logical and objective, but in reality, they're just as scared of change as everyone else.

1. Empower Engineers

There's nothing that engineers hate more than not being allowed to do something that they want to do. Not only is putting restrictions on access futile (good engineers will find a way to do it anyway), but there are many more mental implications with regards to trust and value.

If you're going to pay me thousands of dollars to do my job, then let me do my fucking job.

The simple way to empower engineers is to give everybody access to everything. That means, at the very least:

  • Read/write access to every code repository across the company.
  • SSH access to every server, including production.
  • Ability to deploy code to production (and roll it back if needed)
  • Access to critical business data (e.g. users/sessions, revenue, profit, etc)

If you want to go one step further - as Uber does - then you can also give everybody access to all the data too. At Uber, engineers have access to all of the companies intimate business metrics through multiple dashboards (one of which was infamously leaked the day before I started).

Some of these things might sound absolutely crazy. Access for everyone? Won't junior engineers do bad deploys or accidentally drop our tables? Won't our metrics get leaked?! The answer to these questions is, of course, probably. Fun side note, a screenshot of Uber's internal business metrics was leaked on the day I joined the company.

That leads me to point 2, will which help you avoid these things as much as possible.

2. Invest in Tools

If the idea of letting a junior engineer in their first week deploy code to production scares you, then it simply means that your tooling is not good enough. Let's think through some of the things that could go wrong:

  • They push bad code to production, causing an outage.
  • They deploy with no visibility and can't monitor the results.
  • They mix up steps 7 and 9 of a DB follower promotion, causing corrupt data.
  • They deploy code which was not reviewed or tested.

In a customer-obsessed organisation, these are all bad things which can have a big impact. I've used the example of a junior engineer to hit the point home, but no matter how senior the person doing these actions are, everyone will make a mistake at some point. You have to invest in tools to make sure that the impact of a senior person making these mistakes is exactly the same as a junior person. Once you've done that, it doesn't matter who's pressing the big red button and you can give access to everybody.

So how do you minimise the impact of these problems? Let's take a look at each of the examples:

Pushing bad code - We know that everyone is going to push bad code at some point. The key here is to invest in exception monitoring tools and tie these into your deploy system. If a deploy goes out and errors spike, then the tool should automatically rollback to the previous deploy and notify the engineers responsible for that deploy and the lines of code causing the problems.

Lack of monitoring - Knowing that your change is working is crucial to avoid problems. Invest in a robust monitoring system with simple to create dashboards (e.g. Grafana) a simple to search logging stack (e.g. ELK) and a solid error reporting system (e.g. Sentry, Rollbar).

Process mix-up - This is a real thing that has happened at Uber. What did we learn from it? Any process which has more than 3 manual steps is a nightmare waiting to happen. Even a seasoned veteran can make a mistake under pressure. Invest in tools to automate any important processes and take humans out of the equation. That can be as simple as a couple of bash scripts or as smart as a system which uses machine learning to automatically make the call on when to re-balance/fail-over.

Un-reviewed or untested code - Adopting new tools or adding simple extensions to existing ones can easily solve this problem. For instance, Phabricator will warn an engineer if they're trying to merge code which has not been reviewed or if the tests don't pass. Pair this up with your linters and code coverage tools to ensure that code can't make it to master unless it meets your criteria.

3. Foster Responsibility

So you've given everyone access to wreck your company and put some tools in place to make that less likely to happen. What next? We all know the adage about what comes with great power. Fostering a responsible culture is about allowing people to own up to their mistakes and moving forward. Blameless post-mortems are a great way to do this. By encouraging people to put their hands up, own their mistakes and then work as a team to fix them, you're much less likely to find yourself in the situation of a costly repeat mistake.

There are three important parts to this process:

  1. Lead by example. You have to show junior or people new to the team that it's OK to make mistakes as long as you learn from them. The best way to do that is by owning and over-communicating your own mistakes.
  2. Don't over-react. Someone accidentally ran a command in production and took down your service. You'll undoubtedly hear your infra/security people calling for revoking access. Don't over-react, remember, invest in your tools. It turns out that the person made the mistake whilst trying to restart their service. A better solution would be to add that capability to a simple UI rather than removing access and forcing everybody through a cumbersome procedure.
  3. Have a concrete outcome. A post-mortem with the outcome of "we made a mistake and we'll try not to do it again" is useless. You need concrete tasks to come out of them which will stop the same mistake being made again.

So you had your outage, and in the postmortem, you realised that the problem should've been caught before deploying to production! Why did it get caught in QA? Get ready for what is probably my most controversial recommendation:

4. Ditch QA and Staging

That's right. No QA and no staging environment. I know, you've always done it this way because quality is so important - getting rid of QA is a preposterous idea. Bear with me here, because this is a hard one to swallow, but I truly believe that once you remove QA and staging you'll see a huge productivity increase across the board.

QA comes in many different shapes and sizes. I've seen the worst of it in the financial sector - let me explain what the typical process was there (and I should point out that I was working on a system which accounted for hundreds of millions of dollars a day in trading volume):

  1. I'd write some code, check that it worked and cut a build.
  2. I'd pass the build on to the QA team, and explain what feature I'd added.
  3. They'd run through days of manual and automated testing to ensure that the code behaved exactly as specified in the requirement docs.
  4. They'd either find a bug, in which case we'd repeat the whole process, or they'd approve the change.
  5. Some number of weeks later, somebody somewhere would deploy the code to staging.
  6. Things looked fine because staging isn't the real world. After a few weeks, the code would be deployed to production.
  7. The inevitable bug they didn't catch would show up in production.

This is a shitty situation to be in for so many reasons. Firstly, they'd take a bit of the blame for letting the bug through. I can't imagine what it must feel like to be blamed for somebody else's shitty code, but I doubt it's fun. Secondly, because QA was there and their job was to ensure the requirements were met, I barely did any testing. I had a safety net which made me complacent. Finally, the whole process made a change that should've taken an hour take a whole week.

Now let me explain how the process typically works at Uber:

  1. I write some code. I write a lot of tests. I spend time trying to break the code and ensure that I've covered edge and failure cases.
  2. Somebody reviews my code and checks for the same things.
  3. I hit deploy. I monitor the roll-out. I watch the graphs.
  4. If there are errors, the system automatically rolls back the changes.

The difference here is that I've taken on full responsibility for my work. I own the code through every step of the process - if it goes wrong, there's nobody else to blame. That's fine, because we invested in tools which minimize problems, and we foster a culture of learning from our mistakes. The outcome of this is that I put so much more effort into catching issues before they ever reach the eyes of another engineer.

You'll also notice that I'm the person doing the deploys and monitoring the system. It doesn't make sense for somebody who doesn't understand my change to do that, and the added benefit is that when writing code, I care a lot more about good logging and instrumentation because I know that I might need it later.

This is genuinely how it works at Uber. This really is one of the things that let us move faster than I was able to at a 20 person start-up which had a shitty attitude towards testing and a cumbersome staging and deployment process.

This section is already too long and I haven't really spoken about staging yet, so I'll keep it brief; staging is shitty because it is not production. No plan survives first contact with the enemy, so the faster you can get your code out there, the quicker you'll know if you need to change something. Staging only slows you down, instead, dogfood with employees, but do it in production. Invest in a phased rollout system, targeted to employees only. Use A/B tests and feature flags.

5. Let Builders Build.

This one is cribbed directly from Uber's values. The idea is to set engineers up to execute, then get out of the way and let people build. If somebody comes up with an idea to solve problem X, don't stop it being built because it doesn't also solve problems Y and Z.

I see this happen a lot with technology choices too. When I was at Expensify, we had an issue where our support and sales teams didn't know when a particular change or fix had moved from development into production. Engineers would routinely close issues and the sales team would tell a customer that the issue was fixed and that they should try again - only to have it continue to fail because a deploy got held up.

One afternoon, I decided that this needed fixing and set out to write a simple script which would take the diff between the previous deployed hash and the new one, scour the commit messages for issue numbers, then use the Github API to post a comment saying 'deployed to staging' or 'deployed to production' so that the team would know the status of the fix.

Now, the majority of the Expensify backend is written in PHP, and this was 2012 and Node was the new hotness. I decided that I wanted to learn about Node and thought that this little script would be a good way to practice, so I just did it. It took about an hour to write and saved the sales team a whole bunch of trouble. I'd fixed a problem which everybody knew about for months and did nothing about, but when I announced it to the team all I got was a bunch of flak from some of the engineers about my choice of language.

This is the antithesis of Let Builders Build. Embrace things that move your company forwards and empower engineers to make those decisions. You can always rewrite things if they don't work, but stifling creativity for the sake of it is a real morale killer for your team. If you're using technology that is considered old and clunky, take heed. Good engineers are always looking to learn new things and keep up to date.

A word of warning; apparently some teams/people as Uber used this phrase verbatim as an excuse to push out poor quality code. I see this as a mantra for cutting out bureaucracy, but never quality. Make sure it isn't misused.

6. Go Cross-functional

In my experience, the 'traditional' engineering team setup has been to divide people up by their discipline. The chances are you've worked in a company which does this. Typically, you'd have a 'backend' team, a 'frontend' team, a web team, maybe Android and iOS teams, a data science team, an infrastructure team, etc. This works fine when the company is small, but at some point (I estimate at around 50 engineers) you find that overhead starts to account for more time than actual work.

If you want to ship a feature which requires work across all of these disciplines, you need to align multiple roadmaps and get 'resources' from another team to work on your project. More often than not, the people on the other team don't really care about your feature and they have something else which they're already working on. You also have the problem that it takes a lot of time to familiarise somebody enough with a feature that they actually implement it properly.

The best solution to this problem that I've seen is to go completely cross-functional. If a team is responsible for shipping new Widgets across all platforms, you should make a Widget team which has engineers who specialise in frontend, backend and mobile. A data scientist, an engineering manager and a product manager. You might even want to include what are traditionally 'non-tech' families such as sales, operations or support.

What you end up with is a team that is laser focussed on shipping Widgets. Most importantly, there is no point where that team is dependent on another team to get their work done. In my opinion, the best thing that Jeff Holden (CPO of Uber) ever brought to our team was a push to implement and double down on this team paradigm.

7. Parallelise Promotion

Finally, a word of advice on engineer growth. In 'traditional' companies, engineers are hired because they show promise in engineering talent. As they grow in technical ability, they also grow in the organisation and are promoted to more senior roles. Unfortunately, many companies have a single ladder for promotion and the only place to go after 'Senior Engineer' is 'Engineering Manager'.

I can't stress this enough. Don't promote into management.

Being an engineer and managing people are two completely different skills. Promoting an engineer into management is like promoting your best football player into the head coach. Not only do they not have the skills necessary to succeed in management, but you also just stripped your team of your best player. In many cases, both the individual and the team are going to end up unhappy.

Instead, parallelise your promotion tracks. Have a ladder for engineering which runs parallel to the ladder for management. Use identical pay bands for the two ladders and promote at the same rate. This only incentivises people to move sideways when they feel they'll be better at the job rather than for monetary reasons. Also, be sure to squash any 'congratulations!' messages people send when they see someone become a manager to really hit this one home.

Note that the other - more radical - option is to get rid of titles all together. In theory, this sounds interesting, but I've never worked at a company that's done this so I can't speak about the inevitable social nuances that arise.

Well, that was a lot of pontification for somebody who's never run an engineer organisation, but I'm pretty confident that it's decent advice based on my own personal experience. Of course, maybe this stuff is already common knowledge. I've always been an observer and the hardest part is always executing and bringing change rather than just talking about it. If you do decide to change something about your org, I'd love to hear about how it went.