There's a specific kind of dread that hits when you take over a system and you can't tell whether it's fragile or simply unfamiliar.
The dashboards look "fine." The on-call doc exists. The team is busy. But every small change feels like you're pulling on a thread you can't see.
I've walked into environments like this more than once. The pattern is consistent: the mess is rarely just technical. The code is only the most visible symptom.

The shapes a mess actually takes
Most inherited messes are not a single problem. They are a stack of decisions made under pressure, in a system that nobody had the time or authority to redesign. By the time you arrive, they tend to settle into one of a few recognizable shapes.
- The hero-dependent system. One or two engineers carry the load. Deploys go through them. Incidents are routed to them. When they take vacation, the team gets quieter and more careful. The code might be fine. The bus factor is what's broken.
- The fear-driven team. People know how to ship. They've stopped believing they're allowed to. Reviews take forever. Changes get smaller. Risk migrates outside the team because nobody wants to be the one who broke production again.
- The frozen architecture. A central system that everyone agrees needs to change, and nobody is allowed to touch. Sometimes because a previous rewrite failed. Sometimes because the team that owned it left. Sometimes because the customer impact of being wrong is so severe that "do not touch" became policy by attrition.
- The leadership mismatch. The team has been optimizing for one thing (speed, optics, predictability) and the new leadership wants something else. The codebase is fine. The incentives no longer match the goals.
You will usually find more than one of these layered on top of each other. Knowing which shape you're inheriting matters more than knowing which language the code is in.
Step 1: Assess the real damage (technical and cultural)
If you try to "fix the code" before you understand the forces that created it, you'll spend your political capital on the wrong fights.
One example: I once inherited a system where every deploy was treated like a mini-incident. People scheduled releases around specific individuals being online, because nobody trusted the process. Before we touched architecture, we made deploys boring: tightened the release path, clarified ownership, and killed the top recurring alert. That single change reduced the ambient stress enough that real improvements became possible.
I like to run two assessments in parallel:
- Technical reality
- What breaks in production, and how often?
- Where does delivery slow down? (build times, review cycles, deploy friction)
- Where is risk concentrated? (single owner, unclear interfaces, untested critical paths, black box files)
- What's truly unknown? (no one can explain it, no one wants to touch it)
- Organizational reality
- What does leadership reward: speed, correctness, predictability, optics?
- What did the team learn to optimize for?
- Who is afraid of what? (reliability incidents, missed dates, blame, audits)
This is also where you should look for "quiet constraints" that don't show up in Jira: compliance obligations, vendor lock-in, customer promises, or a burned-out team that has stopped believing change is possible.
The first two weeks: a listening tour, not a verdict
Before you start fixing things, you have to be able to see clearly. The most useful work you can do in your first two weeks looks like nothing. It's listening.
The pattern I run when I take over a system:
- 30-minute conversations with everyone on the team. Same three questions: what's working, what's painful, and what would you change if you had a free hand for a week? You're not gathering opinions. You're building a map of where people are willing to spend energy. Block 45 minutes on your calendar. Some of these will run long, and cutting off someone who's finally opening up is worse than running late to your next meeting.
- Read the last six months of incidents. Not just the post-mortems. Read the Slack threads, the on-call notes, the stuff that never made it into a ticket. The way a team talks about incidents tells you whether they trust the system, each other, or neither.
- Trace one real customer flow end to end. Pick a critical path. Money movement, auth, the thing that breaks first when something is wrong. Walk the code, the services, the data, the runbooks. You'll usually find the gap between "how it's documented" and "how it actually works" inside the first hour.
- Sit through a deploy and an on-call rotation. Just observe. Don't intervene. The friction the team has stopped noticing is the friction you need to fix first.
The temptation in week one is to make decisions to prove you're capable. Resist it. Premature decisions are the cheapest way to spend trust you don't have yet.
Step 2: Choose battles (what to fix vs. what to replace)
A turnaround fails when you try to modernize everything at once, or when you only chase visible problems.
A simple heuristic:
- Fix when the system is salvageable and the constraints are known.
- Replace when the system is structurally misaligned with your needs (and you can afford the migration).
- Fence off when the system is ugly but stable, and the risk is in the touching, not the existing.
I'm deliberate about one thing: the first major initiative should reduce uncertainty, not increase it. "Rewrite the platform" is usually an uncertainty amplifier.
Step 3: Build momentum with early wins
Early wins are not about shipping something flashy. They are about proving that:
- work can ship predictably
- incidents can be handled without heroics
- the system can change without breaking trust
Good early wins tend to look boring:
- adding a missing health check that prevents a class of outages
- cutting a 45-minute deploy down to 15 minutes
- eliminating a recurring production alert by fixing the real root cause
- documenting a critical workflow so a single person is no longer a bottleneck
The point is to make the team feel the ground get more stable under their feet. This isn't just intuition. Research on progress and engagement found that progress on meaningful work is the single strongest driver of positive inner work life1. Boring wins that reduce chaos are meaningful work.
Step 4: The 90-day turnaround framework
Turnarounds need urgency, but they also need sequencing.
Days 1 to 30: Reduce uncertainty
- Map critical flows (money movement, auth, core customer actions).
- Establish a baseline for reliability and delivery using the DORA four key metrics: deployment frequency, lead time for changes, change failure rate, and time to restore service2. These aren't vanity metrics. The decade of research behind them consistently shows they predict both engineering health and business outcomes.
- Identify the "do not touch" zones and the "safe to improve" zones.
- Create an escalation path that does not require you to be the router for every decision.
Days 31 to 60: Remove the biggest sources of drag
- Attack one bottleneck in the delivery system (CI, environments, approvals, release process).
- Make ownership legible (clear on-call, clear domain ownership, clear interfaces).
- Reduce repeated work (manual runbooks, tribal knowledge workflows).
Days 61 to 90: Invest in leverage
- Pay down the debt that blocks product work (not the debt that annoys engineers).
- Make tradeoffs explicit (what you are not fixing and why).
- Put a roadmap in front of leadership that has risk and sequencing, not just feature dates.
If you do this well, the team comes out of the first 90 days with fewer surprises, fewer emergencies, and more confidence.
The people equation
A turnaround framework that ignores the people side gets a clean diagram and a terrible outcome. The same archetypes show up almost everywhere.
The hero engineer. This is the person who keeps the system standing. They know the most, ship the most, and carry the most invisible weight. They are also usually exhausted, slightly defensive, and convinced nobody else can be trusted with the hard parts. Your job is not to remove them. It's to redistribute what they hold. Pair them on the work nobody else understands. Get the runbooks out of their head. Make their knowledge a team resource rather than a single point of failure.
The burned-out engineer. Often a strong performer who has stopped pushing back. They show up, do the work, and have given up on the system getting better. The wrong move is to demand more energy from them. The right move is to ship one visible improvement to the thing they're most tired of, and to do it without asking them to lead it. They reengage when they see the ground move.
The skeptic. Someone (often senior) who has seen leaders come in with big plans before and watched them fail. They will challenge your framing. They are not wrong to. The most useful thing you can do is be honest about what you don't know yet, name the constraints you've inherited, and let your first 30 days demonstrate that this time is different. Skeptics turn into your strongest allies once they trust the process is real.
The predecessor's loyalists. People who worked closely with whoever was in your seat before you. If you trash the previous work, you trash them. Make it explicit that you are not here to judge prior decisions, you are here to operate inside the current reality. The system reflects the constraints it grew up under. So do the people in it.
You cannot do a turnaround alone. The framework reduces uncertainty in the system. The people choices reduce uncertainty in the team.
What to communicate upward without trashing the past
If you inherited a mess, it's tempting to narrate the story as "we're fixing what the previous team broke." Don't. It's usually wrong, and it creates enemies you don't need.
A more defensible framing:
- The system reflects the incentives it grew up under.
- Our goal is to improve reliability and delivery without stopping the business.
- We are choosing a sequence that reduces risk, not just a list of improvements.
I also like to say this explicitly: some of what looks like "bad engineering" is actually "engineering under constraints." Your job is to decide which constraints still exist, and which ones you can remove.
Where people will push back
If you publish a turnaround strategy, people will challenge it. You can preempt most objections by acknowledging the messy reality up front.
"We can't take 90 days. We need features now."
That's real. The answer is not to pause product work. The answer is to pick improvements that increase throughput quickly (deploy friction, ownership clarity, incident load). The best turnarounds are not detours. They are force multipliers3.
"This is just a rewrite pitch."
It isn't. A rewrite is a last resort, not a starting point. You reach for it only after you've stabilized the system, understood the actual constraints, and confirmed that incremental improvement won't get you where you need to go. Most turnarounds never need one. A good turnaround starts with risk reduction, not new architecture.
"You're over-indexing on process."
Only if the process does not change outcomes. The goal is fewer emergencies and more predictable delivery. If a ritual does not reduce chaos, remove it.
"This doesn't apply in regulated environments."
It does, but the sequencing changes. In regulated contexts, you still reduce uncertainty first. The difference is you often start with auditability, access controls, and change management because reliability is not only uptime. It's provability.
"You're blaming the team."
If the piece reads like blame, it will lose trust. Make it clear that you assume people were doing their best under constraints, and that your job is to change the system those people operate inside.
What day 90 actually looks like
If the turnaround is working, the signals are usually quiet, not loud.
- Deploys stopped being scheduled around specific people being online.
- The on-call pager is going off for fewer, smaller things.
- The team is making changes in parts of the code they used to avoid.
- Leadership stopped asking when things would stop being on fire, because they noticed it before you said anything.
- You have a roadmap that names what you're not doing, and why.
You will not be "done" at day 90. You will have replaced a vague crisis with a known operating reality. That trade is the whole point. A team that knows what it's dealing with can plan. A team in ambient crisis can only react.
The takeaway
A technical turnaround is not a heroic rewrite. It's a sequence of choices that reduces uncertainty, buys back time, and restores trust.
If you've inherited a system like this, start by asking: What is the one change that would make the next month calmer? Then do that. Momentum follows stability.
The work is almost always less glamorous than people expect. Boring deploys. Fewer alerts. Cleaner ownership. A team that no longer flinches when leadership asks how things are going. The point is not to look like a turnaround. The point is to stop needing one.