Working on Existing Systems

Freelancing in infrastructure means you don’t build clean systems—you inherit them. Usually at the exact moment they’ve become too fragile, too confusing, or too risky for the internal team to keep pretending everything is fine.

Over time, certain patterns repeat. Not vague “lessons learned”, but very specific, concrete situations that keep showing up in different companies, across different stacks, with different people. The details change. The underlying problems don’t.

Here are some of the more memorable ones.

The single point of failure labeled “temporary”

Another environment had exactly one VM doing everything: API, background jobs, database, file storage. There was technically a plan to split it out “soon”, but that plan was already over a year old.

The interesting part wasn’t the architecture—it was the operational behavior around it. Deployments were done via SSH. There was no staging environment, so production was the test environment. If something broke, the fix was applied directly, often while users were actively hitting the system.

At some point, the VM ran out of disk space. Not because of data growth, but because logs had been accumulating for months without rotation. The application went down, and the immediate fix was to manually delete files until things started working again.

No monitoring had alerted anyone. The system failed in a completely predictable way that no one had set up a guardrail for.

Credentials shared across everything

In one project, every service used the same database credentials. Not similar credentials—the exact same username and password.

The reasoning was convenience. Rotating credentials would have required coordinating changes across multiple services, and that was seen as risky. So instead, nothing was ever rotated.

Those credentials existed in environment variables, config files, CI pipelines, and at least one internal wiki page. At some point, they were also pasted into a support ticket for debugging purposes.

When I asked how access was revoked when someone left the company, the answer was essentially: “we don’t, unless there’s a specific reason”.

Backups that had never been restored

A client confidently told me they had daily backups of their database and file storage. The setup looked reasonable on paper. Jobs were running, files were being written, and retention policies were configured.

The problem was that no one had ever actually tried to restore from them.

We tested it. The database backup restored, but with missing tables due to a misconfiguration that had been there for months. The file backups existed, but permissions were wrong, so the application couldn’t read them without manual fixes.

Everything looked green until the moment it needed to work.

A CI pipeline nobody wanted to touch

One company had a CI/CD pipeline that technically worked, but only under very specific conditions. It had grown over time, accumulating steps, exceptions, and environment-specific behavior.

Builds would fail intermittently. When they did, the common solution was to re-run them until they passed. No one investigated deeply because failures were inconsistent and hard to reproduce.

The original person who set it up had left. Documentation was minimal. Every change felt risky, so changes weren’t made.

At some point, the pipeline stopped being a safety mechanism and became a formality—something you had to get through, not something you could trust.

Monitoring that didn’t help

I’ve seen multiple setups where monitoring existed, dashboards were in place, and metrics were being collected—but none of it actually helped during incidents.

In one case, there were plenty of graphs, but no clear thresholds or alerts tied to real problems. CPU usage could spike, memory could climb, error rates could increase, and nothing would trigger a meaningful notification.

In another case, alerts did exist, but there were so many of them that people had learned to ignore the noise. Important signals were buried in a constant stream of low-priority warnings.

In both situations, the tooling wasn’t the issue. The issue was that no one had taken the time to define what “bad” actually looked like and how to respond when it happened.

Docker used as a black box

Containers are often introduced to standardize environments. In practice, I’ve seen them used as a way to avoid understanding what’s inside.

One project relied on a set of Docker images that had been built manually and passed around between team members. There was no reliable way to rebuild them from scratch. The Dockerfiles either didn’t exist or didn’t reflect the current state.

When something broke inside a container, the fix was to modify it interactively and then commit the changes into a new image with a slightly different tag.

Over time, this created a chain of images where each one depended on undocumented manual steps. Reproducibility—the main benefit of containers—was effectively gone.

“We can’t restart that”

There is almost always at least one component that no one wants to touch.

In one environment, it was an application server that had been running continuously for years. It handled critical traffic, and no one was entirely sure how it initialized or what state it relied on.

Restarting it might have been fine. Or it might have caused an outage that took hours to recover from. Since no one knew which outcome was more likely, the default decision was to avoid finding out.

This created a situation where even routine maintenance became complicated. Security updates were delayed. Configuration changes were avoided. The system stayed stable by being left alone, which made it increasingly fragile over time.

The cron job that ran everything

One system depended on a single cron job that ran every minute. That job acted as a scheduler, queue processor, and recovery mechanism all at once.

It checked for new work, retried failed tasks, cleaned up stale state, and occasionally triggered deployments. Over time, more responsibilities were added because it was “already running anyway.”

The problem wasn’t just that it was overloaded—it was that it had no visibility. If something failed inside it, there was no structured logging, no alerting, and no easy way to see what had or hadn’t run.

At one point, the cron daemon itself stopped running after a system update. Nothing happened immediately. Things just slowly stopped processing.

It took hours before anyone noticed, because there was no single signal that clearly said: “the system has stopped doing work”.

The reverse proxy with undocumented rules

One client had a reverse proxy in front of their application stack. It handled routing, TLS termination, and a handful of redirects.

Over time, more rules were added. Some for legacy endpoints, some for temporary migrations, some to work around bugs in upstream services.

At some point, no one fully understood the rule set anymore.

Requests would behave differently depending on subtle details—headers, paths, even trailing slashes. Debugging issues required stepping through configurations line by line and making educated guesses about intent.

There was no documentation explaining why certain rules existed. Removing any of them felt risky, because it wasn’t clear what would break.

Disk space as an afterthought

In one environment, disk usage wasn’t monitored at all. Not loosely—literally not tracked.

Applications wrote logs, uploaded files, and temporary data without any constraints or cleanup policies. The assumption was that storage was “large enough”.

When the disk filled up, the failure mode was messy. Writes started failing in different parts of the system, causing partial errors rather than a clean outage.

Some services crashed. Others kept running but behaved unpredictably.

The fix was to delete data manually and restart affected services. No structural changes were made afterward, so the same thing happened again a few months later.

The manual deploy step everyone forgot

A deployment process included one manual step that wasn’t part of the automated pipeline. It was documented—somewhere—but easy to miss.

After the main deployment completed, someone had to run a command on a specific server to update a cache. If they didn’t, the application would continue serving outdated data.

Most of the time, someone remembered. Occasionally, they didn’t.

When that happened, the system looked partially broken. Some users saw new behavior, others saw old behavior, depending on which layer they hit.

It wasn’t obvious that a deploy had gone wrong. It just looked inconsistent.

Timezones handled “later”

One application stored timestamps without timezone information and assumed everything was in UTC. Except parts of the system weren’t actually using UTC.

Some services wrote local time. Others converted inconsistently. The database just stored whatever it received.

This worked fine until reporting and scheduling features became important. Then it became a problem.

Events appeared out of order. Scheduled jobs ran at the wrong times. Fixing it required tracing where each timestamp originated and how it had been transformed along the way.

There was no single place to correct it, because the inconsistency was spread across the system.

The “just this once” firewall rule

A firewall rule had been added to allow access from a specific external IP for debugging.

It was meant to be temporary.

Over time, more exceptions were added in the same way. Different IPs, different ports, different services. None of them were cleaned up.

The result was a ruleset that technically restricted access, but in practice allowed far more than intended.

No one had a clear view of which rules were still necessary. Auditing them required reconstructing the context in which each one was added—which no longer existed.

The system nobody understood

One company built a critical part of their infrastructure on top of a stack that no one on the team actually understood.

It wasn’t an obscure technology. It had good documentation and a large community. The problem was that the person who originally set it up had left, and no one else had taken the time to learn how it worked.

From that point on, every change followed the same pattern: search for the issue, find an example online, and try to apply it. Sometimes it worked. Sometimes it made things worse. Either way, no one was confident enough to remove or simplify anything afterward.

Over time, the configuration turned into a collection of copied fixes and partially understood settings.

Even small changes became slow and risky. There was no intuition, no clear mental model—just trial and error in production.

The system itself wasn’t broken. It ran. But no one was really in control of it anymore.

━━━━━━ ❖ ━━━━━━