Kaniko, NGINX Unit, Ingress-NGINX... The more operator tools you use, the more time you will spend replacing them.

Image description

The more operator tools you use, the more time you will spend replacing them after deprecation. Your processes might be well-optimized, but a chain of deprecations can cause you to spend time solving problems you have already solved before, and now you have to make changes that could make your system less stable than before.

The year 2025 was a year of deprecation:

  1. Kaniko was deprecated (link). Our team spent quite some time finding a solution with similar performance to avoid increasing pipeline build times.

  2. NGINX Unit (link) was discontinued. Similarly, we had to find an application server that could handle high loads without slowing down.

  3. Ingress-NGINX (link) was discontinued—the most impactful. The options were either to migrate to another solution or start using an API gateway.

Finding an “ideal” solution that fits your current needs doesn’t guarantee stability in the long term. One day, you might have to migrate to something new, introducing potential instability to your system.

Martin Fowler highlights studies showing AI degrades code quality. Paradoxically, this might be the best news for skilled engineers.

Martin Fowler shared a “fragment” referencing a Carnegie Mellon study on AI’s impact on open-source projects. The findings are not optimistic for code quality, but they offer a surprising silver lining for professional engineers.

“The key point is that the AI code probably reduced the quality of the code base… If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends.” — Martin Fowler

Signal Analysis

If this trend holds true, tech professionals are protected. We are not facing replacement, but a shift toward higher-level maintenance and architecture.

  1. The “Mess” Factor: Someone needs to clean up the technical debt AI is generating at scale.
  2. Long-term Maintainability: AI writes for the “now”, engineers write for the “future”. The demand for deep understanding of system architecture will likely increase.
  3. The Guardian Role: We are moving from “Code Writers” to “Code Reviewers” and “System Guardians”.

Read the full fragment at MartinFowler.com →

Flux Operator provides a native solution for ephemeral environments in GitLab. Fast setup, automatic cleanup, and simple configuration.

One of the simplest solutions I’ve used for managing temporal (ephemeral) environments is the solution provided by Flux Operator.

The configuration is straightforward and offers an out-of-the-box solution that can describe even complex environments. This setup makes environments available for a merge request and, at the same time, provides fast termination and cleanup of resources.

Read the docs: ResourceSets for GitLab Merge Requests →

Stop watching videos. This open-source platform is the 'LeetCode' for DevOps: real AWS clusters, broken scenarios, and automated checks for CKA/CKS.

Preparing for CKA, CKS, or CKAD usually involves expensive courses or limited simulators. The SRE Learning Platform (ViktorUJ/cks) is an open-source alternative that provisions real environments on AWS.

It uses Terraform and Terragrunt to spin up clusters on Spot Instances (to keep costs low) and provides:

  • “LeetCode-style” Drills: Specific scenarios for CKA/CKS/CKAD.
  • Mock Exams: Full-scale practice exams with checking scripts (check_result).
  • Infrastructure as Code: You learn Terraform while setting up your learning environment.

Check out the repository on GitHub →

Flux finally gets a Web UI via Flux Operator. This solves the main adoption blocker—lack of visibility—without building custom tools.

Finally, FluxCD has a GUI (via Flux Operator). People say it looks like ArgoCD. I’ve never used Argo, but if that’s true, it’s a massive move for the ecosystem.

The main reason for Flux’s lower adoption was the lack of out-of-the-box visibility. Many teams want to see the status of resources directly, rather than relying on custom notifications or parsing logs when an update hangs.

At my current workplace, I had to build a feedback system that reports state back into the GitLab pipeline. But this approach is inefficient. It doesn’t make sense for every company to build its own solution just because the tool lacks a default feedback mechanism.

Check out the Flux Web UI →

Julius Volz explains why native Prometheus instrumentation is safer than OTel: you keep target health monitoring and gain significant performance.

With the massive hype around OpenTelemetry, it’s tempting to use its SDKs for everything. However, Julius Volz (co-founder of Prometheus) argues that for metrics, native Prometheus instrumentation is often superior.

Key takeaways:

  1. Target Health: OTel’s push model loses the context of “what should be running.” You lose the ability to detect silent failures (absent targets) that Prometheus’s Service Discovery + Pull model provides out of the box.
  2. Performance: Benchmarks show Prometheus Go SDKs can be up to 30x faster than OTel SDKs for simple counter increments.
  3. Complexity: Mapping OTel attributes to Prometheus labels often requires complex target_info joins or awkward metric renaming strategies.

Read the full analysis: Why I recommend native Prometheus instrumentation →

A strategy for zero-downtime upgrades: use physical replication for the bulk transfer, then switch to logical replication for the final cutover.

Upgrading a critical PostgreSQL cluster (e.g., v13 to v16) usually involves a choice: fast but risky pg_upgrade, or slow logical replication. Palark shared a practical method to get the best of both worlds.

The core strategy involves a clever pivot:

  1. Start with Physical Replication to bulk transfer data quickly (10x faster than logical).
  2. Switch to Logical Replication by advancing a replication slot to a specific LSN (Log Sequence Number) from the physical replica logs.
  3. Upgrade the Logical Replica and cut over with only seconds of downtime.

This method solves the “catch-up” problem of logical replication on large datasets.

Read the full guide: Upgrading PostgreSQL with no data loss →