Frustrated with YAML's limitations, Kubernetes developers created KYAML—their own HCL-like configuration language that addresses YAML's indentation, scaling, and debugging issues.

Image

Today I read the article “What Would a Kubernetes 2.0 Look Like? Thoughts on what the next major version might be. And found this :)

"YAML is just too much for what we're trying to do with k8s and it's not a safe enough format. Indentation is error-prone, the files don't scale great (you really don't want a super long YAML file), debugging can be annoying. YAML has so many subtle behaviors outlined in its spec."

"HCL is already the format for Terraform, so at least we'd only have to hate one configuration language instead of two. It's strongly typed with explicit types. There's already good validation mechanisms. It is specifically designed to do the job that we are asking YAML to do and it's not much harder to read."

and realized that Kubernetes developers had the same thoughts about using YAML- but instead of HCL, they just invented their own HCL-like language: KYAML.

Platform Engineering focuses on velocity and Developer Experience by building golden paths, while SRE ensures stability and production health. Both must work together for success.

The article clarifies the distinction between Platform Engineering (focused on velocity and Developer Experience/DevEx) and Site Reliability Engineering (focused on stability and production health). It argues that while their daily tasks differ, they must be integrated: Platform Engineers build the “golden paths” that abstract infrastructure complexity, while SREs ensure those paths are robust, scalable, and monitored.

Read →

CRDs Catalog

Jan 26, 2026
A comprehensive catalog of popular Kubernetes CustomResourceDefinitions in JSON schema format, perfect for linting GitOps repositories.

Image

If you, like me, use linters in the pipeline for GitOps repositories, this repo is the best thing you can use. It contains popular Kubernetes CRDs (CustomResourceDefinition) in JSON schema format.

Repo →

A combination of AWS CodeBuild misconfiguration and predictable GitHub identifiers allowed admin access to the AWS GitHub account, as reported by Wiz.

Image

You can do everything right but still be hacked through the official SDK. A couple of mistakes (CI/CD misconfiguration, unanchored regular expressions) in the configuration of AWS CodeBuild by AWS, combined with predictable identifier generation in GitHub, resulted in granting admin access to the AWS GitHub account. The Wiz team reported a case of gaining access to the AWS GitHub. But how many companies have made similar mistakes, enabling a hacker to have already injected vulnerabilities inside widely used libraries?

Read the Wiz Research: CodeBreach Vulnerability →

AWS now supports creating ECR repositories on push. A massive quality-of-life improvement that eliminates the need for 'workaround' infrastructure code.

Not a big feature, but a massive quality-of-life improvement. Automatic ECR repository creation is one of those features we’ve needed for a long time.

Literally a couple of weeks ago, we discussed with the team how to automate this to simplify life for both us and the developers. Now it’s native, and we won’t have to spend time building “workarounds” or custom Lambda triggers just to docker push a new service.

Read the announcement: Creating repositories on push →

Kaniko, NGINX Unit, Ingress-NGINX... The more operator tools you use, the more time you will spend replacing them.

Image description

The more operator tools you use, the more time you will spend replacing them after deprecation. Your processes might be well-optimized, but a chain of deprecations can cause you to spend time solving problems you have already solved before, and now you have to make changes that could make your system less stable than before.

The year 2025 was a year of deprecation:

  1. Kaniko was deprecated (link). Our team spent quite some time finding a solution with similar performance to avoid increasing pipeline build times.

  2. NGINX Unit (link) was discontinued. Similarly, we had to find an application server that could handle high loads without slowing down.

  3. Ingress-NGINX (link) was discontinued—the most impactful. The options were either to migrate to another solution or start using an API gateway.

Finding an “ideal” solution that fits your current needs doesn’t guarantee stability in the long term. One day, you might have to migrate to something new, introducing potential instability to your system.

Martin Fowler highlights studies showing AI degrades code quality. Paradoxically, this might be the best news for skilled engineers.

Martin Fowler shared a “fragment” referencing a Carnegie Mellon study on AI’s impact on open-source projects. The findings are not optimistic for code quality, but they offer a surprising silver lining for professional engineers.

“The key point is that the AI code probably reduced the quality of the code base… If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends.” — Martin Fowler

Signal Analysis

If this trend holds true, tech professionals are protected. We are not facing replacement, but a shift toward higher-level maintenance and architecture.

  1. The “Mess” Factor: Someone needs to clean up the technical debt AI is generating at scale.
  2. Long-term Maintainability: AI writes for the “now”, engineers write for the “future”. The demand for deep understanding of system architecture will likely increase.
  3. The Guardian Role: We are moving from “Code Writers” to “Code Reviewers” and “System Guardians”.

Read the full fragment at MartinFowler.com →

Flux Operator provides a native solution for ephemeral environments in GitLab. Fast setup, automatic cleanup, and simple configuration.

One of the simplest solutions I’ve used for managing temporal (ephemeral) environments is the solution provided by Flux Operator.

The configuration is straightforward and offers an out-of-the-box solution that can describe even complex environments. This setup makes environments available for a merge request and, at the same time, provides fast termination and cleanup of resources.

Read the docs: ResourceSets for GitLab Merge Requests →

Stop watching videos. This open-source platform is the 'LeetCode' for DevOps: real AWS clusters, broken scenarios, and automated checks for CKA/CKS.

Preparing for CKA, CKS, or CKAD usually involves expensive courses or limited simulators. The SRE Learning Platform (ViktorUJ/cks) is an open-source alternative that provisions real environments on AWS.

It uses Terraform and Terragrunt to spin up clusters on Spot Instances (to keep costs low) and provides:

  • “LeetCode-style” Drills: Specific scenarios for CKA/CKS/CKAD.
  • Mock Exams: Full-scale practice exams with checking scripts (check_result).
  • Infrastructure as Code: You learn Terraform while setting up your learning environment.

Check out the repository on GitHub →

Flux finally gets a Web UI via Flux Operator. This solves the main adoption blocker—lack of visibility—without building custom tools.

Finally, FluxCD has a GUI (via Flux Operator). People say it looks like ArgoCD. I’ve never used Argo, but if that’s true, it’s a massive move for the ecosystem.

The main reason for Flux’s lower adoption was the lack of out-of-the-box visibility. Many teams want to see the status of resources directly, rather than relying on custom notifications or parsing logs when an update hangs.

At my current workplace, I had to build a feedback system that reports state back into the GitLab pipeline. But this approach is inefficient. It doesn’t make sense for every company to build its own solution just because the tool lacks a default feedback mechanism.

Check out the Flux Web UI →

Julius Volz explains why native Prometheus instrumentation is safer than OTel: you keep target health monitoring and gain significant performance.

With the massive hype around OpenTelemetry, it’s tempting to use its SDKs for everything. However, Julius Volz (co-founder of Prometheus) argues that for metrics, native Prometheus instrumentation is often superior.

Key takeaways:

  1. Target Health: OTel’s push model loses the context of “what should be running.” You lose the ability to detect silent failures (absent targets) that Prometheus’s Service Discovery + Pull model provides out of the box.
  2. Performance: Benchmarks show Prometheus Go SDKs can be up to 30x faster than OTel SDKs for simple counter increments.
  3. Complexity: Mapping OTel attributes to Prometheus labels often requires complex target_info joins or awkward metric renaming strategies.

Read the full analysis: Why I recommend native Prometheus instrumentation →

A strategy for zero-downtime upgrades: use physical replication for the bulk transfer, then switch to logical replication for the final cutover.

Upgrading a critical PostgreSQL cluster (e.g., v13 to v16) usually involves a choice: fast but risky pg_upgrade, or slow logical replication. Palark shared a practical method to get the best of both worlds.

The core strategy involves a clever pivot:

  1. Start with Physical Replication to bulk transfer data quickly (10x faster than logical).
  2. Switch to Logical Replication by advancing a replication slot to a specific LSN (Log Sequence Number) from the physical replica logs.
  3. Upgrade the Logical Replica and cut over with only seconds of downtime.

This method solves the “catch-up” problem of logical replication on large datasets.

Read the full guide: Upgrading PostgreSQL with no data loss →