Skip to content

Blog

Modernizing Public Safety Networking: Spearheading eBPF with Cilium

As part of the current move to Kubernetes, the adoption of Cilium and eBPF is underway. This strategic modernization aims to replace legacy iptables routing with a high-performance, kernel-native networking foundation, providing the sub-millisecond latency and deep observability required for real-time crisis management.
WatchersStarsForksGoShellCPythonView on Github

The Image is the OS: Scaling Immutable Infrastructure with Bootable Containers (bootc)

The future of the operating system involves bootc (Bootable Containers), representing the final frontier of Platform Engineering. This transition enables managing the entire OS lifecycle using OCI-compliant GitOps workflows similar to those for applications, aiming for a declarative, immutable foundation.
WatchersStarsForksGoRustContainerfileView on Github

Unifying Observability: Eliminating Monitoring Silos with OpenTelemetry

In mission-critical systems, fragmented monitoring is a liability. OpenTelemetry represents the production observability standard, unifying traces, metrics, and logs into a single, vendor-neutral pipeline. This provides high-fidelity data required to maintain public safety infrastructure with absolute confidence.
WatchersStarsForksGoJavaPythonJavaScriptC++RustView on Github

Scalable Monitoring: Why VictoriaMetrics is the Modern Alternative to Prometheus

As infrastructure grows, so does the volume of metrics. VictoriaMetrics offers a high-performance, cost-effective monitoring solution that remains compatible with the Prometheus ecosystem while providing better compression, lower resource usage, and simpler horizontal scaling.
WatchersStarsForksGoTypeScriptView on Github

Infrastructure as Data: Building a Self-Service Platform with Crossplane

As mission-critical infrastructure moves to Kubernetes, Crossplane is being used to transform cloud resources into declarative data. This transition is the foundation of an Internal Developer Platform (IDP), aiming to eliminate provisioning bottlenecks and empower engineering teams with a self-service model.

Visual Systems Thinking: Mapping Complex Kubernetes Architectures

In complex distributed systems, YAML is not a substitute for a mental model. Automated visualization was used to transform "Black Box" manifests into clear, navigable architectural diagrams, ensuring a shared understanding of mission-critical public safety infrastructure.
WatchersStarsForksPythonPlantUMLShellView on Github

Secure AI Adoption: Local LLMs with Ollama for Enterprise Privacy

GenAI offers a 10x productivity boost, but for mission-critical SRE and development, public APIs are a non-starter. The use of local LLMs was pioneered with Ollama to bring AI-assisted coding to teams without leaking proprietary code or public safety data.

Database Performance at Scale: Tuning CloudNativePG for High-Throughput Workloads

Moving mission-critical databases to Kubernetes is only half the battle. CloudNativePG clusters were tuned to support a 10x increase in connection volume and high-throughput ETL pipelines, proving that self-managed database performance can rival—and exceed—managed cloud services.
WatchersStarsForksPLpgSQLYAMLShellGoView on Github

AI-Assisted SRE: Spearheading Triage with K8sGPT and Local Inference

As critical infrastructure migrates to Kubernetes, the integration of K8sGPT and local LLMs is being spearheaded. This strategic initiative aims to automate the initial triage of cluster failures, transforming cryptic logs into actionable remediation steps while maintaining absolute data privacy in production environments.

Beyond the Theory: Implementing Google SRE Principles in High-Stakes Environments

Google's SRE principles provide the blueprint, but real-world execution is a cultural challenge. During past projects, "The SRE Book" was translated into actionable engineering practices, using SLOs as Code and Error Budgets to balance rapid deployment with the uncompromising uptime required for public safety.
GoShellView site

Ransomware-Proof Infrastructure: Immutable Backups with MinIO and S3 Object Lock

In mission-critical SRE, "having a backup" is no longer enough. A production-grade immutable storage vault has been implemented using MinIO and S3 Object Lock. This provides a "Write Once, Read Many" (WORM) guarantee for offsite backups, ensuring that public safety data remains impervious to ransomware or accidental deletion.
WatchersStarsForksKubernetesPostgreSQLAWSView on Github

Secure Data Governance: Scalable Database Management with pgAdmin and OAuth2

Granting developers access to production data shouldn't mean compromising on security. pgAdmin was deployed on Kubernetes with OIDC integration, replacing insecure port-forwarding with a centralized, identity-aware portal for the entire Cloud-Native PG fleet.
WatchersStarsForksPythonJavaScriptPLpgSQLShellTypeScriptCSSOtherView on Github

Enterprise PostgreSQL on Kubernetes: Achieving High Availability with CloudNativePG

Managing stateful workloads on Kubernetes is a Tier-1 SRE challenge. Implementing CloudNativePG (CNPG) as a production standard has enabled the creation of a self-healing, highly available PostgreSQL platform that matches RDS in reliability while surpassing it in portability and control.

Automating High-Availability PostgreSQL on AWS: A Deep Dive into Trusted Postgres Architect (TPA)

Deploying production-ready PostgreSQL clusters requires more than just `apt-get install`. Trusted Postgres Architect (TPA) by EDB brings Infrastructure as Code (IaC) principles to database orchestration, allowing you to provision, configure, and manage highly available clusters on AWS EC2 with Ansible-driven automation.
PythonJinjaShellDockerfile

The Developer Interface: Optimizing the Physical-to-Digital Bridge with QMK

As an SRE, the most important tool isn't the terminal—it's the interface that connects the mind to the machine. Over the years, QMK Firmware has been used to build customized, programmable hardware that reduces repetitive strain and accelerates mission-critical workflows.
WatchersStarsForksCC++MakefilePythonShellNixView on Github

Converged Reliability: Strategic Hybrid Cloud with Harvester and Rancher

Cloud-Native doesn't always mean "In the Cloud." During recent infrastructure projects, Harvester was used to transform bare-metal hardware into a private cloud, bridging the gap between legacy VMs and modern Kubernetes workloads with a single, unified control plane.
WatchersStarsForksGoShellOtherView on Github

The SRE Knowledge Graph: Building a Second Brain for Mission-Critical Operations

In an SRE career spanning 26 years, the most valuable asset isn't just the code—it's the accumulated knowledge of how systems fail and recover. Throughout a career, Logseq has been used to build a private, graph-based "Second Brain," transforming scattered notes into a searchable, interconnected knowledge base for mission-critical operations.

Mapping the Monolith: Visualizing 26 Years of Systems Integration with yEd

In an era of browser-based tools, yEd remains the SRE's secret weapon for mapping chaotic systems. Algorithmic layouts are consistently used to transform hundreds of undocumented database relationships into clear, hierarchical maps, providing the clarity needed to scale complex platforms.
Web-basedJavaView site