DevOps/Platform Engineer

Permanent employee, Full-time · London

Your role
Overview:
  • Build and run a reliable platform for services and data workflows across Kubernetes and Prefect.
  • Own CI/CD, observability, security, and developer experience for Python/Go/Rust services.
Responsibilities:
  • Design, provision, and operate Kubernetes workloads (deployments, networking, autoscaling, storage).
  • Build and maintain GitLab CI/CD pipelines for Python, Go, and Rust services (build, test, scan, release).
  • Operate Prefect (agents, work queues, deployments, concurrency limits, task execution environments).
  • Implement environment strategy and promotion flow (dev/staging/prod) with clear release gates.
  • Create golden paths and templates for FastAPI microservices and Prefect flows.
  • Manage secrets, configuration, and access (e.g., GitLab variables, K8s secrets).
  • Establish observability: logging, metrics, traces, alerting, runbooks, and SLOs.
  • Operate data stores (MySQL, PostgreSQL, Redis): provisioning, backups, migration execution, monitoring, and capacity planning.
  • Optimise build and runtime costs (container images, caching, autoscaling, resource requests/limits).
  • Lead incident response, postmortems, and reliability improvements.
Your profile
You have:
  • 4+ years in DevOps/SRE/Platform roles with production Kubernetes.
  • Strong GitLab CI/CD experience (pipelines, runners, caching, artifact management).
  • Proficiency with containers and image optimization; comfortable with Linux internals and networking.
  • Hands-on with Prefect in production (deployments, flow orchestration, storage, results).
  • Familiar with operating MySQL/PostgreSQL/Redis in production (availability, performance, backups).
  • Scripting/automation with Python or Go; ability to read Rust build pipelines.
  • Solid understanding of security fundamentals (least privilege, image scanning, SBOM, secret hygiene).
  • Experience instrumenting systems and creating actionable alerts.
Nice to have:
  • Helm/Kustomize, policy-as-code (OPA), and basic gRPC.
  • Performance tuning for high‑throughput data or API services.
  • Experience in multi‑tenant or multi‑cluster environments.
About us
Stelia is building a global distributed intelligence platform empowering enterprises to integrate and scale AI’s limitless potential. By optimising data mobility and connecting diverse AI resources, Stelia simplifies distributed AI operations, making them accessible anywhere. Committed to innovation, collaboration, and simplicity, we enable enterprises to lead the AI revolution and drive transformative change across all sectors.
We look forward to hearing from you!
Uploading document. Please wait.
Please add all mandatory information with a * to send your application.