Requirements
What you’ll need
- Significant experience operating production systems and improving on-call health
- Depth in observability tooling (metrics, logs, traces) and alert design
- Automation skills in Python, Go, or similar to codify runbooks and guardrails
- Proficiency with incident management practices and stakeholder communication
- Understanding of scalability basics: load patterns, capacity planning, and failure modes
- Based in New York City or remote.
Nice to have
- Experience running chaos experiments or game days and integrating learnings
- Hands-on with Kubernetes platform operations or service mesh reliability
- Database and queue resilience patterns (replication, backpressure, retries)
- Security-minded operations (secrets rotation, least privilege, hardened pipelines)
- Track record mentoring teams on SLOs, on-call hygiene, and incident leadership
Apply