Full‑time

Site Reliability Engineer

New York City / Remote

Description

About the role

Build and govern SLOs, automate operations, and improve resilience across services and platforms.

You will define and measure reliability, lead incident response improvements, and automate away toil. You’ll partner with platform and product teams to design observable, failure-tolerant systems with clear ownership and runbooks.

Responsibilities

Establish and evolve SLOs/error budgets, and align alerting to customer-impacting signals
Build high-signal dashboards, logging, and tracing for critical user journeys
Automate runbooks, chaos/failover drills, and toil reduction for on-call teams
Lead post‑incident reviews, drive actions to completion, and track reliability trends
Partner on architecture and capacity plans that balance reliability, latency, and cost

Requirements

What you’ll need

Significant experience operating production systems and improving on-call health
Depth in observability tooling (metrics, logs, traces) and alert design
Automation skills in Python, Go, or similar to codify runbooks and guardrails
Proficiency with incident management practices and stakeholder communication
Understanding of scalability basics: load patterns, capacity planning, and failure modes
Based in New York City or remote.

Nice to have

Experience running chaos experiments or game days and integrating learnings
Hands-on with Kubernetes platform operations or service mesh reliability
Database and queue resilience patterns (replication, backpressure, retries)
Security-minded operations (secrets rotation, least privilege, hardened pipelines)
Track record mentoring teams on SLOs, on-call hygiene, and incident leadership

Apply