Skybitra logo Skybitra
Full‑time

Site Reliability Engineer

New York City / Remote

Description

About the role

Build and govern SLOs, automate operations, and improve resilience across services and platforms.

You will define and measure reliability, lead incident response improvements, and automate away toil. You’ll partner with platform and product teams to design observable, failure-tolerant systems with clear ownership and runbooks.

Responsibilities

  • Establish and evolve SLOs/error budgets, and align alerting to customer-impacting signals
  • Build high-signal dashboards, logging, and tracing for critical user journeys
  • Automate runbooks, chaos/failover drills, and toil reduction for on-call teams
  • Lead post‑incident reviews, drive actions to completion, and track reliability trends
  • Partner on architecture and capacity plans that balance reliability, latency, and cost
Requirements

What you’ll need

  • Significant experience operating production systems and improving on-call health
  • Depth in observability tooling (metrics, logs, traces) and alert design
  • Automation skills in Python, Go, or similar to codify runbooks and guardrails
  • Proficiency with incident management practices and stakeholder communication
  • Understanding of scalability basics: load patterns, capacity planning, and failure modes
  • Based in New York City or remote.

Nice to have

  • Experience running chaos experiments or game days and integrating learnings
  • Hands-on with Kubernetes platform operations or service mesh reliability
  • Database and queue resilience patterns (replication, backpressure, retries)
  • Security-minded operations (secrets rotation, least privilege, hardened pipelines)
  • Track record mentoring teams on SLOs, on-call hygiene, and incident leadership

Apply