Job Description

About the Role

We are seeking a Site Reliability Engineer (SRE) to own the availability, resilience, and operational readiness of our cloud-native platform running on AWS. This role is responsible for ensuring our systems are designed to tolerate failure, recover quickly, and support safe, continuous delivery.

As an SRE, you will apply software engineering principles to infrastructure, operations, and incident response. You will partner closely with application engineers to balance delivery velocity with reliability, and you will have clear ownership of high availability (HA), disaster recovery (DR), and incident management preparedness.

Core Responsibilities:

Reliability, Availability & Disaster Recovery

Own and continuously evolve the High Availability (HA) and Disaster Recovery (DR) strategy across all production systems
Define, document, and enforce service reliability targets, including availability objectives and recovery expectations
Design and maintain resilient architectures using AWS managed services
Establish and validate RTO and RPO targets for applications and data stores
Design, document, and execute disaster recovery simulations, game days, and real failover testing
Implement backup, restore, replication, and failover strategies for managed databases, Kafka, and OpenSearch
Identify and eliminate single points of failure across infrastructure, pipelines, and operational processes
Ensure DR plans are tested, trusted, and continuously improved, not just documented

Incident Management & Operational Readiness

Own incident management preparedness, including tooling, runbooks, escalation paths, and communication practices
Participate in and lead incident response for availability-impacting events
Conduct blameless post-incident reviews and drive corrective actions to completion
Improve systems and processes based on incident learnings
Ensure on-call rotations, alerts, and monitoring are actionable and sustainable
Design systems so that failures are expected, detected quickly, and recoverable

Release Engineering & Continuous Delivery

Own the reliability aspects of continuous delivery and release management
Design, build, and improve CI/CD pipelines using AWS CodePipeline and related services
Define and implement safe deployment strategies (e.g., rolling, blue/green, canary)
Build automated validation, rollback, and deployment safety mechanisms
Partner with engineering teams to reduce deployment risk, downtime, and mean time to recovery
Balance release velocity against reliability using data-driven decision-making

Platform Engineering & DevSecOps

Build and maintain infrastructure using CloudFormation as infrastructure as code
Deploy and operate Docker-based workloads on ECS Fargate
Embed security controls into build, deploy, and runtime stages (DevSecOps)
Secure dependencies and artifacts using AWS CodeArtifact
Collaborate with development teams using Bitbucket-based workflows
Implement observability best practices, including metrics, logs, tracing, and alerts
Apply AWS best practices for IAM, networking, encryption, and secrets management
Required Qualifications
Strong production experience operating systems on AWS
Hands-on experience with containerized workloads on ECS Fargate
Proven experience owning system reliability, availability, and recovery
Experience designing and executing disaster recovery tests and failover simulations
Experience participating in or leading incident response
Strong understanding of CI/CD, release engineering, and deployment strategies
Hands-on experience with CloudFormation or equivalent infrastructure-as-code tools
Experience working with Bitbucket or similar source control systems
Familiarity with managed databases, Kafka, and OpenSearch
Strong scripting and automation skills (e.g., Python, Bash)

Preferred Qualifications

Experience defining and operating with SLOs, SLIs, and error budgets
Experience running DR game days in production environments
Familiarity with SRE or production-readiness review practices
Experience integrating security scanning and controls into CI/CD pipelines
AWS certifications (DevOps Engineer, Solutions Architect, or Security Specialty)
Experience supporting compliance or audit-driven environments

What Success Looks Like

Systems consistently meet defined availability and recovery objectives
Disaster recovery plans are regularly tested and trusted
Incidents are handled calmly, efficiently, and lead to lasting improvements
Deployments are routine, low-risk, and automated
Engineering teams ship faster with confidence in platform reliability

What We Offer

Clear ownership of platform reliability and operational excellence
Modern AWS-native architecture using managed services
A culture that values engineering rigor, resilience, and learning from failure
Competitive compensation and benefits
Flexible work environment

Important Note for Candidates

This role includes shared on-call responsibilities and active participation in incident response. We believe reliable systems are built by engineers who are empowered to improve them.

Benefits

Competitive salary and stock options plan (with approval).
5 weeks of PTO.
5 sick leave days.
Multisport card.
Flexible work hours and a hybrid work setup.
Professional growth and development opportunities.
Global, collaborative, and inclusive company culture.

Site Reliability Engineer

Job Description

Benefits

1559 similar remote jobs

Director, Software Engineering Site Reliability Engineering

Director, Software Engineering Site Reliability Engineering

Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)