Job Description
About the Role
We are seeking a Site Reliability Engineer (SRE) to own the availability, resilience, and operational readiness of our cloud-native platform running on AWS. This role is responsible for ensuring our systems are designed to tolerate failure, recover quickly, and support safe, continuous delivery.
As an SRE, you will apply software engineering principles to infrastructure, operations, and incident response. You will partner closely with application engineers to balance delivery velocity with reliability, and you will have clear ownership of high availability (HA), disaster recovery (DR), and incident management preparedness.
Core Responsibilities:
Reliability, Availability & Disaster Recovery
- Own and continuously evolve the High Availability (HA) and Disaster Recovery (DR) strategy across all production systems
- Define, document, and enforce service reliability targets, including availability objectives and recovery expectations
- Design and maintain resilient architectures using AWS managed services
- Establish and validate RTO and RPO targets for applications and data stores
- Design, document, and execute disaster recovery simulations, game days, and real failover testing
- Implement backup, restore, replication, and failover strategies for managed databases, Kafka, and OpenSearch
- Identify and eliminate single points of failure across infrastructure, pipelines, and operational processes
- Ensure DR plans are tested, trusted, and continuously improved, not just documented
Incident Management & Operational Readiness
- Own incident management preparedness, including tooling, runbooks, escalation paths, and communication practices
- Participate in and lead incident response for availability-impacting events
- Conduct blameless post-incident reviews and drive corrective actions to completion
- Improve systems and processes based on incident learnings
- Ensure on-call rotations, alerts, and monitoring are actionable and sustainable
- Design systems so that failures are expected, detected quickly, and recoverable
Release Engineering & Continuous Delivery
- Own the reliability aspects of continuous delivery and release management
- Design, build, and improve CI/CD pipelines using AWS CodePipeline and related services
- Define and implement safe deployment strategies (e.g., rolling, blue/green, canary)
- Build automated validation, rollback, and deployment safety mechanisms
- Partner with engineering teams to reduce deployment risk, downtime, and mean time to recovery
- Balance release velocity against reliability using data-driven decision-making
Platform Engineering & DevSecOps
Build and maintain infrastructure using CloudFormation as infrastructure as code
Deploy and operate Docker-based workloads on ECS Fargate
Embed security controls into build, deploy, and runtime stages (DevSecOps)
Secure dependencies and artifacts using AWS CodeArtifact
Collaborate with development teams using Bitbucket-based workflows
Implement observability best practices, including metrics, logs, tracing, and alerts
Apply AWS best practices for IAM, networking, encryption, and secrets management
Required Qualifications
Strong production experience operating systems on AWS
Hands-on experience with containerized workloads on ECS Fargate
Proven experience owning system reliability, availability, and recovery
Experience designing and executing disaster recovery tests and failover simulations
Experience participating in or leading incident response
Strong understanding of CI/CD, release engineering, and deployment strategies
Hands-on experience with CloudFormation or equivalent infrastructure-as-code tools
Experience working with Bitbucket or similar source control systems
Familiarity with managed databases, Kafka, and OpenSearch
Strong scripting and automation skills (e.g., Python, Bash)
Preferred Qualifications
- Experience defining and operating with SLOs, SLIs, and error budgets
- Experience running DR game days in production environments
- Familiarity with SRE or production-readiness review practices
- Experience integrating security scanning and controls into CI/CD pipelines
- AWS certifications (DevOps Engineer, Solutions Architect, or Security Specialty)
- Experience supporting compliance or audit-driven environments
What Success Looks Like
- Systems consistently meet defined availability and recovery objectives
- Disaster recovery plans are regularly tested and trusted
- Incidents are handled calmly, efficiently, and lead to lasting improvements
- Deployments are routine, low-risk, and automated
- Engineering teams ship faster with confidence in platform reliability
What We Offer
- Clear ownership of platform reliability and operational excellence
- Modern AWS-native architecture using managed services
- A culture that values engineering rigor, resilience, and learning from failure
- Competitive compensation and benefits
- Flexible work environment
Important Note for Candidates
This role includes shared on-call responsibilities and active participation in incident response. We believe reliable systems are built by engineers who are empowered to improve them.
Benefits
- Competitive salary and stock options plan (with approval).
- 5 weeks of PTO.
- 5 sick leave days.
- Multisport card.
- Flexible work hours and a hybrid work setup.
- Professional growth and development opportunities.
- Global, collaborative, and inclusive company culture.










