TruU Logo

Site Reliability Engineer

Job Description

About the Role

We are seeking a Site Reliability Engineer (SRE) to own the availability, resilience, and operational readiness of our cloud-native platform running on AWS. This role is responsible for ensuring our systems are designed to tolerate failure, recover quickly, and support safe, continuous delivery.

As an SRE, you will apply software engineering principles to infrastructure, operations, and incident response. You will partner closely with application engineers to balance delivery velocity with reliability, and you will have clear ownership of high availability (HA), disaster recovery (DR), and incident management preparedness.

Core Responsibilities:

Reliability, Availability & Disaster Recovery

  • Own and continuously evolve the High Availability (HA) and Disaster Recovery (DR) strategy across all production systems
  • Define, document, and enforce service reliability targets, including availability objectives and recovery expectations
  • Design and maintain resilient architectures using AWS managed services
  • Establish and validate RTO and RPO targets for applications and data stores
  • Design, document, and execute disaster recovery simulations, game days, and real failover testing
  • Implement backup, restore, replication, and failover strategies for managed databases, Kafka, and OpenSearch
  • Identify and eliminate single points of failure across infrastructure, pipelines, and operational processes
  • Ensure DR plans are tested, trusted, and continuously improved, not just documented

Incident Management & Operational Readiness

  • Own incident management preparedness, including tooling, runbooks, escalation paths, and communication practices
  • Participate in and lead incident response for availability-impacting events
  • Conduct blameless post-incident reviews and drive corrective actions to completion
  • Improve systems and processes based on incident learnings
  • Ensure on-call rotations, alerts, and monitoring are actionable and sustainable
  • Design systems so that failures are expected, detected quickly, and recoverable

Release Engineering & Continuous Delivery

  • Own the reliability aspects of continuous delivery and release management
  • Design, build, and improve CI/CD pipelines using AWS CodePipeline and related services
  • Define and implement safe deployment strategies (e.g., rolling, blue/green, canary)
  • Build automated validation, rollback, and deployment safety mechanisms
  • Partner with engineering teams to reduce deployment risk, downtime, and mean time to recovery
  • Balance release velocity against reliability using data-driven decision-making

Platform Engineering & DevSecOps

  • Build and maintain infrastructure using CloudFormation as infrastructure as code

  • Deploy and operate Docker-based workloads on ECS Fargate

  • Embed security controls into build, deploy, and runtime stages (DevSecOps)

  • Secure dependencies and artifacts using AWS CodeArtifact

  • Collaborate with development teams using Bitbucket-based workflows

  • Implement observability best practices, including metrics, logs, tracing, and alerts

  • Apply AWS best practices for IAM, networking, encryption, and secrets management

  • Required Qualifications

  • Strong production experience operating systems on AWS

  • Hands-on experience with containerized workloads on ECS Fargate

  • Proven experience owning system reliability, availability, and recovery

  • Experience designing and executing disaster recovery tests and failover simulations

  • Experience participating in or leading incident response

  • Strong understanding of CI/CD, release engineering, and deployment strategies

  • Hands-on experience with CloudFormation or equivalent infrastructure-as-code tools

  • Experience working with Bitbucket or similar source control systems

  • Familiarity with managed databases, Kafka, and OpenSearch

  • Strong scripting and automation skills (e.g., Python, Bash)

Preferred Qualifications

  • Experience defining and operating with SLOs, SLIs, and error budgets
  • Experience running DR game days in production environments
  • Familiarity with SRE or production-readiness review practices
  • Experience integrating security scanning and controls into CI/CD pipelines
  • AWS certifications (DevOps Engineer, Solutions Architect, or Security Specialty)
  • Experience supporting compliance or audit-driven environments

What Success Looks Like

  • Systems consistently meet defined availability and recovery objectives
  • Disaster recovery plans are regularly tested and trusted
  • Incidents are handled calmly, efficiently, and lead to lasting improvements
  • Deployments are routine, low-risk, and automated
  • Engineering teams ship faster with confidence in platform reliability

What We Offer

  • Clear ownership of platform reliability and operational excellence
  • Modern AWS-native architecture using managed services
  • A culture that values engineering rigor, resilience, and learning from failure
  • Competitive compensation and benefits
  • Flexible work environment

Important Note for Candidates

This role includes shared on-call responsibilities and active participation in incident response. We believe reliable systems are built by engineers who are empowered to improve them.

Benefits

  • Competitive salary and stock options plan (with approval).
  • 5 weeks of PTO.
  • 5 sick leave days.
  • Multisport card.
  • Flexible work hours and a hybrid work setup.
  • Professional growth and development opportunities.
  • Global, collaborative, and inclusive company culture.
Share this job:
Please let TruU know you found this job on Remote First Jobs 🙏

1559 similar remote jobs

Explore latest remote opportunities and join a team that values work flexibility.

Remote companies like TruU

Explore remote-first companies similar to TruU. Discover other top-rated employers that offer flexible schedules and work-from-anywhere options.

PingWind Logo

PingWind

Delivering IT services and technology solutions to federal government partners.

4 open positions →
Zirous Logo

Zirous

Provides IT solutions leveraging AI, data analytics, identity management, and cloud operations since 1986.

View company profile →
SenseOn Logo

SenseOn

51-200 senseon.io

Our AI-driven intelligence cloud helps reduce cybersecurity risk and spending without traditional SIEM systems.

View company profile →
Kalles Group Logo

Kalles Group

We provide cybersecurity, cyber risk, engineering, project leadership, and learning consulting services.

View company profile →
IT Concepts, Inc Logo

IT Concepts, Inc

501-1000 www.kentro.us

Provides digital solutions, IT modernization, and specialized services to federal agencies.

View company profile →
Entefy Logo

Entefy

An enterprise AI software and automation company focused on multisensory AI and digital transformation.

View company profile →

Project: Career Search

Rev. 2026.2

[ Remote Jobs ]
Direct Access

We source jobs directly from 21,000+ company career pages. No intermediaries.

01

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

02

Advanced Filters

Filter by category, benefits, seniority, and more.

03

Priority Job Alerts

Get timely alerts for new job openings every day.

04

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

21,000+ SOURCES UPDATED 24/7
Apply