Job description

Staff Site Reliability Engineer (Remote, US)

Compensation: $200K–$250K + Equity

Full-Time | Remote | Infrastructure Team

We’re hiring a Staff Reliability Engineer to help scale and maintain the massive GPU infrastructure that powers our cutting-edge AI systems. If you’re passionate about building robust, scalable systems and solving deep infrastructure challenges at scale, this role is for you.

What You’ll Be Doing

Work closely with engineers and researchers to define and meet system performance, availability, and efficiency requirements.
Operate and manage thousands of GPUs distributed across multiple cloud providers and clusters.
Design scalable solutions to support rapid growth in compute demands for AI model training, data processing, and inference.
Build resilient, fault-tolerant systems to ensure continuous uptime and seamless performance.
Develop automation tools to eliminate toil and streamline infrastructure operations.
Set up and maintain monitoring systems to proactively detect issues and drive performance improvements.
Define and track SLOs and SLIs that uphold system reliability standards.
Participate in an on-call rotation to ensure ²⁴⁄₇ system availability.

Qualifications

Proven 7+ years of experience as a reliability engineer, infrastructure engineer, or production engineer in fast-paced, high-growth environments.
Deep knowledge of GPU infrastructure, including scheduling, scaling, cloud networking, storage, and security.
Proficiency in one or more scripting or programming languages.
Strong experience with Kubernetes or similar container orchestration systems.
Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.
Experience working with observability tools like Prometheus, Grafana, DataDog, ELK, or Splunk.
Excellent troubleshooting, debugging, and systems thinking.
Strong communication skills and a collaborative mindset.
Bonus: Experience in AI/ML infrastructure, or managing large-scale GPU clusters.

What We’re Building

We’re developing highly complex infrastructure to support advanced AI research and production systems running on thousands of GPUs. This is an opportunity to work on some of the most demanding reliability and performance challenges in tech today—at scale. You’ll have direct impact on how infrastructure supports foundation model development and deployment.

Compensation & Benefits

Base Salary: $200K–$250K/year

Competitive equity package (stock options)

Comprehensive health benefits

Generous PTO and flexible work policies

Support for ongoing professional development

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Job description

Staff Site Reliability Engineer (Remote, US)

What You’ll Be Doing

Qualifications

What We’re Building

Compensation & Benefits

Similar Remote Jobs

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer, DevOps

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

IntelliPro

Latest Jobs at IntelliPro

Senior Site Reliability Engineer

Senior Machine Learning Engineer

Staff Site Reliability Engineer

Database Engineer I

Field Marketing Manager

Benefits of using Remote First Jobs

Discover Hidden Jobs

Advanced Filters

Priority Job Alerts

Manage Your Job Hunt

Search remote, work from home, 100% online jobs

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

How often are new jobs added?

Can I trust the job listings on Remote First Jobs?

Can I suggest companies to be added to your search?

How do I apply for jobs?