Site Reliability Engineer

🇺🇸 United States - Remote
🔧 DevOps🔵 Mid-level

Job description

Position overview:

We are seeking a Site Reliability Engineer to join our Operations Group. This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis for the organization.

Our center provides essential HPC and data systems to more than 10,000 researchers working in areas such as alternative energy, climate science, energy efficiency, environmental science, and other missions.

As a Site Reliability Engineer, you will be part of a 247 operations team that ensures our systems are accessible, reliable, secure, and available to the scientific community. You will work with a state-of-the-art data collection and monitoring system to maintain and optimize performance across complex HPC and data environments.

What You Will Do at Level 2

  • Work five shifts per week monitoring a large HPC facility, including 2–3 overnight shifts (midnight–8 a.m.) per week.
  • Split time between on-site and off-site shifts depending on staffing needs.
  • Review and respond to alerts from computing systems, storage, networks, and other data center/facility systems by triaging or escalating to on-call staff.
  • Develop solutions to improve processes, prevent recurrence of issues, and automate responses to routine service conditions.
  • Identify areas for improved monitoring and automation; propose and implement solutions.
  • Respond to monitoring alerts to ensure continuous 247 data collection for real-time diagnoses.
  • Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team.
  • Create software programs to provide alerts and notifications from HPC system APIs into the monitoring pipeline.
  • Configure software and solve technical issues to ensure programs scale reliably with increasing data volume.
  • Collaborate with other groups to ensure workflows are understood and maintained.
  • Assign technical tasks to other monitoring team members as needed.
  • Coordinate system maintenance activities and manage diagnostic and notification software during outages.
  • Provide accurate documentation in ticketing systems for outages, updates, and incidents.
  • Work on and resolve problems of diverse scope where analysis requires evaluation of identifiable factors.

Additional Responsibilities (For Level 3 only)

  • Provide leadership in developing monitoring and alerting pipelines, documentation, and software.
  • Contribute to the design and deployment of the monitoring cluster.
  • Partner with other technical groups to improve monitoring experiences.
  • Tackle complex problems requiring in-depth evaluation of variable factors.
  • Determine methods and procedures on new assignments and may coordinate activities of other team members.

Required Qualifications (Level 2)

  • Typically requires 5+ years of related experience with a Bachelor’s degree, or 3+ years with a Master’s degree, or equivalent work experience.
  • Strong hands-on knowledge of Linux shell and command-line environments.
  • Experience developing tools using languages such as C, C++, Perl, Java, or Python.
  • Knowledge of IT infrastructure and large data communication networks supporting highly available systems.
  • Ability to learn and work with data center management technologies (e.g., Kubernetes, Prometheus, alerting/monitoring tools, building management software, cooling/power systems).
  • Strong communication skills and ability to collaborate across multiple technical teams.
  • Experience working in a 247 operations team managing large data centers or installations.
  • Knowledge of network security, ACLs, firewalls, and protocols.
  • Relevant certifications in system administration or related areas.

Required Qualifications (Level 3)

  • Typically requires 8+ years of related experience with a Bachelor’s degree, or 6+ years with a Master’s degree, or equivalent.
  • Advanced expertise in one or more programming languages such as C, C++, Perl, Java, or Python.
  • Demonstrated excellence with monitoring and automation tools.
  • Experience leading technical projects.
  • Strong ability to respond proactively to complex issues.

Additional Details

  • Shift: Includes overnight “Owl” shifts (12 a.m. – 8 a.m.), primarily on-site.
  • This is a full-time, exempt position (monthly paid).
  • A background check is required. Convictions are reviewed in relation to job responsibilities and do not automatically disqualify applicants.
  • This position requires substantial on-site presence, but hybrid schedules may be available depending on business needs. Candidates must reside within 150 miles of the work site.
Share this job:
Please let LTD Global know you found this job on Remote First Jobs 🙏

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Apply