Lead Site Reliability Engineer

💰 $142k-$198k
🇺🇸 United States - Remote
🔧 DevOps🟣 Senior

Job description

Company Description

Hitachi Solutions is a global Microsoft solutions integrator passionate about developing and delivering industry-focused solutions that support our clients to deliver on their business transformation goals. Our industry focus, expertise, and intellectual property is what truly sets us apart.  We have earned, and continue to maintain, a strategic relationship with Microsoft.  Recognized for our achievements - teaming with our clients to deliver innovative digital solutions and services - is how we have achieved year after year recognition.

As their trusted advisor, we support our clients to deliver on their strategic business initiatives as they unify, automate, and modernize their data and operations to increase efficiency, reduce costs, and enhance their customer’s experience. Our over 3,000 team members across 14 countries, and our 18 years of 100% focus on Microsoft technologies and business applications, is how we deliver excellence through expert services and industry-focused cloud solutions.

A part of Hitachi, Ltd., our company has a long and rich history of innovation, financial strength, and international presence of one of the world’s largest companies. Since 1910, Hitachi, Ltd. has been a leader in manufacturing innovative products and solutions that support industry and social infrastructure around the globe supported by 303,000 employees in over 100 countries and across 864 companies.

Job Description

This is a full-time role in our product organization for an expert in systems design with considerable skill and expertise in large software development in an AZURE dev environment.  Designs and implements Continuous Integration/Continuous Deployment (CI/CD) tooling using GitHub Actions / Azure DevOps, and related technologies.  This includes defining and implementing: build and test pipelines for containerized architectures, infrastructure as code (IaC) for the stateful deployment of environments, Role-Based Access Control (RBAC), linting and other code quality controls, gitops and kubernetes pipelines, and managing SaaS deployment APIs.

Individuals in this role will assist in the design, engineering, development, planning and administration of Azure Kubernetes AKS clusters for a set of critical business applications. This role will work closely with application, engineering, security and operations teams to engineer and build Kubernetes and Azure PaaS & IaaS solutions within an agile and modern enterprise grade operating model. Qualified applicants will have a demonstrated capability to learn new concepts quickly, and/or have robust domain expertise.

Qualifications

Key Responsibilities:

  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards.

  • Analyze, troubleshoot, and resolve operational challenges contributing to defined SLO’s.

  • Manage site stability, performance, reliability, and maintain uptime for production environments.

  • Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.

  • Strive for automation to reduce toil and increase development velocity.

  • Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.

  • Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.

  • Analyze and address complex technical challenges and issues that arise during the software development & run lifecycle. Debug, troubleshoot, and resolve technical problems efficiently.

  • Create and maintain technical documentation, including design specifications, user guides, run books and best practice guidelines.

  • Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.

  • Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.

  • Participate in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives.

  • Collaborate with product managers, designers, and other engineers to ensure alignment and efficient project execution.

  • Share your expertise and mentor engineers, helping them grow and develop their skills. Foster a culture of continuous learning and improvement within the team.

  • Stay updated with the latest technologies, tools, and cloud computing. Proactively learn and adapt to new technologies to drive innovation.

  • Collaborate with customers to understand their needs, gather feedback, and provide technical support and guidance as needed.

  • Triage incoming Web Support escalation requests routing to applicable internal teams

  • Contribute to incident root cause analysis, service restoration, and serve as an incident commander during outage events.

  • Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.

  • Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.,)

  • Strong backgroud with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts.

  • Experience implementing observability plans around logs, metrics, and traces.

  • Experience in an agile development team developing software.

  • Implement and participate exercising best practices for CI/CD.

  • Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM).

  • Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others.

  • Strong experience with containerization technology and/or Kubernetes.

  • Experience with Release automation, system administration, configuration management.

  • Experience with programming languages (Python, Go, etc.).

  • Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.

  • Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.

  • Strong analytical and programming skills (Python, Go etc.).

  • (Bonus) Experience with MLFlow and other MLOps pipeline technology

Practices, Principles, Techniques

  • Continuous Integration/Continuous Deployment (CI/CD)
  • Instrumentation strategy and Site Reliability Engineering (SRE)
  • Release Communication and Collaboration
  • Security and Compliance
  • TDD (Test Driven Development, especially with respect to CI/CD and DevOps)

Additional Information

Base Salary Pay Range*: $ 142,500 - $ 198,750 USD

*The current applicable Base Salary Pay Range for this role is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job, education, experience, knowledge, skills relevant to the role, internal equity, alignment with market data, or other law.

NOTE: WHILE THIS ROLE IS REMOTE, YOU MUST BE A US CITIZEN OR ABLE TO WORK WITHIN THE UNITED STATES WITHOUT SPONSORSHIP.

We are an equal opportunity employer. All applicants will be considered for employment without attention to age, race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Other Compensation / Benefit Overview

In addition to Base Salary, the successful candidate may be eligible to participate in the following plans / programs, upon satisfying all hiring requirements:

  • Bonus Plan

  • Medical, Dental and Vision Coverage

  • Life Insurance and Disability Programs

  • Retirement Savings with Company Match

  • Paid Time Off

  • Flexible Work Arrangements including Remote Work

We are an equal opportunity employer. All applicants will be considered for employment without attention to age, race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

All your information will be kept confidential according to EEO guidelines

#REMOTE

#LI-JH1

Beware of scams

Our recruiting team may communicate with candidates via our @hitachisolutions.com domain email address and/or via our SmartRecruiters (Applicant Tracking System) [email protected] domain email address regarding your application and interview requests.

All offers will originate from our @hitachisolutions.com domain email address. If you receive an offer or information from someone purporting to be an employee of Hitachi Solutions from any other domain, it may not be legitimate.

Share this job:
Please let Hitachi Solutions know you found this job on Remote First Jobs 🙏

Similar Remote Jobs

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Apply