MLOps Lead

🇺🇸 United States - Remote
🔧 DevOps🟣 Senior

Job description

The Chan Zuckerberg Initiative was founded by Priscilla Chan and Mark Zuckerberg in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education to addressing the needs of our local communities. Our mission is to build a more inclusive, just, and healthy future for everyone.

The Team

Founded by Priscilla Chan and Mark Zuckerberg in 2015, the Chan Zuckerberg Initiative (CZI) is a new kind of philanthropy that’s leveraging technology to help solve some of the world’s toughest challenges – from eradicating disease, to improving education, to reforming the criminal justice system. Our mission is to create a future for everyone.  Across our core Initiative focus areas of Science and  Education,  we’re pairing engineering with grantmaking, impact investing, policy work, and movement building, to help build an inclusive, just and healthy future for everyone.

Our Values

  • We believe we can help build a future for everyone.
  • We aim to be daring, but humble:  We look for bold ideas — regardless of structure and stage — and help them scale by pairing engineers with subject matter experts to build tools that accelerate the pace of social progress.
  • We want to learn fast, but build for the long-term: We want to iterate fast and help bring new solutions to the table, but we also realize that important breakthroughs often take decades, or even centuries.
  • Stay close to the real problems: We engage directly in the communities we serve because no one understands our society’s challenges like those who live them every day.

Our success is dependent on building teams that include people from different backgrounds and experiences who can challenge each other’s assumptions with fresh perspectives. To that end, we look for a diverse pool of applicants including those from historically marginalized groups — women, people with disabilities, people of color, formerly incarcerated people, people who are lesbian, gay, bisexual, transgender, and/or gender nonconforming, first and second generation immigrants, veterans, and people from different socioeconomic backgrounds.

The Opportunity

Our Central Tech team provides technology and security support for CZI, the Biohub Network,  and our grantees. We believe that Engineering and Security are most effective when in sync and learning from each other on a daily basis.  Our AI Infrastructure Engineering team enables our AI Research teams to achieve their goals faster and more securely. We leverage technology to automate manual processes, constantly innovate to optimize operations, provide first-class support, and build solutions to enable the scale and execution of our business partners’ strategies and initiatives.

The AI/ML and Data Engineering Infrastructure organization works on building shared tools and platforms to be used across all of the Chan Zuckerberg Initiative, partnering and supporting the work of a wide range of Research Scientists, Data Scientists, AI Research Scientists, as well as a broad range of Engineers focusing on Education and Science domain problems. Members of the shared infrastructure engineering team have an impact on all of CZI’s initiatives by enabling the technology solutions used by other engineering teams at CZI to scale. A person in this role will build these technology solutions and help to cultivate a culture of shared best practices and knowledge around core engineering.

What You’ll Do

  • Provide technical MLOps leadership: for a team of MLOps Engineers,  where you will manage and lead the team  in operating our heterogeneous AI training and inference systems as well as  collaborating in the design and build of our AI platform components.
  • Drive the application of MLOps and DevOps principles: across our multiple platforms, ensuring peak operational efficiency across our AI operations and process automation necessary for a world class large scale AI model training environment.
  • Instrumentation and Observation technical leadership: for the MLOps team, defining our end to end metrics program including full proactive monitoring and alerting systems
  • Facilitate model training through collaboration with our AI Researchers: alongside the rest of the AI Infrastructure Eng team work together to make sure that our models we are training and releasing to  inference make use of best machine learning and deep learning practices, and are through code automation libraries fully resilient to restarts and checkpoint recoveries.
  • Continuous Optimization of our Kubernetes based AI Lifecycle platform: through our IAC based practices and integrating our MLOps AI Lifecycle platform tooling, alongside integrating this with our On-Prem HPC systems into a cohesive heterogeneous platform.
  • Collaboration on Data systems for our AI model training: with our Data Infrastructure Eng team as well as the Science data teams on the end to end data usage that drive our AI model training.
  • Lead our MLOps team supporting our  on-call rotation: combining  a focus on automation and proactive alerting focused on  reducing on-call loads and  improving self healing AI system operations. This will be low volume, but we do have 247 coverage, and will include members of the rest of the AI team for escalation and on-call coverage.

What You’ll Bring

  • BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience
  • 7+ years of relevant coding and systems experience
  • 5+ years of systems Architecture and Design experience, with a broad range of MLOps experience across Data Infrastructure and AI/ML platforms
  • Proven technical leadership in SRE and MLOps related experience, as well as either direct or indirect people management experience
  • Proven SRE and MLOps knowledge and related experience
  • Strong experience scaling containerized applications  on Kubernetes or Mesos, including expertise  with creating custom containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)
  • Cloud Platform proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments
  • MLOps experience working with medium to large scale GPU clusters in Kubernetes (Kubeflow),  HPC environments, or large scale Cloud based ML deployments
  • Working knowledge of Nvidia CUDA and AI/ML custom libraries.
  • Knowledge of Linux systems optimization and administration
  • Solid Coding experience
  • Proven coding ability with a systems language such as Rust,C/ C++, C#, Go, Java, or Scala
  • Expertise with a scripting language such as Python (preferred), PHP, or Ruby
  • Experience in integrating Data with the AI Lifecycle
  • AI/ML Platform Operations experience in an environment integrated with  challenging data and systems platform challenges
  • Large scale Streaming data systems integration experience
  • Hadoop, Spark,  and/or  Kafka deployments, or their corollaries such as Pulsar, Flink, and/or Ray)
  • Workflow scheduling tools such as Apache Airflow, Dagster, or Apache Beam
  • Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.
  • PyTorch, Keras, or Tensorflow  experience a strong nice to have
  • HPC with and Slurm experience a strong nice to have

Compensation

The Redwood City, CA base pay range for this role is $241,000 - $331,000. New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience, as evaluated throughout the interview process.

Work Mode

As we grow, we’re excited to strengthen in-person connections and cultivate a collaborative, team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager. The exact schedule will be at the hiring manager’s discretion and communicated during the interview process.

Benefits for the Whole You

We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible.

  • CZI provides a generous employer match on employee 401(k) contributions to support planning for the future.
  • Annual benefit for employees that can be used most meaningfully for them and their families, such as housing, student loan repayment, childcare, commuter costs, or other life needs.
  • CZI Life of Service Gifts are awarded to employees to “live the mission” and support the causes closest to them.
  • Paid time off to volunteer at an organization of your choice.
  • Funding for select family-forming benefits.
  • Relocation support for employees who need assistance moving to the Bay Area
  • And more!

If you’re interested in a role but your previous experience doesn’t perfectly align with each qualification in the job description, we still encourage you to apply as you may be the perfect fit for this or another role.

Explore our work modes, benefits, and interview process at www.chanzuckerberg.com/careers.

#LI-Hybrid

Share this job:
Please let Chan Zuckerberg Initiative know you found this job on Remote First Jobs 🙏

Similar Remote Jobs

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Apply