Summary
The job is for a senior infrastructure engineer at CentML, a company focused on reducing the cost of developing and deploying ML models. The role involves designing, developing, and maintaining the CentML platform that offers a cost-effective infrastructure for serving and training large scale machine learning models across multiple cloud service providers.
Requirements
- 4+ years of experience working with containerized deployment systems (e.g., kubernetes, openshift, terraform etc.)
- Experience with deploying and managing cloud infrastructure on AWS, GCP, Azure
- Knowledge in GPU architecture and Nvidia GPU virtualization technologies is highly desirable
- Strong coding skills in languages like Python, Java, Go, and/or C/C++
Responsibilities
- Design and lead the development of the deployment infrastructure of the CentML platform
- Implement GPU cluster scheduling solutions for large scale ML training and inference workloads
- Communicate with product teams and define new features and goals for improving the CentML platform
Preferred Qualifications
- Contributed to kubernetes and have expertise in container runtime technologies like docker engine, containerd, or CRI-O
- Past experience in building GPU clusters for large scale ML training and inference is desirable
Benefits
- An open and inclusive culture and work environment
- Fully stocked kitchen at the office
- Full health and dental benefits
- Parental Leave top-up for 6 months
- Continuous education budget
- Generous vacation - we're not saying unlimited, but if you need extra time to recharge, just ask