Summary
Join us in shaping the future at Together AI! We are seeking a Distributed ML Systems Engineer to design and build scalable machine learning systems that power our accelerated AI initiatives. This role involves developing large-scale, fault-tolerant distributed systems that handle high-load and high-performance requirements.
Requirements
- 3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems
- Strong programming skills in one or more of Python, Go, Rust, or C/C++
- Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale
- Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure
- Strong problem-solving skills and ability to work in a fast-paced environment
Responsibilities
- Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance
- Develop and optimize distributed processing frameworks and storage systems
- Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure
- Conduct architecture and design reviews to ensure best practices in system design
- Implement robust monitoring and logging systems to ensure the health and performance of our ML systems
Preferred Qualifications
- Experience with Kubernetes
- Experience with Pytorch
Benefits
- Health insurance
- Competitive compensation