Distributed ML Systems Engineer

  • $160k-$230k
  • Remote - United States

Remote

Software Development

Mid-level

Summary

Join us in shaping the future at Together AI! We are seeking a Distributed ML Systems Engineer to design and build scalable machine learning systems that power our accelerated AI initiatives. This role involves developing large-scale, fault-tolerant distributed systems that handle high-load and high-performance requirements.

Requirements

  • 3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems
  • Strong programming skills in one or more of Python, Go, Rust, or C/C++
  • Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale
  • Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure
  • Strong problem-solving skills and ability to work in a fast-paced environment

Responsibilities

  • Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance
  • Develop and optimize distributed processing frameworks and storage systems
  • Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure
  • Conduct architecture and design reviews to ensure best practices in system design
  • Implement robust monitoring and logging systems to ensure the health and performance of our ML systems

Preferred Qualifications

  • Experience with Kubernetes
  • Experience with Pytorch

Benefits

  • Health insurance
  • Competitive compensation
Share this job:
Please let Together AI know you found this job on Remote First Jobs 🙏
Apply