Summary
The job is for a remote Site Reliability Engineer position that lasts for 6 months to 2 years. The role involves managing infrastructure services, maintaining service levels, providing user support, incident responses, and participating in on-call rotation. Required qualifications include a Bachelor's degree in Computer Science or related fields, 2+ years of experience with various systems at their newest versions, strong problem-solving skills, excellent communication skills, and team player mentality.
Requirements
- Participate in on-call rotation
- Bachelor's degree or above, majoring in Computer Science or related fields
- 2+ years of experience in one or more of the following types of systems at their newest versions
- Kubernetes and Docker
- Redis and/or MongoDB
- Kafka and/or RocketMQ
- Flink
- MySQL
- ElasticSearch
- HDFS
- Mesos and/or Yarn
- Spark and/or Hive
- Familiar with Unix/Linux operating systems
- Experience in debugging and automating routine tasks
- Strong skills in problem solving and communication
- Excellent team player
Responsibilities
- Managing infrastructure services, responsible for deployment, operation, and troubleshooting
- Maintain services to meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring availability, performance, and overall system health
- Provide user support, incident responses, and post-mortems
Preferred Qualifications
Experience of supporting/managing systems at scale (10s thousands to 100s thousands instances) is a big plus