Summary
The job is for a Site Reliability Engineer at Syndica, a Web3 RPC infrastructure company. The candidate should have 5+ years of DevOps or SRE experience, proficiency in scripting languages and modern programming languages, experience with Kubernetes, web protocols, information security, automation tools, capacity planning, and various monitoring tools.
Requirements
- Great collaborator with 5+ years of experience in a DevOps or SRE role
- Proficiency in scripting languages (Python, Shell) and experience with at least one modern programming language (Go, Rust, Typescript, etc.)
- Experience deploying large-scale systems reliably
- Experience using Kubernetes
- Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
- Working knowledge of information security issues
- Experience writing automation tools & eagerness to 'automate all the things
- Commitment to implementing reliability and security best practices
- Capacity planning experience, including resource optimization and load testing
- Systematic problem-solving approach, combined with a strong sense of ownership and drive
Responsibilities
- Administer overall site availability, security, latency, and system health
- Effective provisioning, installation/configuration, operation, and maintenance of services and system software and related infrastructure
- Develop comprehensive monitoring solutions to provide full visibility to the different system components using tools like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, etc
- Enable the development team to release code quickly and reliably by ensuring full observability of systems and automated detection of performance and integration issues
- Formulate technical performance measures and implement them using queries, logs, code instrumentation and other analytics tools
- Design dashboards and visualizations that effectively convey technical measures
- Troubleshoot issues at multiple layers of deployment, from hardware, to operating environment, network, and application to conduct root cause analysis and make recommendations from your findings
- Work with development teams to ensure best practices for scalability, reliability, and security are designed and implemented from the start
- Forecast changes in demand and capacity to establish appropriate scalability plans and drive decisions on the right-sizing of servers, storage and other resources
- Design and perform high-throughput stress testing to determine system capacity limits and identify points of failure
- Troubleshoot critical customer issues related to Syndica’s RPC, APIs, and App Deployments
Preferred Qualifications
- Experience with Prometheus/Grafana for metrics aggregation/visualization and other monitoring and alerting tools
- Experience with infrastructure-as-code tools such as Terraform, Ansible, Chef
- Experience in Building and managing Virtualized systems (KVM, OVM, Containers/Docker) and ability to read and understand source code
- Knowledge of one or more load testing tools (K6, Locust, JMeter, etc.)
- Experience with configuration of CI/CD pipelines