Site Reliability Engineer

  • Remote - Worldwide

Remote

DevOps

Senior

Summary

The job is for a Site Reliability Engineer at Syndica, a Web3 RPC infrastructure company. The candidate should have 5+ years of DevOps or SRE experience, proficiency in scripting languages and modern programming languages, experience with Kubernetes, web protocols, information security, automation tools, capacity planning, and various monitoring tools.

Requirements

  • Great collaborator with 5+ years of experience in a DevOps or SRE role
  • Proficiency in scripting languages (Python, Shell) and experience with at least one modern programming language (Go, Rust, Typescript, etc.)
  • Experience deploying large-scale systems reliably
  • Experience using Kubernetes
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
  • Working knowledge of information security issues
  • Experience writing automation tools & eagerness to 'automate all the things
  • Commitment to implementing reliability and security best practices
  • Capacity planning experience, including resource optimization and load testing
  • Systematic problem-solving approach, combined with a strong sense of ownership and drive

Responsibilities

  • Administer overall site availability, security, latency, and system health
  • Effective provisioning, installation/configuration, operation, and maintenance of services and system software and related infrastructure
  • Develop comprehensive monitoring solutions to provide full visibility to the different system components using tools like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, etc
  • Enable the development team to release code quickly and reliably by ensuring full observability of systems and automated detection of performance and integration issues
  • Formulate technical performance measures and implement them using queries, logs, code instrumentation and other analytics tools
  • Design dashboards and visualizations that effectively convey technical measures
  • Troubleshoot issues at multiple layers of deployment, from hardware, to operating environment, network, and application to conduct root cause analysis and make recommendations from your findings
  • Work with development teams to ensure best practices for scalability, reliability, and security are designed and implemented from the start
  • Forecast changes in demand and capacity to establish appropriate scalability plans and drive decisions on the right-sizing of servers, storage and other resources
  • Design and perform high-throughput stress testing to determine system capacity limits and identify points of failure
  • Troubleshoot critical customer issues related to Syndica’s RPC, APIs, and App Deployments

Preferred Qualifications

  • Experience with Prometheus/Grafana for metrics aggregation/visualization and other monitoring and alerting tools
  • Experience with infrastructure-as-code tools such as Terraform, Ansible, Chef
  • Experience in Building and managing Virtualized systems (KVM, OVM, Containers/Docker) and ability to read and understand source code
  • Knowledge of one or more load testing tools (K6, Locust, JMeter, etc.)
  • Experience with configuration of CI/CD pipelines
Share this job:
Please let Syndica know you found this job on Remote First Jobs 🙏
Apply now