Data & Reporting SRE

at ZILO
🇬🇧 United Kingdom - Remote
🔧 DevOps🔵 Mid-level

Job description

About:

Step forward into the future of technology with ZILO™.

We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.

At ZILO™, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.

We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.

We are seeking an experienced Site Reliability Engineer (SRE) with deep subject-matter expertise in data processing and reporting. In this role, you will own the reliability, performance, and operational excellence of our real-time and batch data pipelines built on AWS, Apache Flink, Kafka, and Python. You’ll act as the first line of defense for data-related incidents, rapidly diagnose root causes, and implement resilient solutions that keep critical reporting systems up and running.

Incident Management & Triage

  • Serve as on-call escalation for data pipeline incidents, including real-time stream failures and batch job errors.
  • Rapidly analyze logs, metrics, and trace data to pinpoint failure points across AWS, Flink, Kafka, and Python layers.
  • Lead post-incident reviews: identify root causes, document findings, and drive corrective actions to closure.

Reliability & Monitoring

  • Design, implement, and maintain robust observability for data pipelines: dashboards, alerts, distributed tracing.
  • Define SLOs/SLIs for data freshness, throughput, and error rates; continuously monitor and optimize.
  • Automate capacity planning, scaling policies, and disaster-recovery drills for stream and batch environments.

Architecture & Automation

  • Collaborate with data engineering and product teams to architect scalable, fault-tolerant pipelines using AWS services (e.g., Step Functions, EMR, Lambda, Redshift) integrated with Apache Flink and Kafka.
  • Troubleshoot & Maintain Python-based applications.
  • Harden CI/CD for data jobs: implement automated testing of data schemas, versioned Flink jobs, and migration scripts.

Performance Optimization

  • Profile and tune streaming jobs: optimize checkpoint intervals, state backends, and parallelism settings in Flink.
  • Analyze Kafka cluster health: tune broker configurations, partition strategies, and retention policies to meet SLAs.
  • Leverage Python profiling and vectorized libraries to streamline batch analytics and report generation.

Collaboration & Knowledge Sharing

  • Act as SME for data & reporting stack: mentor peers, lead brown-bag sessions on best practices.

  • Contribute to runbooks, design docs, and on-call playbooks detailing common failure modes and recovery steps.

  • Work cross-functionally with DevOps, Security, and Product teams to align reliability goals and incident response workflows.

  • Enhanced leave - 38 days inclusive of 8 UK Public Holidays

  • Private Health Care including family cover

  • Life Assurance – 5x salary

  • Flexible working-work from home and/or in our London Office

  • Employee Assistance Program

  • Company Pension (Salary Sacrifice options available)

  • Access to training and development

  • Buy and Sell holiday scheme

  • The opportunity for “work from anywhere/global mobility”

Share this job:
Please let ZILO know you found this job on Remote First Jobs 🙏

Similar Remote Jobs

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Apply