LLM Ops Engineer - Serverless & CI/CD

at Expedite Commerce
🇮🇳 India - Remote
🔧 DevOps🔵 Mid-level

Job description

This isn’t your average DevOps role. This isn’t just about pipelines or cloud provisioning. This is about engineering the backbone of Agentic AI systems that drive the next generation of enterprise SaaS—where conversational interfaces, dynamic UIs, and intelligent agents operate seamlessly on AWS Serverless infrastructure, with deep integration into Salesforce and cross-agent protocols.

This is for builders with something to prove. For engineers who’ve gone beyond cloud fluency to orchestrate complex, multi-agent ecosystems—who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time.

If you’re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation—not the exception—you’re in the right place. Join us to redefine SaaS infrastructure and champion a new era of AI-powered, product-led enterprise experiences.

The Role

We are seeking a hands-on Agentic AI Ops Engineer who thrives at the intersection of cloud infrastructure, AI agent systems, and DevOps automation. In this role, you will build and maintain the CI/CD infrastructure for Agentic AI solutions using Terraform on AWS, while also developing, deploying, and debugging intelligent agents and their associated tools. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.

The Responsibilities

CI/CD Infrastructure for Agentic AI

  • Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools.
  • Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.

Agent Development & Debugging

  • Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production.
  • Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops.
  • Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors.

Monitoring, Tracing & Reliability

  • Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking.
  • Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time.
  • Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis.

Cost Optimization & Usage Analysis

  • Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs.
  • Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance.

Collaboration & Continuous Improvement

  • Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows.

  • Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments.

  • Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.

  • 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems.

  • Deep expertise in AWS serverless architecture, including hands-on experience with:

    • AWS Lambda – function design, performance tuning, cold-start optimization.
    • Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely.
    • Step Functions – orchestrating agentic workflows and managing execution states.
    • S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems.
  • Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates.

  • Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions.

  • Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production.

  • Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems.

  • Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry).

  • Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines.

  • Excellent debugging, documentation, and cross-team communication skills.

  • Equity participation program.

  • Health Insurance, PTO, and Leave time

  • Ongoing paid professional training and certifications

  • Fully Remote work Opportunity

  • Strong Onboarding & Training program

Work Timings - 1 pm -10 pm IST

Next Steps

We’re looking for someone who embodies the spirit of a boundary-pushing Principal Architect—ready to own ambitious projects, craft scalable multi-cloud solutions, and skillfully integrate AI where it truly elevates outcomes.

  1. Apply Now: Send us your resume and a brief summary of your experience leading teams, including notable multi-platform or AI-driven projects.
  2. Show Us Your Ingenuity: Be prepared to discuss your boldest cross-platform solutions, how you integrated new technologies, and how you overcame tough technical hurdles.
  3. Collaborate & Ideate: If selected, you’ll workshop a real-world scenario with our leadership—so we can see firsthand how you approach challenges across AWS, AI, and beyond.

This is your opportunity to shape the future of enterprise solutions—across AWS, emerging AI platforms, and the occasional Salesforce ecosystem. We can’t wait to hear from you!

Our Belief

We believe extraordinary things happen when technology and human creativity unite. By empowering teams with cloud solutions, AI insights, and thoughtful architecture, we free them to focus on meaningful relationships, innovative strategies, and real impact. It’s more than just code—it’s about sparking a revolution in how people interact with systems, solve problems, and propel businesses forward.

If this resonates with you—if you’re driven, daring, and ready to build the next wave of multi-platform innovation—then let’s do this. Apply now and help us shape the future.

About Expedite Commerce

At Expedite Commerce, we believe that people achieve their best when technology enables them to build relationships and explore new ideas. So we build systems that free you up to focus on your customers and drive innovations. We have a great commerce platform that changes the way you do business!

See more about us at expeditecommerce.com. You can also read about us on G2/products/expedite-commerce, and on Salesforce Appexchange/ExpediteCommerce.

EEO Statement

All qualified applicants to Expedite Commerce are considered for employment without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability, veteran’s status or any other protected characteristic.

Share this job:
Please let Expedite Commerce know you found this job on Remote First Jobs 🙏
Expedite Commerce logo

Expedite Commerce

  • 51-200 employees
  • Founded in 2008
  • 5 remote jobs

Benefits of using Remote First Jobs

Discover Hidden Jobs

Unique jobs you won't find on other job boards.

Advanced Filters

Filter by category, benefits, seniority, and more.

Priority Job Alerts

Get timely alerts for new job openings every day.

Manage Your Job Hunt

Save jobs you like and keep a simple list of your applications.

Search remote, work from home, 100% online jobs

We help you connect with top remote-first companies.

Search jobs

Hiring remote talent? Post a job

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

Unlike other job boards that only show jobs from companies that pay to post, we actively scan over 20,000 companies to find remote positions. This means you get access to thousands more jobs, including ones from companies that don't typically post on traditional job boards. Our platform is dedicated to fully remote positions, focusing on companies that have adopted remote work as their standard practice.

How often are new jobs added?

New jobs are constantly being added as our system checks company websites every day. We process thousands of jobs daily to ensure you have access to the most up-to-date remote job listings. Our algorithms scan over 20,000 different sources daily, adding jobs to the board the moment they appear.

Can I trust the job listings on Remote First Jobs?

Yes! We verify all job listings and companies to ensure they're legitimate. Our system automatically filters out spam, junk, and fake jobs to ensure you only see real remote opportunities.

Can I suggest companies to be added to your search?

Yes! We're always looking to expand our listings and appreciate suggestions from our community. If you know of companies offering remote positions that should be included in our search, please let us know. We actively work to increase our coverage of remote job opportunities.

How do I apply for jobs?

When you find a job you're interested in, simply click the 'Apply Now' button on the job listing. This will take you directly to the company's application page. We kindly ask you to mention that you found the position through Remote First Jobs when applying, as it helps us grow and improve our service 🙏

Apply