Job description

This isn’t your average DevOps role. This isn’t just about pipelines or cloud provisioning. This is about engineering the backbone of Agentic AI systems that drive the next generation of enterprise SaaS—where conversational interfaces, dynamic UIs, and intelligent agents operate seamlessly on AWS Serverless infrastructure, with deep integration into Salesforce and cross-agent protocols.

This is for builders with something to prove. For engineers who’ve gone beyond cloud fluency to orchestrate complex, multi-agent ecosystems—who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time.

If you’re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation—not the exception—you’re in the right place. Join us to redefine SaaS infrastructure and champion a new era of AI-powered, product-led enterprise experiences.

The Role

We are seeking a hands-on Agentic AI Ops Engineer who thrives at the intersection of cloud infrastructure, AI agent systems, and DevOps automation. In this role, you will build and maintain the CI/CD infrastructure for Agentic AI solutions using Terraform on AWS, while also developing, deploying, and debugging intelligent agents and their associated tools. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.

The Responsibilities

CI/CD Infrastructure for Agentic AI

Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools.
Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.

Agent Development & Debugging

Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production.
Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops.
Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors.

Monitoring, Tracing & Reliability

Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking.
Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time.
Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis.

Cost Optimization & Usage Analysis

Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs.
Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance.

Collaboration & Continuous Improvement

Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows.
Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments.
Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.
2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems.
Deep expertise in AWS serverless architecture, including hands-on experience with:
- AWS Lambda – function design, performance tuning, cold-start optimization.
- Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely.
- Step Functions – orchestrating agentic workflows and managing execution states.
- S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems.
Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates.
Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions.
Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production.
Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems.
Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry).
Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines.
Excellent debugging, documentation, and cross-team communication skills.
Equity participation program.
Health Insurance, PTO, and Leave time
Ongoing paid professional training and certifications
Fully Remote work Opportunity
Strong Onboarding & Training program

Work Timings - 1 pm -10 pm IST

Next Steps

We’re looking for someone who embodies the spirit of a boundary-pushing Principal Architect—ready to own ambitious projects, craft scalable multi-cloud solutions, and skillfully integrate AI where it truly elevates outcomes.

Apply Now: Send us your resume and a brief summary of your experience leading teams, including notable multi-platform or AI-driven projects.
Show Us Your Ingenuity: Be prepared to discuss your boldest cross-platform solutions, how you integrated new technologies, and how you overcame tough technical hurdles.
Collaborate & Ideate: If selected, you’ll workshop a real-world scenario with our leadership—so we can see firsthand how you approach challenges across AWS, AI, and beyond.

This is your opportunity to shape the future of enterprise solutions—across AWS, emerging AI platforms, and the occasional Salesforce ecosystem. We can’t wait to hear from you!

Our Belief

We believe extraordinary things happen when technology and human creativity unite. By empowering teams with cloud solutions, AI insights, and thoughtful architecture, we free them to focus on meaningful relationships, innovative strategies, and real impact. It’s more than just code—it’s about sparking a revolution in how people interact with systems, solve problems, and propel businesses forward.

If this resonates with you—if you’re driven, daring, and ready to build the next wave of multi-platform innovation—then let’s do this. Apply now and help us shape the future.

About Expedite Commerce

At Expedite Commerce, we believe that people achieve their best when technology enables them to build relationships and explore new ideas. So we build systems that free you up to focus on your customers and drive innovations. We have a great commerce platform that changes the way you do business!

See more about us at expeditecommerce.com. You can also read about us on G2/products/expedite-commerce, and on Salesforce Appexchange/ExpediteCommerce.

EEO Statement

All qualified applicants to Expedite Commerce are considered for employment without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability, veteran’s status or any other protected characteristic.

LLM Ops Engineer - Serverless & CI/CD

Job description

The Role

The Responsibilities

CI/CD Infrastructure for Agentic AI

Agent Development & Debugging

Monitoring, Tracing & Reliability

Cost Optimization & Usage Analysis

Collaboration & Continuous Improvement

Next Steps

Our Belief

About Expedite Commerce

EEO Statement

Similar Remote Jobs

Staff Software Engineer Platform & Architecture

Infrastructure Automation Engineer

Middle Python Engineer

Expedite Commerce

Latest Jobs at Expedite Commerce

LLM Ops Engineer - Serverless & CI/CD

LLM Agent Engineer

AI Savvy Director

Technical Architect

QA Engineer

Benefits of using Remote First Jobs

Discover Hidden Jobs

Advanced Filters

Priority Job Alerts

Manage Your Job Hunt

Search remote, work from home, 100% online jobs

Frequently Asked Questions

What makes Remote First Jobs different from other job boards?

How often are new jobs added?

Can I trust the job listings on Remote First Jobs?

Can I suggest companies to be added to your search?

How do I apply for jobs?