All roles

[Remote] Senior Software Engineering Manager (DevOps)

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. TrueML is seeking a highly experienced and strategic Senior Software Engineering Manager to lead their infrastructure and platform engineering efforts. This role is critical in driving cloud architecture strategy and ensuring the scalability and reliability of machine learning-driven products, while also focusing on technical leadership and people management.

Responsibilities

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs
  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity
  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors
  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions
  • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance
  • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines
  • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error
  • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff
  • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations
  • Hands-on development of Terraform Infrastructure as Code for resource provisioning
  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability
  • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling
  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR
  • Conduct hands-on audits of cloud configurations and IAM policies, implementing "least privilege" access controls and automated remediation scripts
  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow)
  • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning
  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck
  • Collaborate with the Quality Engineering and Security leadership to define and enforce "Definition of Done" standards that include automated testing and security gates
  • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence
  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities

Skills

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience
  • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers
  • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments
  • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment
  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus
  • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash
  • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets)
  • Experience acting as an Incident Commander for high-severity outages and fostering a 'blameless' post-mortem culture
  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams
  • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery
  • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation
  • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success
  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments
  • Notable contributions to Open Source projects or communities

Company Overview

  • TrueML Technologies’ family of companies creates technology solutions seeking to revolutionize the experience of consumers seeking financial health and endeavors to ensure nobody gets locked out of the financial system. It was founded in 2013, and is headquartered in Lenexa, Kansas, USA, with a workforce of 51-200 employees. Its website is https://getretain.com.
  • Company H1B Sponsorship

  • TrueML has a track record of offering H1B sponsorships, with 3 in 2025. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Related roles

    [Remote] Field Marketing Specialist

    Remote · USA Full-time

    [Remote] Lead Director – Finance Reporting and AI/ML Transformation

    Remote · USA Full-time

    [Remote] Product Marketing, Postgres

    Remote · USA Full-time

    [Remote] Senior Financial Analyst

    Remote · USA Full-time

    [Remote] Program Manager, Innovation

    Remote · USA Full-time

    [Remote] Mechanical Engineer 2 (NSSS Component Design)

    Remote · USA Full-time

    [Remote] Senior Global Account Executive, Insurance (R-19287)

    Remote · USA Full-time

    [Remote] Senior Application Security Engineer

    Remote · USA Full-time

    [Remote] M&T Equipment Finance Relationship Manager - Northern CA

    Remote · USA Full-time

    [Remote] Director of Go-to-Market (GTM)

    Remote · USA Full-time

    Staff Design Systems Engineer

    Remote · USA Full-time

    Experienced Chat Support Sales Representative – E-commerce and Amazon Customer Service

    Remote · USA Full-time

    Entry Level Data Entry Clerk / Remote Typing Specialist – Flexible Part‑Time Schedule with Growth Potential at arenaflex

    Remote · USA Full-time

    SAP Basis Sr

    Remote · USA Full-time

    Experienced Entry-Level Customer Service Representative – Work from Home Opportunity with arenaflex

    Remote · USA Full-time

    Remote Data Entry Associate – Entry‑Level Position with Comprehensive Training, Flexible Remote Schedule, and Career Growth at arenaflex

    Remote · USA Full-time

    Senior Risk and Quality Assurance Analyst

    Remote · USA Full-time

    Product Manager, Data Feeds

    Remote · USA Full-time

    Associate Sales Representative, CST - Orlando, Tallahassee, Tampa, FL

    Remote · USA Full-time

    Senior Clinical Research Associate (CRA)

    Remote · USA Full-time