All roles

Senior Site Reliability Engineer (Fleet Management)

Remote · USA Full-time New today

Requirements

  • Have 6+ years of experience in software development and operating distributed systems,
  • Are proficient in Go, Python, or a similar language, with a strong commitment to code quality and testing practices (writing unit, integration, and E2E tests),
  • Have deep experience using and extending containerization technologies, preferably Kubernetes,
  • Have a solid understanding of Linux operating system internals and networking concepts (e.g., filesystems, TCP/IP, DNS, TLS),
  • Possess a customer focused mindset, treating internal developers as your primary users,
  • Have strong operational ownership, including a track record of debugging complex production issues and driving them to resolution,
  • Prefer automation over manual processes ("allergic to ops work"),
  • We are a small team of software engineers with a strong bias toward building software solutions to eliminate toil,
  • (Desirable) Designing and implementing secure, multi-tenant runtime environments from first principles,
  • (Desirable) Proficiency with Kubernetes ecosystem tools such as Helm, Kustomize, Gatekeeper, Kyverno, and CRDs/Operators, CRI, CSI,
  • (Desirable) Expertise in cloud infrastructure platforms, including AWS, GCP, or Azure,
  • (Desirable) Proficiency in provisioning infrastructure using tools like Terraform, Crossplane, and AWS Controllers for Kubernetes (ACK),
  • (Desirable) Advanced Linux systems internals and networking concepts specifically relevant to containers, such as namespaces and cgroups

What the job involves

  • Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization,
  • Among these are our multi-cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-facing edge and internal service mesh), and observability and alerting systems,
  • The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers,
  • We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper),
  • As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model,
  • Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB,
  • Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems,
  • Participate in a 24/7 on-call rotation to resolve critical issues,
  • Prioritize blameless post-mortems and dedicate engineering time to systemic fixes, ensuring you aren’t paged for the same issue twice

Apply tot his job Apply To this Job

Related roles