Associate Site Reliability Engineer
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The Associate Site Reliability Engineer will design, develop, and deploy AI-powered solutions, maintain Kubernetes-based infrastructure, and collaborate with developers to automate workflows. This role is essential for ensuring high availability and reliability of applications.
Responsibilities
- Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows and decision-making
- Design, deploy, and maintain Kubernetes-based infrastructure to ensure high availability and scalability of applications
- Build and manage CI/CD pipelines using GitHub Actions to enable fast and reliable deployments
- Use Terraform to provision and manage infrastructure in Google Cloud Platform (GCP)
- Manage and optimize Apache Kafka-based systems to ensure reliable message streaming and data processing
- Monitor and improve system performance and reliability using Prometheus and Grafana
- Collaborate with developers to automate workflows and implement best practices for infrastructure-as-code (IaC)
- Write Python scripts for automation and tooling to enhance operational efficiency
- Troubleshoot and resolve system issues to minimize downtime and impact on users
- Participate in on-call rotations and incident response to ensure high service reliability
Skills
- 1+ years of experience with Google Cloud Platform (GCP) services such as Compute Engine, Kubernetes Engine, and Cloud Storage
- 1+ years of hands-on experience with Kubernetes for deploying and managing containerized applications
- 1+ years of experience in understanding GitHub Actions for creating and maintaining CI/CD pipelines
- 1+ years of experience in proficiency in Python for scripting, automation, and tooling
- 1+ years of experience with Apache Kafka for building, maintaining, and troubleshooting message-driven systems
- 1+ years of experience using Prometheus and Grafana for monitoring and observability
- Basic level of knowledge of Terraform for infrastructure provisioning and management
- Familiarity with other cloud providers (e.g., AWS or Azure)
- Knowledge of Helm for Kubernetes package management
- Experience with debugging and optimizing distributed systems
- Exposure to security best practices for cloud infrastructure
- Knowledge of Java for developing and troubleshooting backend systems
- Familiarity with DataHub or similar data cataloging and metadata management platforms
- Understanding of Artificial Intelligence (AI) concepts and tools, such as building or managing machine learning pipelines, integrating AI models, or working with ML platforms like TensorFlow, PyTorch, or Vertex AI
- Experience with Golang for developing infrastructure tools or cloud-native applications
Benefits
- A comprehensive benefits package
- Incentive and recognition programs
- Equity stock purchase
- 401k contribution (all benefits are subject to eligibility requirements)
Company Overview