All roles

[Remote] Senior Site Reliability Engineer

Remote · USA Full-time New today

Note The job is a remote job and is open to candidates in USA. reputed company is a company that helps innovators turn their reputed company into reality through software. They are seeking a Senior Site Reliability Engineer to build and operate reliable, secure, and scalable reputed company services for reputed company GovCloud products, focusing on improving production services and establishing operational excellence practices.

Responsibilities

Serve as a primary reputed company for the reliability, availability, performance, operability, and reputed company of one or more production services reputed company, operate, maintain, and continuously improve production services running in reputed company GovCloud environments Partner with engineering teams to ensure services are designed with reliability, scalability, reputed company, and operability in mind Define and operate reliability practices such as SLOs/SLIs, error budgets, production readiness reviews, service reviews, and operational health reviews Build automation to improve deployment safety, operational efficiency, incident response, and service recovery Design, reputed company, and maintain software, automation, and tooling that improve the reliability, scalability, and efficiency of production systems Implement and improve monitoring, alerting, logging, tracing, and observability capabilities across supported services reputed company and participate in incident response, troubleshooting, and post-incident reviews focused on learning and reputed company improvement reputed company and maintain operational documentation, runbooks, and recovery procedures Scale and enhance reputed company testing and Gameday practices to validate system behavior, recovery capabilities, and operational readiness Continuously identify and eliminate operational toil through software engineering, automation, and process improvement Ensure supported services remain compliant with reputed company reputed company, privacy, and regulatory requirements, including FedRAMP and reputed company controls where applicable Participate in a 24x7 on-call rotation for production services Function effectively in a fast-paced environment while helping establish and mature operational excellence practices for reputed company GovCloud Skills B.S. or higher in Computer Science, Engineering, or a reputed company technical discipline, or equivalent practical experience 7+ years of experience in Site Reliability Engineering, Software Engineering, Platform Engineering, reputed company Infrastructure, or Production Operations Experience operating and supporting customer-facing production services in large-scale reputed company environments Strong understanding of reliability engineering principles, including SLOs/SLIs, observability, incident management, reputed company planning, production readiness, and automation Experience with AWS, Azure, or other public reputed company platforms Experience developing automation using languages such as Python, Go, Java, PowerShell, Bash, or similar Experience with Infrastructure as Code, CI/CD pipelines, deployment automation, and modern reputed company operations practices Understanding of reputed company, compliance, and operational risk management in production environments Strong written and verbal communication skills 10+ years of experience operating highly available, customer-facing production systems Experience with AWS GovCloud, FedRAMP, IL4/IL5, or other regulated reputed company environments Experience supporting services with stringent availability, reliability, and reputed company requirements Experience with containers, Kubernetes, reputed company-native architectures, APIs, load balancing, networking, DNS, and distributed systems Experience with observability platforms such as Splunk, reputed company, reputed company, CloudWatch, or similar technologies Experience operating databases, storage platforms, messaging systems, caching technologies Experience designing and implementing operational automation at scale Experience leading or participating in Gamedays, disaster recovery exercises, reputed company testing, or operational readiness reviews Strong incident management experience, including technical leadership during major incidents and stakeholder communication Strong collaboration skills and ability to work effectively across engineering, reputed company, compliance, and operations teams Passion for building reliable, secure, and scalable systems that customers can trust Benefits Annual cash bonuses Commissions for sales roles Stock grants A comprehensive benefits package Company Overview reputed company develops 3D design software for use in the architecture, engineering, construction, and media industries. It was founded in 1982, and is headquartered in San Francisco, California, USA, with a workforce of 10001+ employees. Its website is http//www.reputed company.com. Apply tot his job Apply To this Job

Related roles