Staff Site Reliability Engineer
Posted by Max Blaze
Company Details
Duolingo
Pittsburgh, PA
FTE only
Description
You will…
- Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
- Own core infrastructure (i.e manage, diagnose, and debug large-scale distributed systems in production)
- Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
- Maintain and document sustainable postmortem/incident response practices
- Understand and resolve potential threats to performance or security
- Monitor and measure latency, availability and overall system health, once live
- Advocate for and implement changes that improve reliability, scalability, and velocity
- Monitor and stress test systems to collect metrics for tuning and capacity planning
- Reduce the burden of toil with iterative development of tooling and automation
- Collaborate with engineering teams to release new features and become an authority on our services
- Participate in on-call rotation
You have…
- Bachelor’s Degree in Computer Science
- 5+ years of experience within site reliability engineering/devops of a product with millions of users
- Experience analyzing and troubleshooting large-scale distributed systems
- Proven knowledge of C, C++, Java, Kotlin, Python or Go
- Fluency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
- An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
- Effective communication skills and understanding of best practices around tools/methodologies for Infrastructure, Automation, Capacity Planning, etc.
- Ability to be on-call for critical incident responses
How to Apply
Please log in or sign up to view this posting's application instructions.