Role Overview
We are seeking an experienced Operations Team Lead with a strong technical background to drive efficiency, reliability, and continuous improvement across our product and platform operations. You will lead a multidisciplinary team of operations engineers, collaborate with product and engineering, and own the operational health of critical services. This role demands strategic thinking, hands-on leadership, and a passion for delivering high-velocity, high-quality outcomes in a fast-paced environment.
Key Responsibilities
- Lead and mentor a team of operations engineers, setting clear goals, coaching for growth, and ensuring consistent delivery of operational practices.
- Own end-to-end operational health of critical systems, including incident management, on-call readiness, post-incident reviews, and root-cause analysis.
- Collaborate with Product and Engineering to design and implement scalable, reliable, and observable systems; drive the adoption of SRE-like practices where appropriate.
- Develop and maintain robust runbooks, SLAs, SLOs, and incident response playbooks; continually optimize escalation paths and response times.
- Manage capacity planning, performance monitoring, and cost optimization to support business growth while maintaining service quality.
- Implement automation to reduce toil, improve deployment pipelines, and streamline day-to-day operations.
- Partner with Security and Compliance teams to ensure operational controls meet regulatory requirements and security standards.
- Lead continuous improvement initiatives, using metrics and feedback to shape roadmaps and prioritize investments in tooling and processes.
- Communicate clearly with stakeholders at all levels, translating technical concepts into actionable insights and status updates.
Required Qualifications
- 5+ years of experience in technical operations, platform engineering, or site reliability engineering, with 2+ years in a leadership or team lead role.
- Strong technical foundation in systems, networking, cloud infrastructure (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and CI/CD pipelines.
- Proven incident management experience, including on-call leadership, incident triage, and post-incident reporting.
- Demonstrated ability to build scalable processes, runbooks, and automation to reduce toil and improve reliability.
- Excellent problem-solving, organizational, and communication skills; ability to influence cross-functional teams without direct authority.
Preferred Qualifications
- Experience with observability stacks (Prometheus, Grafana, Open Telemetry, etc.) and modern APM tools.
- Background in security-focused operations, compliance frameworks, and risk-based decision making.
- Experience in a product-centric environment with a strong product mindset and customer impact awareness.
- Familiarity with agile methodologies and collaborative, fast-paced team environments.