Site Reliability Engineer

ZenHR • Full-time • عمان, JO • 2w ago

We’re looking for a passionate and motivated Site Reliability Engineer to join our team. We want a team player who is driven and willing to learn to accompany us on our mission of playing a vital role in the digital transformation sweeping the MENA region.

As an Site Reliability Engineer at ZenHR, you will be responsible to ensure the reliability, scalability, and high availability of our systems. The ideal candidate has strong expertise in monitoring, observability, and infrastructure automation, with a proactive mindset for identifying and resolving issues before they impact users.

Who we are:

At ZenHR, delighting our customers is our passion! We are an award-winning cloud based HRMS that caters to the full HR value chain from the “acquire” stage to the “retire” stage. a group of young and passionate people who are dedicated to providing cutting-edge technology, continuously researching and implementing new HR trends that cater to the needs of employers in the MENA region. Despite the numerous obstacles that we face, we see them as possibilities. We understand that rather than making excuses for the existing status quo, we must challenge it. If you want to make a difference in the HR world, ZenHR is the place for you. Our people shape ZenHR’s culture, therefore our strategy and success are built on our employees. In our hiring process, we prioritize equal employment opportunities, diversity, women empowerment, and inclusion, ensuring that we attract and retain A-players from various backgrounds.

What we offer:

Flexible working hours and remote/work-from-home option
Health insurance coverage from day one at ZenHR
Access to online and in-person Mental Health sessions
A Zen work atmosphere
Great culture and amazing people to work with and learn from

The Job - Site Reliability Engineer

Maintain and enhance system uptime, reliability, and performance across production and staging environments.
Design, implement, and manage monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK) to detect and analyze system behavior.
Proactively identify and investigate issues, perform root cause analysis (RCA), and implement preventive measures.
Manage and optimize containerized environments using Docker and Kubernetes.
Automate and manage infrastructure provisioning using Terraform and related IaC tools.
Collaborate with development teams to improve system resilience, observability, and CI/CD processes.
Ensure best practices for scalability, fault tolerance, and performance tuning.

Who are you:

Solid experience in monitoring, alerting, and observability systems.
Hands-on expertise with Docker, Kubernetes, and Terraform in production environments.
Knowledge of web application development, preferably Ruby on Rails.
Strong background in Linux systems, cloud platforms (AWS/GCP/Azure), and automation scripting (e.g., Python, Bash).
Excellent analytical and communication skills, with a proactive and investigative approach to problem-solving.