Key Responsibilities
● Lead the Site reliability engineering team● Layout schedule and shift plans for the team
● Manage tickets and allocate tasks for team members
● Work collaboratively with peers and management
● Ensure transparent communication with the customer
● Provide direction and assistance to team members
● Record and track team SLAs and workflows
● Ensure that the monitoring systems and procedures are aligned with industry best practices, regulatory requirements, and security policies.
● Implement metrics-driven processes to ensure service quality
Skill Set
● Knowledge in monitoring tools such as Zabbix, Nagios, etc
● Knowledge/experience in ticketing systems such as Zoho Desk /JIRA etc
● Strong problem-solving skills, particularly in investigating and analyzing recurring issues.
● Hands-on knowledge of Linux fundamentals, System administration, scripting, performance tuning, etc
● Strong problem-solving skills and ability to think under pressure
● Basic knowledge of cloud environments such as AWS, Azure, Google Cloud, etc
● Basic knowledge of networking, routing and switching
● Communication and documentation skills
Experience: 5 – 7 years of L2 monitoring