Key Responsibilities:
- Lead the Site reliability engineering team
- Layout schedule and shift plans for the team
- Manage tickets and allocate tasks for team members
- Work collaboratively with peers and management
- Ensure transparent communication with the customer
- Provide direction and assistance to team members
- Record and track team SLAs and workflows
- Ensure that the monitoring systems and procedures are aligned with industry best practices, regulatory requirements, and security policies.
- Implement metrics-driven processes to ensure service quality
Skill Set:
- Knowledge in monitoring tools such as Zabbix, Nagios, etc
- Knowledge/experience in ticketing systems such as Zoho Desk /JIRA etc
- Strong problem-solving skills, particularly in investigating and analyzing recurring issues.
- Hands-on knowledge of Linux fundamentals, System administration, scripting, performance tuning, etc
- Strong problem-solving skills and ability to think under pressure
- Basic knowledge of cloud environments such as AWS, Azure, Google Cloud, etc
- Basic knowledge of networking, routing and switching
- Communication and documentation skills
Experience:
5 – 7 Years of L2 monitoring