Monitoring & Operations Engineer
(Cloud / On-Prem / Kubernetes)
Jr / Mid / Sr
Work Model: Remote - 7/24
We are looking for Monitoring & Operations Engineers at Junior, Mid, and Senior levels to operate and monitor hybrid environments including AWS, Azure, On-Premise infrastructures, Windows/Linux servers, Databases, Cloud Services, and Kubernetes platforms.
This role focuses on 24/7 monitoring, incident detection, first-level troubleshooting, and operational support, working closely with DevOps, SRE, Platform, Infrastructure, and Application development teams to ensure high availability and system reliability.
- Monitor cloud, on-premise, and Kubernetes-based systems in a 7/24 shift-based environment
- Track system health, performance, and availability using:
- AWS CloudWatch, Azure Monitor
- Grafana, Prometheus
- ELK
- Monitor Windows and Linux servers (CPU, memory, disk, services, events)
- Monitor Kubernetes clusters (EKS / AKS / On-Prem K8s):
- Nodes, pods, deployments, services
- Cluster events and resource usage
- Analyze alarms and alerts, identify potential root causes, and take first-level actions
- Escalate incidents to relevant teams with clear technical findings and evidence
- Perform end-to-end system checks during incidents (infra, application, network, security, platform)
- Execute operational procedures using runbooks / SOPs
- Log incidents, events, and actions accurately in ticketing systems
- Support maintenance, change, and release activities
- Contribute to improving monitoring coverage, alert quality, and operational processes
Core Technical Skills
- Experience or strong interest in hybrid environments
- Cloud (AWS, Azure)
- On-Prem infrastructure
- Knowledge of Windows Server and Linux fundamentals
- Hands-on experience with monitoring & observability tools:
- CloudWatch, Azure Monitor
- Grafana, Prometheus
- ELK
- Kubernetes monitoring and troubleshooting knowledge
- Understanding of:
- Networking basics (DNS, TCP/IP, Load Balancers)
- Application metrics, logs, and events
- Ability to distinguish false alerts vs real incidents
- Experience with ticketing and incident management tools
(Jira, ServiceNow, Opsgenie, PagerDuty, etc.)
Job Description
We are looking for Monitoring & Operations Engineers at Junior, Mid, and Senior levels to operate and monitor hybrid environments including AWS, Azure, On-Premise infrastructures, Windows/Linux servers, Databases, Cloud Services, and Kubernetes platforms.
This role focuses on 24/7 monitoring, incident detection, first-level troubleshooting, and operational support, working closely with DevOps, SRE, Platform, Infrastructure, and Application development teams to ensure high availability and system reliability.
Responsibilities
- Monitor cloud, on-premise, and Kubernetes-based systems in a 7/24 shift-based environment
- Track system health, performance, and availability using:
- AWS CloudWatch, Azure Monitor
- Grafana, Prometheus
- ELK
- Monitor Windows and Linux servers (CPU, memory, disk, services, events)
- Monitor Kubernetes clusters (EKS / AKS / On-Prem K8s):
- Nodes, pods, deployments, services
- Cluster events and resource usage
- Analyze alarms and alerts, identify potential root causes, and take first-level actions
- Escalate incidents to relevant teams with clear technical findings and evidence
- Perform end-to-end system checks during incidents (infra, application, network, security, platform)
- Execute operational procedures using runbooks / SOPs
- Log incidents, events, and actions accurately in ticketing systems
- Support maintenance, change, and release activities
- Contribute to improving monitoring coverage, alert quality, and operational processes
Required Skills & Qualifications
Core Technical Skills
- Experience or strong interest in hybrid environments
- Cloud (AWS, Azure)
- On-Prem infrastructure
- Knowledge of Windows Server and Linux fundamentals
- Hands-on experience with monitoring & observability tools:
- CloudWatch, Azure Monitor
- Grafana, Prometheus
- ELK
- Kubernetes monitoring and troubleshooting knowledge
- Understanding of:
- Networking basics (DNS, TCP/IP, Load Balancers)
- Application metrics, logs, and events
- Ability to distinguish false alerts vs real incidents
- Experience with ticketing and incident management tools
(Jira, ServiceNow, Opsgenie, PagerDuty, etc.)