Skip to content
Job Title
 

Monitoring & Operations Engineer

(Cloud / On-Prem / Kubernetes)
Jr / Mid / Sr

Work Model: Remote - 7/24

We are looking for Monitoring & Operations Engineers at Junior, Mid, and Senior levels to operate and monitor hybrid environments including AWS, Azure, On-Premise infrastructures, Windows/Linux servers, Databases, Cloud Services, and Kubernetes platforms.

This role focuses on 24/7 monitoring, incident detection, first-level troubleshooting, and operational support, working closely with DevOps, SRE, Platform, Infrastructure, and Application development teams to ensure high availability and system reliability.

  • Monitor cloud, on-premise, and Kubernetes-based systems in a 7/24 shift-based environment
  • Track system health, performance, and availability using:
    • AWS CloudWatch, Azure Monitor
    • Grafana, Prometheus
    • ELK
  • Monitor Windows and Linux servers (CPU, memory, disk, services, events)
  • Monitor Kubernetes clusters (EKS / AKS / On-Prem K8s):
    • Nodes, pods, deployments, services
    • Cluster events and resource usage
  • Analyze alarms and alerts, identify potential root causes, and take first-level actions
  • Escalate incidents to relevant teams with clear technical findings and evidence
  • Perform end-to-end system checks during incidents (infra, application, network, security, platform)
  • Execute operational procedures using runbooks / SOPs
  • Log incidents, events, and actions accurately in ticketing systems
  • Support maintenance, change, and release activities
  • Contribute to improving monitoring coverage, alert quality, and operational processes

Core Technical Skills

  • Experience or strong interest in hybrid environments
    • Cloud (AWS, Azure)
    • On-Prem infrastructure
  • Knowledge of Windows Server and Linux fundamentals
  • Hands-on experience with monitoring & observability tools:
    • CloudWatch, Azure Monitor
    • Grafana, Prometheus
    • ELK
  • Kubernetes monitoring and troubleshooting knowledge
  • Understanding of:
    • Networking basics (DNS, TCP/IP, Load Balancers)
    • Application metrics, logs, and events
  • Ability to distinguish false alerts vs real incidents
  • Experience with ticketing and incident management tools
    (Jira, ServiceNow, Opsgenie, PagerDuty, etc.)

Job Description

We are looking for Monitoring & Operations Engineers at Junior, Mid, and Senior levels to operate and monitor hybrid environments including AWS, Azure, On-Premise infrastructures, Windows/Linux servers, Databases, Cloud Services, and Kubernetes platforms.

This role focuses on 24/7 monitoring, incident detection, first-level troubleshooting, and operational support, working closely with DevOps, SRE, Platform, Infrastructure, and Application development teams to ensure high availability and system reliability.

Responsibilities

  • Monitor cloud, on-premise, and Kubernetes-based systems in a 7/24 shift-based environment
  • Track system health, performance, and availability using:
    • AWS CloudWatch, Azure Monitor
    • Grafana, Prometheus
    • ELK
  • Monitor Windows and Linux servers (CPU, memory, disk, services, events)
  • Monitor Kubernetes clusters (EKS / AKS / On-Prem K8s):
    • Nodes, pods, deployments, services
    • Cluster events and resource usage
  • Analyze alarms and alerts, identify potential root causes, and take first-level actions
  • Escalate incidents to relevant teams with clear technical findings and evidence
  • Perform end-to-end system checks during incidents (infra, application, network, security, platform)
  • Execute operational procedures using runbooks / SOPs
  • Log incidents, events, and actions accurately in ticketing systems
  • Support maintenance, change, and release activities
  • Contribute to improving monitoring coverage, alert quality, and operational processes

Required Skills & Qualifications

Core Technical Skills

  • Experience or strong interest in hybrid environments
    • Cloud (AWS, Azure)
    • On-Prem infrastructure
  • Knowledge of Windows Server and Linux fundamentals
  • Hands-on experience with monitoring & observability tools:
    • CloudWatch, Azure Monitor
    • Grafana, Prometheus
    • ELK
  • Kubernetes monitoring and troubleshooting knowledge
  • Understanding of:
    • Networking basics (DNS, TCP/IP, Load Balancers)
    • Application metrics, logs, and events
  • Ability to distinguish false alerts vs real incidents
  • Experience with ticketing and incident management tools
    (Jira, ServiceNow, Opsgenie, PagerDuty, etc.)