Reliability Operations Engineer (Malaysia)

serverobotics· Software
Apply Now ↗
📍 Penang, MalaysiaFullTime

About this role

At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses.

The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles, Miami, Dallas, Atlanta and Chicago while doing commercial deliveries. We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity.

Who We Are

We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.

The Reliability Operations Engineer supports the operational reliability of robotic and cloud systems by handling Tier 2 escalations, following and improving runbooks, and performing technical investigations during your region’s daytime hours. This role works closely with senior team members, product engineering, and SREs to investigate issues, refine operational workflows, and strengthen system health. This position contributes to incident response by providing triage and clear communication, ensuring timely escalation and effective coordination across teams.

Responsibilities

  • Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response.

  • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed.

  • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures.

  • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks

  • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance.

  • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination.

  • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability.

  • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise.

  • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

Qualifications

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent hands-on experience.

  • 2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support function.

  • Experience participating in Tier 1 or Tier 2 investigations, including log review, basic triage, and structured escalation.

  • Exposure to operational environments supporting distributed or cloud-based systems.

  • Participation in incident response workflows and/or on-call rotations.

  • Proficiency with Linux, including navigating systems, reviewing logs, and performing basic diagnostics.

  • Experience using and contributing to runbooks and operational workflows.

  • Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.

  • Familiarity with cloud platforms, preferably Google Cloud Platform (GCP).

  • Ability to follow documented remediation steps, with good judgment around when to escalate.

  • Understanding of CI/CD pipelines and how application deployments affect runtime behavior.

  • Experience using Jira or similar ticketing systems.

  • Clear and effective communicator, especially when providing updates during time-sensitive operational issues.

  • Calm, organized approach to troubleshooting and prioritization.

  • Collaborative mindset, working effectively with senior operations engineers, product teams, and SREs.

  • Strong sense of ownership and accountability for operational responsibilities.

What Makes You Stand You

  • Prior experience participating in high-severity incident response or supporting operational incidents.

  • Exposure to robot fleets, IoT systems, or other distributed physical device environments.

  • Ability to write or modify lightweight scripts and automations to improve operational workflows.

  • Familiarity with incident management platforms such as PagerDuty, OpsGenie, Jira Service Management, or Grafana IRM.

  • Experience contributing to the creation or improvement of operational runbooks and support documentation.

  • Strong networking fundamentals; familiarity with Tailscale or similar zero-trust networking tools is a plus.

  • Demonstrated ability to learn quickly and contribute to improving operational maturity within a team

Additional Information

  • As part of maintaining continuous operational coverage, this role also participates in a rotating weekend on-call schedule shared across the Reliability Operations team.

Frequently Asked Questions

Is the salary disclosed for the Reliability Operations Engineer (Malaysia) position at serverobotics?
The salary for this Reliability Operations Engineer (Malaysia) role at serverobotics is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Where is the Reliability Operations Engineer (Malaysia) position at serverobotics located?
This Reliability Operations Engineer (Malaysia) role at serverobotics is based in Penang, Malaysia. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Is the Reliability Operations Engineer (Malaysia) role at serverobotics full-time or part-time?
This is listed as a FullTime position. It is posted as a Reliability Operations Engineer (Malaysia) role in the Software department at serverobotics.
Which team or department does the Reliability Operations Engineer (Malaysia) at serverobotics belong to?
This Reliability Operations Engineer (Malaysia) position is part of the Software department at serverobotics. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Reliability Operations Engineer (Malaysia) position at serverobotics?
Click the "Apply Now" button on this page. You will be redirected to serverobotics's official application portal hosted on ashby where you can submit your application directly.
When was the Reliability Operations Engineer (Malaysia) job at serverobotics posted?
This Reliability Operations Engineer (Malaysia) position at serverobotics was posted on Apr 28, 2026. Apply as soon as possible — early applications are often reviewed first.
Reliability Operations Engineer (Malaysia)
serverobotics
Apply for this role ↗

You'll be redirected to serverobotics's official application page on Ashby ATS.