SRE Engineer
redotpay· 11# Central Hub
About this role
SRE Engineer
Role Overview
As a Site Reliability Engineer (SRE), you will be the guardian of our app and core business systems, ensuring their stability, availability, and recoverability. Through robust monitoring and alerting, incident response, release governance, capacity planning, automation, and disaster recovery drills, you will safeguard our end-user experience and maintain uninterrupted business continuity.
Core Responsibilities
App Stability Assurance
- Own the stability monitoring for critical user journeys, including login, homepage, trading, payments, deposits/withdrawals, and core APIs.
- Define and track core Service Level Indicators (SLIs) such as user-side availability, API success/error rates, latency, and crash rates.
- Promptly detect and address issues like app launch failures, API timeouts, service degradation, and regional access anomalies.
Monitoring, Alerting & Observability
- Build and optimize comprehensive observability capabilities encompassing logs, metrics, distributed tracing, business probes, and Real User Monitoring (RUM).
- Refine alerting rules to reduce noise/false positives and improve the accuracy of incident detection.
- Establish and enforce tiered incident classification (P0/P1/P2), alongside clear notification, escalation, and response protocols.
Incident Response & Emergency Handling
- Lead or actively participate in production incident triage, mitigation, recovery, and post-mortem analysis.
- Develop and maintain emergency runbooks for critical scenarios (e.g., app downtime, core API failures, database anomalies, cloud service outages, network disruptions).
- Drive Root Cause Analysis (RCA) and ensure the closed-loop implementation of corrective actions.
Release & Change Stability Governance
- Participate in establishing best practices for production releases, canary/gray deployments, rollbacks, change windows, and post-release monitoring.
- Identify and mitigate stability risks during the release pipeline to prevent incidents caused by deployments or configuration changes.
- Champion the adoption of automated deployments, automated rollbacks, and advanced change risk controls.
Capacity, Performance & Resilience
- Contribute to capacity planning, performance stress testing, resource utilization monitoring, and scaling strategies.
- Drive the implementation of reliability patterns, including rate limiting, graceful degradation, circuit breaking, and backup/restore mechanisms.
- Regularly organize or participate in chaos engineering/fault drills, disaster recovery exercises, and restoration validation.
Automation & Toil Reduction
- Develop tools and platforms for automated health checks, alert analysis, and system self-healing.
- Eliminate manual toil to drastically improve the efficiency of production issue resolution.
- Standardize operations by documenting Standard Operating Procedures (SOPs), runbooks, and post-mortem templates.
Qualifications
- Solid understanding of core infrastructure components: Linux, networking, databases, caching, middleware, and cloud services.
- Familiarity with common modern architectures: App backend services, API gateways, load balancing, CDN, and Kubernetes/containerization.
- Hands-on experience with one or more monitoring and observability ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, CloudWatch, APM, distributed tracing).
- Proven track record in handling production incidents, with the ability to independently perform log analysis, trace debugging, performance profiling, and system recovery.
- Strong understanding of SRE workflows, including deployments, canary releases, rollbacks, capacity planning, incident response, and post-mortems.
- Proficiency in scripting or development (Shell, Python, or Go) to build automation tools.
- Preferred: Experience ensuring the stability of global apps, or a background in Payments, FinTech, Web3, or Cross-border businesses.
Frequently Asked Questions
Is the salary disclosed for the SRE Engineer position at redotpay?
The salary for this SRE Engineer role at redotpay is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Where is the SRE Engineer position at redotpay located?
This SRE Engineer role at redotpay is based in Lok Ma Chau, Hong Kong. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Is the SRE Engineer role at redotpay full-time or part-time?
This is listed as a Full Time position. It is posted as a SRE Engineer role in the 11# Central Hub department at redotpay.
Which team or department does the SRE Engineer at redotpay belong to?
This SRE Engineer position is part of the 11# Central Hub department at redotpay. See the full job description for more information about the team structure and responsibilities.
How do I apply for the SRE Engineer position at redotpay?
Click the "Apply Now" button on this page. You will be redirected to redotpay's official application portal hosted on bamboohr where you can submit your application directly.
When was the SRE Engineer job at redotpay posted?
This SRE Engineer position at redotpay was posted on May 15, 2026. Apply as soon as possible — early applications are often reviewed first.
SRE Engineer
redotpay
You'll be redirected to redotpay's official application page on bamboohr.