Site Reliability Engineer - AI Infrastructure

andromeda· Engineering
Apply Now ↗
🌍 Remote📍 Global Remote / San Francisco, CAFullTime

About this role

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

What You’ll Do

  • Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.

  • Build automation and tooling to streamline cluster deployments and integrations.

  • Debug customer issues across networking, storage, scheduling, and system layers.

  • Improve reliability and scalability of both training and inference infrastructure.

  • Design and implement monitoring, alerting, and observability for critical systems.

  • Collaborate with engineering and product teams to plan and deliver infrastructure for new services.

  • Participate in on-call and incident response, leading postmortems and reliability improvements.

    What We’re Looking For

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles.

  • Strong Linux systems and networking fundamentals.

  • Deep experience with Kuber

Kubernetes and container orchestration at scale.

  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).

  • Strong automation and scripting skills (Python, Go, or Bash).

  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).

  • Track record of operating production systems and leading incident response.

Nice to Have

  • Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).

  • Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).

  • Customer-facing support or consulting experience.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Frequently Asked Questions

Is the salary disclosed for the Site Reliability Engineer - AI Infrastructure position at andromeda?
The salary for this Site Reliability Engineer - AI Infrastructure role at andromeda is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Is the Site Reliability Engineer - AI Infrastructure job at andromeda remote?
Yes, this Site Reliability Engineer - AI Infrastructure position at andromeda is remote, with team members based in Global Remote / San Francisco, CA. You can work from home or anywhere in the supported regions.
Is the Site Reliability Engineer - AI Infrastructure role at andromeda full-time or part-time?
This is listed as a FullTime position. It is posted as a Site Reliability Engineer - AI Infrastructure role in the Engineering department at andromeda.
Which team or department does the Site Reliability Engineer - AI Infrastructure at andromeda belong to?
This Site Reliability Engineer - AI Infrastructure position is part of the Engineering department at andromeda. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Site Reliability Engineer - AI Infrastructure position at andromeda?
Click the "Apply Now" button on this page. You will be redirected to andromeda's official application portal hosted on ashby where you can submit your application directly.
When was the Site Reliability Engineer - AI Infrastructure job at andromeda posted?
This Site Reliability Engineer - AI Infrastructure position at andromeda was posted on Nov 6, 2025. Apply as soon as possible — early applications are often reviewed first.
Site Reliability Engineer - AI Infrastructure
andromeda
Apply for this role ↗

You'll be redirected to andromeda's official application page on Ashby ATS.