Operations Engineer, HPC Networking

falยท Engineering
Apply Now โ†—
๐ŸŒ Remote๐Ÿ“ Remote

About this role

fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access that teams need to move from idea to production, and do it at scale without compromise. For developers and enterprises, fal is the foundation that makes generative media not just possible, but practical: a unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.

As generative media reshapes industries across a market projected to grow by hundreds of billions over the next decade, fal is becoming the ecosystem that ambitious teams build on.

About the role

We're hiring an Operations Engineer for HPC Networking to keep our InfiniBand and Ethernet fabrics healthy as we scale.

This is a hands-on role. You'll bring up new fabrics alongside DC ops, monitor the ones in production, and chase down the weird stuff: link flaps, congestion, NCCL stalls, firmware bugs that only show up at scale.ย 

You're a fit if you've:

  • Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.
  • Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.
  • Brought up new fabrics from cable pull through validation.
  • Scripted your way through repetitive operational work (bash, python, go, whatever).
  • Nice to have: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.

Who you are:

  • Detail-oriented. Cable plant hygiene is a personality trait.
  • Calm under fire. A fabric incident during a customer training run doesn't rattle you.
  • You read vendor release notes for fun, or at least out of self-defense.
  • You'd rather find the root cause than reboot the switch.

Responsibilities:

  • Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.
  • Investigate and resolve fabric issues: connectivity, congestion, performance regressions.
  • Support fabric bring-up alongside DC ops and customer-facing teams.
  • Run maintenance and upgrades on switches and control plane components.
  • Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.
  • Improve the tooling and runbooks so the next incident resolves faster than the last.

ย 

Frequently Asked Questions

Is the salary disclosed for the Operations Engineer, HPC Networking position at fal?
The salary for this Operations Engineer, HPC Networking role at fal is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Is the Operations Engineer, HPC Networking job at fal remote?
Yes, this Operations Engineer, HPC Networking position at fal is remote, with team members based in Remote. You can work from home or anywhere in the supported regions.
Which team or department does the Operations Engineer, HPC Networking at fal belong to?
This Operations Engineer, HPC Networking position is part of the Engineering department at fal. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Operations Engineer, HPC Networking position at fal?
Click the "Apply Now" button on this page. You will be redirected to fal's official application portal hosted on greenhouse where you can submit your application directly.
When was the Operations Engineer, HPC Networking job at fal posted?
This Operations Engineer, HPC Networking position at fal was posted on May 14, 2026. Apply as soon as possible โ€” early applications are often reviewed first.
Operations Engineer, HPC Networking
fal
Apply for this role โ†—

You'll be redirected to fal's official application page on Greenhouse.