Senior Distributed Systems Engineer

ifm-us· Engineering
Apply Now ↗
📍 Sunnyvale, CA💰 USD 200K–400K

About this role

About the Institute of Foundation Models The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology. This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.   The Mission We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design. ·       Design and optimize expert-parallel and hybrid-parallel communication patterns ·       Drive high-performance hierarchical collectives for MoE workloads ·       Co-design runtime orchestration with communication topology awareness ·       Reduce tail latency and improve determinism across thousands of GPUs ·       Architect fault-tolerant distributed execution under real-world cluster failures Core Technical Scope ·       Communication-compute overlap and topology-aware collective optimization ·       Deep debugging of NCCL, RDMA, and custom communication layers ·       Hybrid expert parallel strategies in modern large-scale MoE systems ·       Elastic and resilient distributed job orchestration concepts ·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics ·       Microbenchmarking and performance modeling for communication-heavy workloads Expected Technical Depth ·       Hybrid expert parallel communication for Mixture-of-Experts training ·       Scaling behavior under network pressure ·       Distributed orchestration for elastic, large-scale training ·       Fault detection and recovery in distributed GPU workloads ·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler Required Background ·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth) ·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA ·       Deep familiarity with NCCL and/or UCX internals ·       Strong systems programming ability (C/C++, Rust, or Go) ·       Strong familiarity with modern model training frameworks such as PyTorch ·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks ·       Ability to translate research ideas into production-grade optimizations ·       Experience debugging distributed hangs, desynchronization, and performance regressions What We Mean by "Hardcore" ·       You can explain why an communication degrades at scale and how to fix it ·       You have improved real cluster throughput via communication redesign ·       You can trace a distributed hang across ranks and identify the root cause ·       You are comfortable working at the boundary between hardware and runtime Application Requirements ·       Include a link to your GitHub (required) ·       Provide links to relevant distributed systems, HPC, or large-scale training projects ·       Include a list of publications and/or public technical reports (if applicable) ·       Describe the hardest distributed debugging problem you solved ·       Include measurable performance improvements you have delivered Academic Qualifications Master’s, or Bachelor’s + 1 year of relevant experience.

Frequently Asked Questions

What is the salary for the Senior Distributed Systems Engineer role at ifm-us?
The listed salary for this Senior Distributed Systems Engineer position at ifm-us is USD 200K–400K. This is an full-time role.
Where is the Senior Distributed Systems Engineer position at ifm-us located?
This Senior Distributed Systems Engineer role at ifm-us is based in Sunnyvale, CA. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Which team or department does the Senior Distributed Systems Engineer at ifm-us belong to?
This Senior Distributed Systems Engineer position is part of the Engineering department at ifm-us. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Senior Distributed Systems Engineer position at ifm-us?
Click the "Apply Now" button on this page. You will be redirected to ifm-us's official application portal hosted on lever where you can submit your application directly.
When was the Senior Distributed Systems Engineer job at ifm-us posted?
This Senior Distributed Systems Engineer position at ifm-us was posted on Mar 3, 2026. Apply as soon as possible — early applications are often reviewed first.
Senior Distributed Systems Engineer
ifm-us · 💰 USD 200K–400K
Apply for this role ↗

You'll be redirected to ifm-us's official application page on Lever.