Senior Distributed Systems Engineer

ifm-us· Engineering

📍 Sunnyvale, CA💰 USD 200K–400K🗓 Posted Mar 3, 2026

About this role

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
 
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Reduce tail latency and improve determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       Deep debugging of NCCL, RDMA, and custom communication layers
·       Hybrid expert parallel strategies in modern large-scale MoE systems
·       Elastic and resilient distributed job orchestration concepts
·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
·       Hybrid expert parallel communication for Mixture-of-Experts training
·       Scaling behavior under network pressure
·       Distributed orchestration for elastic, large-scale training
·       Fault detection and recovery in distributed GPU workloads
·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
·       Deep familiarity with NCCL and/or UCX internals
·       Strong systems programming ability (C/C++, Rust, or Go)
·       Strong familiarity with modern model training frameworks such as PyTorch
·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks
·       Ability to translate research ideas into production-grade optimizations
·       Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
·       You can explain why an communication degrades at scale and how to fix it
·       You have improved real cluster throughput via communication redesign
·       You can trace a distributed hang across ranks and identify the root cause
·       You are comfortable working at the boundary between hardware and runtime
Application Requirements
·       Include a link to your GitHub (required)
·       Provide links to relevant distributed systems, HPC, or large-scale training projects
·       Include a list of publications and/or public technical reports (if applicable)
·       Describe the hardest distributed debugging problem you solved
·       Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.

Frequently Asked Questions

What is the salary for the Senior Distributed Systems Engineer role at ifm-us?

The listed salary for this Senior Distributed Systems Engineer position at ifm-us is USD 200K–400K. This is an full-time role.

Where is the Senior Distributed Systems Engineer position at ifm-us located?

This Senior Distributed Systems Engineer role at ifm-us is based in Sunnyvale, CA. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.

Which team or department does the Senior Distributed Systems Engineer at ifm-us belong to?

This Senior Distributed Systems Engineer position is part of the Engineering department at ifm-us. See the full job description for more information about the team structure and responsibilities.

How do I apply for the Senior Distributed Systems Engineer position at ifm-us?

Click the "Apply Now" button on this page. You will be redirected to ifm-us's official application portal hosted on lever where you can submit your application directly.

When was the Senior Distributed Systems Engineer job at ifm-us posted?

This Senior Distributed Systems Engineer position at ifm-us was posted on Mar 3, 2026. Apply as soon as possible — early applications are often reviewed first.

Senior Distributed Systems Engineer

ifm-us · 💰 USD 200K–400K

Apply for this role ↗

You'll be redirected to ifm-us's official application page on Lever.