LLM Pre-training & Distributed Engineer (AI Infrastructure)

hyphenconnect· Engineering
Apply Now ↗
📍 San Francisco Bay Area, USA

About this role

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

 

Frequently Asked Questions

Is the salary disclosed for the LLM Pre-training & Distributed Engineer (AI Infrastructure) position at hyphenconnect?
The salary for this LLM Pre-training & Distributed Engineer (AI Infrastructure) role at hyphenconnect is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Where is the LLM Pre-training & Distributed Engineer (AI Infrastructure) position at hyphenconnect located?
This LLM Pre-training & Distributed Engineer (AI Infrastructure) role at hyphenconnect is based in San Francisco Bay Area, USA. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Which team or department does the LLM Pre-training & Distributed Engineer (AI Infrastructure) at hyphenconnect belong to?
This LLM Pre-training & Distributed Engineer (AI Infrastructure) position is part of the Engineering department at hyphenconnect. See the full job description for more information about the team structure and responsibilities.
How do I apply for the LLM Pre-training & Distributed Engineer (AI Infrastructure) position at hyphenconnect?
Click the "Apply Now" button on this page. You will be redirected to hyphenconnect's official application portal hosted on greenhouse where you can submit your application directly.
When was the LLM Pre-training & Distributed Engineer (AI Infrastructure) job at hyphenconnect posted?
This LLM Pre-training & Distributed Engineer (AI Infrastructure) position at hyphenconnect was posted on Apr 24, 2026. Apply as soon as possible — early applications are often reviewed first.
LLM Pre-training & Distributed Engineer (AI Infrastructure)
hyphenconnect
Apply for this role ↗

You'll be redirected to hyphenconnect's official application page on Greenhouse.