Machine Learning - Infrastructure

causal· Engineering
Apply Now ↗
📍 San FranciscoFullTime

About this role

Our mission is general causal intelligence, AI that is capable of (1) predicting the future and (2) identifying the optimal actions to change that future.

To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data.

Weather is the ideal training ground for an LPM. It is the most well-observed physical system, offering rapid, objective ground truth feedback from sensory observations and data at a scale that dwarfs what is used to train today’s LLMs.

Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Insitro, and Nabla Bio - who believe general causal intelligence will be the most important technical breakthrough for civilization.

We look for infrastructure engineers who are excited to tackle unsolved problems.

Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission.

Responsibilities

  • Design, deploy, and maintain large distributed ML training and inference clusters

  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle

  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales

  • Analyze, profile and debug low-level GPU operations to optimize performance

  • Stay up-to-date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads

  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models

  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings

  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)

  • Background working on distributed task management systems and scalable model serving & deployment architectures

  • Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don’t have to meet every single requirement above.

Frequently Asked Questions

Is the salary disclosed for the Machine Learning - Infrastructure position at causal?
The salary for this Machine Learning - Infrastructure role at causal is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Where is the Machine Learning - Infrastructure position at causal located?
This Machine Learning - Infrastructure role at causal is based in San Francisco. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Is the Machine Learning - Infrastructure role at causal full-time or part-time?
This is listed as a FullTime position. It is posted as a Machine Learning - Infrastructure role in the Engineering department at causal.
Which team or department does the Machine Learning - Infrastructure at causal belong to?
This Machine Learning - Infrastructure position is part of the Engineering department at causal. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Machine Learning - Infrastructure position at causal?
Click the "Apply Now" button on this page. You will be redirected to causal's official application portal hosted on ashby where you can submit your application directly.
When was the Machine Learning - Infrastructure job at causal posted?
This Machine Learning - Infrastructure position at causal was posted on Oct 29, 2025. Apply as soon as possible — early applications are often reviewed first.
Machine Learning - Infrastructure
causal
Apply for this role ↗

You'll be redirected to causal's official application page on Ashby ATS.