AI Infrastructure Engineer

42dot· ENGINEERING

🌍 Remote📍 Pangyo (Software Dream Center), South KoreaFullTime🗓 Posted Feb 9, 2026

About this role

We are looking for the best

42dot의 AI 인프라 엔지니어는 여러 데이터 센터에 걸쳐 있는 수천 개의 GPU를 관리하며, 이를 효율적으로 오케스트레이션하는 고성능 AI 인프라를 운영합니다. 세계 최고 수준의 컴퓨팅 환경을 유지하기 위해 확장성, 모니터링 및 운영 최적화 전반에 기여하게 됩니다.

At 42dot, our AI Infrastructure Engineer manages the high-performance AI infrastructure orchestrating thousands of GPUs across multiple data centers. You will contribute to the scaling, monitoring, and operational optimization required to maintain a robust and world-class computing environment.

Responsibilities

Kubernetes 및 Slurm을 활용하여 여러 데이터 센터에 분산된 수천 개 규모의 대규모 GPU 클러스터 운영 및 유지 보수
GPU 하드웨어 및 소프트웨어 스택 전반의 장애를 모니터링하고 진단하여 고가용성 유지 및 신속한 장애 복구 수행
Python 또는 Shell을 활용한 자동화 도구 및 스크립트를 개발하여 반복적인 인프라 관리 업무를 효율화
GPU 리소스 쿼터(Quota) 관리 및 ML 개발자를 위한 기술 지원을 통해 컴퓨팅 자원의 최적 활용 보장
대규모 자율주행 모델 학습을 위한 분산 학습 환경의 아키텍처 설계 및 성능 튜닝 참여

Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.
Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.
Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.
Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.
Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.

Qualifications

Linux 운영체제에 대한 깊은 이해 (커널 동작, 프로세스 관리, 시스템 보안 등)
Docker 및 Kubernetes 등 컨테이너 기반 기술 및 오케스트레이션 실무 경험
TCP/IP, HTTP(S) 등 네트워크 기본 원리에 대한 이해 및 기초적인 네트워크 트러블슈팅 능력
Python 또는 Shell을 활용하여 유지보수가 용이한 자동화/시스템 관리 스크립트 작성 역량
복잡하고 거대한 시스템에서 근본 원인을 찾아 해결하는 논리적인 문제 해결 능력
다양한 유관 부서 및 파트너와 원활하게 소통할 수 있는 커뮤니케이션 역량

Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.
Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.
Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.
Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.
Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.
Strong communication skills to effectively collaborate with cross-functional teams and external partners.

Preferred Qualifications

Prometheus, Grafana, Datadog 등을 활용한 대규모 클러스터의 관측성(Observability) 스택 구축 경험
AWS, GCP 등 퍼블릭 클라우드 플랫폼 상의 인프라 구축 및 운영 경험
드라이버, CUDA, NCCL 등을 포함한 NVIDIA 가속 컴퓨팅 스택에 대한 지식
ML 모델 학습 라이프사이클 및 PyTorch, TensorFlow 등 딥러닝 프레임워크에 대한 이해
Kubernetes 또는 Slurm과 같은 대규모 워크로드 매니저 및 리소스 스케줄링 도구 활용 경험
Terraform 등 Infrastructure as Code(IaC) 도구를 활용한 복잡한 인프라 관리 경험

Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.
Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.
Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.
Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.
Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.
Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.

Interview Process

서류전형 - 코딩테스트 - 화상면접 (1시간 내외) - 대면 혹은 화상면접 (3시간 내외) - 최종합격
전형절차는 직무별로 다르게 운영될 수 있으며, 일정 및 상황에 따라 변동될 수 있습니다.
전형일정 및 결과는 지원서에 등록하신 이메일로 개별 안내드립니다.

Resume Screening - Coding Test - Virtual Interview (approximately 1 hour) - Onsite or Virtual Interview (approximately 3 hours) - Final Offer
Please note that the interview process may vary depending on the position and is subject to change based on scheduling and other circumstances.
Interview schedules and results will be communicated individually via the email address provided in your application.

Additional Information

모든 제출파일은 PDF 양식으로 업로드를 부탁드립니다.
국가보훈대상자 및 취업보호대상자는 관계법령에 따라 우대합니다.
장애인 고용촉진 및 직업재활법에 따라 장애인 등록증 소지자를 우대합니다.
42dot은 의뢰하지 않은 서치펌의 이력서를 받지 않으며, 요청하지 않은 이력서에 대해 수수료를 지불하지 않습니다.
3개월의 수습기간이 적용될 수 있습니다.
Please upload all required documents in PDF format.
Veterans and applicants eligible for employment protection will receive preferential consideration in accordance with applicable laws and regulations.
In compliance with the Act on Employment Promotion and Vocational Rehabilitation for Persons with Disabilities, registered individuals with disabilities will receive preferential consideration.
42dot does not accept unsolicited resumes from search firms. We will not pay any fees for resumes submitted without prior agreement.
A 3-month probationary period may apply.

※ 지원 전 아래 내용을 꼭 확인해 주세요.

42dot이 일하는 방식, 42dot Way 보러가기 →

Learn more about how we work at 42dot, 42dot Way →

Frequently Asked Questions

Is the salary disclosed for the AI Infrastructure Engineer position at 42dot?

The salary for this AI Infrastructure Engineer role at 42dot is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.

Is the AI Infrastructure Engineer job at 42dot remote?

Yes, this AI Infrastructure Engineer position at 42dot is remote, with team members based in Pangyo (Software Dream Center), South Korea. You can work from home or anywhere in the supported regions.

Is the AI Infrastructure Engineer role at 42dot full-time or part-time?

This is listed as a FullTime position. It is posted as a AI Infrastructure Engineer role in the ENGINEERING department at 42dot.

Which team or department does the AI Infrastructure Engineer at 42dot belong to?

This AI Infrastructure Engineer position is part of the ENGINEERING department at 42dot. See the full job description for more information about the team structure and responsibilities.

How do I apply for the AI Infrastructure Engineer position at 42dot?

Click the "Apply Now" button on this page. You will be redirected to 42dot's official application portal hosted on ashby where you can submit your application directly.

When was the AI Infrastructure Engineer job at 42dot posted?

This AI Infrastructure Engineer position at 42dot was posted on Feb 9, 2026. Apply as soon as possible — early applications are often reviewed first.

AI Infrastructure Engineer

42dot

Apply for this role ↗

You'll be redirected to 42dot's official application page on Ashby ATS.