AI Infrastructure Engineer
About this role
We are looking for the best
42dot์ AI ์ธํ๋ผ ์์ง๋์ด๋ ์ฌ๋ฌ ๋ฐ์ดํฐ ์ผํฐ์ ๊ฑธ์ณ ์๋ ์์ฒ ๊ฐ์ GPU๋ฅผ ๊ด๋ฆฌํ๋ฉฐ, ์ด๋ฅผ ํจ์จ์ ์ผ๋ก ์ค์ผ์คํธ๋ ์ด์ ํ๋ ๊ณ ์ฑ๋ฅ AI ์ธํ๋ผ๋ฅผ ์ด์ํฉ๋๋ค. ์ธ๊ณ ์ต๊ณ ์์ค์ ์ปดํจํ ํ๊ฒฝ์ ์ ์งํ๊ธฐ ์ํด ํ์ฅ์ฑ, ๋ชจ๋ํฐ๋ง ๋ฐ ์ด์ ์ต์ ํ ์ ๋ฐ์ ๊ธฐ์ฌํ๊ฒ ๋ฉ๋๋ค.
At 42dot, our AI Infrastructure Engineer manages the high-performance AI infrastructure orchestrating thousands of GPUs across multiple data centers. You will contribute to the scaling, monitoring, and operational optimization required to maintain a robust and world-class computing environment.
Responsibilities
Kubernetes ๋ฐ Slurm์ ํ์ฉํ์ฌ ์ฌ๋ฌ ๋ฐ์ดํฐ ์ผํฐ์ ๋ถ์ฐ๋ ์์ฒ ๊ฐ ๊ท๋ชจ์ ๋๊ท๋ชจ GPU ํด๋ฌ์คํฐ ์ด์ ๋ฐ ์ ์ง ๋ณด์
GPU ํ๋์จ์ด ๋ฐ ์ํํธ์จ์ด ์คํ ์ ๋ฐ์ ์ฅ์ ๋ฅผ ๋ชจ๋ํฐ๋งํ๊ณ ์ง๋จํ์ฌ ๊ณ ๊ฐ์ฉ์ฑ ์ ์ง ๋ฐ ์ ์ํ ์ฅ์ ๋ณต๊ตฌ ์ํ
Python ๋๋ Shell์ ํ์ฉํ ์๋ํ ๋๊ตฌ ๋ฐ ์คํฌ๋ฆฝํธ๋ฅผ ๊ฐ๋ฐํ์ฌ ๋ฐ๋ณต์ ์ธ ์ธํ๋ผ ๊ด๋ฆฌ ์ ๋ฌด๋ฅผ ํจ์จํ
GPU ๋ฆฌ์์ค ์ฟผํฐ(Quota) ๊ด๋ฆฌ ๋ฐ ML ๊ฐ๋ฐ์๋ฅผ ์ํ ๊ธฐ์ ์ง์์ ํตํด ์ปดํจํ ์์์ ์ต์ ํ์ฉ ๋ณด์ฅ
๋๊ท๋ชจ ์์จ์ฃผํ ๋ชจ๋ธ ํ์ต์ ์ํ ๋ถ์ฐ ํ์ต ํ๊ฒฝ์ ์ํคํ ์ฒ ์ค๊ณ ๋ฐ ์ฑ๋ฅ ํ๋ ์ฐธ์ฌ
Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.
Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.
Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.
Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.
Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.
Qualifications
Linux ์ด์์ฒด์ ์ ๋ํ ๊น์ ์ดํด (์ปค๋ ๋์, ํ๋ก์ธ์ค ๊ด๋ฆฌ, ์์คํ ๋ณด์ ๋ฑ)
Docker ๋ฐ Kubernetes ๋ฑ ์ปจํ ์ด๋ ๊ธฐ๋ฐ ๊ธฐ์ ๋ฐ ์ค์ผ์คํธ๋ ์ด์ ์ค๋ฌด ๊ฒฝํ
TCP/IP, HTTP(S) ๋ฑ ๋คํธ์ํฌ ๊ธฐ๋ณธ ์๋ฆฌ์ ๋ํ ์ดํด ๋ฐ ๊ธฐ์ด์ ์ธ ๋คํธ์ํฌ ํธ๋ฌ๋ธ์ํ ๋ฅ๋ ฅ
Python ๋๋ Shell์ ํ์ฉํ์ฌ ์ ์ง๋ณด์๊ฐ ์ฉ์ดํ ์๋ํ/์์คํ ๊ด๋ฆฌ ์คํฌ๋ฆฝํธ ์์ฑ ์ญ๋
๋ณต์กํ๊ณ ๊ฑฐ๋ํ ์์คํ ์์ ๊ทผ๋ณธ ์์ธ์ ์ฐพ์ ํด๊ฒฐํ๋ ๋ ผ๋ฆฌ์ ์ธ ๋ฌธ์ ํด๊ฒฐ ๋ฅ๋ ฅ
๋ค์ํ ์ ๊ด ๋ถ์ ๋ฐ ํํธ๋์ ์ํํ๊ฒ ์ํตํ ์ ์๋ ์ปค๋ฎค๋์ผ์ด์ ์ญ๋
Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.
Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.
Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.
Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.
Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.
Strong communication skills to effectively collaborate with cross-functional teams and external partners.
Preferred Qualifications
Prometheus, Grafana, Datadog ๋ฑ์ ํ์ฉํ ๋๊ท๋ชจ ํด๋ฌ์คํฐ์ ๊ด์ธก์ฑ(Observability) ์คํ ๊ตฌ์ถ ๊ฒฝํ
AWS, GCP ๋ฑ ํผ๋ธ๋ฆญ ํด๋ผ์ฐ๋ ํ๋ซํผ ์์ ์ธํ๋ผ ๊ตฌ์ถ ๋ฐ ์ด์ ๊ฒฝํ
๋๋ผ์ด๋ฒ, CUDA, NCCL ๋ฑ์ ํฌํจํ NVIDIA ๊ฐ์ ์ปดํจํ ์คํ์ ๋ํ ์ง์
ML ๋ชจ๋ธ ํ์ต ๋ผ์ดํ์ฌ์ดํด ๋ฐ PyTorch, TensorFlow ๋ฑ ๋ฅ๋ฌ๋ ํ๋ ์์ํฌ์ ๋ํ ์ดํด
Kubernetes ๋๋ Slurm๊ณผ ๊ฐ์ ๋๊ท๋ชจ ์ํฌ๋ก๋ ๋งค๋์ ๋ฐ ๋ฆฌ์์ค ์ค์ผ์ค๋ง ๋๊ตฌ ํ์ฉ ๊ฒฝํ
Terraform ๋ฑ Infrastructure as Code(IaC) ๋๊ตฌ๋ฅผ ํ์ฉํ ๋ณต์กํ ์ธํ๋ผ ๊ด๋ฆฌ ๊ฒฝํ
Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.
Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.
Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.
Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.
Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.
Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.
Interview Process
์๋ฅ์ ํ - ์ฝ๋ฉํ ์คํธ - ํ์๋ฉด์ (1์๊ฐ ๋ด์ธ) - ๋๋ฉด ํน์ ํ์๋ฉด์ (3์๊ฐ ๋ด์ธ) - ์ต์ข ํฉ๊ฒฉ
์ ํ์ ์ฐจ๋ ์ง๋ฌด๋ณ๋ก ๋ค๋ฅด๊ฒ ์ด์๋ ์ ์์ผ๋ฉฐ, ์ผ์ ๋ฐ ์ํฉ์ ๋ฐ๋ผ ๋ณ๋๋ ์ ์์ต๋๋ค.
์ ํ์ผ์ ๋ฐ ๊ฒฐ๊ณผ๋ ์ง์์์ ๋ฑ๋กํ์ ์ด๋ฉ์ผ๋ก ๊ฐ๋ณ ์๋ด๋๋ฆฝ๋๋ค.
Resume Screening - Coding Test - Virtual Interview (approximately 1 hour) - Onsite or Virtual Interview (approximately 3 hours) - Final Offer
Please note that the interview process may vary depending on the position and is subject to change based on scheduling and other circumstances.
Interview schedules and results will be communicated individually via the email address provided in your application.
Additional Information
๋ชจ๋ ์ ์ถํ์ผ์ PDF ์์์ผ๋ก ์ ๋ก๋๋ฅผ ๋ถํ๋๋ฆฝ๋๋ค.
๊ตญ๊ฐ๋ณดํ๋์์ ๋ฐ ์ทจ์ ๋ณดํธ๋์์๋ ๊ด๊ณ๋ฒ๋ น์ ๋ฐ๋ผ ์ฐ๋ํฉ๋๋ค.
์ฅ์ ์ธ ๊ณ ์ฉ์ด์ง ๋ฐ ์ง์ ์ฌํ๋ฒ์ ๋ฐ๋ผ ์ฅ์ ์ธ ๋ฑ๋ก์ฆ ์์ง์๋ฅผ ์ฐ๋ํฉ๋๋ค.
42dot์ ์๋ขฐํ์ง ์์ ์์นํ์ ์ด๋ ฅ์๋ฅผ ๋ฐ์ง ์์ผ๋ฉฐ, ์์ฒญํ์ง ์์ ์ด๋ ฅ์์ ๋ํด ์์๋ฃ๋ฅผ ์ง๋ถํ์ง ์์ต๋๋ค.
3๊ฐ์์ ์์ต๊ธฐ๊ฐ์ด ์ ์ฉ๋ ์ ์์ต๋๋ค.
Please upload all required documents in PDF format.
Veterans and applicants eligible for employment protection will receive preferential consideration in accordance with applicable laws and regulations.
In compliance with the Act on Employment Promotion and Vocational Rehabilitation for Persons with Disabilities, registered individuals with disabilities will receive preferential consideration.
42dot does not accept unsolicited resumes from search firms. We will not pay any fees for resumes submitted without prior agreement.
A 3-month probationary period may apply.
โป ์ง์ ์ ์๋ ๋ด์ฉ์ ๊ผญ ํ์ธํด ์ฃผ์ธ์.
42dot์ด ์ผํ๋ ๋ฐฉ์, 42dot Way ๋ณด๋ฌ๊ฐ๊ธฐ โ
Learn more about how we work at 42dot, 42dot Way โ
Frequently Asked Questions
Is the salary disclosed for the AI Infrastructure Engineer position at 42dot?
Is the AI Infrastructure Engineer job at 42dot remote?
Is the AI Infrastructure Engineer role at 42dot full-time or part-time?
Which team or department does the AI Infrastructure Engineer at 42dot belong to?
How do I apply for the AI Infrastructure Engineer position at 42dot?
When was the AI Infrastructure Engineer job at 42dot posted?
You'll be redirected to 42dot's official application page on Ashby ATS.