AI Infrastructure Engineer

42dotยท ENGINEERING
Apply Now โ†—
๐ŸŒ Remote๐Ÿ“ Pangyo (Software Dream Center), South KoreaFullTime

About this role

We are looking for the best

42dot์˜ AI ์ธํ”„๋ผ ์—”์ง€๋‹ˆ์–ด๋Š” ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์— ๊ฑธ์ณ ์žˆ๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ GPU๋ฅผ ๊ด€๋ฆฌํ•˜๋ฉฐ, ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜๋Š” ๊ณ ์„ฑ๋Šฅ AI ์ธํ”„๋ผ๋ฅผ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค. ์„ธ๊ณ„ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ™•์žฅ์„ฑ, ๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ์šด์˜ ์ตœ์ ํ™” ์ „๋ฐ˜์— ๊ธฐ์—ฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

At 42dot, our AI Infrastructure Engineer manages the high-performance AI infrastructure orchestrating thousands of GPUs across multiple data centers. You will contribute to the scaling, monitoring, and operational optimization required to maintain a robust and world-class computing environment.

Responsibilities

  • Kubernetes ๋ฐ Slurm์„ ํ™œ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์— ๋ถ„์‚ฐ๋œ ์ˆ˜์ฒœ ๊ฐœ ๊ทœ๋ชจ์˜ ๋Œ€๊ทœ๋ชจ GPU ํด๋Ÿฌ์Šคํ„ฐ ์šด์˜ ๋ฐ ์œ ์ง€ ๋ณด์ˆ˜

  • GPU ํ•˜๋“œ์›จ์–ด ๋ฐ ์†Œํ”„ํŠธ์›จ์–ด ์Šคํƒ ์ „๋ฐ˜์˜ ์žฅ์• ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ  ์ง„๋‹จํ•˜์—ฌ ๊ณ ๊ฐ€์šฉ์„ฑ ์œ ์ง€ ๋ฐ ์‹ ์†ํ•œ ์žฅ์•  ๋ณต๊ตฌ ์ˆ˜ํ–‰

  • Python ๋˜๋Š” Shell์„ ํ™œ์šฉํ•œ ์ž๋™ํ™” ๋„๊ตฌ ๋ฐ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ ๋ฐ˜๋ณต์ ์ธ ์ธํ”„๋ผ ๊ด€๋ฆฌ ์—…๋ฌด๋ฅผ ํšจ์œจํ™”

  • GPU ๋ฆฌ์†Œ์Šค ์ฟผํ„ฐ(Quota) ๊ด€๋ฆฌ ๋ฐ ML ๊ฐœ๋ฐœ์ž๋ฅผ ์œ„ํ•œ ๊ธฐ์ˆ  ์ง€์›์„ ํ†ตํ•ด ์ปดํ“จํŒ… ์ž์›์˜ ์ตœ์  ํ™œ์šฉ ๋ณด์žฅ

  • ๋Œ€๊ทœ๋ชจ ์ž์œจ์ฃผํ–‰ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ ๋ถ„์‚ฐ ํ•™์Šต ํ™˜๊ฒฝ์˜ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๋ฐ ์„ฑ๋Šฅ ํŠœ๋‹ ์ฐธ์—ฌ

  • Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.

  • Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.

  • Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.

  • Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.

  • Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.

Qualifications

  • Linux ์šด์˜์ฒด์ œ์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด (์ปค๋„ ๋™์ž‘, ํ”„๋กœ์„ธ์Šค ๊ด€๋ฆฌ, ์‹œ์Šคํ…œ ๋ณด์•ˆ ๋“ฑ)

  • Docker ๋ฐ Kubernetes ๋“ฑ ์ปจํ…Œ์ด๋„ˆ ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ  ๋ฐ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ์‹ค๋ฌด ๊ฒฝํ—˜

  • TCP/IP, HTTP(S) ๋“ฑ ๋„คํŠธ์›Œํฌ ๊ธฐ๋ณธ ์›๋ฆฌ์— ๋Œ€ํ•œ ์ดํ•ด ๋ฐ ๊ธฐ์ดˆ์ ์ธ ๋„คํŠธ์›Œํฌ ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ… ๋Šฅ๋ ฅ

  • Python ๋˜๋Š” Shell์„ ํ™œ์šฉํ•˜์—ฌ ์œ ์ง€๋ณด์ˆ˜๊ฐ€ ์šฉ์ดํ•œ ์ž๋™ํ™”/์‹œ์Šคํ…œ ๊ด€๋ฆฌ ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ ์—ญ๋Ÿ‰

  • ๋ณต์žกํ•˜๊ณ  ๊ฑฐ๋Œ€ํ•œ ์‹œ์Šคํ…œ์—์„œ ๊ทผ๋ณธ ์›์ธ์„ ์ฐพ์•„ ํ•ด๊ฒฐํ•˜๋Š” ๋…ผ๋ฆฌ์ ์ธ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ

  • ๋‹ค์–‘ํ•œ ์œ ๊ด€ ๋ถ€์„œ ๋ฐ ํŒŒํŠธ๋„ˆ์™€ ์›ํ™œํ•˜๊ฒŒ ์†Œํ†ตํ•  ์ˆ˜ ์žˆ๋Š” ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜ ์—ญ๋Ÿ‰

  • Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.

  • Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.

  • Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.

  • Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.

  • Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.

  • Strong communication skills to effectively collaborate with cross-functional teams and external partners.

Preferred Qualifications

  • Prometheus, Grafana, Datadog ๋“ฑ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ด€์ธก์„ฑ(Observability) ์Šคํƒ ๊ตฌ์ถ• ๊ฒฝํ—˜

  • AWS, GCP ๋“ฑ ํผ๋ธ”๋ฆญ ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ ์ƒ์˜ ์ธํ”„๋ผ ๊ตฌ์ถ• ๋ฐ ์šด์˜ ๊ฒฝํ—˜

  • ๋“œ๋ผ์ด๋ฒ„, CUDA, NCCL ๋“ฑ์„ ํฌํ•จํ•œ NVIDIA ๊ฐ€์† ์ปดํ“จํŒ… ์Šคํƒ์— ๋Œ€ํ•œ ์ง€์‹

  • ML ๋ชจ๋ธ ํ•™์Šต ๋ผ์ดํ”„์‚ฌ์ดํด ๋ฐ PyTorch, TensorFlow ๋“ฑ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์— ๋Œ€ํ•œ ์ดํ•ด

  • Kubernetes ๋˜๋Š” Slurm๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์›Œํฌ๋กœ๋“œ ๋งค๋‹ˆ์ € ๋ฐ ๋ฆฌ์†Œ์Šค ์Šค์ผ€์ค„๋ง ๋„๊ตฌ ํ™œ์šฉ ๊ฒฝํ—˜

  • Terraform ๋“ฑ Infrastructure as Code(IaC) ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•œ ๋ณต์žกํ•œ ์ธํ”„๋ผ ๊ด€๋ฆฌ ๊ฒฝํ—˜

  • Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.

  • Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.

  • Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.

  • Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.

  • Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.

  • Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.

Interview Process

  • ์„œ๋ฅ˜์ „ํ˜• - ์ฝ”๋”ฉํ…Œ์ŠคํŠธ - ํ™”์ƒ๋ฉด์ ‘ (1์‹œ๊ฐ„ ๋‚ด์™ธ) - ๋Œ€๋ฉด ํ˜น์€ ํ™”์ƒ๋ฉด์ ‘ (3์‹œ๊ฐ„ ๋‚ด์™ธ) - ์ตœ์ข…ํ•ฉ๊ฒฉ

  • ์ „ํ˜•์ ˆ์ฐจ๋Š” ์ง๋ฌด๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ์šด์˜๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ผ์ • ๋ฐ ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋ณ€๋™๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ „ํ˜•์ผ์ • ๋ฐ ๊ฒฐ๊ณผ๋Š” ์ง€์›์„œ์— ๋“ฑ๋กํ•˜์‹  ์ด๋ฉ”์ผ๋กœ ๊ฐœ๋ณ„ ์•ˆ๋‚ด๋“œ๋ฆฝ๋‹ˆ๋‹ค.

  • Resume Screening - Coding Test - Virtual Interview (approximately 1 hour) - Onsite or Virtual Interview (approximately 3 hours) - Final Offer

  • Please note that the interview process may vary depending on the position and is subject to change based on scheduling and other circumstances.

  • Interview schedules and results will be communicated individually via the email address provided in your application.

Additional Information

  • ๋ชจ๋“  ์ œ์ถœํŒŒ์ผ์€ PDF ์–‘์‹์œผ๋กœ ์—…๋กœ๋“œ๋ฅผ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

  • ๊ตญ๊ฐ€๋ณดํ›ˆ๋Œ€์ƒ์ž ๋ฐ ์ทจ์—…๋ณดํ˜ธ๋Œ€์ƒ์ž๋Š” ๊ด€๊ณ„๋ฒ•๋ น์— ๋”ฐ๋ผ ์šฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

  • ์žฅ์• ์ธ ๊ณ ์šฉ์ด‰์ง„ ๋ฐ ์ง์—…์žฌํ™œ๋ฒ•์— ๋”ฐ๋ผ ์žฅ์• ์ธ ๋“ฑ๋ก์ฆ ์†Œ์ง€์ž๋ฅผ ์šฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

  • 42dot์€ ์˜๋ขฐํ•˜์ง€ ์•Š์€ ์„œ์น˜ํŽŒ์˜ ์ด๋ ฅ์„œ๋ฅผ ๋ฐ›์ง€ ์•Š์œผ๋ฉฐ, ์š”์ฒญํ•˜์ง€ ์•Š์€ ์ด๋ ฅ์„œ์— ๋Œ€ํ•ด ์ˆ˜์ˆ˜๋ฃŒ๋ฅผ ์ง€๋ถˆํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • 3๊ฐœ์›”์˜ ์ˆ˜์Šต๊ธฐ๊ฐ„์ด ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Please upload all required documents in PDF format.

  • Veterans and applicants eligible for employment protection will receive preferential consideration in accordance with applicable laws and regulations.

  • In compliance with the Act on Employment Promotion and Vocational Rehabilitation for Persons with Disabilities, registered individuals with disabilities will receive preferential consideration.

  • 42dot does not accept unsolicited resumes from search firms. We will not pay any fees for resumes submitted without prior agreement.

  • A 3-month probationary period may apply.

โ€ป ์ง€์› ์ „ ์•„๋ž˜ ๋‚ด์šฉ์„ ๊ผญ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.

Frequently Asked Questions

Is the salary disclosed for the AI Infrastructure Engineer position at 42dot?
The salary for this AI Infrastructure Engineer role at 42dot is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Is the AI Infrastructure Engineer job at 42dot remote?
Yes, this AI Infrastructure Engineer position at 42dot is remote, with team members based in Pangyo (Software Dream Center), South Korea. You can work from home or anywhere in the supported regions.
Is the AI Infrastructure Engineer role at 42dot full-time or part-time?
This is listed as a FullTime position. It is posted as a AI Infrastructure Engineer role in the ENGINEERING department at 42dot.
Which team or department does the AI Infrastructure Engineer at 42dot belong to?
This AI Infrastructure Engineer position is part of the ENGINEERING department at 42dot. See the full job description for more information about the team structure and responsibilities.
How do I apply for the AI Infrastructure Engineer position at 42dot?
Click the "Apply Now" button on this page. You will be redirected to 42dot's official application portal hosted on ashby where you can submit your application directly.
When was the AI Infrastructure Engineer job at 42dot posted?
This AI Infrastructure Engineer position at 42dot was posted on Feb 9, 2026. Apply as soon as possible โ€” early applications are often reviewed first.
AI Infrastructure Engineer
42dot
Apply for this role โ†—

You'll be redirected to 42dot's official application page on Ashby ATS.