Machine Learning Engineer, Evaluation

hackerrank· Engineering
Apply Now ↗
📍 Hybrid in Bangalore, India

About this role

HackerRank helps companies like NVIDIA, Amazon, and Microsoft hire and upskill the next generation of developers based on skills, not pedigree. Our platform is trusted by over 2,500 of the world’s most innovative companies to build strong engineering teams ready for what’s next.

Software has entered an era where humans and AI build side by side. As this shift accelerates, the definition of strong technical talent is changing. We give companies better ways to identify and invest in next-generation skills.

People at HackerRank care deeply about the impact of their work and sweat the small details so our customers can be wildly successful with products they genuinely love to use. We move with urgency and believe great outcomes come from high standards.

About the role

How developers were evaluated previously was whether they can write functionally correct code. How developers are being evaluated now is whether they can orchestrate AI to accomplish the task while still having the fundamentals underneath. That shift, between what used to matter and what matters now, is exactly the problem this role is hired to solve.

Open Problem
How do you measure skill when AI is already in the room?

Software engineering has moved from writing code to using AI to solve problems. That shift sounds simple. The implications for assessment are not. This is not just a take-home assignment problem. It spans live interviews, async assessments, AI-assisted coding environments, pair programming with agents, and every other context in which someone is trying to figure out how good a developer actually is. The tools developers use are changing fast. The frameworks we use to evaluate them have not kept up.

For over a decade, skills-based hiring relied on deterministic evaluation: a candidate's code either passed test cases or it did not. The score was binary and reproducible. What replaces it is genuinely unsolved. Nobody has cracked how to fairly assess human skill in a world where AI assistance is ambient and invisible, where the question is no longer "can you write this function" but "how effectively do you use AI to solve a real problem."

We are moving from a deterministic evaluation to evaluation by a council of LLMs. But making that consistent, scalable, and defensible across hundreds of thousands of assessments is a hard research and engineering problem. How do you ensure the same rubric is applied the same way to the 200,000th candidate as to the first? How do you detect when your evaluation model is drifting? How do you explain a score to a candidate who believes they were assessed unfairly?

HackerRank sits at the center of this problem with a rare combination of scale, longitudinal data, and direct relationships with the companies making hiring decisions. The opportunity here is to define what rigorous, fair, and meaningful skill evaluation looks like in the agentic era. That methodology does not exist yet. This role exists to build it.

What you will do

  • Build LLM-powered evaluation pipelines that assess AI usage skills consistently, fairly, and at production scale.
  • Own the evaluation methodology end to end- what the rubric is, how the model applies it, how you measure whether it is being applied correctly, and how you audit for bias.
  • Design and run experiments to determine what good evaluation actually looks like. The answer is not known. You will be finding it.
  • Build RAG pipelines and fine-tuning workflows that make evaluation models adhere reliably to the rules we set for them.
  • Define the benchmarking infrastructure: how we know when our evaluation quality has improved, and how we catch regressions before candidates do.
  • Translate model behavior into outcomes that product managers, enterprise customers, and candidates can understand and trust.

Who you are

  • You have shipped LLM-powered systems in production where consistency and reliability were hard constraints, not nice-to-haves.
  • You think as rigorously about how you measure your model as about the model itself. A poorly constructed eval is a worse outcome than a weaker model.
  • You have a research mindset. You are comfortable operating in a space where the right methodology does not exist yet and needs to be invented.
  • You think in systems. The data pipeline, the model, the serving layer, and the rubric it enforces are one problem to you.
  • You can defend ML judgment in plain language to people who are not ML engineers, because the translation layer is part of the job.

Even better if you have

  • Experience building evaluation frameworks for generative or conversational AI systems.
  • Background in educational assessment, psychometrics, or human-in-the-loop evaluation at scale.
  • Publications or open-source contributions in LLM evaluation, benchmarking, or alignment.
  • Prior work at the interface of research and product, where you had to ship science, not just publish it.

You will thrive here if

  • You find the measurement problem as interesting as the model problem, maybe more interesting. 
  • You hold evaluation methodology to the same standard as model performance, and you are uncomfortable shipping something you cannot explain. 
  • You want your work to define what good looks like in a field that is just now figuring that out.

Want to learn more about HackerRank? Check out HackerRank.com to explore our products, solutions and resources, and dive into our story and mission here.

HackerRank is a proud equal employment opportunity and affirmative action employer. We provide equal opportunity to everyone for employment based on individual performance and qualification. We never discriminate based on race, religion, national origin, gender identity or expression, sexual orientation, age, marital, veteran, or disability status. All your information will be kept confidential according to EEO guidelines. 

Linkedin | X | Blog | Instagram | Life@HackerRank

Notice to prospective HackerRank job applicants:

  • Our Recruiters use @hackerrank.com email addresses.
  • We never ask for payment or credit check information to apply, interview, or work here.

Frequently Asked Questions

Is the salary disclosed for the Machine Learning Engineer, Evaluation position at hackerrank?
The salary for this Machine Learning Engineer, Evaluation role at hackerrank is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Where is the Machine Learning Engineer, Evaluation position at hackerrank located?
This Machine Learning Engineer, Evaluation role at hackerrank is based in Hybrid in Bangalore, India. The position is listed as on-site or hybrid. Check the full job description or apply directly to confirm the work arrangement.
Which team or department does the Machine Learning Engineer, Evaluation at hackerrank belong to?
This Machine Learning Engineer, Evaluation position is part of the Engineering department at hackerrank. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Machine Learning Engineer, Evaluation position at hackerrank?
Click the "Apply Now" button on this page. You will be redirected to hackerrank's official application portal hosted on greenhouse where you can submit your application directly.
When was the Machine Learning Engineer, Evaluation job at hackerrank posted?
This Machine Learning Engineer, Evaluation position at hackerrank was posted on Apr 9, 2026. Apply as soon as possible — early applications are often reviewed first.
Machine Learning Engineer, Evaluation
hackerrank
Apply for this role ↗

You'll be redirected to hackerrank's official application page on Greenhouse.