Senior Software Engineer — AI Evaluation & Benchmarks (Python)

g2i· Software Engineering for AI
Apply Now ↗
🌍 Remote📍 Adana📍 Ankara📍 Antalya📍 Antwerp📍 Arequipa📍 Arlington📍 Asunción📍 Athens📍 Atlanta📍 Austin📍 Banská Bystrica📍 Barcelona📍 Barquisimeto📍 Barranquilla📍 Bayamón📍 Belfast📍 Belgrade📍 Belo Horizonte📍 Belém📍 Berlin📍 Birmingham📍 Bogotá📍 Boise📍 Boston📍 Brasília📍 Bratislava📍 Bremen📍 Bristol📍 Bruges📍 Brussels📍 Bucharest📍 Budapest📍 Buenos Aires📍 Bursa📍 Calgary📍 Cali📍 Campinas📍 Caracas📍 Cardiff📍 Cartagena📍 Charleroi📍 Charlotte📍 Chesapeake📍 Chicago📍 Chihuahua📍 Ciudad Juárez📍 Ciudad del Este📍 Cluj-Napoca📍 Cochabamba📍 Columbus📍 Cork📍 Cuenca📍 Curitiba📍 Córdoba📍 Dallas📍 Debrecen📍 Denver📍 Dortmund📍 Dresden📍 Dublin📍 Duisburg📍 Durrës📍 Düsseldorf📍 Ecatepec📍 Edinburgh📍 Edmonton📍 El Alto📍 El Paso📍 Fairfax📍 Fort Worth📍 Fortaleza📍 Frankfurt am Main📍 Funchal📍 Gaziantep📍 Gdańsk📍 Ghent📍 Glasgow📍 Goiânia📍 Graz📍 Guadalajara📍 Guarulhos📍 Guayaquil📍 Hamburg📍 Hannover📍 Heraklion📍 Huston📍 Indianapolis📍 Istanbul📍 Jacksonville📍 Kaunas📍 Kayseri📍 Konya📍 Kraków📍 Köln📍 La Paz📍 La Plata📍 Las Vegas📍 Leeds📍 Leipzig📍 León📍 Lima📍 Limerick📍 Linz📍 Lisbon📍 Liverpool📍 Liège📍 London📍 Lyon📍 Madrid📍 Manaus📍 Manchester📍 Mar del Plata📍 Maracaibo📍 Maracay📍 Marseille📍 Medellín📍 Mersin📍 Mexico City📍 Miami📍 Milan📍 Miskolc📍 Mississauga📍 Monterrey📍 Montevideo📍 Montreal📍 Mérida📍 München📍 Namur📍 Nantes📍 Naples📍 Nashville📍 Nice📍 Norfolk📍 Nottingham📍 Novi Sad📍 Nürnberg📍 Oklahoma City📍 Oruro📍 Ottawa📍 Palermo📍 Paris📍 Patras📍 Philadelphia📍 Phoenix📍 Piraeus📍 Plovdiv📍 Podgorica📍 Porto📍 Porto Alegre📍 Poznań📍 Prague📍 Prishtinë📍 Puebla📍 Pécs📍 Quito📍 Recife📍 Reno📍 Richmond📍 Riga📍 Rio de Janeiro📍 Rome📍 Rosario📍 Salvador📍 Salzburg📍 San Antonio📍 San Juan📍 Santa Cruz de la Sierra📍 Santiago de Chile📍 Santiago de los Caballeros📍 Santo Domingo📍 Santo Domingo Este📍 Santo Domingo Oeste📍 Santo Domingo de Guzmán📍 Sarajevo📍 Savannah📍 Seattle📍 Seville📍 Sheffield📍 Skopje📍 Sofia📍 Southampton📍 Strasbourg📍 Stuttgart📍 Szeged📍 São Paulo📍 Tallinn📍 Tartu📍 Thessaloniki📍 Tijuana📍 Tirana📍 Toronto📍 Toulouse📍 Turin📍 Valencia📍 Valletta📍 Vancouver📍 Varna📍 Vienna📍 Vilnius📍 Virginia Beach📍 Warsaw📍 Washington, D.C.📍 Winnipeg📍 Wrocław📍 Zapopan📍 İzmir📍 ŁódźContract

About this role

Before Applying

This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations.

For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role.

What You'll Be Doing

Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:

  • Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code

  • Build and maintain scalable data pipelines for evaluation workflows

  • Analyze model-generated code for correctness, reliability, and edge-case failures

  • Construct structured evaluation scenarios across large repos and multi-language environments

  • Provide detailed technical feedback on model performance and failure patterns

  • Contribute to evaluation frameworks that set the bar for how coding ability is measured

End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.

AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

What You'll Need

  • 4+ years of professional software engineering experience (non-negotiable)

  • Expert Python — clean, performant, well-tested code

  • Hands-on experience working in large, complex codebases

  • Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines

  • Strong command of Git and modern development workflows

  • Track record at a high-growth tech company or top-tier software organization

  • Strong written English communication

Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.

Nice to have

  • Senior or Lead-level profile with a history of technical ownership

  • Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)

  • Proficiency in additional languages: JavaScript, Go, C++, or others

  • CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)

  • Background in security engineering or significant open-source contributions

  • Familiarity with AI/ML evaluation methodologies or model benchmarking

Logistics

  • Location: Fully remote — work from anywhere on the accepted locations list

  • Compensation: $80–$100/hr based on location and seniority

  • Contract length: 3 months, with potential for extension

  • Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week

  • Engagement: 1099 independent contractor

  • Payment: Weekly via PayPal or Stripe

⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

Frequently Asked Questions

Is the salary disclosed for the Senior Software Engineer — AI Evaluation & Benchmarks (Python) position at g2i?
The salary for this Senior Software Engineer — AI Evaluation & Benchmarks (Python) role at g2i is not publicly listed. Click "Apply Now" to learn more about the compensation package on their official careers page.
Is the Senior Software Engineer — AI Evaluation & Benchmarks (Python) job at g2i remote?
Yes, this Senior Software Engineer — AI Evaluation & Benchmarks (Python) position at g2i is remote, with team members based in Adana, Ankara, Antalya, Antwerp, Arequipa, Arlington, Asunción, Athens, Atlanta, Austin, Banská Bystrica, Barcelona, Barquisimeto, Barranquilla, Bayamón, Belfast, Belgrade, Belo Horizonte, Belém, Berlin, Birmingham, Bogotá, Boise, Boston, Brasília, Bratislava, Bremen, Bristol, Bruges, Brussels, Bucharest, Budapest, Buenos Aires, Bursa, Calgary, Cali, Campinas, Caracas, Cardiff, Cartagena, Charleroi, Charlotte, Chesapeake, Chicago, Chihuahua, Ciudad Juárez, Ciudad del Este, Cluj-Napoca, Cochabamba, Columbus, Cork, Cuenca, Curitiba, Córdoba, Dallas, Debrecen, Denver, Dortmund, Dresden, Dublin, Duisburg, Durrës, Düsseldorf, Ecatepec, Edinburgh, Edmonton, El Alto, El Paso, Fairfax, Fort Worth, Fortaleza, Frankfurt am Main, Funchal, Gaziantep, Gdańsk, Ghent, Glasgow, Goiânia, Graz, Guadalajara, Guarulhos, Guayaquil, Hamburg, Hannover, Heraklion, Huston, Indianapolis, Istanbul, Jacksonville, Kaunas, Kayseri, Konya, Kraków, Köln, La Paz, La Plata, Las Vegas, Leeds, Leipzig, León, Lima, Limerick, Linz, Lisbon, Liverpool, Liège, London, Lyon, Madrid, Manaus, Manchester, Mar del Plata, Maracaibo, Maracay, Marseille, Medellín, Mersin, Mexico City, Miami, Milan, Miskolc, Mississauga, Monterrey, Montevideo, Montreal, Mérida, München, Namur, Nantes, Naples, Nashville, Nice, Norfolk, Nottingham, Novi Sad, Nürnberg, Oklahoma City, Oruro, Ottawa, Palermo, Paris, Patras, Philadelphia, Phoenix, Piraeus, Plovdiv, Podgorica, Porto, Porto Alegre, Poznań, Prague, Prishtinë, Puebla, Pécs, Quito, Recife, Reno, Richmond, Riga, Rio de Janeiro, Rome, Rosario, Salvador, Salzburg, San Antonio, San Juan, Santa Cruz de la Sierra, Santiago de Chile, Santiago de los Caballeros, Santo Domingo, Santo Domingo Este, Santo Domingo Oeste, Santo Domingo de Guzmán, Sarajevo, Savannah, Seattle, Seville, Sheffield, Skopje, Sofia, Southampton, Strasbourg, Stuttgart, Szeged, São Paulo, Tallinn, Tartu, Thessaloniki, Tijuana, Tirana, Toronto, Toulouse, Turin, Valencia, Valletta, Vancouver, Varna, Vienna, Vilnius, Virginia Beach, Warsaw, Washington, D.C., Winnipeg, Wrocław, Zapopan, İzmir, Łódź. You can work from home or anywhere in the supported regions.
Is the Senior Software Engineer — AI Evaluation & Benchmarks (Python) role at g2i full-time or part-time?
This is listed as a Contract position. It is posted as a Senior Software Engineer — AI Evaluation & Benchmarks (Python) role in the Software Engineering for AI department at g2i.
Which team or department does the Senior Software Engineer — AI Evaluation & Benchmarks (Python) at g2i belong to?
This Senior Software Engineer — AI Evaluation & Benchmarks (Python) position is part of the Software Engineering for AI department at g2i. See the full job description for more information about the team structure and responsibilities.
How do I apply for the Senior Software Engineer — AI Evaluation & Benchmarks (Python) position at g2i?
Click the "Apply Now" button on this page. You will be redirected to g2i's official application portal hosted on ashby where you can submit your application directly.
When was the Senior Software Engineer — AI Evaluation & Benchmarks (Python) job at g2i posted?
This Senior Software Engineer — AI Evaluation & Benchmarks (Python) position at g2i was posted on May 13, 2026. Apply as soon as possible — early applications are often reviewed first.
Senior Software Engineer — AI Evaluation & Benchmarks (Python)
g2i
Apply for this role ↗

You'll be redirected to g2i's official application page on Ashby ATS.