AI / ML2025·6 months

Human-in-the-loop evaluation harness

Frontier-model lab · post-training team

Resumen

Built the platform the team uses to grade model outputs before and after training runs: rubric authoring, blind pairwise comparisons, rater calibration, and exportable scorecards that feed directly into the RLHF preference dataset. Replaced a patchwork of spreadsheets and ad-hoc scripts.

Qué entregamos

Rubric authoring UI
Blind pairwise comparison flow
Rater onboarding + calibration
Scorecard export (CSV + JSONL)
Admin metrics (IRR, drift, throughput)
Model-provider adapter for reference outputs

Stack

Python · FastAPI · PostgreSQL · Next.js · LangGraph · OpenAI API

Duración: 6 months
Año: 2025
Industria: AI / ML

Human-in-the-loop evaluation harness

Resumen

Qué entregamos

Stack

Más trabajo.

Company-wide knowledge assistant for support ops

Claims automation platform

¿Quieres que construyamos el tuyo? Agenda una llamada de 15 minutos.