Resumen
Built the platform the team uses to grade model outputs before and after training runs: rubric authoring, blind pairwise comparisons, rater calibration, and exportable scorecards that feed directly into the RLHF preference dataset. Replaced a patchwork of spreadsheets and ad-hoc scripts.
Qué entregamos
- Rubric authoring UI
- Blind pairwise comparison flow
- Rater onboarding + calibration
- Scorecard export (CSV + JSONL)
- Admin metrics (IRR, drift, throughput)
- Model-provider adapter for reference outputs
Stack
Python · FastAPI · PostgreSQL · Next.js · LangGraph · OpenAI API
Duración: 6 months
Año: 2025
Industria: AI / ML


