Research and product systems
Evaluation Specialist
This role focuses on the hardest part of AI measurement: deciding what good looks like and making that standard repeatable. You will design human evaluation protocols, build review workflows, and help the team distinguish reliable signals from plausible noise in AI-generated content.
Résumé du rôle
Own the quality standards for how Chatobserver evaluates AI answers, citations, and visibility signals — and build the human review layer that keeps machine output honest.
Pourquoi ce rôle existe
As prompt volume scales, the gap between raw output and trustworthy insight grows. We need someone who treats evaluation quality as a discipline, not a checkbox.
90 premiers jours
Audit the current evaluation rubrics and identify the top gaps in coverage or consistency.
Pourquoi ce rôle existe
As prompt volume scales, the gap between raw output and trustworthy insight grows. We need someone who treats evaluation quality as a discipline, not a checkbox.
Ce sur quoi vous travaillerez
- Design and maintain evaluation rubrics for answer quality, citation accuracy, and positioning signals.
- Run structured human review workflows to label and audit machine-generated analysis outputs.
- Identify systematic error patterns in the current evaluation pipeline and propose remediation.
- Collaborate with research and product to translate evaluation findings into product improvements.
À quoi ressemble un bon fit
- Deep experience designing annotation guidelines, evaluation rubrics, or quality review workflows.
- Strong analytical instincts for identifying bias, inconsistency, and labeling noise in structured datasets.
- Comfort working with LLM outputs and an understanding of where they tend to fail in practice.
- Clear writing and the ability to articulate why a quality standard is the right one.
Ce qui vous enthousiasmera ici
- Defining what high quality actually means for a product category that lacks established benchmarks.
- Building evaluation infrastructure that improves the entire product's trustworthiness.
- Working at the interface between human judgment and automated analysis.
90 premiers jours
- 01Audit the current evaluation rubrics and identify the top gaps in coverage or consistency.
- 02Design a structured review workflow for at least one core analysis type.
- 03Ship a measurable improvement to inter-rater reliability on a key evaluation task.
Processus de recrutement
Le processus est volontairement court, direct et ancré dans le travail réel.
- 1
Candidature
Envoyez-nous votre parcours, votre travail pertinent et pourquoi ce rôle vous correspond.
- 2
Conversation de base
Un échange centré sur votre travail, votre jugement et le rôle.
- 3
Deep dive du rôle
Une discussion ou un exercice qui ressemble davantage au travail réel qu'à une boucle d'entretien générique.
- 4
Conversation avec le fondateur
Un dernier échange sur le niveau d'exigence, l'ambition et ce que serait la réussite ici.
- 5
Décision
Nous bouclons clairement et avançons vite quand la conviction est là.
Besoin de contexte avant de postuler ? [email protected]
Evaluation Specialist
Le rôle est visible sur le site. Les candidatures s'ouvrent dès que le poste Dover correspondant est actif.
Les candidatures restent fermées jusqu'à l'activation du poste Dover correspondant. D'ici là, vous pouvez écrire à [email protected].