Back to careers

Research and product systems

Evaluation Specialist

This role focuses on the hardest part of AI measurement: deciding what good looks like and making that standard repeatable. You will design human evaluation protocols, build review workflows, and help the team distinguish reliable signals from plausible noise in AI-generated content.

Applications opening soon

Role summary

Own the quality standards for how Chatobserver evaluates AI answers, citations, and visibility signals — and build the human review layer that keeps machine output honest.

Why this role exists

As prompt volume scales, the gap between raw output and trustworthy insight grows. We need someone who treats evaluation quality as a discipline, not a checkbox.

First 90 days

Audit the current evaluation rubrics and identify the top gaps in coverage or consistency.

Why this role exists

As prompt volume scales, the gap between raw output and trustworthy insight grows. We need someone who treats evaluation quality as a discipline, not a checkbox.

What you will work on

  • Design and maintain evaluation rubrics for answer quality, citation accuracy, and positioning signals.
  • Run structured human review workflows to label and audit machine-generated analysis outputs.
  • Identify systematic error patterns in the current evaluation pipeline and propose remediation.
  • Collaborate with research and product to translate evaluation findings into product improvements.

What a strong fit looks like

  • Deep experience designing annotation guidelines, evaluation rubrics, or quality review workflows.
  • Strong analytical instincts for identifying bias, inconsistency, and labeling noise in structured datasets.
  • Comfort working with LLM outputs and an understanding of where they tend to fail in practice.
  • Clear writing and the ability to articulate why a quality standard is the right one.

What will excite you here

  • Defining what 'high quality' actually means for a product category that lacks established benchmarks.
  • Building evaluation infrastructure that improves the entire product's trustworthiness.
  • Working at the interface between human judgment and automated analysis.

First 90 days

  1. 01Audit the current evaluation rubrics and identify the top gaps in coverage or consistency.
  2. 02Design a structured review workflow for at least one core analysis type.
  3. 03Ship a measurable improvement to inter-rater reliability on a key evaluation task.

Hiring process

The process is intentionally short, direct, and anchored in the real work.

  1. 1

    Apply

    Send us your background, relevant work, and why this role makes sense for you.

  2. 2

    Foundational conversation

    A focused conversation about your work, your judgment, and the role itself.

  3. 3

    Role-specific deep dive

    A discussion or exercise that looks like the actual work more than a generic interview loop.

  4. 4

    Founder conversation

    A final conversation on standards, ambition, and what success would look like here.

  5. 5

    Decision

    We close the loop clearly and move quickly once there is conviction.

Need context before you apply? [email protected]

Evaluation Specialist

We are not taking applications for this role yet. We will update this page when it opens.

Questions in the meantime? Email [email protected].