VibeScore · Proof of Work

We show our work.
And our misses.

VibeScore is a measurement instrument, not a theory of emotion. We do not claim to read what you truly feel. We report what four independent models agree the emotional framing of text is, and we show where they disagree. The only honest way to trust a number is to be able to check it. So here is the check: the bar we set before we ran, the one benchmark we have beaten, and the long list of things we have not proven yet.

A number must never launder a decision. Transparency, not promise.

Confirmed

0.569

core-emotion macro-F1

Convergent validity, against the benchmark

On GoEmotions (a Google research dataset, seed 42, n=500), the engine scores a core-emotion macro-F1 of 0.569, beating a fine-tuned BERT baseline of ~0.46 and a trivial lexicon. The pre-registered bar was F1 ≥ 0.50, set before the run.

And the honest caveat we publish beside it: across all eight emotions the F1 is 0.437 (mapping eight emotions onto the dataset costs us), and GoEmotions itself has only ~60% human inter-annotator agreement, a ceiling no model can meaningfully exceed. We report the weaker number on purpose.

The pre-registered bar

1 confirmed · 2 in progress · the rest, not yet

Nine ways an emotional instrument should be able to prove itself. We wrote the pass criteria first. Most are still open, and saying so is the point. An instrument that claims to have passed tests it never ran is the exact thing you should distrust.

Dimension	What it confirms	Pre-registered pass	Status
Convergent, categoricalGoEmotions (seed 42, n=500)	agrees with human emotion labels	core-Ekman F1 ≥ 0.50, beats lexicon	✅ F1 0.569
Criterion, predictivereviews→stars; ads→outcome	score predicts a real outcome	valence predicts rating > baseline	◐ started
Generalizationmulti-domain; non-English	holds across domains + languages	no large cross-domain drop	◐ started
Convergent, continuousEmoBank / NRC-VAD	valence tracks human valence	valence↔gold corr ≥ 0.6	⬜ Not yet proven
Construct (theory)large cached corpus	reproduces Plutchik geometry	opposites anti-correlate; neighbors co-occur	⬜ Not yet proven
Discriminantauthored minimal pairs	separates what should differ	framing / negation / intensity move monotonically	⬜ Not yet proven
Sequential (arc)DailyDialog / MELD	tracks emotion across a sequence	arc corr w/ human labels ≥ 0.6	⬜ Not yet proven
Modality (image)OASIS / IAPS	vision read matches human	image valence↔gold corr ≥ 0.6	⬜ Not yet proven
Reliability (wobble)re-score same inputs 5×	same input, same score	stddev within the precision gate	⬜ Not yet proven

How a reading is made

One input. Four models. Shown disagreement.

A worked example, the Gettysburg Address scored by the live engine:

1 · Input

The full text, chunked into beats.

2 · Four models

Claude, GPT-4o, Gemini, Grok each score the 8 emotions, blind to each other.

3 · Consensus + divergence

Where they agree, the read is clear. Where they split, we flag it. Here: 99% agreement.

4 · vibescore.v3

The vector → 0–1000. Overall 543 (TRUST). The arc runs 296 (FEAR, the war) → 852 (ANTICIPATION, the close).

Same input → same score. Deterministic by design (the precision gate that enforces this is in progress, and we say so).

The controls behind every number

How we keep ourselves honest

Beat the dumb baseline

Every dimension must clearly beat a trivial word-count lexicon. If it can’t, the four-model engine isn’t justified there.

Determinism / wobble

Score the same input five times, report the spread. No claimed difference may be smaller than that floor. (Hard gate, in progress.)

No answer-key leakage

Public datasets may sit in model training data, so we also test on novel, private stimuli the models can’t have seen.

Ground-truth ceiling

We report each dataset’s human inter-annotator agreement (GoEmotions: ~60%). We cannot meaningfully beat the gold’s own noise.

One source of truth

Identical golden vectors run through all three formula copies (app, validation, database) and must return identical scores. A test forbids drift.

Pre-register + publish misses

Pass criteria are fixed before the run. Failures are reported next to successes. A rule, not an aspiration.

The line we hold

We build the instrument to be auditable, consistent, and honest. We do not build the targeting, coercion, or manipulation layer on top of it.

Built to a consequential-use standard: traceable, reliable, governable. A human stays on the wheel; the reading informs a decision, it never makes one. The humility is not branding. It is the safety property.

See a live read How the math works

We show our work.And our misses.

Convergent validity, against the benchmark

1 confirmed · 2 in progress · the rest, not yet

One input. Four models. Shown disagreement.

How we keep ourselves honest

We show our work.
And our misses.