VibeScore · Proof of Work
We show our work.
And our misses.
VibeScore is a measurement instrument, not a theory of emotion. We do not claim to read what you truly feel — we report what four independent models agree the emotional framing of text is, and we show where they disagree. The only honest way to trust a number is to be able to check it. So here is the check: the bar we set before we ran, the one benchmark we have beaten, and the long list of things we have not proven yet.
A number must never launder a decision. Transparency, not promise.
Confirmed
0.569
core-emotion macro-F1
Convergent validity, against the benchmark
On GoEmotions (a Google research dataset, seed 42, n=500), the engine scores a core-emotion macro-F1 of 0.569, beating a fine-tuned BERT baseline of ~0.46 and a trivial lexicon. The pre-registered bar was F1 ≥ 0.50 — set before the run.
And the honest caveat we publish beside it: across all eight emotions the F1 is 0.437 (mapping eight emotions onto the dataset costs us), and GoEmotions itself has only ~60% human inter-annotator agreement — a ceiling no model can meaningfully exceed. We report the weaker number on purpose.
The pre-registered bar
1 confirmed · 2 in progress · the rest, not yet
Nine ways an emotional instrument should be able to prove itself. We wrote the pass criteria first. Most are still open — and saying so is the point. An instrument that claims to have passed tests it never ran is the exact thing you should distrust.
| Dimension | What it confirms | Pre-registered pass | Status |
|---|---|---|---|
| Convergent — categoricalGoEmotions (seed 42, n=500) | agrees with human emotion labels | core-Ekman F1 ≥ 0.50, beats lexicon | ✅ F1 0.569 |
| Criterion — predictivereviews→stars; ads→outcome | score predicts a real outcome | valence predicts rating > baseline | ◐ started |
| Generalizationmulti-domain; non-English | holds across domains + languages | no large cross-domain drop | ◐ started |
| Convergent — continuousEmoBank / NRC-VAD | valence tracks human valence | valence↔gold corr ≥ 0.6 | ⬜ Not yet proven |
| Construct (theory)large cached corpus | reproduces Plutchik geometry | opposites anti-correlate; neighbors co-occur | ⬜ Not yet proven |
| Discriminantauthored minimal pairs | separates what should differ | framing / negation / intensity move monotonically | ⬜ Not yet proven |
| Sequential (arc)DailyDialog / MELD | tracks emotion across a sequence | arc corr w/ human labels ≥ 0.6 | ⬜ Not yet proven |
| Modality (image)OASIS / IAPS | vision read matches human | image valence↔gold corr ≥ 0.6 | ⬜ Not yet proven |
| Reliability (wobble)re-score same inputs 5× | same input, same score | stddev within the precision gate | ⬜ Not yet proven |
How a reading is made
One input. Four models. Shown disagreement.
A worked example — the Gettysburg Address, scored by the live engine:
1 · Input
The full text, chunked into beats.
2 · Four models
Claude, GPT-4o, Gemini, Grok each score the 8 emotions — blind to each other.
3 · Consensus + divergence
Where they agree, the read is clear. Where they split, we flag it. Here: 99% agreement.
4 · vibescore.v3
The vector → 0–1000. Overall 543 (TRUST). The arc runs 296 (FEAR, the war) → 852 (ANTICIPATION, the close).
Same input → same score. Deterministic by design (the precision gate that enforces this is in progress, and we say so).
The controls behind every number
How we keep ourselves honest
Beat the dumb baseline
Every dimension must clearly beat a trivial word-count lexicon. If it can’t, the four-model engine isn’t justified there.
Determinism / wobble
Score the same input five times, report the spread. No claimed difference may be smaller than that floor. (Hard gate — in progress.)
No answer-key leakage
Public datasets may sit in model training data, so we also test on novel, private stimuli the models can’t have seen.
Ground-truth ceiling
We report each dataset’s human inter-annotator agreement (GoEmotions: ~60%). We cannot meaningfully beat the gold’s own noise.
One source of truth
Identical golden vectors run through all three formula copies — app, validation, database — and must return identical scores. A test forbids drift.
Pre-register + publish misses
Pass criteria are fixed before the run. Failures are reported next to successes. A rule, not an aspiration.
The line we hold
We build the instrument to be auditable, consistent, and honest. We do not build the targeting, coercion, or manipulation layer on top of it.
Built to a consequential-use standard — traceable, reliable, governable. A human stays on the wheel; the reading informs a decision, it never makes one. The humility is not branding. It is the safety property.