VibeScore · Proof of Work

We show our work.
And our misses.

VibeScore is a measurement instrument, not a theory of emotion. We do not claim to read what you truly feel — we report what four independent models agree the emotional framing of text is, and we show where they disagree. The only honest way to trust a number is to be able to check it. So here is the check: the bar we set before we ran, the one benchmark we have beaten, and the long list of things we have not proven yet.

A number must never launder a decision. Transparency, not promise.

Confirmed

0.569

core-emotion macro-F1

Convergent validity, against the benchmark

On GoEmotions (a Google research dataset, seed 42, n=500), the engine scores a core-emotion macro-F1 of 0.569, beating a fine-tuned BERT baseline of ~0.46 and a trivial lexicon. The pre-registered bar was F1 ≥ 0.50 — set before the run.

And the honest caveat we publish beside it: across all eight emotions the F1 is 0.437 (mapping eight emotions onto the dataset costs us), and GoEmotions itself has only ~60% human inter-annotator agreement — a ceiling no model can meaningfully exceed. We report the weaker number on purpose.

The pre-registered bar

1 confirmed · 2 in progress · the rest, not yet

Nine ways an emotional instrument should be able to prove itself. We wrote the pass criteria first. Most are still open — and saying so is the point. An instrument that claims to have passed tests it never ran is the exact thing you should distrust.

DimensionWhat it confirmsPre-registered passStatus
Convergent — categoricalGoEmotions (seed 42, n=500)agrees with human emotion labelscore-Ekman F1 ≥ 0.50, beats lexicon F1 0.569
Criterion — predictivereviews→stars; ads→outcomescore predicts a real outcomevalence predicts rating > baseline started
Generalizationmulti-domain; non-Englishholds across domains + languagesno large cross-domain drop started
Convergent — continuousEmoBank / NRC-VADvalence tracks human valencevalence↔gold corr ≥ 0.6 Not yet proven
Construct (theory)large cached corpusreproduces Plutchik geometryopposites anti-correlate; neighbors co-occur Not yet proven
Discriminantauthored minimal pairsseparates what should differframing / negation / intensity move monotonically Not yet proven
Sequential (arc)DailyDialog / MELDtracks emotion across a sequencearc corr w/ human labels ≥ 0.6 Not yet proven
Modality (image)OASIS / IAPSvision read matches humanimage valence↔gold corr ≥ 0.6 Not yet proven
Reliability (wobble)re-score same inputs 5×same input, same scorestddev within the precision gate Not yet proven

How a reading is made

One input. Four models. Shown disagreement.

A worked example — the Gettysburg Address, scored by the live engine:

1 · Input

The full text, chunked into beats.

2 · Four models

Claude, GPT-4o, Gemini, Grok each score the 8 emotions — blind to each other.

3 · Consensus + divergence

Where they agree, the read is clear. Where they split, we flag it. Here: 99% agreement.

4 · vibescore.v3

The vector → 0–1000. Overall 543 (TRUST). The arc runs 296 (FEAR, the war) → 852 (ANTICIPATION, the close).

Same input → same score. Deterministic by design (the precision gate that enforces this is in progress, and we say so).

The controls behind every number

How we keep ourselves honest

Beat the dumb baseline

Every dimension must clearly beat a trivial word-count lexicon. If it can’t, the four-model engine isn’t justified there.

Determinism / wobble

Score the same input five times, report the spread. No claimed difference may be smaller than that floor. (Hard gate — in progress.)

No answer-key leakage

Public datasets may sit in model training data, so we also test on novel, private stimuli the models can’t have seen.

Ground-truth ceiling

We report each dataset’s human inter-annotator agreement (GoEmotions: ~60%). We cannot meaningfully beat the gold’s own noise.

One source of truth

Identical golden vectors run through all three formula copies — app, validation, database — and must return identical scores. A test forbids drift.

Pre-register + publish misses

Pass criteria are fixed before the run. Failures are reported next to successes. A rule, not an aspiration.

The line we hold

We build the instrument to be auditable, consistent, and honest. We do not build the targeting, coercion, or manipulation layer on top of it.

Built to a consequential-use standard — traceable, reliable, governable. A human stays on the wheel; the reading informs a decision, it never makes one. The humility is not branding. It is the safety property.

Ad+Verb Labs, parent company. Legal, support, and identity live here once, referenced everywhere. All ideas, brands, and IP of Trent McNelly.