Methodology

Library science has a 150-year tradition of evaluating sources. Cerulean operationalizes that tradition in a search box.

The lineage

Source evaluation predates the web. Library and information science has spent more than a century building frameworks for distinguishing kinds of sources, evaluating authority, and tracing claims back toward original evidence. The Association of College and Research Libraries publishes the ACRL Framework for Information Literacy, which is the current professional standard. The CRAAP test (Currency, Relevance, Authority, Accuracy, and Purpose) is the most widely taught checklist for source evaluation in undergraduate research instruction. The BEAM framework (Background, Exhibit, Argument, and Method) describes how sources function inside a research argument. Underneath all of them sits the foundational distinction between primary, secondary, and tertiary sources, which dates to nineteenth-century historiography.

None of this is novel. What is novel is applying it to web search at scale, and surfacing the results to the user instead of hiding them behind relevance ranking.

What web search lost

Relevance ranking collapsed source-type distinctions because it did not need them. A search engine optimizing for click-through can put a SEO listicle, a peer-reviewed journal article, and a Wikipedia entry side by side as long as all three contain the query terms. The user is expected to evaluate the sources after the click. In practice most users do not. The structural signals a research librarian would foreground (who produced this, how close to the original evidence, with what editorial process) disappeared from the result page.

This was not malicious. Relevance ranking is a tractable engineering problem. Source evaluation is a harder one. The engines that won did the tractable thing, and the harder thing waited.

Two axes

Every classified result on Cerulean gets two tags. The axes are orthogonal, and both are load-bearing.

Source role

How close is the source to the original evidence?

Source type

What kind of entity produced it?

Why both

Role answers the research question. Type answers the trust question. A government site can be primary (raw dataset release), secondary (policy analysis paper), or tertiary ("about this agency" page). Journalism can be primary (original 2008 financial-crisis reporting) or secondary (2015 retrospective). Collapsing both into a single tag loses information that matters at the moment a user decides whether to click.

How classification works

Cerulean classifies in three tiers. None of them run an LLM at query time.

Tier 1: bundled index. A curated JSON file shipped with the deploy, holding domain entries with source-type and stable-default source-role labels. Lookups are sub-millisecond. Source data includes Tranco top sites, Crossref DOI prefixes, the IndieWeb directory, public .gov and .edu lists, Common Crawl frequent domains, and hand-curated additions. The index is git-versioned and publicly auditable in the repository.

Tier 2: heuristic classifier. For domains not in the bundled index, structural features get extracted at query time: TLD class, URL path patterns, schema.org markup, byline presence, editorial-process markers, citation density, document-type signals, and several others. Classification requires two or more features to co-fire before a strong label is assigned. Otherwise the result falls back to "unclassified."

Tier 3: background batch. For long-tail unknowns that keep surfacing in real queries, an LLM classifies them in a nightly batch, and the results promote to the bundled index on the next deploy. This is the only place an LLM touches the classification pipeline. It runs offline, on unknowns the structural classifier could not resolve, and never on a query path.

How labels are generated

This is the section information-literate readers want to see, so it sits in the open.

Labels in the bundled index come from a documented pipeline, not from a credentialed human labeler. Cerulean is an indie project without institutional access to trained library professionals at scale. Pretending otherwise would be worse than being honest. The pipeline does the following:

  1. Anchor-first. Entries with a clean upstream anchor (a .gov TLD, a Crossref DOI prefix, an IndieWeb directory listing, a known academic publisher domain) inherit the validated upstream classification. These cases are deterministic, not judgment calls.
  2. LLM for the no-anchor remainder. Domains without a clean anchor get classified by an LLM using a published prompt that operationalizes the criteria defined above. The model is asked to apply the documented role and type definitions, cite the structural signals it relied on, and return "unclassified" when confidence is too low.
  3. Every label cites its anchor. Each entry records what drove the classification: TLD, Crossref prefix, IndieWeb listing, structural-signal pattern, or "no anchor, model judgment." Users can filter by confidence in the result UI, and auditors can see exactly which entries rest on which kind of evidence.
  4. Reproducibility. Given the published prompt, model version, and seed, classifications are deterministic. Anyone can rerun the pipeline against the index and verify the labels independently. This is a stronger guarantee than hand-labeling, which is not reproducible at all.

This is not the classical library science benchmark methodology. Classical methodology uses trained labelers applying criteria manually against a gold-standard set. Cerulean uses a documented LLM pipeline anchored on validated upstream sources, with audit surfaces and a public correction channel. The choice is appropriate for an indie project that cannot field trained labelers, and the audit surfaces compensate for the absence of credentialed review. The defense is not "we did the classical thing." It is "we did something different, here is exactly what, here is how to verify it, and here is how to challenge it."

How the benchmark works

The benchmark suite that gates production release is also LLM-graded. The same honesty principle that applies to the bundled index applies here: hand-labeling at the required scale is not feasible for an indie project, and the alternative is to be explicit about what the benchmark is and what it tells you.

The pipeline:

  1. Query generation. An LLM produces the query set, distributed across research, news, technical, commercial, and controversial categories. The generator prompt selects for queries that probe the classifier's decision boundaries (queries where role and type assignment are non-trivial), not just easy cases. The prompt is published and the generated query set is git-versioned. The current benchmark contains 1000 queries, 200 per category.
  2. Label generation. A separate prompt, with deliberately heavier compute (chain-of-thought reasoning, multi-pass deliberation, self-critique, possibly a different model than the Tier 3 classifier) produces expected role and type distributions for the top 10 results of each query. The deliberation gap between the labeler and the runtime classifier is the operational substitute for human review: the benchmark asks whether the fast runtime classifier can match the output of a slower, more deliberate LLM applying the same criteria.
  3. Scoring. The classifier runs against each query through real search providers. Its output distribution is compared to the labeler's expected distribution. A query passes if the actual distribution falls within the constraint bounds the labeler defined. Accuracy per axis is reported.
  4. Reporting. Accuracy numbers are published on this page when the benchmark runs. Production release of new classifier features requires 85% role accuracy and 90% type accuracy.

Circularity is the honest concern. If the labeler LLM and the Tier 3 classifier share a substrate, the benchmark partially measures an LLM against an LLM. The mitigations: different prompting between labeler and classifier; possibly a different model; deliberately heavier deliberation for the labeler. These narrow the circularity without eliminating it. The benchmark tells you whether the fast classifier can replicate careful LLM reasoning. It does not tell you whether either matches the judgment of a trained library professional.

Why this is acceptable for an indie project. A hand-labeled benchmark by trained labelers is the classical methodology and the stronger one. It is also out of reach for a hobby project. The alternatives are: ship without a benchmark (worse), ship with a thin benchmark (worse), or ship with a documented LLM-graded benchmark and be explicit about the limits (this). The benchmark is reproducible. The prompts are public. The query set and label set are public. Anyone can rerun it, propose corrections, or run their own benchmark with stricter labelers.

Honest limits

Source role is partially context-dependent. A 2010 historian writing about the French Revolution is secondary for the Revolution itself and primary for 21st-century historiography. Cerulean classifies at the document level using stable structural defaults, which is correct often enough to be useful, and wrong sometimes. The methodology cannot eliminate that.

Structural signals do not capture semantic context. A page that looks like editorial journalism by every structural measure can still be a press release with a byline glued on. A page that looks like a SEO farm can be a legitimate small publisher with weak design conventions. The classifier reports its best structural read; it does not verify intent.

The bundled index is not exhaustive. New domains appear constantly. Long-tail entries fall through to the heuristic classifier, and a fraction of those fall through to "unclassified." Surfacing "unclassified" honestly is better than overclaiming.

The LLM in the pipeline carries its own biases. Models trained on internet-scale text reflect their training data's view of what counts as authoritative. The pipeline mitigates this by anchoring labels on upstream validated sources wherever possible, by publishing the prompt and the model version, and by keeping corrections public and versioned. It does not eliminate the bias. Readers who want to inspect the prompt and rerun the classification can.

Current rollout status

In production today: source-type tags. Results currently display the source-type axis (journalism, academic, indie, commercial, and so on). The legacy classifier underneath uses an earlier taxonomy that is being replaced.

Rolling out: source-role tags. The two-axis system described above is in active development. The bundled index is being rebuilt against the new schema. The heuristic classifier is being extended with role features. Role tags will ship to production when the pipeline meets internal quality gates, not before.

This page documents the target system. It will be updated when production catches up to it.

How to challenge a classification

The bundled index is public and versioned. It lives in the project repository on GitHub at github.com/garretblanchette/Cerulean-search. Anyone can read it, audit it, and propose corrections.

To challenge a label, open a GitHub issue against the repository describing the domain, the current label, the proposed label, and the reasoning. Corrections are reviewed, batched, and folded into the next index version. Egregious misclassifications get fixed faster than minor disagreements. Disagreements about edge cases are documented in the issue and left visible, even if the label does not change, so that future readers can see the reasoning.

Rate-limiting and trust weighting will be added before any in-app flag UI ships, to prevent mass-flagging abuse. This page will be updated when that happens.

What a research librarian would do, in a search box.