How It Works

Cerulean is a wrapper around real search providers (currently Brave Search, Serper, and DuckDuckGo). It does not crawl the web itself; that is prohibitively expensive for an indie project. What it adds on top is source classification, applied to whatever the providers return. The framework underneath this page is on the Methodology page. What follows is the implementation.

The three-tier classifier

Every result that the providers return runs through a three-tier classification pipeline before it reaches the page.

Tier 1: bundled index. A curated JSON file shipped with each deploy, holding domain entries with source-type and stable-default source-role labels. Sub-millisecond dict lookup at runtime. Sources include Tranco top sites, Crossref DOI prefixes, the IndieWeb directory, public .gov and .edu lists, Common Crawl frequent domains, and hand-curated additions. The index is git-versioned and publicly auditable.

Tier 2: heuristic classifier. For domains not in the bundled index, structural features get extracted at query time: TLD class, URL path patterns, schema.org markup, byline presence, editorial-process markers, citation density, document-type signals, and several others. Two or more features must co-fire before a strong label is assigned; otherwise the result falls back to "unclassified."

Tier 3: background batch. For long-tail unknowns that keep surfacing in real queries, an LLM classifies them in a nightly batch, and the results promote to the bundled index on the next deploy. The LLM never runs at query time. Only on offline batches, only on unknowns the structural classifier could not resolve.

AI-content scoring

Each result also gets a 0.0 to 1.0 score estimating the probability the page is AI-generated content. Based on title and snippet only (no destination-page fetch). Inputs include domain blocklist hits, listicle title patterns, LLM-tell phrases, em-dash density, and absence of dates or named entities. In the two-axis taxonomy, high scores feed the "SEO farm" and "aggregator" source-type classifications.

The scoring is a heuristic, not a measurement. It is wrong sometimes. Real page-level AI fraction detection requires fetching destination pages and is slower; that is on the roadmap.

Filter controls

Client-side, on the result page:

Top sentences

Top sentences is extractive. It pulls representative sentences from result snippets and shows them as bullets. No language model involved. The intent is triage across sources, not replacement of them. This is the deliberate counterpoint to Google's AI Overview.

No personalization

Results do not depend on who you are, where you are, or what you have searched before. Two people running the same query get the same results.

What it does not do

Known limitations