How It Works
Cerulean is a wrapper around real search providers (currently Brave Search, Serper, and DuckDuckGo). It does not crawl the web itself; that is prohibitively expensive for an indie project. What it adds on top is source classification, applied to whatever the providers return. The framework underneath this page is on the Methodology page. What follows is the implementation.
The three-tier classifier
Every result that the providers return runs through a three-tier classification pipeline before it reaches the page.
Tier 1: bundled index. A curated JSON file shipped with each deploy, holding domain entries with source-type and stable-default source-role labels. Sub-millisecond dict lookup at runtime. Sources include Tranco top sites, Crossref DOI prefixes, the IndieWeb directory, public .gov and .edu lists, Common Crawl frequent domains, and hand-curated additions. The index is git-versioned and publicly auditable.
Tier 2: heuristic classifier. For domains not in the bundled index, structural features get extracted at query time: TLD class, URL path patterns, schema.org markup, byline presence, editorial-process markers, citation density, document-type signals, and several others. Two or more features must co-fire before a strong label is assigned; otherwise the result falls back to "unclassified."
Tier 3: background batch. For long-tail unknowns that keep surfacing in real queries, an LLM classifies them in a nightly batch, and the results promote to the bundled index on the next deploy. The LLM never runs at query time. Only on offline batches, only on unknowns the structural classifier could not resolve.
AI-content scoring
Each result also gets a 0.0 to 1.0 score estimating the probability the page is AI-generated content. Based on title and snippet only (no destination-page fetch). Inputs include domain blocklist hits, listicle title patterns, LLM-tell phrases, em-dash density, and absence of dates or named entities. In the two-axis taxonomy, high scores feed the "SEO farm" and "aggregator" source-type classifications.
The scoring is a heuristic, not a measurement. It is wrong sometimes. Real page-level AI fraction detection requires fetching destination pages and is slower; that is on the roadmap.
Filter controls
Client-side, on the result page:
- Ranking mode toggle switches between Unfiltered (raw provider results) and Quality Boost (the classifier reranks toward higher-confidence sources).
- Source-type chips narrow results to selected categories.
- Provider selector switches between Brave, Serper, and DuckDuckGo.
- Top sentences toggle shows extractive top sentences pulled from result snippets, off by default.
Top sentences
Top sentences is extractive. It pulls representative sentences from result snippets and shows them as bullets. No language model involved. The intent is triage across sources, not replacement of them. This is the deliberate counterpoint to Google's AI Overview.
No personalization
Results do not depend on who you are, where you are, or what you have searched before. Two people running the same query get the same results.
What it does not do
- Page-level AI fraction estimation on destination pages (roadmap).
- Bias scoring on news outlets.
- User accounts, history, saved searches.
Known limitations
- Video search is rough. YouTube and Vimeo results that appear in normal text search get classified alongside everything else, but Cerulean does not call provider-specific video endpoints. Dedicated video queries return thin results.
- Image search is not implemented. Use Google or DuckDuckGo for images.
- Brave provider caps results at 20 per query (provider limit). Serper and DDG can return up to 50.