Methodology

The CiteLayer Index

Five structural signals that published AI research correlates with citation probability in generative answer engines. Each dimension maps to a documented stage in the Retrieval-Augmented Generation (RAG) pipeline that powers ChatGPT, Perplexity, Google AI Overviews, and Claude.

Why This Matters Now

AI-referred website sessions grew 527% year-over-year in 2025. Traditional search volume is projected to drop 25% by 2026 and 50% by 2028. The question isn’t whether AI answer engines will replace search — it’s whether your business is structured to be cited when they do.

The CiteLayer Index doesn’t measure SEO. It measures whether AI systems can find, understand, extract, compare, and recommend your business when someone asks a relevant question. These are different problems with different solutions.

527%
YoY growth in AI-referred
website sessions (2025)
88%
of AI citations come from
non-Google-page-one sources
2.8x
higher citation rate for
pages with structured data

How AI Selects Sources

Every major AI answer engine uses Retrieval-Augmented Generation (RAG) — a two-stage process where the system first retrieves relevant documents from an index, then synthesizes an answer citing the sources it relied on. The CiteLayer Index measures readiness at each stage of this pipeline.

01
Crawl
Findability
02
Parse
Describability
03
Chunk
Summarizability
04
Retrieve
Comparability
05
Cite
Recommendability

Five Dimensions, Five Pipeline Stages

🔍

Findability

Can AI crawlers access and index your content?

Maps to the RAG ingestion stage. If GPTBot, ClaudeBot, or PerplexityBot is blocked in robots.txt — or if your content requires JavaScript to render — nothing downstream matters. AI crawlers do not execute JavaScript. Content must be visible in raw HTML.

We check: crawler permissions (robots.txt), redirect chains, sitemap presence, server-side rendering, and content accessibility without JavaScript execution.

Research basis: 60% of ChatGPT queries are answered from parametric knowledge alone — but the other 40% rely on real-time retrieval. If your pages aren’t crawlable, you’re excluded from that 40% entirely. Princeton GEO Study →
🏷️

Describability

Does AI have structured data to understand what you do?

Maps to the knowledge graph construction stage. AI systems build internal entity representations from structured data — Organization schema, LocalBusiness schema, FAQPage markup, product/service attributes. Without this, AI guesses what you offer — or cites a competitor who made it explicit.

We check: JSON-LD schema markup, entity consistency, business attribute completeness (hours, location, services, reviews), and whether schema is server-side rendered.

Research basis: A Relixir study of 50 sites found FAQPage schema produced 2.7x higher citation rates (41% vs. 15%). Microsoft confirmed at SMX Munich (March 2025) that schema markup directly helps LLMs understand content.
📄

Summarizability

Can AI extract clean, citable passages from your content?

Maps to the RAG chunking and extraction stage. AI systems break pages into segments and evaluate each for relevance. Content structured as modular, self-contained sections (200-500 words each) with clear headings and direct answers in the first paragraph scores highest.

We check: first-paragraph answer density, heading structure (questions vs. vague labels), content modularity, FAQ presence, and overall extractability.

Research basis: NVIDIA’s chunking research found page-level chunking achieves 0.648 accuracy with the lowest variance. The Princeton GEO study showed well-designed content optimization boosts source visibility by up to 40%. Pages with first-answer paragraphs under 40 words generated 67% more AI citations.
⚖️

Comparability

Does AI have enough data to compare you against alternatives?

Maps to the retrieval and ranking stage. When a user asks “best X in Y,” AI must compare entities across consistent attributes. Businesses with structured, explicit differentiation data — pricing, specialties, service areas, unique value — win the comparison. Those without it lose by default.

We check: brand entity consistency across platforms, sameAs profile links, category identification, contact detail consistency, and cross-platform presence.

Research basis: Rand Fishkin’s study (2,961 prompts across ChatGPT, Claude, and Google AI) found entities present across knowledge graphs, document indices, AND concept spaces are chosen “far more reliably.” Brands on 4+ platforms are 2.8x more likely to appear in ChatGPT responses.

Recommendability

Does AI have trust signals to justify citing you by name?

Maps to the citation and attribution stage. AI systems need verifiable trust signals — third-party mentions, review data, content depth, freshness signals — to justify recommending a specific entity. Without these, AI defaults to safer, more documented alternatives.

We check: third-party credibility signals, review data accessibility, content depth and freshness, competitive positioning signals, and citation history across AI platforms.

Research basis: Brand search volume has a 0.334 correlation with LLM citations — the strongest single predictor. Adding statistics to content boosts visibility by 22%. Adding quotations: 37%. Semantic completeness shows 0.87 correlation with citation inclusion.

Scoring Methodology

Each dimension is scored 0-10 based on automated checks that evaluate structural signals. The five dimension scores sum to the CiteLayer AI Score (0-50). Letter grades translate the composite score into an at-a-glance assessment.

GradeScore RangeWhat It Means
A+ / A / A-42-50AI systems can reliably find, describe, extract, compare, and recommend your business.
B+ / B / B-33-41Most structural signals are in place. Targeted improvements will close remaining gaps.
C+ / C / C-24-32Partial visibility. AI can find you but lacks enough data to consistently recommend you.
D+ / D / D-15-23Significant structural gaps. AI defaults to competitors with better-structured data.
F0-14Structurally invisible to AI answer engines regardless of traditional SEO performance.

What We Don’t Measure (And Why)

The CiteLayer Index deliberately excludes traditional SEO metrics like keyword rankings, backlink profiles, and domain authority. These are valuable for Google Search — but AI answer engines use a fundamentally different selection process. A page can rank #1 on Google and be completely invisible to ChatGPT if it lacks structured data, isn’t crawlable by AI bots, or can’t be cleanly extracted into a citable passage.

We also don’t claim to predict exact AI responses. Rand Fishkin’s research showed fewer than 1 in 100 identical prompts produced the same brand list from AI systems. What we measure is structural readiness — whether your content has the signals that make citation possible and probable, not guaranteed.

Research Foundation

1
Princeton, Georgia Tech, Allen AI, IIT Delhi — Published at KDD 2024. Foundational paper defining generative engine optimization. Well-designed optimizations boost source visibility by up to 40%.
2
Rand Fishkin / SparkToro — 2,961 prompts across ChatGPT, Claude, and Google AI. Multi-representation entities are chosen far more reliably.
3
NVIDIA Research — Page-level chunking achieves 0.648 accuracy with lowest variance across document types. Validates modular content architecture.
4
AirOps — Brand search volume has 0.334 correlation with LLM citations, the strongest single predictor. Cross-platform presence increases citation 2.8x.
5
Search Engine Land — Visibility percentage across multiple runs is statistically meaningful. Defines recommendation share as a measurable KPI.
6
Frase — Comprehensive analysis of GEO factors. Adding statistics: +22% visibility. Adding quotations: +37% visibility. Semantic completeness: 0.87 correlation.
7
Search Atlas — Platform-specific citation patterns across ChatGPT, Perplexity, Gemini. Wikipedia: 47.9% of ChatGPT citations. Reddit: 46.7% of Perplexity citations.

The CiteLayer Index measures structural signals correlated with AI citation in published research. It does not guarantee citation by any specific AI platform. AI system responses vary by query, region, index state, and model version. The methodology is updated as new research emerges. Signal assessment is point-in-time as of scan date.

CiteLayer AI does not claim authorship of the underlying research. We operationalize published findings into a diagnostic framework. All research sources are cited and linked above.