AI Extractability

The difference between being read and being cited

Structural integrity tells an LLM what your page is. AI Extractability determines whether it can actually use your content. These are two different problems and this pillar, at 35% of the total score, is the more consequential one.

When an LLM processes a page for potential citation or summarization, it doesn't read it linearly. It chunks content into discrete units, identifies patterns that signal factual or instructional value, and evaluates whether the information is structured for reliable extraction. A page full of long, unbroken paragraphs with no schema markup and no date signals is technically readable but it's not extractable in any meaningful sense.

AI Extractability is the single strongest predictor of whether your content gets cited in an AI-generated answer. A page can have perfect structural integrity and still score poorly here.

This pillar covers eight distinct checks, each targeting a specific behavior in how LLMs parse and prioritize web content. JSON-LD schema markup alone can account for up to 26 points making it the most impactful individual optimization available.

What gets measured and why

JSON-LD Schema Markup

Up to +26 pts

JSON-LD structured data is the single most impactful optimization in this entire scoring system. It provides LLMs with explicit, machine-readable descriptions of your content's type, structure, and relationships without requiring the model to infer any of this from prose.

The analyzer rewards specific schema types based on their extractability value. FAQPage earns the most points (+10) because it explicitly structures question-answer pairs exactly the format LLMs use when generating responses. HowTo earns +8 for similar reasons. Organization and Person schemas earn +5 as trust and authorship signals. Having more than one schema type on a page earns an additional +3 for breadth.

✓ FAQPage + Organization (+18 pts combined)

✓ HowTo + BreadcrumbList (+11 pts combined)

✗ No JSON-LD schema found (0 pts)

If you implement only one change from this entire pillar, make it JSON-LD schema. A single FAQPage schema on a relevant page can add 10 points to this pillar alone.

Paragraph Length

+8 pts (or −5 penalty)

LLMs extract content in chunks, and paragraphs are the natural chunking unit in prose-based content. Paragraphs that are too short provide insufficient context for accurate extraction. Paragraphs that are too long force the model to split them in unpredictable ways, often losing nuance or misattributing sentences.

The analyzer measures the average word count across all paragraphs with more than 20 characters. The ideal range is 40–120 words per paragraph. Paragraphs averaging over 200 words receive a −5 penalty. Paragraphs averaging under 40 words receive 0 points not a penalty, but a missed opportunity.

✓ Avg paragraph: 72 words (ideal 40–120) → +8 pts

✗ Avg paragraph: 240 words (too long) → −5 pts

Bullet & Numbered Lists

+5 pts (or −5 penalty)

Lists are among the most LLM-friendly content formats in existence. They present information in discrete, labeled units that are trivially easy to extract and reformat. When an LLM generates a response that includes "here are five reasons why…", it is almost always drawing from list-formatted source content.

The analyzer checks for the presence of <ul> and <ol> elements. Two or more lists earns +5 points. Zero lists earns a −5 penalty because a page with no list content at all is significantly harder for an LLM to extract from than one that structures at least some information in list form.

Converting even one section of your prose into a bulleted list especially for steps, features, or comparisons can meaningfully improve your extractability score.

Definitional Patterns

Up to +9 pts

LLMs are disproportionately likely to extract and cite content that answers direct questions or defines concepts. This is because AI assistants are primarily used in a question-answering context and they preferentially source from content that is already structured as an answer.

The analyzer detects six specific patterns in your page's text that signal definitional or instructional intent: "What is", "How to", and their Greek equivalents "Τι είναι", "Πώς να", "Οδηγός", and "Βήματα". Each detected pattern earns +3 points, up to a maximum of +9.

✓ "What is LLM visibility?" + "How to improve your score" → +6 pts

✗ No definitional patterns detected → 0 pts

Data Tables

+4 pts

HTML tables are highly structured content that LLMs can parse with near-perfect accuracy. Comparative data, specifications, pricing tables, and feature matrices are all significantly more extractable in table format than in prose and LLMs will preferentially use table data when generating structured comparisons.

The analyzer checks for the presence of at least one <table> element. This is a modest bonus (+4 pts) because tables are relevant to a subset of page types not every page benefits from tabular data. But for product, comparison, or reference pages, a well-structured table is a strong extractability signal.

Breadcrumb Navigation

Up to +5 pts

Breadcrumbs serve a dual purpose for LLM extractability: they provide navigational context (where this page sits within the site hierarchy) and, when implemented with BreadcrumbList JSON-LD schema, they provide machine-readable path information that helps LLMs understand the topical context of the content.

The analyzer distinguishes between two levels of implementation. A BreadcrumbList JSON-LD schema earns full +5 points it's the most reliable signal. A breadcrumb navigation detected via CSS class or ID patterns (without schema) earns +3 points. No breadcrumb at all earns 0.

✓ BreadcrumbList JSON-LD schema → +5 pts

~ HTML breadcrumb nav (no schema) → +3 pts

✗ No breadcrumb detected → 0 pts

Date Signals

Up to +5 pts

Publication and modification dates are freshness signals that LLMs use to assess content recency. For factual or rapidly-changing topics, an LLM may prefer more recently dated content over older content even if the older content is structurally superior. Explicitly marking your content with dates removes ambiguity about when it was written.

The analyzer detects date signals across multiple implementation methods: itemprop="datePublished", property="article:published_time", <time datetime="...">, and JSON-LD "datePublished". Having both datePublished and dateModified earns the full +5 points. Having only one earns +3.

Internal Links

Up to +5 pts

Internal links signal to LLMs that a page is part of a larger content ecosystem rather than an isolated document. They also help AI crawlers navigate and index your site more effectively. Pages with strong internal link structures tend to receive more consistent attribution because the model can contextualize them within a broader topical authority.

The analyzer counts links with href values starting with / or # (relative internal links). Five or more internal links earns full +5 points. Two to four earns +2. Fewer than two earns 0 and for content pages this is almost always fixable by adding contextual links to related content.

✓ 8 internal links found → +5 pts

~ 3 internal links found → +2 pts

✗ 1 internal link found → 0 pts

How the score is calculated

The AI Extractability pillar has a raw maximum of 54 points, normalized to a 0–100 scale. Penalties apply for actively poor implementations (long paragraphs, no lists). The schema checks alone can contribute up to 26 points nearly half the raw maximum.

Check

Max Points

Key Conditions

FAQPage schema

+10

JSON-LD FAQPage detected

HowTo schema

JSON-LD HowTo detected

Organization / Person schema

Entity schema detected

Multiple schema types

More than one JSON-LD type

Paragraph length

+8 / −5

Ideal: 40–120 words avg. Penalty: 200+ words avg

Lists (ul / ol)

+5 / −5

2+ lists: +5. Zero lists: −5

Definitional patterns

+3 per pattern (What is, How to, Greek equivalents)

Data tables

At least one <table> element

Breadcrumb

BreadcrumbList schema: +5. HTML only: +3

Date signals

Both published + modified: +5. One only: +3

Internal links

5+ links: +5. 2–4: +2. Under 2: 0

Total (normalized to 100)

100

Raw max: 54 pts before normalization

What the analyzer finds most often

No JSON-LD schema whatsoever

The majority of pages analyzed have zero structured data. This is the single largest missed opportunity in LLM optimization and it's implementable in under an hour for most content types.

Wall-of-text paragraphs

Long-form content pages frequently have average paragraph lengths of 200+ words. This almost always results from drafting without a structure-first approach. Breaking paragraphs at logical points is a low-effort, high-impact fix.

No lists on content-heavy pages

Service pages and about pages frequently present information that would be naturally listable features, benefits, process steps as unbroken paragraphs. This consistently results in the −5 list penalty.

Missing date signals

Static pages and landing pages rarely have publication dates even when the content has been recently updated. Adding a simple <time> element or article:published_time meta tag takes minutes and earns up to +5 points.

No breadcrumb implementation

Many sites have visual breadcrumb navigation but no corresponding schema markup. The HTML breadcrumb earns partial credit (+3), but adding a BreadcrumbList JSON-LD block alongside it is a straightforward upgrade.

Insufficient internal linking

Standalone pages and microsites frequently have fewer than 2 internal links. Even adding 3–5 contextually relevant links to related pages on your site is enough to move from 0 to +2 or +5 on this check.

Quick wins for AI Extractability

Unlike some pillars where improvements require content rewrites, most AI Extractability fixes are additive you add schema, you add structure, you add links. The underlying content doesn't have to change.

Add a FAQPage or HowTo JSON-LD schema

If your page answers questions, implement a FAQPage schema. If it describes a process, use HowTo. Both are easy to generate with any schema generator and can add up to 10–18 points to this pillar alone.

High impact

Break long paragraphs into 40–120 word chunks

Go through your content and split any paragraph over 150 words at a natural logical break. This is the fastest way to improve your paragraph length score and it also improves human readability as a side effect.

Low effort

Convert at least two prose sections into lists

Identify content in your page that naturally enumerates items features, steps, reasons, examples and convert it to <ul> or <ol>. Two lists is the minimum to earn +5 points and avoid the −5 penalty.

Low effort

Add a BreadcrumbList schema to every page

Implement a BreadcrumbList JSON-LD block that maps the page's path from homepage to current page. This earns +5 on breadcrumb and also contributes to the overall schema count bonus.

Low effort

Add publication and modification dates

Add <time itemprop="datePublished" datetime="YYYY-MM-DD"> and <time itemprop="dateModified" datetime="YYYY-MM-DD"> to your page. For articles, also add article:published_time and article:modified_time Open Graph tags.

Low effort

Add "What is" or "How to" sections to content pages

If your page is about a topic, add a section that explicitly defines it or explains how something works. Phrases like "What is [topic]?" and "How to [action]" are both strong definitional signals and naturally attract LLM citation.

Medium effort

The difference between being read and being cited

What gets measured and why

How the score is calculated

What the analyzer finds most often

Quick wins for AI Extractability

See how your page scores on AI Extractability

The four LLM visibility pillars