Astro FlexSearch: Build a Fast Offline Search Index for Your Blog • didof.dev

TL;DR: Prerender one JSON file per locale at build time. Fetch it lazily on Cmd+K. Index it in memory with FlexSearch. Result: zero search-related JS or data on first paint, sub-100ms indexing for ~35 posts, no server. Part 2 covers deep-linking results to the exact matched word and the fight with Astro’s <ClientRouter /> along the way.

Most “add search to your Astro site” tutorials I’ve read do roughly the same thing: in the browser, fetch every post’s full body, build the index on the fly, search it. That’s fine when there are five posts. By thirty it starts to feel slow on first open. By three hundred you’re shipping a megabyte of MDX over the wire just to find out the user typed “arduino”.

There’s a better split, and it’s the one I’m running on didof.dev right now: do the heavy lifting at build time, ship a single prerendered JSON file per locale, and let FlexSearch chew through it in memory after the user opens the search palette.

This post walks through the actual implementation, including the Astro 5 gotcha that cost me an evening before I caught it.

What we’re building

A Cmd+K search palette that:

Loads its index lazily on first open (zero cost on page load).
Searches across post title, description, tags, keywords, categories, and full body text.
Renders 16:9 thumbnails for blog results, generated at build time.
Highlights category and tag matches as colored pills when the query hits them.

Here’s the flow at a glance:

Two halves, clean boundary. The build half runs once per deploy. The runtime half runs once per session, after the user actually wants to search. Both stay simple because neither knows about the other.

The endpoint

Astro will happily prerender a JSON API route if you give it getStaticPaths and don’t mark it as on-demand. That means we can write what looks like a server endpoint and end up with a static file in the deploy output.

// src/pages/api/search/[locale].json.ts
import { getCollection } from "astro:content";
import { getImage } from "astro:assets";
import { isPublished } from "@/lib/posts";

export function getStaticPaths() {
  return [{ params: { locale: "en" } }, { params: { locale: "it" } }];
}

export async function GET({ params }: { params: { locale: string } }) {
  const { locale } = params;

  const posts = await getCollection("blog");
  const publishedPosts = posts
    .filter((post) => post.id.endsWith(`/${locale}`))
    .filter(isPublished);

  const blogItems = await Promise.all(
    publishedPosts.map(async (post) => {
      const slug = post.id.split("/").slice(0, -1).join("/");
      const tags = (post.data.tags ?? []) as string[];
      const keywords = (post.data.keywords ?? []) as string[];
      const categories = (post.data.categories ?? []).map(
        (c: { id: string } | string) => (typeof c === "string" ? c : c.id),
      );
      const body = extractSearchableText(post.body ?? "");
      const description = post.data.description ?? "";

      const searchText = [
        post.data.title,
        description,
        tags.join(" "),
        keywords.join(" "),
        categories.join(" "),
        body,
      ]
        .filter(Boolean)
        .join(" ");

      let thumbnail = null;
      if (post.data.cover) {
        const img = await getImage({
          src: post.data.cover,
          width: 192,
          height: 108,
          format: "webp",
          fit: "cover",
        });
        thumbnail = { src: img.src, width: 192, height: 108 };
      }

      return {
        id: `blog-${slug}`,
        type: "blog" as const,
        title: post.data.title,
        description,
        tags,
        categories,
        url: `/${locale}/blog/${slug}`,
        body,
        searchText,
        thumbnail,
      };
    }),
  );

  return new Response(JSON.stringify(blogItems), {
    headers: { "Content-Type": "application/json" },
  });
}

A few things worth pointing out, because each one is a small decision that adds up:

getStaticPaths: by listing every locale, Astro generates /api/search/en.json and /api/search/it.json at build time. They’re plain static files. Cache them forever and let the CDN do its job.
isPublished filter: a tiny helper that respects the publish_after and draft fields. Drafts have no business in a public search index.
getImage: from astro:assets. Resolves the post’s cover image (a relative import in frontmatter) into a build-optimized webp at 192x108. We bake the URL into the JSON so the client renders it directly. No client-side image processing, no layout shift.
Promise.all: getImage is async. Without Promise.all you serialize image generation across every post and the build slows to a crawl on a large blog.

The Astro 5 gotcha that cost me an evening

Look at the filter again:

.filter((post) => post.id.endsWith(`/${locale}`))

In Astro 4, with the legacy content collections loader, post.id included the file extension: my-post/en.md or my-post/en.mdx. Most tutorials still tell you to filter with endsWith('/${locale}.md').

Astro 5’s glob() loader strips extensions. post.id is now my-post/en, full stop. If you carry the old .md filter forward, every single post gets filtered out and your search returns nothing. The endpoint still serves a perfectly valid JSON file. It’s just empty.

The symptom: search for “arduino”, get zero results, even though there’s an obvious arduino post on the site. The fix is one character. The discovery took me longer than I’d like to admit.

If you hit this, build with astro build, open dist/api/search/en.json in a real text editor, and count the items. If it’s [], you know.

Stripping MDX and Markdown noise

post.body is the raw source: imports, JSX components, fenced code blocks, link syntax, emphasis markers, the lot. Indexing it raw means every search for “import” matches every post. We need clean prose.

function extractSearchableText(raw: string): string {
  let text = raw;
  // MDX import/export lines
  text = text.replace(/^\s*(import|export)\s.+$/gm, " ");
  // Fenced code blocks
  text = text.replace(/```[\s\S]*?```/g, " ");
  // Inline code
  text = text.replace(/`[^`\n]*`/g, " ");
  // JSX self-closing tags
  text = text.replace(/<[A-Z][\w.]*[^>]*\/>/g, " ");
  // JSX paired tags (best effort, non-greedy)
  text = text.replace(/<([A-Z][\w.]*)[^>]*>[\s\S]*?<\/\1>/g, " ");
  // HTML tags
  text = text.replace(/<\/?[a-z][\w-]*[^>]*>/gi, " ");
  // Markdown images: keep the alt text
  text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, " $1 ");
  // Markdown links: keep the link text
  text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, " $1 ");
  // Heading markers
  text = text.replace(/^#{1,6}\s+/gm, "");
  // Emphasis markers
  text = text.replace(/[*_~]{1,3}/g, "");
  // Blockquote markers
  text = text.replace(/^>\s?/gm, "");
  // List markers
  text = text.replace(/^\s*([-*+]|\d+\.)\s+/gm, "");
  return text.replace(/\s+/g, " ").trim();
}

Yes, it’s a pile of regexes. No, it’s not a real Markdown parser. It doesn’t need to be. The output is for a fuzzy text index, not for rendering. A few false positives in either direction are fine. What matters is that JSX components, code blocks, and import lines don’t pollute matches.

If you want to be more rigorous, swap this for remark with a stringify pipeline. For didof.dev’s volume, the regex approach is plenty.

One combined `searchText` field, not many

FlexSearch’s Document API lets you index multiple fields independently:

new FlexSearch.Document({
  document: {
    id: "id",
    index: ["title", "description", "body", "tags"],
    store: true,
  },
});

You’d think this is the right move. It isn’t, in my experience. Per-field indexes mean the user has to type a query that hits the right field. “arduino” in the title scores differently from “arduino” in the body, and merging the field results is something you end up writing.

The simpler approach: build one combined searchText string per item that already includes title, description, tags, keywords, categories, and body. Index that one field. Score and rank are the same regardless of where the term appears.

const searchText = [
  post.data.title,
  description,
  tags.join(" "),
  keywords.join(" "),
  categories.join(" "),
  body,
]
  .filter(Boolean)
  .join(" ");

Smaller index, simpler query, fewer surprises. The store still keeps title, description, tags, and categories as separate fields so the UI can render them properly.

The client island

The search palette is a React island, hydrated on demand. The relevant config:

import FlexSearch from "flexsearch";

const index = useMemo(
  () =>
    new FlexSearch.Document({
      document: {
        id: "id",
        index: ["searchText"],
        store: [
          "title",
          "description",
          "tags",
          "categories",
          "url",
          "type",
          "thumbnail",
        ],
      },
      tokenize: "forward",
      cache: true,
    }),
  [],
);

Three things going on:

index: ['searchText']: only the combined field is tokenized and indexed. Everything else is just storage.
store: [...]: list of fields kept verbatim so search hits return enough metadata to render result rows. Note that id isn’t in the store. FlexSearch returns it separately on each hit, so when reconstructing items I do { id: hit.id, ...hit.doc }.
tokenize: 'forward': prefix tokenizer. Typing “ardu” matches “arduino”, which is the behavior users expect from a live search-as-you-type palette.

Fetching and indexing happens on first open, not on page load:

useEffect(() => {
  if (!isOpen || data) return;
  setIsLoading(true);
  fetch(`/api/search/${locale}.json`)
    .then((r) => r.json())
    .then((items: SearchItem[]) => {
      setData(items);
      items.forEach((item) => index.add(item));
      setIsLoading(false);
    })
    .catch(() => setIsLoading(false));
}, [isOpen, data, locale, index]);

The result: zero search-related JS or data on first paint of any page. Nothing happens until the user actually presses Cmd+K. Indexing 35 posts in a Document index is well under 100ms on a normal laptop, so the loading state barely flashes.

Then the actual search:

const raw = index.search(query, 12, { enrich: true });
const seen = new Map<string, SearchItem>();
(raw as any[]).forEach((fieldResult) => {
  fieldResult.result.forEach((hit: any) => {
    const id = String(hit.id);
    seen.set(id, { id, ...(hit.doc as Omit<SearchItem, "id">) });
  });
});
setResults(Array.from(seen.values()));

enrich: true makes FlexSearch return the stored document fields alongside the id. The Map dedupes in case the same item appears in multiple field results (still possible if you ever expand the index later).

UX sugar: pill-highlight category and tag matches

Plain text snippets work, but a query like “electronics” or “arduino” deserves a stronger visual cue when it matches a category or a tag exactly. So the result row checks for that and renders a colored pill:

function findMatches(values: string[] | undefined, query: string): string[] {
  if (!values?.length || !query.trim()) return [];
  const tokens = query.toLowerCase().split(/\s+/).filter(Boolean);
  return values.filter((v) => {
    const lower = v.toLowerCase();
    return tokens.some((t) => lower.includes(t));
  });
}

In the row component, categories get the primary brand color with a small dot, and tags fall back to a neutral pill. It’s a tiny detail, and exactly the kind of thing that makes the difference between a search box that feels “okay” and one that feels considered.

const matchedCategories = findMatches(item.categories, query);
const matchedTags =
  matchedCategories.length === 0 ? findMatches(item.tags, query) : [];

If categories matched, skip showing tag pills. Two separate pill styles in the same row read as visual noise.

The body snippet trick

The other small touch: when a query matches text inside the post body, surface a snippet around the match.

const SNIPPET_RADIUS = 70;

function buildSnippet(body: string | undefined, query: string): string | null {
  if (!body || !query) return null;
  const q = query.trim().toLowerCase();
  const lower = body.toLowerCase();
  const tokens = q.split(/\s+/).filter(Boolean);
  let pos = -1;
  for (const t of tokens) {
    const idx = lower.indexOf(t);
    if (idx !== -1) {
      pos = idx;
      break;
    }
  }
  if (pos === -1) return null;
  const start = Math.max(0, pos - SNIPPET_RADIUS);
  const end = Math.min(body.length, pos + SNIPPET_RADIUS);
  const prefix = start > 0 ? "…" : "";
  const suffix = end < body.length ? "…" : "";
  return `${prefix}${body.slice(start, end).trim()}${suffix}`;
}

The result row then prefers the snippet over the description when the body matched the query, so the user sees why this post showed up. This is the difference between “search returned 12 results” and “search returned 12 results, and I can already see which one I want.”

When this approach doesn’t fit

I’d skip the precomputed JSON if any of these are true:

Truly massive content sets: if your blog has thousands of posts with full body text, the JSON file gets too big to ship as a single asset. At that point, look at Pagefind , which shards the index and fetches only what each query needs.
Authenticated content: if some posts are gated, you can’t ship them in a static index. Either filter at build time (only public content gets indexed) or move to a server-side endpoint with auth.
Stale-while-revalidate concerns: deploys are a clean refresh, but if you publish hourly and want indexing to keep up, you need an incremental rebuild story or a runtime endpoint.

For a typical Astro blog (tens to low hundreds of posts, all public, refreshed on deploy), precomputing the index and indexing it once in the browser gives you near-instant search with no infrastructure. Same family of tradeoffs as running SQLite in the browser : push the work to the client, lean on static delivery, skip the server.

Extending the FlexSearch index: workers, telemetry, recents

I haven’t bothered with any of these on didof.dev yet, but they’re the obvious next moves. Building the FlexSearch index in a Web Worker would push the ~100ms indexing cost off the main thread on first open. Persisting recent queries to localStorage and surfacing them in the empty state is a 30-line patch. Logging which results actually get clicked, and which don’t, gives you tuning signal almost immediately.

The base implementation is small enough (one endpoint, one React island, a few hundred lines total) that all of these stay one-day tasks rather than month-long migrations. Which is the whole point of preferring small, owned components over a hosted search service for a personal blog.

The two files that carry the weight are src/pages/api/search/[locale].json.ts (build-time index emitter) and src/components/shell/SiteSearch.tsx (the React island). Everything in this post is enough to rebuild both from scratch in an afternoon.