Embedding – A Real-World Use Case

02.03.2025

When dealing with extensive PDF files—like operating system documentation for Red Hat 7, which can span thousands of pages—you may want to add semantic search functionality. Unlike traditional full-text search, semantic search can return relevant text passages even when the search query itself doesn't explicitly appear in the document. Instead, it identifies text that best matches the intent behind your query.

Below, we'll walk through the essential steps needed to implement semantic search in large documents using embeddings.

Chunking the Document

  • Splitting into Manageable Pieces

    • Since embedding models (like those used by ChatGPT) can handle up to 8192 tokens, that's technically quite large. However, for more precise search results, it's recommended to use smaller chunks (e.g., 256–512 characters).
    • This smaller chunk size avoids returning overly broad passages—like four full pages—which could dilute the relevance of your search results.
  • Overlap for Context

    • It's best practice to overlap your chunks slightly (often around 10% of the chunk size). For a 256-character chunk, this means appending an overlap of about 26 characters to ensure continuity between segments.
  • Including Metadata

    • Attach extra metadata—such as page numbers—to each chunk for easy reference and context.

After this step, you end up with a table-like structure:

page_number (integer) | chunk (varchar 256)

https://github.com/jaroslavcech/aiblog/blob/main/06EmbeddingTOC.py