How to Implement Semantic Search in the Real World
How to Implement Semantic Search in Practice
In the real world, the text being searched is rarely stored directly as files within a file system. Companies typically utilize tools such as Confluence, SharePoint, or various online shopping platforms. Although some of these tools already boast "AI search" capabilities, the actual results frequently fall short of expectations due to multiple factors. Furthermore, semantic search in these platforms is usually restricted to certain file formats, which means if you're using anything less common, support is often unavailable.
So, how do you approach creating your own semantic search?
Step-by-step Guide to Building Semantic Search
- Obtain Data via API
- First, determine how you can access data through APIs from your chosen system.
- Identify Data Format
Check the format of your data, whether it's PDF, plain text, Microsoft formats, or something else. - Convert Data to Plain Text
Find ways to convert your data to plain text. Many programming languages, including Python, offer numerous libraries for this purpose. - Understand Data Structure
Analyze how your data is structured (pages, chapters, metadata, etc.). Large documents, such as a 1,000-page document, typically won't be stored as a single block of text but rather structured in some meaningful way. - Divide Data into Meaningful Segments
Break down your data into logical sections, such as chapters or other coherent segments. - Define Vectorization Strategy
Establish how you'll segment documents for vectorization. Keep in mind that vectorization tools (e.g., OpenAI) handle text segments of up to approximately 8,192 tokens (around 13 pages of A4 text), which is usually too coarse for effective semantic search. Conversely, overly small segments lose their semantic context, since semantic search targets meaning rather than exact word matches. Typically, paragraphs provide a good balance. - Set Reasonable Text Overlapping
When dividing text, overlapping is important. Without overlaps, related sentences might be separated, diminishing search quality. A good overlap is typically around 20-50%. - Standard Practice (OpenAI)
OpenAI standardly segments text into chunks of 800 tokens (~3,000 characters) with an overlap of 400 tokens. - Determine Optimal Segment Length
There's no universally optimal length—it depends entirely on your specific use case and how semantic search will be employed. - Choose Vectorization Tools
Select your vectorization tool carefully. Options range from Open Source to commercial products, each offering different levels of performance, speed, language support, and specialization. For example:

Select Your Vector Database
Choose a database based on the volume of data processed. Vector databases typically grow larger than the original indexed data. PostgreSQL with the pg_vector extension works well for typical scenarios (tens of GB). Larger datasets may benefit from databases like Milvus.Design Result Presentation
{ "id": 1234, "text": "text chunk from the file", "file": "somename.txt", "page": 11 }
Plan how you will present search results to users. A typical search result might look like:Choose Your Vector Database Tooling
Determine how you'll interpret search results:- Similarity calculation methods:
- Cosine similarity
- Euclidean distance
- Dot product
- Tools for similarity searches (faiss, pg_vector, etc.)
- Decide if results will be presented directly or fed into subsequent processes.
Conclusion
Implementing semantic search is not a trivial task. It involves careful consideration of multiple parameters and choices. If you have further questions or need help implementing semantic search solutions, feel free to contact me.