Tool for Easy and Efficient Text Vectorization

19.03.2025

If you're involved in application development or data analytics, you've likely encountered the concept of "vectorization" or "embedding" text content. This process converts text into vector form, enabling computers to better understand the meaning behind words and sentences. It's essential for semantic search, recommendation systems, automatic summarization, and research tasks. 📚🔍

While GitHub has many examples demonstrating how to perform vectorization, they often lack flexibility—especially detailed options to configure how documents should be split into chunks, how to set overlaps between text blocks, how to effectively store resulting vectors, or how to search through them. 🤔

That's why I've decided to create a tool to simplify these tasks and provide users with plenty of configuration options. 🛠️

Technologies and Tools 💻

Initially, I've chosen OpenAI's API due to its excellent vector representation quality and robust documentation. Over time, I plan to expand the tool to include other popular solutions, ensuring maximum versatility.

For storing the resulting vectors, I'll use PostgreSQL with the pgvector extension. According to community experiences, this solution provides an excellent compromise between ease of use and performance. Plus—everyone uses PostgreSQL, right? 😉

How Will We Store Vectors? 📦

Determining the exact position of relevant text in documents can be challenging because file formats and structures vary widely. To elegantly solve this, I've decided to automatically convert all input documents to PDF format, leveraging its clear pagination feature. Each text chunk will be associated with a page number, acting as a natural and universal "pointer." 📄🔖

The resulting database structure we'll use is:

  • File Name – clear identification of the document 🗂️

  • Page Number – quick orientation and reference within the document 📍

  • Text Chunk – specific content section 📜

  • Embedding Vector – numeric vector representing the meaning of the text chunk 🔢

This structure is versatile enough to suit most common usage scenarios.

Overview of the Entire Process 🎯

  • Directory with Input Files 📁

  • PDF Creation and Pagination 📄

  • Text Extraction 📑

  • Splitting into Chunks of Chosen Length with Overlap ✂️

  • Chunk Vectorization using Chosen Tool 🧮

  • Creating a New Database Record 🗃️

  • Process Logging 📝

  • Embedding Indexing 🔍

You can see exactly how the entire process works in the attached diagram below. If you have additional ideas to improve the system, please let me know! 💡

This tool is under heavy testing now - I will release it within 2-3 days ...

Looking forward to your feedback! 😊