Embedding – The Core of LLM
Whether you're working with a simple open-source AI model or the most advanced systems for processing and generating text, they all share one fundamental principle: converting text into a vector matrix and searching for similarities within their database. The result is then further processed into "human language."
How does it work? Let's demonstrate with a simple Python script:
- from openai import OpenAI
- import re
- def normalize_text(text):
- text = text.lower()
- text = re.sub(r"\s+", "", text)
- return text
- client = OpenAI(api_key='sk-proj-....')
- def get_embedding2(text, model="text-embedding-3-small"):
- normalized_text = normalize_text(text)
- return client.embeddings.create(input=[normalized_text], model=model).data[0].embedding
- emb = get_embedding2("Some text")
- print(emb)
In this script, we first normalize the text by converting it to lowercase and removing extra spaces. Then, using the OpenAI client with your API key, we create an embedding (which is 1536th dimensional vector of float numbers) for the normalized text with a specified model. The resulting vector captures the semantic features of the text, ready to be used for similarity searches or further processing.
This simple example highlights the underlying process that powers many large language models today. Enjoy exploring and experimenting with embeddings!