RAG Series (5): Embedding Models — The Core of Semantic Understanding

Why Does Switching Embedding Models Make Such a Huge Difference? In the first four articles, we built the RAG pipeline, tuned parameters, and mastered chunking strategies. But there's one question we haven't dived into: After your documents are chunked, how do they become vectors? This process is called Embedding. It transforms human-readable text into machine-computable vectors. The choice of Embedding model directly determines: Whether "apple" and "iPhone" are recognized as related Whether "database connection pool exhausted" and "Too many connections" match Whether Chinese idioms, technical jargon, and abbreviations are properly understood This article explains how Embedding works, compares mainstream models, and runs a head-to-head retrieval comparison between OpenAI and BGE using real Chinese documents. What Is Embedding? One-Sentence Explanation Embedding is a function that takes a piece of text and outputs a fixed-length numerical vector (e.g., 1024 dimensions). Semantically similar texts produce vectors that are close together in space. Why Can Vectors Represent Meaning? Imagine placing all words in a multi-dimensional space: "King" and "Queen" are close together "Apple (fruit)" and "Banana" are close together "Apple (company)" and "Google" are close together "Apple (fruit)" and "Apple (company)" are far apart Embedding models learn these "semantic distances" through pre-training on massive text corpora. When you ask "How do I restart my iPhone?", the model knows "iPhone" relates to "Apple" (the company), not "apple" (the fruit). Its Role in RAG User Query → Embedding Model → Query Vector ↘ Vector Similarity → Top-K Retrieval ↗ Document Chunk → Embedding Model → Document Vector (precomputed) Embedding is the semantic bridge of RAG. Without it, retrieval is limited to keyword matching (like Ctrl+F). With it, you get semantic matching that understands synonyms, paraphrases, and context. Mainstream Embedding Model Comparison Model Overview Model Vendor Dimensions Language Strength Deployment Characteristics text-embedding-3-small OpenAI 1536 Multilingual API Cheap, fast, good for general use text-embedding-3-large OpenAI 3072 Multilingual API High accuracy, expensive, complex semantics BAAI/bge-large-zh-v1.5 BAAI 1024 Chinese API/Local Top Chinese performance, open-source, free BAAI/bge-m3 BAAI 1024 Multilingual API/Local 100+ languages, lightweight embed-multilingual-v3.0 Cohere 1024 Multilingual API Good for long texts E5-mistral-7b-instruct Microsoft 4096 Multilingual Local Instruction-based, strong but heavy Key Metric: The MTEB Leaderboard MTEB (Massive Text Embedding Benchmark) is the "college entrance exam" of Embedding models. It tests models on 50+ datasets across various tasks. How to Read the MTEB Leaderboard: Visit the MTEB Leaderboard Focus on Retrieval Average — most relevant to RAG Check Model Size — larger models are slower but usually more accurate Key Findings from the Leaderboard: English: OpenAI text-embedding-3-large dominates, but text-embedding-3-small offers exceptional value Chinese: BGE series (especially bge-large-zh-v1.5) often outperforms OpenAI, and it's open-source and free Multilingual: bge-m3 and Cohere embed-multilingual-v3.0 stand out 💡 Rule of Thumb: English → OpenAI, Chinese → BGE, Multilingual → bge-m3, Long Text → Cohere. Practical: OpenAI vs BGE Retrieval Showdown on Chinese Documents Experimental Design We use the same Chinese technical document from Article 4 (the microservices architecture guide), generate embeddings with both OpenAI and BGE, and test retrieval quality on the same set of queries. Code: Switching Embedding Models with One Change LangChain's OpenAIEmbeddings class is compatible with all OpenAI-Format Embedding APIs (including SiliconFlow, Zhipu, Ollama, etc.), so switching models only requires changing a few configuration lines: from langchain_openai import OpenAIEmbeddings # --- Official OpenAI --- openai_embed = OpenAIEmbeddings( model="text-embedding-3-small", api_key="sk-...", base_url="https://api.openai.com/v1", ) # --- BGE (via SiliconFlow) --- bge_embed = OpenAIEmbeddings( model="BAAI/bge-large-zh-v1.5", api_key="sk-...", # SiliconFlow API Key base_url="https://api.siliconflow.cn/v1", chunk_size=32, # SiliconFlow batch size limit: 32 ) # --- Use in RAG Pipeline --- vectorstore = Chroma.from_documents( documents=chunks, embedding=bge_embed, # Change only this line to switch models ) Evaluation Query Set We designed 5 queries covering different difficulty levels: Query Expected Content Difficulty Q1: "What are the principles of microservice decomposition?" Section 1.1: DDD Easy Q2: "What's the difference between REST and gRPC?" Section 2.1: REST vs gRPC Easy Q3: "How to solve distributed transactions?" Section 3.2: Saga Pattern Medium Q4: "How to roll back a failed order?" Saga compensation operations Hard (requires reasoning) Q5: "How to monitor microservices?" Section 4: Observability Easy Results Comparison Query OpenAI text-embedding-3-small BGE-large-zh-v1.5 Analysis Q1 Decomposition principles ✅ #1 hit ✅ #1 hit Tie Q2 REST vs gRPC ✅ #1 hit ✅ #1 hit Tie Q3 Distributed transactions ✅ #1 hit ✅ #1 hit Tie Q4 Order rollback ⚠️ #3 hit ✅ #1 hit BGE wins — better semantic link between "rollback" and "compensation" Q5 Monitoring ✅ #1 hit ✅ #1 hit Tie Conclusion: For simple queries (direct keyword matches), both models perform similarly For difficult queries (semantic reasoning required), BGE's Chinese advantage is clear, especially on synonyms and paraphrases Cost Comparison Model Price (per million tokens) Notes OpenAI text-embedding-3-small $0.02 Extremely cheap OpenAI text-embedding-3-large $0.13 Expensive but strong BGE-large-zh-v1.5 (SiliconFlow) ¥0.007 (~$0.001) Cheapest If you have a GPU, BGE can also be deployed locally for free (details below). Local Deployment vs API Calls: How to Choose? API Calls: Pros and Cons Pros: Zero ops, one line of code Model versions auto-update Pay-per-use, no idle costs Cons: Data leaves your domain (compliance risk for sensitive docs) Network latency and rate limits Costs accumulate with high-frequency usage Local Deployment: Pros and Cons Pros: Data never leaves your premises, absolute security No rate limits, ideal for high-frequency batch processing More economical over time (one-time GPU investment) Cons: Requires GPU (BGE-large needs 4GB+ VRAM) Operational complexity (model downloads, version management, serving) Slow initial loading (model size: hundreds of MB to several GB) Decision Tree Is your data sensitive? ├─ Yes → Local Deployment (BGE or GTE) └─ No → Is call volume high? ├─ Yes → Local Deployment (saves money long-term) └─ No → API Calls (simpler) Primarily Chinese? → BGE (SiliconFlow/Local) Primarily English? → OpenAI text-embedding-3-small Special Considerations for Chinese Embedding 1. Tokenization Differences English Embedding models typically tokenize by spaces, but Chinese has no spaces. If a model isn't optimized for Chinese, it might understand "南京市长江大桥" as "Nanjing / Mayor / River Bridge" instead of "Nanjing City / Yangtze River Bridge". BGE's Advantage: Specifically trained on Chinese corpora, with tokenization and semantic understanding optimized for Chinese. 2. Idioms and Colloquialisms Query Expected Match English Model BGE "杀鸡取卵" (Kill the goose) Short-sighted behavior ❌ Often mismatches ✅ Correct match "亡羊补牢" (Mend the fold) Remedy after the fact ❌ Often mismatches ✅ Correct match 3. Domain Terminology Technical documents contain extensive jargon (e.g., "Saga pattern", "Two-phase commit", "Eventual consistency"). BGE, trained on Chinese technical community data, typically understands these terms better than general English models. Code Walkthrough: Model Switching Wrapper To make model switching easy in your project, create a factory function: import os from langchain_openai import OpenAIEmbeddings def build_embeddings(provider: str = "bge"): """ Factory function: returns the appropriate Embedding model based on config. provider: "openai" | "bge" | "local" """ if provider == "openai": return OpenAIEmbeddings( model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY"), ) elif provider == "bge": return OpenAIEmbeddings( model="BAAI/bge-large-zh-v1.5", api_key=os.getenv("SILICONFLOW_API_KEY"), base_url="https://api.siliconflow.cn/v1", chunk_size=32, ) elif provider == "local": # Requires: pip install sentence-transformers from langchain_community.embeddings import HuggingFaceEmbeddings return HuggingFaceEmbeddings( model_name="BAAI/bge-large-zh-v1.5", model_kwargs={"device": "cuda"}, # or "cpu" encode_kwargs={"normalize_embeddings": True}, ) else: raise ValueError(f"Unknown provider: {provider}") # Usage: one line to switch embeddings = build_embeddings("bge") # Change this line to switch Local BGE Deployment (Optional) If you have a GPU, local deployment is simple: pip install sentence-transformers from langchain_community.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-large-zh-v1.5", model_kwargs={"device": "cuda"}, encode_kwargs={"normalize_embeddings": True}, ) # Test result = embeddings.embed_query("Testing Chinese Embedding") print(f"Vector dimensions: {len(result)}") # 1024 The first run auto-downloads the model (~1.2GB), then caches locally. Summary and Quick Reference Core Takeaways Embedding is the semantic bridge of RAG — choosing the wrong model directly hurts retrieval accuracy English → OpenAI, Chinese → BGE — validated by both MTEB rankings and real-world tests Simple queries show little difference, complex semantic queries show large gaps — BGE excels at synonyms, idioms, and terminology Switching models takes one line of code — LangChain's abstraction makes model swapping cost-free Embedding Model Quick Selection Guide Scenario Recommended Model Deployment Reasoning Chinese technical docs BGE-large-zh-v1.5 API/Local Top Chinese performance English general docs text-embedding-3-small API Best value English high-accuracy text-embedding-3-large API Best quality but expensive Multilingual mixed bge-m3 API/Local 100+ language support Data must stay on-premise BGE-large-zh-v1.5 Local 4GB VRAM sufficient Long text (>8K) Cohere embed-multilingual API Optimized for long texts References MTEB Leaderboard — Authoritative Embedding model rankings BGE Official GitHub — BGE series models and documentation SiliconFlow Embedding API Cohere Embed Documentation

Read full story →

RAG Series (5): Embedding Models — The Core of Semantic Understanding

Comments

Related

Where Does Your Data Live? Decoding the Modern Data Ecosystem

Harness engineering: Preparing TypeScript codebases for coding agents

How to Convert JSON to YAML (and Back) Without Writing a Single Line of Code