When you are creating a Retrieval-Augmented Generation (RAG) pipeline first step is to load the data and split it. In the post Document Loaders in LangChain With Examples we saw different types of document loaders provided by LangChain. In this article we’ll see different text splitters provided by LangChain to break the loaded documents into smaller, manageable chunks.
Why do we need Text Splitters
The documents you load using document loaders may be very large in size and it is quite impractical to send the content of the whole document to the LLM to get relevant answers. Text splitters in LangChain help in breaking large documents into smaller, manageable chunks that models can process efficiently without losing context. They help overcome context window limits, improve retrieval accuracy, and enable better indexing and semantic understanding. Here are some of the benefits of splitting the documents.
- Context window limit- LLMs have a maximum token limit. If you feeding an entire book or long document that will exceed this limit. By splitting documents into smaller, semantically coherent chunks, you can select only the relevant chunks to send to the LLM instead of the entire document.
- Token Efficiency- If you send the entire document (without any splitting), the LLM has to process every token, even irrelevant ones. That inflates cost and slows response time. With splitting + retrieval, only the relevant chunks are injected into the prompt. This means fewer tokens are consumed, lowering the overall cost.
- Efficient Retrieval in RAG Pipelines- One of the steps in creating a RAG pipeline is to store the loaded documents in vector databases. By splitting documents into smaller chunks and storing those chunks (not the whole document as is) improves search granularity and ensures the right passage is retrieved from the vector DB.
- Maintaining Semantic Coherence- There are TextSplitter classes in LangChain that don’t just cut text arbitrarily, they try to preserve contextual meaning. For example, splitting by paragraphs or semantic boundaries avoids breaking sentences mid-thought.
Splitting at natural boundaries (sentences, paragraphs, sections) keeps ideas intact. That ultimately helps LLM to interpret the context correctly without guessing missing parts. This reduces hallucinations and increases factual accuracy.
Text splitters in LangChain
LangChain offers a variety of text splitters, each designed to serve different functionalities.
1. CharacterTextSplitter
One of the simplest text-splitting utilities in LangChain. It divides text using a specified character sequence (default: "\n\n" meaning paragraph), with chunk length measured by the number of characters.
Text is split using a given character separator (which is paragraph by default). Instead of cutting arbitrarily at the exact character count, the splitter looks for the nearest separator before the limit. This ensures chunks end at natural boundaries (paragraphs, sentences, etc.), preserving meaning. The chunk size is the maximum number of characters allowed in each chunk. For example, if chunk_size=1000, each chunk will contain up to 1000 characters. The splitter tries to fill the chunk up to this limit, but will break at the nearest separator to avoid cutting mid-paragraph or mid-sentence.
CharacterTextSplitter is best for documents with a consistent and predictable structure, such as logs or lists where a single separator (like a newline) clearly defines boundaries.
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
Parameters-
- separator: Used to identify split points. The default is "\n\n" (double newline), which aims to preserve paragraph integrity.
- chunk_size: The maximum number of characters allowed in a single chunk.
- chunk_overlap: The number of characters that consecutive chunks should share. This helps maintain semantic context across splits.
- length_function: A function used to calculate the length of the chunks, defaulting to the standard Python len().
Methods that you can use-
- .split_text- when you just have raw strings (plain text), it returns plain string chunks.
- .split_documents- when you already have your text wrapped inside LangChain Document objects. If you have used Document loader in LangChain to load document you will have them as Document objects. In that case, you use split_document to break them into smaller Document chunks while preserving metadata.
LangChain CharacterTextSplitter Example
In the code, space (" ") is used as the separator not the default.
from langchain_text_splitters import CharacterTextSplitter
# Sample text to split
text = """
Generative AI is a type of artificial intelligence that creates new, original content—such as text, images, video, audio, or code—by learning patterns from existing data. Unlike traditional AI that classifies or analyzes data, GenAI uses deep learning models to generate novel outputs that resemble the training data.
Key Aspects of Generative AI:
How it Works: These models (e.g., GANs, Transformers) are trained on massive datasets to understand underlying structures and probabilities. When prompted, they predict and generate new, human-like content.
"""
# Create a CharacterTextSplitter instance
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20, separator=" ")
# Split the text into chunks
chunks = text_splitter.split_text(text)
print(f"Total chunks created: {len(chunks)}\n")
# Print the resulting chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Output
Total chunks created: 7 Chunk 1: Generative AI is a type of artificial intelligence that creates new, original content—such as text, Chunk 2: as text, images, video, audio, or code—by learning patterns from existing data. Unlike traditional Chunk 3: Unlike traditional AI that classifies or analyzes data, GenAI uses deep learning models to generate Chunk 4: models to generate novel outputs that resemble the training data. Key Aspects of Generative Chunk 5: of Generative AI: How it Works: These models (e.g., GANs, Transformers) are trained on massive Chunk 6: trained on massive datasets to understand underlying structures and probabilities. When prompted, Chunk 7: When prompted, they predict and generate new, human-like content.
2. RecursiveCharacterTextSplitter
The RecursiveCharacterTextSplitter is the recommended default text splitter for generic text in LangChain. It splits documents by recursively checking a list of characters until the resulting chunks are within a specified size limit. The default list of separator is ["\n\n", "\n", " ", ""]
- "\n\n": double newline (paragraphs)
- "\n": single newline (lines)
- " ": space (words)
- "": empty string (individual characters)
How RecursiveCharacterTextSplitter Works
Instead of using a single separator, it uses a hierarchical list to preserve semantic context (paragraphs -> lines -> words -> characters):
- It first attempts to split the text by the first character in its list (default is double newline \n\n for paragraphs).
- Recursive Fallback: If any resulting chunk still exceeds the chunk_size, it moves to the next separator (e.g., single newline \n) and tries again only on that chunk.
- Continue in the hierarchy: It repeats this process through the list (e.g., spaces then finally individual characters "") until the size requirement is met.
LangChain RecursiveCharacterTextSplitter Example
In this example first a PDF document is loaded using PyPDFLoader, then RecursiveCharacterTextSplitter is used to split it. Code assumes that the PDF document is inside the resources folder which resides in project root.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
def get_file_path(file_name):
# Current script directory
script_dir = os.path.dirname(os.path.abspath(__file__))
# Project root is one level above
project_root = os.path.dirname(script_dir)
#print(f"Project root directory: {project_root}")
file_path = os.path.join(project_root, "resources", file_name)
return file_path
def load_documents(file_name):
file_path = get_file_path(file_name)
loader = PyPDFLoader(file_path)
documents = loader.load()
print(f"Number of Documents: {len(documents)}")
return documents
def split_documents(documents):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(documents)
print(f"length of chunks {len(chunks)}")
for i, chunk in enumerate(chunks[:3]): # first 3 chunks
# Chunk Lengths
print(f"Chunk {i+1} length: {len(chunk.page_content)}")
# Chunk Content
#print(f"Chunk {i+1}:\n{chunk.page_content}...\n")
# Chunk Metadata
#print(f"Chunk {i+1} metadata: {chunk.metadata}")
if __name__ == "__main__":
documents = load_documents("Health Insurance Policy Clause.pdf")
split_documents(documents)
Output
Output Number of Documents: 41 length of chunks 139 Chunk 1 length: 914 Chunk 2 length: 913 Chunk 3 length: 983
3. Code Text Splitter
Though LangChain provides specific code text splitter classes like PythonCodeTextSplitter for Python but the recommended approach is to use RecursiveCharacterTextSplitter.from_language() method. Supported languages are stored in the langchain_text_splitters.Language enum. You need to pass a value from the enum into RecursiveCharacterTextSplitter.from_language() method to instantiate a splitter that is tailored for a specific language. Here’s an example using the PythonTextSplitter:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import Language
PYTHON_CODE = """
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
print(python_docs)
Output
[Document(metadata={}, page_content='def hello_world():\n print("Hello, World!")'), Document(metadata={}, page_content='# Call the function\nhello_world()')]
Note that in the above example, create_documents() method is used. This method does both tasks, raw text to Document objects and splitting the documents in one go.
4. TokenTextSplitter
TokenTextSplitter class in LangChain is used to divide text into smaller chunks based on a specific number of tokens rather than characters.
LLMs have strict token-based context window limit this class ensures chunks don’t exceed the model’s max token limit.
How TokenTextSplitter Works
Raw text to tokens
The splitter first converts your text into tokens using the model’s tokenizer (e.g., GPT-3.5, GPT-4, or embedding models).
Chunking by token count
You specify chunk_size and chunk_overlap in terms of tokens. The splitter groups tokens into chunks of the given size, with overlap applied at the token level.
Convert tokens back
Each chunk of tokens is decoded back into a string. The result is a list of text chunks that align with token boundaries. By tokenizing first, the splitter ensures each chunk is within the desired token budget.
from langchain_text_splitters import TokenTextSplitter
text = """
Generative AI is a type of artificial intelligence that creates new, original content—such as text, images, video, audio, or code—by learning patterns from existing data. Unlike traditional AI that classifies or analyzes data, GenAI uses deep learning models to generate novel outputs that resemble the training data.
Key Aspects of Generative AI:
How it Works: These models (e.g., GANs, Transformers) are trained on massive datasets to understand underlying structures and probabilities. When prompted, they predict and generate new, human-like content.
"""
#cl100k_base is a tokenizer encoding provided by OpenAI’s tiktoken library.
text_splitter = TokenTextSplitter(
encoding_name="cl100k_base",
chunk_size=100,
chunk_overlap=20
)
chunks = text_splitter.split_text(text)
print(f"total chunks {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
Apart from these classes LangChain has some specialized classes for splitting specific documents.
- Splitting JSON- RecursiveJsonSplitter splits json data while allowing control over chunk sizes.
- Splitting Markdown- MarkdownTextSplitter attempts to split the text along Markdown-formatted headings.
- Splitting HTML- LangChain provides three different text splitters that you can use to split HTML content effectively:
- HTMLHeaderTextSplitter- Splits HTML text based on header tags (e.g., <h1>, <h2>, <h3>, etc.), and adds metadata for each header relevant to any given chunk.
- HTMLSectionSplitter- Splitting HTML into sections based on specified tags.
- HTMLSemanticPreservingSplitter- Splits HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components.
That's all for this topic Text Splitters in LangChain With Examples. If you have any doubt or any suggestions to make please drop a comment. Thanks!
Related Topics
You may also like-
No comments:
Post a Comment