In this article, we’ll explore LangChain Document Loaders and how they fit into the Retrieval-Augmented Generation (RAG) pipeline. LangChain provides specific modules for each of the four core RAG steps
- Data ingestion (Load & split): Use document loaders(e.g. PyPDFLoader, WebBaseLoader) to import your data and text splitters to break it into smaller, manageable chunks.
- Indexing (Embed & store): LangChain provides wrappers for embedding models (like OpenAi or Hugging Face) and vector stores (like FAISS, Chroma, or Pinecone) to store your data as searchable vectors.
- Retrieval: The retriever component identifies and fetches the most relevant document chunks based on a user’s query.
- Generation: Chains or LCEL (LangChain Expression Language) combine the retrieved context with the user’s prompt and send it to an LLM to generate the final response.
LangChain Document loaders
As you can see first step is to load the documents. Document loaders provide a standard interface for reading data from different sources or different file formats. These sources can be Slack, Google drive, Confluence, Github etc. You have classes to load data from text files, PDFs, Word documents, CSV Files, Web Pages etc.
The documents are loaded in the form of Document objects that can then be used by other components like text Splitters, embeddings, vector stores, LLMs etc.
Note that Document is also a class in LangChain which stores the text content of the document (page_content) and associated metadata (file name, source, page number etc.).
All the document loader classes implement the BaseLoader interface. Each document loader may define its own parameters, but they share a common API:
- load()– Loads all documents at once.
- lazy_load()– Streams documents lazily, useful for large datasets.
Popular document loaders
LangChain offers over 200 integrations for different data types. We can categorize these loaders based on functionality. See the full list of document loaders here- https://docs.langchain.com/oss/python/integrations/document_loaders
- File based:
- TextLoader: Reads simple .txt files.
- PYPDFLoader: Extracts text from PDFs page by page.
- CSVLoader: Converts each row of a CSV into a separate document.
- Web based:
- WebBaseLoader: Uses BeautifulSoup to scrape and extract text from URLs.
- Unstructured: Uses Unstructured to load and parse web pages
- Directory-Level:
- DirectoryLoader: Automatically detects and loads all files in a folder using appropriate sub-loaders.
- Social platforms:
- Twitter- This loader fetches the text from the Tweets of a list of Twitter users, using the tweepy Python package.
- Reddit- This loader fetches the text from the Posts of Subreddits or Reddit users, using the praw Python package.
In this article we’ll see examples of some of the most frequently used document loaders in LangChain.
1. Text Loader
TextLoader class is used to load text files. Primarily for files with .txt extension, can also be used for markdown files (.md) or for code files (e.g., .py, .js, .html).
LangChain TextLoader example
Suppose I have a file "genai.txt" under resources folder in project root directory. Keeping cross-platform compatibility and code portability (using relative paths) in view, file path is constructed using os.path.
from langchain_community.document_loaders import TextLoader
import os
# Current script directory
script_dir = os.path.dirname(os.path.abspath(__file__))
# Project root is one level above
project_root = os.path.dirname(script_dir)
print(f"Project root directory: {project_root}")
loader = TextLoader(os.path.join(project_root, "resources", "genai.txt"), encoding="utf-8")
documents = loader.load()
print(f"Number of Documents: {len(documents)}")
print(f"Type of Documents: {type(documents)}")
# Print first 500 characters of the first document
print(f"Content of first Document: {documents[0].page_content[:500]}...")
print(f"Metadata of first Document: {documents[0].metadata}")
Output
Number of Documents: 1
Type of Documents: <class 'list'>
Content of first Document: Generative AI is a type of artificial intelligence that creates new, original content—such as text, images, video, audio, or code—by learning patterns from existing data. Unlike traditional AI that classifies or analyzes data, GenAI uses deep learning models to generate novel outputs that resemble the training data.
Key Aspects of Generative AI:
How it Works: These models (e.g., GANs, Transformers) are trained on massive datasets to understand underlying structures and probabilities. When ...
Metadata of first Document: {'source': 'D:\\Training content\\Python Training Content\\PythonML\\agent\\langchaindemos\\resources\\genai.txt'}
Points to note here:
- TextLoader is imported from langchain_community.document_loaders package.
- An object of TextLoader class is created passing it the path of the file, encoding is also passed to ensure handling of special characters (accents, symbols, non English scripts, emojis). Explicitly setting encoding="utf-8" avoids decoding errors or garbled text.
- If your file is small, you’ll typically only get one Document, which is the case here. So, documents[0].page_content will contain the full text of the file.
2. PDF Loader
LangChain provides many different PDF loader classes for loading PDF files. Some of the classes with their use cases are given below.
| Loader | Best Use Case | Description |
|---|---|---|
| PyPDFLoader | Simple PDFs with mostly text | Uses pypdf under the hood. Fast and lightweight, but can struggle with complex layouts, tables, or images. |
| PDFPlumberLoader | PDFs with structured layouts (tables, columns, forms) | Built on pdfplumber. Better at preserving layout and extracting tabular data. Slightly slower than PyPDF. |
| PyPDFDirectoryLoader | Batch loading multiple PDFs in a directory | Wraps PyPDFLoader for convenience. Ideal when you have a corpus of PDFs to ingest at once. |
| PyMuPDFLoader | Complex PDFs with mixed content (images, annotations, multi-column text) | Uses PyMuPDF. More powerful parsing, can handle embedded images and metadata. Good for research papers or scanned docs. |
| UnstructuredPDFLoader | Messy, scanned, or semi-structured PDFs | Uses the unstructured library. Best when PDFs are inconsistent, contain scanned text, or need aggressive cleaning. Often produces more reliable text for downstream NLP. |
Let’s start with a simple example using PyPDFLoader. Needs installation of pypdf package so install it using pip install pypdf command or add pypdf to the requirements.txt of your project and run pip install -r requirements.txt
from langchain_community.document_loaders import PyPDFLoader
import os
# Current script directory
script_dir = os.path.dirname(os.path.abspath(__file__))
# Project root is one level above
project_root = os.path.dirname(script_dir)
print(f"Project root directory: {project_root}")
loader = PyPDFLoader(os.path.join(project_root, "resources", "Health Insurance Policy Clause.pdf"))
documents = loader.load()
print(f"Number of Documents: {len(documents)}")
print(f"Type of Documents: {type(documents)}")
Output
Number of Documents: 41 Type of Documents: <class 'list'>
Note that PyPDFLoader creates one Document object per page of the PDF. In this example, gave a pdf of almost 1 MB size with 41 pages which resulted in loading of 41 documents (0-40, one Document per page of the PDF).
- Content: the text of each page is stored in documents[i].page_content.
- Metadata: each Document also carries metadata like the page number and source file path.
3. CSV loader
The CSVLoader class in LangChain is used to load csv data with a single row per document.
LangChain CSVLoader example
from langchain_community.document_loaders import CSVLoader
import os
# Current script directory
script_dir = os.path.dirname(os.path.abspath(__file__))
# Project root is one level above
project_root = os.path.dirname(script_dir)
print(f"Project root directory: {project_root}")
file_path = os.path.join(project_root, "resources", "50_Startups.csv")
loader = CSVLoader(file_path)
documents = loader.load()
print(f"Number of Documents: {len(documents)}")
print(f"Type of Documents: {type(documents)}")
# One documnent per row in the CSV file
print(f"Content of first Document: {documents[0].page_content}")
print(f"Metadata of first Document: {documents[0].metadata}")
Output
Number of Documents: 50
Type of Documents: <class 'list'>
Content of first Document: R&D Spend: 165349.2
Administration: 136897.8
Marketing Spend: 471784.1
State: New York
Profit: 192261.83
Metadata of first Document: {'source': 'D:\\Training content\\Python Training Content\\PythonML\\agent\\langchaindemos\\resources\\50_Startups.csv', 'row': 0}
Since there are 50 records in the CSV file so 50 Document objects are created, one for each row.
4. Web page loader
To load web pages, LangChain provides a WebBaseLoader class to load all text from HTML webpages into a document format.
In order to use the WebBaseLoader, apart from langchain-community python package, you also need to install beautifulsoup4 package.
To load a web page, pass it to WebBaseLoader.
loader = WebBaseLoader("https://www.example.com/")
If you want to load multiple web pages then you can pass in a list of pages to load from.
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
LangChain WebBaseLoader example
from langchain_community.document_loaders import WebBaseLoader
# Example URL to load
url1 = "https://www.netjstech.com/2026/04/runablepassthrough-langchain-examples.html"
url2 = "https://www.netjstech.com/2026/04/runnableparallel-in-langchain-example.html"
loader = WebBaseLoader([url1, url2])
documents = loader.load()
print(f"Number of Documents: {len(documents)}")
Output
Number of Documents: 2
Let’s feed this data to the LLM to get our queries answered based on the text loaded from the web page. By loading data of the web pages, we can provide context to the LLM.
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.messages import SystemMessage
import os
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
load_dotenv()
def load_text_from_url(url: str):
loader = WebBaseLoader(url)
documents = loader.load()
return documents
def generate_response(user_input: str) -> str:
document = load_text_from_url("https://www.netjstech.com/2026/04/runablepassthrough-langchain-examples.html")
system_message = SystemMessage(content="You are a helpful assistant that responds to user queries based on the provided context and nothing else.")
human_message = HumanMessagePromptTemplate.from_template("Based on the given context: {context}, answer the question: {user_input}")
prompt = ChatPromptTemplate.from_messages([system_message, human_message])
model = ChatOllama(model="llama3.1")
chain = prompt | model | StrOutputParser()
response = chain.invoke({"user_input": user_input, "context": document[0].page_content})
return response
if __name__ == "__main__":
response = generate_response("What is the main topic of the article and what are the key points discussed?")
print(response)
Output
The main topic of the article is "RunablePassthrough in LangChain With Examples". The article discusses the concept of `RunnablePassthrough` in LangChain, a library used for building chain-based models. Here are the key points discussed: 1. **What is RunablePassthrough**: It's a simple runnable that returns its input unchanged, useful when you want to preserve the original input alongside other computed values. 2. **Example use case**: Preserving the original question in a RAG (Retrieval-Augmented Generation) pipeline to be used later in the prompt construction. 3. **Code example**: Demonstrating how to use `RunnablePassthrough` in a chain-based model, specifically with Pinecone vector store and OpenAI embeddings. 4. **`.assign` method**: Explaining how to add extra static or computed fields to the passthrough output using `.assign`, such as adding metadata like timestamp to the prompt. Overall, the article provides an introduction to `RunnablePassthrough` in LangChain and demonstrates its usage with examples.
5. Directory loader
The DirectoryLoader class in LangChain is used to load all documents from a directory.
Key parameters that you can pass while creating object of the DirectoryLoader:
- path: The path to the directory to load from.
- glob: A pattern to filter which files to load (e.g., **/*.pdf for all PDFs including subfolders).
- loader_cls: The specific LangChain Loader to use for each file; defaults to UnstructuredFileLoader.
- use_multithreading: Set to True to speed up loading when dealing with many files.
- silent_errors: If True, the loader will skip files that fail to load instead of raising an exception.
For example, if you want to load all text files from the data directory using the TextLoader class.
loader = DirectoryLoader(
'./data',
glob="**/*.md",
loader_cls=TextLoader,
use_multithreading=True
)
LangChain DirectoryLoader example
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
import os
print(os.getcwd())
# Load all .pdf files from the specified directory
loader = DirectoryLoader("./langchaindemos/resources", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Check the number of documents loaded
print(f"Loaded {len(documents)} documents.")
That's all for this topic Document Loaders in LangChain With Examples. If you have any doubt or any suggestions to make please drop a comment. Thanks!
Related Topics
You may also like-
No comments:
Post a Comment