Chunking

$Level 1 : Fixed Size Chunking This is the most crude and simplest method of segmenting the text. It breaks down the text into chunks of a specified number of characters, regardless of their content or structure. Langchain and llamaindex framework offer ﷟HYPERLINK "https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter"CharacterTextSplitter and ﷟HYPERLINK "https://docs.llamaindex.ai/en/stable/api/llama_index.node_parser.SentenceSplitter.html"SentenceSplitter (default to spliting on sentences) classes for this chunking technique. A few concepts to remember - How the text is split: by single character How the chunk size is measured: by number of characters chunk_size: the number of characters in the chunks chunk_overlap: the number of characters that are being overlap in sequential chunks. keep duplicate data across chunks separator: character(s) on which the text would be split on (default “”) Level 2: Recursive Chunking While Fixed size chunking is easier to implement, it doesn’t consider the structure of text. Recursive chunking offers an alternative. In this method, we divide the text into smaller chunk in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size, the method recursively calls itself on the resulting chunks with a different separator until the desired chunk size is achieved. Langchain framework offers ﷟HYPERLINK "https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter"RecursiveCharacterTextSplitter class, which splits text using ﷟HYPERLINK "https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L842"default separators (“\n\n”, “\n”, “ “,””) Level 3 : Document Based Chunking In this chunking method, we split a document based on its inherent structure. This approach considers the flow and structure of content but may not be as effective documents lacking clear structure. For instance, a legal document might be chunked by individual charges, with each charge treated as a chunk. This method maintains the document's structural integrity and ensures that no important legal context is lost. 类似一个章节，或一个小节等。 Level 4: Semantic Chunking All above three levels deals with content and structure of documents and necessitate maintaining constant value of chunk size. This chunking method aims to extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. The core idea is to keep together chunks that are semantic similar. 2025年8月1日星期五 10:45$
$All above three levels deals with content and structure of documents and necessitate maintaining constant value of chunk size. This chunking method aims to extract semantic meaning from embeddings and then assess the semantic relationship between these chunks. The core idea is to keep together chunks that are semantic similar. Level 5: Agentic Chunking This chunking strategy explore the possibility to use LLM to determine how much and what text should be included in a chunk based on the context. To generate initial chunks, it uses ﷟HYPERLINK "https://arxiv.org/pdf/2312.06648.pdf"concept of Propositions based on paperthat extracts stand alone statements from a raw piece of text. Langchain provides ﷟HYPERLINK "https://templates.langchain.com/new?integration_name=propositional-retrieval"propositional-retrieval template to implement this. After generating propositions, these are being feed to LLM-based agent. This agent determine whether a proposition should be included in an existing chunk or if a new chunk should be created. Introducing Contextual Retrieval The context conundrum in traditional RAG In traditional RAG, documents are typically split into smaller chunks for efficient retrieval. While this approach works well for many applications, it can lead to problems when individual chunks lack sufficient context. For example, imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?" A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively. chunk中没有显式地提到ACME公司，而是The company，因此需要加入额外的上下文信息，这个信息可以用LLM提取，未命名图片.jpg 计算机生成了可选文字: <document> {{WHOLE_DOCUMENT}} </document> HereisthechunkwewanttoSituatewithinthewholedocument <chunk> {{CHUNK_CONTENT}} </chunk> Pleasegiveashortsuccinctcontextt0situatethischunkwithintheoveralldocumentforthe purposesOfimprovingsearchretrievalOfthechunk.AnsweronlyWiththesuccinctcontextandnothing else.$
$未命名图片.jpg 计算机生成了可选文字: <document> {{WHOLE_DOCUMENT}} </document> HereisthechunkwewanttoSituatewithinthewholedocument <chunk> {{CHUNK_CONTENT}} </chunk> Pleasegiveashortsuccinctcontextt0situatethischunkwithintheoveralldocumentforthe purposesOfimprovingsearchretrievalOfthechunk.AnsweronlyWiththesuccinctcontextandnothing else.$