Text Splitting

See the full code for this tutorial here.

In this tutorial we are reviewing the 5 Levels Of Text Splitting. This is an unofficial list put together for fun and educational purposes.

Ever try to put a long piece of text into ChatGPT but it tells you it’s too long? Or you're trying to give your application better long term memory, but it’s still just not quite working.

One of the most effective strategies to improve performance of your language model applications is to split your large data into smaller pieces. This is call splitting or chunking (we'll use these terms interchangeably). In the world of multi-modal, splitting also applies to images.

We are going to cover a lot, but if you make it to the end, I guarantee you’ll have a solid grasp on chunking theory, strategies, and resources to learn more.

Levels Of Text Splitting

Level 1: Character Splitting - Simple static character chunks of data
Level 2: Recursive Character Text Splitting - Recursive chunking based on a list of separators
Level 3: Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)
Level 4: Semantic Splitting - Embedding walk based chunking
Level 5: Agentic Splitting - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
*Bonus Level:* Alternative Representation Chunking + Indexing - Derivative representations of your raw text that will aid in retrieval and indexing

Notebook resources:

Video Overview - Walkthrough of this code with commentary
ChunkViz.com - Visual representation of chunk splitting methods
RAGAS - Retrieval evaluation framework

This tutorial was created with ❤️ by Greg Kamradt. MIT license, attribution is always welcome.

This tutorial will use code from LangChain (pip install langchain) & Llama Index (pip install llama-index)

Evaluations

It's important to test your chunking strategies in retrieval evals. It doesn't matter how you chunk if the performance of your application isn't great.

Eval Frameworks:

I'm not going to demo evals for each method because success is domain specific. The arbitrary eval that I pick may not be suitable for your data. If anyone is interested in collaborating on a rigorous evaluation of different chunking strategies, please reach out (contact@dataindependent.com).

If you only walk away from this tutorial with one thing have it be the The Chunking Commandment

The Chunking Commandment: Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later.

See the full code tutorial here.