Mastering Text Chunking with Ollama A Comprehensive Guide to Advanced Processing

In today’s world of AI and large language models, one of the most common challenges developers face is handling text that exceeds a model’s context window. Ollama, while powerful for running local language models, shares this limitation with other LLMs. This comprehensive guide will explore advanced chunking techniques to effectively process large documents with Ollama while maintaining coherence and context.

Understanding Chunking in the Context of Ollama

Chunking is the process of dividing large text into smaller, manageable segments that fit within a model’s token limit. Ollama, which provides access to models like Llama, Mistral, and others, has specific token limitations depending on the model you’re using. Effective chunking isn’t just about breaking text apart—it’s about doing so intelligently to preserve meaning across segments.

Why Advanced Chunking Matters for Ollama

When working with Ollama, proper chunking techniques become essential for several reasons:

Context Window Constraints: Most models accessible through Ollama have context windows ranging from 2K to 8K tokens, limiting how much text they can process at once.
Memory Efficiency: Even if a model technically supports larger contexts, processing smaller chunks can reduce RAM usage, allowing Ollama to run smoothly on machines with limited resources.
Coherence Across Chunks: Without proper chunking strategies, the model might lose the thread of thought between segments, resulting in disjointed or contradictory outputs.
Processing Efficiency: Well-designed chunking allows for parallel processing and can significantly reduce the time needed to handle large documents.

Advanced Chunking Strategies for Ollama

Let’s explore several sophisticated chunking approaches that go beyond basic text splitting:

1. Semantic Chunking

Rather than chunking based solely on character or token count, semantic chunking divides text based on meaning and context.

import nltk
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import spacy

# Load SpaCy model for semantic understanding
nlp = spacy.load("en_core_web_md")

def semantic_chunking(text, max_tokens=1000, overlap=100):
    # Break into sentences first
    sentences = sent_tokenize(text)

    # Get sentence embeddings
    sentence_embeddings = [nlp(sentence).vector for sentence in sentences]

    # Track token count (approximate)
    token_counts = [len(sentence.split()) for sentence in sentences]

    chunks = []
    current_chunk = []
    current_token_count = 0

    for i, sentence in enumerate(sentences):
        # If adding this sentence would exceed our limit, start a new chunk
        if current_token_count + token_counts[i] > max_tokens and current_chunk:
            chunks.append(" ".join(current_chunk))

            # For overlap, find the most semantically similar sentences to include
            if overlap > 0 and len(current_chunk) > 0:
                # Get embeddings for current chunk sentences
                current_embs = sentence_embeddings[i-len(current_chunk):i]
                # Find sentences with highest similarity to include in overlap
                similarities = cosine_similarity([sentence_embeddings[i]], current_embs)[0]
                overlap_indices = np.argsort(similarities)[-int(overlap/10):]  # Heuristic for number of sentences

                # Add overlapping sentences to new chunk
                current_chunk = [sentences[i-len(current_chunk)+idx] for idx in overlap_indices]
                current_token_count = sum(token_counts[i-len(current_chunk)+idx] for idx in overlap_indices)
            else:
                current_chunk = []
                current_token_count = 0

        current_chunk.append(sentence)
        current_token_count += token_counts[i]

    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

This approach ensures that semantically related content stays together, providing Ollama with more coherent chunks to process.

2. Hierarchical Chunking

Hierarchical chunking creates a tree-like structure where larger documents are first divided into major sections, then subsections, and finally into token-sized chunks.

def hierarchical_chunking(document, max_tokens=1000):
    # First level: Split by major section headers
    sections = re.split(r'# [A-Za-z\s]+\n', document)

    # Second level: For each section, split by sub-headers
    subsections = []
    for section in sections:
        if not section.strip():
            continue
        subsecs = re.split(r'## [A-Za-z\s]+\n', section)
        subsections.extend([s for s in subsecs if s.strip()])

    # Final level: Split subsections into token-sized chunks
    final_chunks = []
    for subsection in subsections:
        words = subsection.split()
        for i in range(0, len(words), max_tokens):
            chunk = ' '.join(words[i:i+max_tokens])
            if chunk.strip():
                final_chunks.append(chunk)

    return final_chunks

This method is particularly useful for processing structured documents like academic papers or technical documentation with Ollama.

3. Sliding Window Chunking with Context Retention

This advanced technique maintains continuity by creating overlapping windows of text:

def sliding_window_chunking(text, window_size=800, stride=600, context_size=200):
    """
    Process text using a sliding window approach that maintains context
    - window_size: The main processing window size in tokens
    - stride: How far to move the window for each chunk (smaller than window_size creates overlap)
    - context_size: How much previous context to include with each chunk
    """
    words = text.split()
    chunks = []

    # Initialize with first chunk having no previous context
    for i in range(0, len(words), stride):
        if i == 0:
            # First chunk has no previous context
            chunk = words[i:i+window_size]
        else:
            # Calculate how much previous context to include
            context_start = max(0, i-context_size)

            # Create a marker showing where previous context ends and new content begins
            context_part = words[context_start:i]
            new_part = words[i:i+window_size-len(context_part)]

            # Combine with a special separator
            chunk = (
                "--- PREVIOUS CONTEXT ---\n" + 
                " ".join(context_part) + 
                "\n--- NEW CONTENT ---\n" + 
                " ".join(new_part)
            )

        if chunk:
            chunks.append(chunk if isinstance(chunk, str) else " ".join(chunk))

        # If we've processed all words, break
        if i + window_size >= len(words):
            break

    return chunks

This approach is particularly effective for narrative text where continuity between chunks is critical for Ollama to maintain the flow of ideas.

Implementing Advanced Chunking with Ollama

Now let’s see how we can apply these chunking strategies with Ollama’s API for practical use cases:

import json
import requests

def process_with_ollama(chunks, model="llama2", system_prompt=None):
    """
    Process a list of text chunks with Ollama
    """
    responses = []

    # Base URL for Ollama API
    url = "http://localhost:11434/api/generate"

    for i, chunk in enumerate(chunks):
        # Create a metadata-rich prompt for context
        prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"

        # Prepare the request payload
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": False
        }

        # Add system prompt if provided
        if system_prompt:
            payload["system"] = system_prompt

        # Make the API call
        try:
            response = requests.post(url, json=payload)
            response.raise_for_status()  # Check for HTTP errors

            # Extract and store the response
            result = response.json()
            responses.append(result["response"])

            print(f"Processed chunk {i+1}/{len(chunks)}")
        except Exception as e:
            print(f"Error processing chunk {i+1}: {str(e)}")
            responses.append(f"Error: {str(e)}")

    return responses

Advanced Example: Document Analysis with Context Maintenance

Let’s create a more complex workflow that uses semantic chunking for document analysis while maintaining context between chunks:

def analyze_document_with_ollama(document_path, model="llama2:13b"):
    """
    Analyze a large document by:
    1. Reading the document
    2. Creating semantic chunks
    3. Processing each chunk while maintaining context
    4. Synthesizing a coherent analysis
    """
    # Read the document
    with open(document_path, 'r', encoding='utf-8') as f:
        document = f.read()

    # Create semantic chunks
    print("Creating semantic chunks...")
    chunks = semantic_chunking(document, max_tokens=1800, overlap=200)
    print(f"Document divided into {len(chunks)} semantic chunks")

    # Process each chunk with Ollama
    system_prompt = """
    You are analyzing a document that has been divided into chunks.
    For each chunk:
    5. Identify key points, arguments, and evidence
    6. Note how these connect to previous chunks if applicable
    7. Maintain a coherent understanding of the document as it progresses
    """

    print("Processing chunks with Ollama...")
    chunk_analyses = process_with_ollama(chunks, model=model, system_prompt=system_prompt)

    # Create a final synthesis prompt
    synthesis_prompt = "Below are analyses of different sections of a document:\n\n"
    for i, analysis in enumerate(chunk_analyses):
        synthesis_prompt += f"SECTION {i+1} ANALYSIS:\n{analysis}\n\n"

    synthesis_prompt += """
    Based on these section analyses, provide a comprehensive synthesis of the entire document.
    Include:
    8. The main thesis or argument
    9. Key supporting points and evidence
    10. Any significant counterarguments or limitations
    11. Overall evaluation of the document's effectiveness
    """

    # Process the synthesis with Ollama
    print("Creating final synthesis...")
    synthesis_payload = {
        "model": model,
        "prompt": synthesis_prompt,
        "stream": False
    }

    response = requests.post("http://localhost:11434/api/generate", json=synthesis_payload)
    synthesis = response.json()["response"]

    return {
        "num_chunks": len(chunks),
        "chunk_analyses": chunk_analyses,
        "synthesis": synthesis
    }

Advanced Chunking Considerations for Ollama

Token Estimation with Different Models

Different Ollama models have varying tokenization methods. Here’s a simple utility to help estimate token counts across models:

def estimate_tokens(text, model_type="llama2"):
    """
    Estimate token count for different Ollama models
    """
    # Average ratios of tokens to characters for different model families
    # These are approximations and will vary
    token_ratios = {
        "llama2": 0.25,   # ~4 characters per token
        "mistral": 0.23,  # ~4.3 characters per token
        "mpt": 0.22,      # ~4.5 characters per token
        "falcon": 0.26    # ~3.8 characters per token
    }

    ratio = token_ratios.get(model_type.lower(), 0.25)  # Default to llama2 ratio

    # Simple estimation based on character count
    return int(len(text) * ratio)

Handling Code and Technical Content

Code and technical content require special chunking considerations:

def chunk_code_document(document):
    """
    Specialized chunking for technical documents with code blocks
    """
    # Split document by Markdown code blocks
    parts = re.split(r'(```[\w]*\n[\s\S]*?\n```)', document)

    chunks = []
    current_chunk = ""
    current_token_est = 0

    for part in parts:
        # If this is a code block, try to keep it intact
        is_code_block = part.startswith('```') and part.endswith('```')
        part_token_est = estimate_tokens(part)

        # If adding this part would exceed our limit, start a new chunk
        if current_token_est + part_token_est > 1800 and current_chunk:
            chunks.append(current_chunk)
            current_chunk = ""
            current_token_est = 0

        # If it's a code block that alone exceeds token limit, we need to split it
        if is_code_block and part_token_est > 1800:
            # Process the large code block separately
            if current_chunk:  # Save any accumulated content first
                chunks.append(current_chunk)
                current_chunk = ""
                current_token_est = 0

            # Split code by lines, preserving syntax highlighting info
            code_lang = re.match(r'```([\w]*)\n', part)
            code_lang = code_lang.group(1) if code_lang else ""

            code_content = part[3+len(code_lang):-3].strip()
            code_lines = code_content.split('\n')

            code_chunks = []
            current_code_chunk = f"```{code_lang}\n"
            current_code_tokens = estimate_tokens(current_code_chunk)

            for line in code_lines:
                line_tokens = estimate_tokens(line + '\n')
                if current_code_tokens + line_tokens > 1700:  # Leave room for the closing ```
                    current_code_chunk += "```"
                    code_chunks.append(current_code_chunk)
                    current_code_chunk = f"```{code_lang}\n{line}\n"
                    current_code_tokens = estimate_tokens(current_code_chunk)
                else:
                    current_code_chunk += line + '\n'
                    current_code_tokens += line_tokens

            # Add the last code chunk if not empty
            if current_code_chunk != f"```{code_lang}\n":
                current_code_chunk += "```"
                code_chunks.append(current_code_chunk)

            chunks.extend(code_chunks)
        else:
            # Regular text or small code block
            current_chunk += part
            current_token_est += part_token_est

    # Add the last chunk if not empty
    if current_chunk:
        chunks.append(current_chunk)

    return chunks

Parallel Processing with Ollama

For large documents, you can process multiple chunks in parallel to save time:

import concurrent.futures

def process_chunks_in_parallel(chunks, model="llama2", max_workers=4):
    """
    Process multiple chunks in parallel with Ollama
    """
    def process_chunk(chunk_data):
        i, chunk = chunk_data
        url = "http://localhost:11434/api/generate"
        prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"

        payload = {
            "model": model,
            "prompt": prompt,
            "stream": False
        }

        try:
            response = requests.post(url, json=payload)
            response.raise_for_status()
            return response.json()["response"]
        except Exception as e:
            return f"Error processing chunk {i+1}: {str(e)}"

    results = [None] * len(chunks)

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all chunks for processing
        future_to_index = {executor.submit(process_chunk, (i, chunk)): i 
                          for i, chunk in enumerate(chunks)}

        # Process results as they complete
        for future in concurrent.futures.as_completed(future_to_index):
            index = future_to_index[future]
            try:
                results[index] = future.result()
            except Exception as e:
                results[index] = f"Error: {str(e)}"

    return results

Best Practices for Ollama Chunking

Based on extensive testing with various Ollama models, here are some best practices:

Retain Document Structure: When possible, align chunk boundaries with natural document divisions like paragraphs, sections, or sentences.
Context Windows: Use a smaller effective window size than the model’s maximum to leave room for the model’s response.
Model-Specific Tuning:
- Llama models generally perform better with slightly smaller chunks (1500-1800 tokens)
- Mistral models can often handle larger coherent chunks (2000+ tokens)
- Adjust based on your specific model
Metadata Enhancement: Include metadata in each chunk that indicates its position and relationship to other chunks.
Adaptive Chunking: Consider the content type—code, technical text, and narrative content may benefit from different chunking strategies.
System Prompts: Use clear system prompts to tell Ollama how to handle chunked content.

Common Chunking Pitfalls with Ollama

When implementing chunking with Ollama, be aware of these common issues:

Mid-Sentence Splitting: Avoid splitting sentences between chunks when possible, as this can disrupt the model’s understanding.
Losing Key Context: Critical information mentioned early in a document might be missing from later chunks if not properly carried forward.
Tokenizer Mismatches: Remember that character or word counts aren’t perfect proxies for token counts, which can lead to chunks that exceed token limits.
Neglecting Document Structure: Splitting without respect to document structure (e.g., cutting across headers or code blocks) often produces poor results.
Overloading Context Windows: Very dense information-rich chunks may overwhelm the model even if they’re within token limits.

Conclusion: Mastering Ollama with Advanced Chunking

Advanced chunking techniques are essential for getting the most out of Ollama, especially when working with larger documents or complex content. By implementing semantic, hierarchical, or sliding window chunking approaches, you can process content that far exceeds the model’s native context window while maintaining coherence and accuracy.

The techniques outlined in this guide will help you build more sophisticated applications with Ollama that can handle real-world document processing tasks efficiently. By understanding the nuances of different chunking strategies and how they interact with different Ollama models, you can create systems that make the most of local LLM capabilities without being constrained by context window limitations.

Remember that the ideal chunking strategy depends on your specific use case, content type, and chosen model. Experiment with the approaches outlined here and adapt them to your particular needs for optimal results.