Skip to content

Latest commit

 

History

History
177 lines (123 loc) · 3.88 KB

File metadata and controls

177 lines (123 loc) · 3.88 KB

Performance Guide

Overview

PySET is designed for maximum performance while maintaining 100% accuracy. Internal benchmarks show 2-3x speedup over PySBD.

Benchmark Results

By Text Size

Text Size Words PySET Time PySBD Time Speedup
Sentences ~5 0.05ms 0.10ms 2.0x
Paragraph ~104 0.60ms 1.37ms 2.3x
Article ~484 2.41ms 5.25ms 2.2x
Document ~1400 5.68ms 21.95ms 3.9x

Words per Second

Text Size PySET w/s PySBD w/s
Sentences 101,016 47,809
Paragraph 172,136 75,911
Article 200,693 92,152
Document 247,779 64,091

PySET processes 158,000+ words/second on average.

Performance Optimizations

1. Context Caching

PySET caches results from Context methods to avoid repeated regex operations:

# Without caching: ~5 regex searches per position
# With caching: ~1 regex search total
# Speedup: ~5x

2. Early Exit Strategy

Rules engine exits early when confident:

# Exit if confidence >= 0.75 AND priority >= 85
# Exit if confidence >= 0.90 for BOUNDARY
# Speedup: ~3x

3. Pre-compiled Patterns

All regex patterns are compiled once at initialization.

4. Minimal Object Creation

PySET minimizes object allocation during splitting.

Memory Usage

Text Size Memory
Sentences ~1KB
Paragraph ~5KB
Article ~25KB
Document ~70KB

Memory usage is roughly 50x input text size.

Best Practices

1. Reuse Detector Instance

# Good: Reuse detector
detector = TokenBoundaryDetector()
for text in texts:
    sentences = detector.split(text)

# Bad: New detector each time
for text in texts:
    detector = TokenBoundaryDetector()
    sentences = detector.split(text)

2. Batch Processing

# Good: Process large documents at once
sentences = detector.split(large_document)

# Bad: Split into chunks manually

3. Choose Right Configuration

# Fastest: Minimal configuration
detector = TokenBoundaryDetector()

# More accurate but slower: Aggressive abbreviations
detector = TokenBoundaryDetector(aggressive_abbreviations=True)

Integration Performance

RAG/Document Chunking

import time

start = time.perf_counter()
chunks = chunk_document(text)
elapsed = time.perf_counter() - start

print(f"Chunked {len(text.split())} words in {elapsed:.3f}s")
print(f"Rate: {len(text.split())/elapsed:.0f} words/second")

Typical rates: 50,000-100,000 words/second depending on text complexity.

NLP Pipelines

from pyset import TokenBoundaryDetector
import spacy

nlp = spacy.load("en_core_web_sm")
detector = TokenBoundaryDetector()

def process(text):
    sentences = detector.split(text)
    docs = nlp.pipe(sentences)
    return list(docs)

Scaling

PySET scales linearly with text size:

Text Words Time Rate
100 0.6ms 165K/s
1,000 5ms 200K/s
10,000 50ms 200K/s
100,000 500ms 200K/s

Comparison

Feature PySET PySBD Others
Speed Fastest Medium Slow
Memory Low Medium Medium
Accuracy 100% Lower Varies
Dependencies Zero 1 1-5
Languages 50+ 20+ Varies

Troubleshooting Performance

Slow Processing

  1. Check debug mode - debug=True adds overhead
  2. Review rule count - exclude unnecessary rules
  3. Profile your code - ensure detector is reused

High Memory Usage

  1. Process in chunks - don't load entire documents
  2. Clear cache - call detector.reset() if needed
  3. Check text size - memory ~50x text size

Future Optimizations

Planned optimizations:

  • Cython migration for hot paths
  • SIMD instructions for pattern matching
  • Parallel processing for multi-document Batch