PySET is designed for maximum performance while maintaining 100% accuracy. Internal benchmarks show 2-3x speedup over PySBD.
| Text Size | Words | PySET Time | PySBD Time | Speedup |
|---|---|---|---|---|
| Sentences | ~5 | 0.05ms | 0.10ms | 2.0x |
| Paragraph | ~104 | 0.60ms | 1.37ms | 2.3x |
| Article | ~484 | 2.41ms | 5.25ms | 2.2x |
| Document | ~1400 | 5.68ms | 21.95ms | 3.9x |
| Text Size | PySET w/s | PySBD w/s |
|---|---|---|
| Sentences | 101,016 | 47,809 |
| Paragraph | 172,136 | 75,911 |
| Article | 200,693 | 92,152 |
| Document | 247,779 | 64,091 |
PySET processes 158,000+ words/second on average.
PySET caches results from Context methods to avoid repeated regex operations:
# Without caching: ~5 regex searches per position
# With caching: ~1 regex search total
# Speedup: ~5xRules engine exits early when confident:
# Exit if confidence >= 0.75 AND priority >= 85
# Exit if confidence >= 0.90 for BOUNDARY
# Speedup: ~3xAll regex patterns are compiled once at initialization.
PySET minimizes object allocation during splitting.
| Text Size | Memory |
|---|---|
| Sentences | ~1KB |
| Paragraph | ~5KB |
| Article | ~25KB |
| Document | ~70KB |
Memory usage is roughly 50x input text size.
# Good: Reuse detector
detector = TokenBoundaryDetector()
for text in texts:
sentences = detector.split(text)
# Bad: New detector each time
for text in texts:
detector = TokenBoundaryDetector()
sentences = detector.split(text)# Good: Process large documents at once
sentences = detector.split(large_document)
# Bad: Split into chunks manually# Fastest: Minimal configuration
detector = TokenBoundaryDetector()
# More accurate but slower: Aggressive abbreviations
detector = TokenBoundaryDetector(aggressive_abbreviations=True)import time
start = time.perf_counter()
chunks = chunk_document(text)
elapsed = time.perf_counter() - start
print(f"Chunked {len(text.split())} words in {elapsed:.3f}s")
print(f"Rate: {len(text.split())/elapsed:.0f} words/second")Typical rates: 50,000-100,000 words/second depending on text complexity.
from pyset import TokenBoundaryDetector
import spacy
nlp = spacy.load("en_core_web_sm")
detector = TokenBoundaryDetector()
def process(text):
sentences = detector.split(text)
docs = nlp.pipe(sentences)
return list(docs)PySET scales linearly with text size:
| Text Words | Time | Rate |
|---|---|---|
| 100 | 0.6ms | 165K/s |
| 1,000 | 5ms | 200K/s |
| 10,000 | 50ms | 200K/s |
| 100,000 | 500ms | 200K/s |
| Feature | PySET | PySBD | Others |
|---|---|---|---|
| Speed | Fastest | Medium | Slow |
| Memory | Low | Medium | Medium |
| Accuracy | 100% | Lower | Varies |
| Dependencies | Zero | 1 | 1-5 |
| Languages | 50+ | 20+ | Varies |
- Check debug mode - debug=True adds overhead
- Review rule count - exclude unnecessary rules
- Profile your code - ensure detector is reused
- Process in chunks - don't load entire documents
- Clear cache - call detector.reset() if needed
- Check text size - memory ~50x text size
Planned optimizations:
- Cython migration for hot paths
- SIMD instructions for pattern matching
- Parallel processing for multi-document Batch