Rules Reference

PySET uses 85 priority-weighted rules for sentence boundary detection.

Rule Categories

1. Standard Terminals (Rules 1-15)

Rule	Priority	Description
1	100	Period followed by space + uppercase
2	100	Exclamation mark
3	100	Question mark
4	95	Period + newline
5	90	Space after terminal
6	85	Multiple periods (ellipsis detection)
7	80	Terminal followed by quote
8	75	Exclamation/question + quote
...	...	...

2. Ellipsis (Rules 16-20)

Rule	Priority	Description
16	95	Three or more periods
17	90	Ellipsis with spaces
18	85	Ellipsis at end
19	80	... variant
20	75	··· variant

3. Quotation Marks (Rules 21-30)

Rule	Priority	Description
21	95	Period before closing quote
22	90	Closing quote + space
23	85	Smart quotes
24	80	Single quotes
...	...	...

4. Brackets (Rules 31-40)

Rule	Priority	Description
31	90	Close paren + period
32	85	Close bracket + period
33	80	Nested parens
...	...	...

5. Numbers (Rules 41-50)

Rule	Priority	Description
41	90	Ordinal numbers (1st, 2nd)
42	85	Roman numerals
43	80	Year in parentheses
44	75	Decades (1990s)
...	...	...

6. Context (Rules 51-70)

Rule	Priority	Description
51	90	Previous word check
52	85	Next word check
53	80	Abbreviation detection
54	75	Title detection
...	...	...

7. Special Cases (Rules 71-80)

Rule	Priority	Description
71	95	URLs
72	95	Email addresses
73	90	File paths
74	85	Decimals
75	80	Numbers with periods
76	75	Version numbers
77	70	Internet domains
78	65	Acronyms
79	60	Initialisms
80	55	abbreviations

8. Advanced (Rules 81-85)

Rule	Priority	Description
81	90	Legal citation
82	85	Legal sec/ref
83	80	Complex clause
84	75	Semicolon handling
85	70	Colon in list

Priority System

Rules are evaluated in priority order (highest first). When a rule reaches:

confidence >= 0.75 AND priority >= 85: Early exit
confidence >= 0.90 (BOUNDARY): Early exit

This allows fast rejection of obvious boundaries while maintaining accuracy on edge cases.

Adding Custom Rules

from pyset.rules import Rule, RuleContext, BOUNDARY, NOT_BOUNDARY

class MyCustomRule(Rule):
    priority = 88
    category = "Custom"
    
    def evaluate(self, context: RuleContext) -> float:
        # Access position context
        prev = context.prev_word()
        current = context.char()
        next_c = context.next_char()
        
        # Your logic
        if prev in {"hello", "hi"} and current == ".":
            return BOUNDARY  # 1.0
        
        return NOT_BOUNDARY  # 0.0

Context Methods

# Character access
context.char()              # Current char
context.prev_char(n)       # Nth previous char
context.next_char(n)       # Nth next char

# Word access
context.prev_word(n)        # Nth previous word
context.next_word(n)       # Nth next word

# Position
context.position()         # Current index
context.text_length()      # Total length

Rule Result Values

Constant	Value	Meaning
BOUNDARY	1.0	Definitely a boundary
NOT_BOUNDARY	0.0	Definitely not a boundary
MAYBE	0.5	Need more context
LIKELY	0.75	Probably a boundary
VERY_LIKELY	0.90	Almost certainly

Troubleshooting

False Positives (Over-splitting)

If sentences are split too aggressively:

Add abbreviation to custom set
Increase min_sentence_length
Enable aggressive_abbreviations mode

False Negatives (Under-splitting)

If sentences aren't split enough:

Check for missing abbreviations
Review Rule 84 (semicolon handling)
Try excluding rules that may be blocking

Debug Mode

detector = TokenBoundaryDetector(debug=True)
explanations = detector.explain(text)
# Shows all rules evaluated and their results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rules Reference

Rule Categories

1. Standard Terminals (Rules 1-15)

2. Ellipsis (Rules 16-20)

3. Quotation Marks (Rules 21-30)

4. Brackets (Rules 31-40)

5. Numbers (Rules 41-50)

6. Context (Rules 51-70)

7. Special Cases (Rules 71-80)

8. Advanced (Rules 81-85)

Priority System

Adding Custom Rules

Context Methods

Rule Result Values

Troubleshooting

False Positives (Over-splitting)

False Negatives (Under-splitting)

Debug Mode

FilesExpand file tree

rules.md

Latest commit

History

rules.md

File metadata and controls

Rules Reference

Rule Categories

1. Standard Terminals (Rules 1-15)

2. Ellipsis (Rules 16-20)

3. Quotation Marks (Rules 21-30)

4. Brackets (Rules 31-40)

5. Numbers (Rules 41-50)

6. Context (Rules 51-70)

7. Special Cases (Rules 71-80)

8. Advanced (Rules 81-85)

Priority System

Adding Custom Rules

Context Methods

Rule Result Values

Troubleshooting

False Positives (Over-splitting)

False Negatives (Under-splitting)

Debug Mode