Skip to content

Latest commit

 

History

History
174 lines (134 loc) · 4.38 KB

File metadata and controls

174 lines (134 loc) · 4.38 KB

Rules Reference

PySET uses 85 priority-weighted rules for sentence boundary detection.

Rule Categories

1. Standard Terminals (Rules 1-15)

Rule Priority Description
1 100 Period followed by space + uppercase
2 100 Exclamation mark
3 100 Question mark
4 95 Period + newline
5 90 Space after terminal
6 85 Multiple periods (ellipsis detection)
7 80 Terminal followed by quote
8 75 Exclamation/question + quote
... ... ...

2. Ellipsis (Rules 16-20)

Rule Priority Description
16 95 Three or more periods
17 90 Ellipsis with spaces
18 85 Ellipsis at end
19 80 ... variant
20 75 ··· variant

3. Quotation Marks (Rules 21-30)

Rule Priority Description
21 95 Period before closing quote
22 90 Closing quote + space
23 85 Smart quotes
24 80 Single quotes
... ... ...

4. Brackets (Rules 31-40)

Rule Priority Description
31 90 Close paren + period
32 85 Close bracket + period
33 80 Nested parens
... ... ...

5. Numbers (Rules 41-50)

Rule Priority Description
41 90 Ordinal numbers (1st, 2nd)
42 85 Roman numerals
43 80 Year in parentheses
44 75 Decades (1990s)
... ... ...

6. Context (Rules 51-70)

Rule Priority Description
51 90 Previous word check
52 85 Next word check
53 80 Abbreviation detection
54 75 Title detection
... ... ...

7. Special Cases (Rules 71-80)

Rule Priority Description
71 95 URLs
72 95 Email addresses
73 90 File paths
74 85 Decimals
75 80 Numbers with periods
76 75 Version numbers
77 70 Internet domains
78 65 Acronyms
79 60 Initialisms
80 55 abbreviations

8. Advanced (Rules 81-85)

Rule Priority Description
81 90 Legal citation
82 85 Legal sec/ref
83 80 Complex clause
84 75 Semicolon handling
85 70 Colon in list

Priority System

Rules are evaluated in priority order (highest first). When a rule reaches:

  • confidence >= 0.75 AND priority >= 85: Early exit
  • confidence >= 0.90 (BOUNDARY): Early exit

This allows fast rejection of obvious boundaries while maintaining accuracy on edge cases.

Adding Custom Rules

from pyset.rules import Rule, RuleContext, BOUNDARY, NOT_BOUNDARY

class MyCustomRule(Rule):
    priority = 88
    category = "Custom"
    
    def evaluate(self, context: RuleContext) -> float:
        # Access position context
        prev = context.prev_word()
        current = context.char()
        next_c = context.next_char()
        
        # Your logic
        if prev in {"hello", "hi"} and current == ".":
            return BOUNDARY  # 1.0
        
        return NOT_BOUNDARY  # 0.0

Context Methods

# Character access
context.char()              # Current char
context.prev_char(n)       # Nth previous char
context.next_char(n)       # Nth next char

# Word access
context.prev_word(n)        # Nth previous word
context.next_word(n)       # Nth next word

# Position
context.position()         # Current index
context.text_length()      # Total length

Rule Result Values

Constant Value Meaning
BOUNDARY 1.0 Definitely a boundary
NOT_BOUNDARY 0.0 Definitely not a boundary
MAYBE 0.5 Need more context
LIKELY 0.75 Probably a boundary
VERY_LIKELY 0.90 Almost certainly

Troubleshooting

False Positives (Over-splitting)

If sentences are split too aggressively:

  1. Add abbreviation to custom set
  2. Increase min_sentence_length
  3. Enable aggressive_abbreviations mode

False Negatives (Under-splitting)

If sentences aren't split enough:

  1. Check for missing abbreviations
  2. Review Rule 84 (semicolon handling)
  3. Try excluding rules that may be blocking

Debug Mode

detector = TokenBoundaryDetector(debug=True)
explanations = detector.explain(text)
# Shows all rules evaluated and their results