def rbsr_split(text, max_size=1000, level=0): # Level 0: Section (## Header) # Level 1: Paragraph (\n\n) # Level 2: Sentence (.) # Level 3: Word ( ) if len(tokenizer.encode(text)) <= max_size: return [text]
delimiters = [ ('\n## ', 'section'), # High level ('\n\n', 'paragraph'), # Medium level ('. ', 'sentence'), # Low level (' ', 'word') # Minimum level ] rbs-r pdf
Use pdfplumber or unstructured.io to extract bounding boxes . RBS-R cares about Y-coordinates. If two text blocks have the same Y-axis, they are the same line. If the Y-axis delta is large, it’s a new paragraph. If two text blocks have the same Y-axis,
# Use the current level's delimiter delim = delimiters[level][0] splits = text.split(delim) chunks = [] current_chunk = "" if current_chunk: chunks
return chunks The magic of RBS-R for PDFs isn't just the splitting; it's the inheritance .
chunks = [] current_chunk = ""
if current_chunk: chunks.append(current_chunk)