Show HN: Chonky – a neural approach for text semantic chunking https://ift.tt/i6fwmgq

Show HN: Chonky – a neural approach for text semantic chunking TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks. The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences). I propose a fully neural approach to semantic chunking. I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti. The library could be used as a text splitter module in a RAG system or for splitting transcripts for example. The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model. The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased. Please give it a try. I'll appreciate a feedback. The Python library: https://ift.tt/eRnEM8X The transformer model: https://ift.tt/qoLbQrE... https://ift.tt/eRnEM8X April 11, 2025 at 05:18AM

Ad 728 × 90

Breaking News

Show HN: Chonky – a neural approach for text semantic chunking https://ift.tt/i6fwmgq

No comments:

Find us on facebook

Blog Archive

recent posts

Popular Posts

comments

Tags

category

random posts

recent posts

Featured posts

Contact Form