Show HN: Kreuzberg v3.0 – Modern Python Document Extraction https://ift.tt/WoNwA3c

Show HN: Kreuzberg v3.0 – Modern Python Document Extraction I'm excited to announce Kreuzberg v3.0, which was released yesterday. Kreuzberg is an MIT licensed Python library that extracts text from a wide range of documents (PDFs, images, office files etc.) without depending on external APIs dependencies. Its different from other libraries and commercial offerings in this space by being designed to be (1) lightweight, (2) CPU orientated, (3) simple to user and (4) have async support as a first class citizen. The v3.0 release completely reworks the architecture for extensibility. Kreuzberg now now supports: - Multiple OCR backends (Tesseract, PaddleOCR, EasyOCR), with OCR itself being completely optional. - Support custom extractors and overriding of builtin extractors. - Post-processing and validation hooks. - Extensive PDF metadata extraction. - Optional support for semantic chunking. There is also a brand new documentation site at https://ift.tt/b9TwZy1 . I also published a roadmap for the project, which you can see here: https://ift.tt/7ylEPIh You can see the repo at https://ift.tt/rQqzCSe - please star it if you find it valuable, since this motivates me! March 24, 2025 at 03:24AM

Ad 728 × 90

Breaking News

Show HN: Kreuzberg v3.0 – Modern Python Document Extraction https://ift.tt/WoNwA3c

No comments:

Find us on facebook

Blog Archive

recent posts

Popular Posts

comments

Tags

category

random posts

recent posts

Featured posts

Contact Form