项目描述
docext is an on-premises document intelligence toolkit that offers OCR-free unstructured data extraction, markdown conversion, and benchmarking. It leverages vision-language models to convert PDFs and images to markdown, extract document information, and evaluate model performance.
Project Information
Created on 3/25/2025
Updated on 7/2/2025
Categories
image-processing
machine-learning-framework
text-processing
Tags
ready-to-use
data-processing
algorithm-model
open-source-community
model-deployment
Topics
rag
document-information-extraction
nlp
llm-ocr
document-data-extraction
onpremise
machine-learning
extraction
onprem-vision
ocr-benchmark
unstructured-data
ocr-onpremise
document
document-analysis
onprem
vlms
llms
table-extraction
ocr
onprem-ocr