Titan AI LogoTitan AI

olmocr

15,492
1,178
Python

Project Description

Toolkit for linearizing PDFs for LLM datasets/training

olmocr: Toolkit for linearizing PDFs for LLM datasets/training

Project Title

olmocr — Toolkit for Linearizing PDFs and Image-Based Documents for LLM Datasets/Training

Overview

olmocr is a Python-based toolkit designed to convert PDFs and other image-based document formats into clean, readable, plain text format. It stands out for its ability to handle complex formatting, equations, tables, handwriting, and multi-column layouts while maintaining a natural reading order. The toolkit is efficient and cost-effective, with a focus on performance and accuracy.

Key Features

  • Converts PDF, PNG, and JPEG documents into clean Markdown
  • Supports equations, tables, handwriting, and complex formatting
  • Automatically removes headers and footers
  • Maintains natural reading order in multi-column layouts and figures
  • Efficient conversion at less than $200 USD per million pages (based on a 7B parameter VLM, requiring a GPU)

Use Cases

  • Researchers and data scientists using large datasets for training language models
  • Document digitization projects requiring high accuracy and clean text output
  • Educational institutions digitizing textbooks and course materials

Advantages

  • High performance and accuracy in converting various document formats
  • Supports a wide range of document types and complex layouts
  • Cost-effective solution for large-scale document conversion
  • Regular updates and model improvements based on community feedback

Limitations / Considerations

  • Requires a GPU for efficient processing due to the 7B parameter VLM
  • May have limitations with extremely low-quality scans or highly stylized documents

Similar / Related Projects

  • Tesseract OCR: A more traditional OCR tool that can handle a variety of languages and document types, but may not match olmocr's performance with complex layouts and formatting.
  • PDFPlumber: A Python library that extracts data from PDFs, useful for simpler text extraction tasks but lacks the advanced features of olmocr.
  • Apache PDFBox: A Java tool for working with PDF documents, offering a range of functionalities but without the specific focus on OCR and document linearization.

Basic Information


📊 Project Information

  • Project Name: olmocr
  • GitHub URL: https://github.com/allenai/olmocr
  • Programming Language: Python
  • ⭐ Stars: 14,094
  • 🍴 Forks: 1,048
  • 📅 Created: 2024-09-17
  • 🔄 Last Updated: 2025-09-17

🏷️ Project Topics

Topics: [, ]


📚 Documentation


This article is automatically generated by AI based on GitHub project information and README content analysis

Titan AI Explorehttps://www.titanaiexplore.com/projects/olmocr-858798469en-USTechnology

Project Information

Created on 9/17/2024
Updated on 10/31/2025