Project Title

surya — OCR Toolkit for 90+ Languages with Layout Analysis and Table Recognition

Overview

Surya is a comprehensive document OCR toolkit that supports 90+ languages, offering line-level text detection, layout analysis, reading order detection, and table recognition. It stands out for its benchmark performance against cloud services and its ability to handle a wide range of document types.

Key Features

OCR in over 90 languages with competitive benchmarking against cloud services
Line-level text detection for any language
Advanced layout analysis for table, image, header, and other elements
Reading order detection for structured document processing
Table recognition to identify rows and columns in documents
LaTeX OCR for scientific and academic documents

Use Cases

Document digitization for libraries and archives, converting physical documents into searchable digital formats
Automated data extraction from invoices, contracts, and other business documents
Scientific research, where LaTeX OCR can help in extracting and analyzing data from research papers
Multilingual support for global companies needing to process documents in various languages

Advantages

Supports a wide range of languages, making it suitable for international applications
Open-source, allowing for community contributions and customization
Competitive performance against cloud-based OCR services
Provides detailed layout analysis and reading order detection, which can improve the accuracy of data extraction

Limitations / Considerations

The project's license is currently unknown, which may affect its use in commercial applications
While it supports many languages, the performance may vary depending on the language and document quality
The requirement for Python knowledge may be a barrier for some users

Tesseract OCR: A more widely known OCR engine that supports a variety of languages but may not match Surya's performance in some languages.
Apache PDFBox: A Java library for working with PDF documents, which includes features for text extraction but may not offer the same level of language support as Surya.
OCRopus: An OCR engine that focuses on accuracy but has a more limited language support compared to Surya.

Basic Information

GitHub: https://github.com/datalab-to/surya
Stars: 18,563
License: Unknown
Last Commit: 2025-09-16

📊 Project Information

Project Name: surya
GitHub URL: https://github.com/datalab-to/surya
Programming Language: Python
⭐ Stars: 18,563
🍴 Forks: 1,256
📅 Created: 2024-01-10
🔄 Last Updated: 2025-09-16

🏷️ Project Topics

Topics: [, ]

📚 Documentation

Hindu sun god

This article is automatically generated by AI based on GitHub project information and README content analysis

surya

Project Description