Titan AI LogoTitan AI

surya

18,785
1,279
Python

Project Description

OCR, layout analysis, reading order, table recognition in 90+ languages

surya: OCR, layout analysis, reading order, table recognition in 90+ languages

Project Title

surya — OCR Toolkit for 90+ Languages with Layout Analysis and Table Recognition

Overview

Surya is a comprehensive document OCR toolkit that supports 90+ languages, offering line-level text detection, layout analysis, reading order detection, and table recognition. It stands out for its benchmark performance against cloud services and its ability to handle a wide range of document types.

Key Features

  • OCR in over 90 languages with competitive benchmarking against cloud services
  • Line-level text detection for any language
  • Advanced layout analysis for table, image, header, and other elements
  • Reading order detection for structured document processing
  • Table recognition to identify rows and columns in documents
  • LaTeX OCR for scientific and academic documents

Use Cases

  • Document digitization for libraries and archives, converting physical documents into searchable digital formats
  • Automated data extraction from invoices, contracts, and other business documents
  • Scientific research, where LaTeX OCR can help in extracting and analyzing data from research papers
  • Multilingual support for global companies needing to process documents in various languages

Advantages

  • Supports a wide range of languages, making it suitable for international applications
  • Open-source, allowing for community contributions and customization
  • Competitive performance against cloud-based OCR services
  • Provides detailed layout analysis and reading order detection, which can improve the accuracy of data extraction

Limitations / Considerations

  • The project's license is currently unknown, which may affect its use in commercial applications
  • While it supports many languages, the performance may vary depending on the language and document quality
  • The requirement for Python knowledge may be a barrier for some users

Similar / Related Projects

  • Tesseract OCR: A more widely known OCR engine that supports a variety of languages but may not match Surya's performance in some languages.
  • Apache PDFBox: A Java library for working with PDF documents, which includes features for text extraction but may not offer the same level of language support as Surya.
  • OCRopus: An OCR engine that focuses on accuracy but has a more limited language support compared to Surya.

Basic Information


📊 Project Information

  • Project Name: surya
  • GitHub URL: https://github.com/datalab-to/surya
  • Programming Language: Python
  • ⭐ Stars: 18,563
  • 🍴 Forks: 1,256
  • 📅 Created: 2024-01-10
  • 🔄 Last Updated: 2025-09-16

🏷️ Project Topics

Topics: [, ]


📚 Documentation


This article is automatically generated by AI based on GitHub project information and README content analysis

Titan AI Explorehttps://www.titanaiexplore.com/projects/surya-741297064en-USTechnology

Project Information

Created on 1/10/2024
Updated on 10/31/2025