Aller au contenu principal
Retour à l'aperçu

Informations détaillées

Nom du logiciel

swissgeol-ocr

Brève description

End-to-end OCR pipeline (from raw scanned PDF to searchable PDF) based on the AWS Textract cloud service

Documentation

An end-to-end OCR pipeline (from raw scanned PDF file to searchable PDF file) based on the AWS Textract cloud service (https://aws.amazon.com/de/textract/). This pipeline was developed by the Swiss Federal Office of Topography swisstopo. At swisstopo, it is used for digitising geological documents for internal use as well as for publication on the swissgeol.ch platform. In particular, the OCR pipeline has been integrated in the web applications assets.swissgeol.ch and boreholes.swissgeol.ch. The pipeline can be run as a Python script (processing either local files or objects in an S3 bucket) or deployed as an API (processing objects in an S3 bucket). The overall pipeline functionality is similar to the OCRmyPDF software, but with AWS Textract as the underlying OCR model instead of Tesseract. Users who have strong requirements regarding data protection, data soveriegnty or model transparency might perfer an open source OCR model such as Tesseract. On the other hand, a commercial API such as AWS Textract brings advantages such as scalability and high OCR quality at a relatively small price per page. Swisstopo's motivation for using AWS Textract and developing this OCR pipeline is documented in more detail on GitHub.

Version du logiciel

Licence

AGPL-3.0-only

Version Publiccode.yml

0.5.0