Informations détaillées
Nom du logiciel
swissgeol-ocr
URL du dépôt
Organisation
Brève description
End-to-end OCR pipeline (from raw scanned PDF to searchable PDF) based on the AWS Textract cloud service
Documentation
An end-to-end OCR pipeline (from raw scanned PDF file to searchable PDF file) based on the AWS Textract cloud service (https://aws.amazon.com/de/textract/).
This pipeline was developed by the Swiss Federal Office of Topography swisstopo. At swisstopo, it is used for digitising geological documents for internal use as well as for publication on the swissgeol.ch platform. In particular, the OCR pipeline has been integrated in the web applications assets.swissgeol.ch and boreholes.swissgeol.ch.
The pipeline can be run as a Python script (processing either local files or objects in an S3 bucket) or deployed as an API (processing objects in an S3 bucket).
The overall pipeline functionality is similar to the OCRmyPDF software, but with AWS Textract as the underlying OCR model instead of Tesseract. Users who have strong requirements regarding data protection, data soveriegnty or model transparency might perfer an open source OCR model such as Tesseract. On the other hand, a commercial API such as AWS Textract brings advantages such as scalability and high OCR quality at a relatively small price per page. Swisstopo's motivation for using AWS Textract and developing this OCR pipeline is documented in more detail on GitHub.
Version du logiciel
Licence
AGPL-3.0-only
Version Publiccode.yml
0.5.0