From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: 57151@debbugs.gnu.org
Cc: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful.
Date: Fri, 12 Aug 2022 01:07:52 -0400 [thread overview]
Message-ID: <20220812050752.3980-2-maxim.cournoyer@gmail.com> (raw)
In-Reply-To: <20220812050752.3980-1-maxim.cournoyer@gmail.com>
* gnu/packages/ocr.scm (tesseract-ocr)
[phases]{adjust-TESSDATA_PREFIX-macro}: New phase.
{install-minimal-tessdata}: New phase.
[native-inputs]: Add tesseract-ocr-tessdata-fast.
[search-paths]: New field.
[description]: Mention how to add support for more languages.
---
gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e2c9f561cc..21d257ef24 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -132,6 +132,15 @@ (define-public tesseract-ocr
(substitute* "configure.ac"
(("AC_SUBST\\(\\[XML_CATALOG_FILES])")
""))))
+ (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
+ (lambda _
+ ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
+ ;; specific search-path than '/share' can be specified. The
+ ;; build system uses CPPFLAGS for itself, so we can't simply set
+ ;; a make flag.
+ (substitute* "Makefile.am"
+ (("-DTESSDATA_PREFIX='\"@datadir@\"'")
+ "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
(add-after 'build 'build-training
(lambda* (#:key parallel-build? #:allow-other-keys)
(define n (if parallel-build? (number->string
@@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string
(invoke "make" "-j" n "training")))
(add-after 'install 'install-training
(lambda _
- (invoke "make" "training-install"))))))
+ (invoke "make" "training-install")))
+ (add-after 'install 'install-minimal-tessdata
+ ;; tesseract-ocr cannot be used without its trained models data;
+ ;; install the English language as a minimal base which can be
+ ;; extended via TESSDATA_PREFIX.
+ (lambda* (#:key native-inputs inputs #:allow-other-keys)
+ (define eng.traineddata
+ "/share/tesseract-ocr/tessdata/eng.traineddata")
+ (install-file (search-input-file (or native-inputs inputs)
+ eng.traineddata)
+ (dirname (string-append #$output
+ eng.traineddata))))))))
(native-inputs
(list asciidoc
autoconf
@@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string
libtool
libxml2 ;for XML_CATALOG_FILES
libxslt
- pkg-config))
+ pkg-config
+ tesseract-ocr-tessdata-fast))
(inputs
(list cairo
icu4c
leptonica
pango
python-wrapper))
+ (native-search-paths (list (search-path-specification
+ (variable "TESSDATA_PREFIX")
+ (files (list "share/tesseract-ocr/tessdata"))
+ (separator #f)))) ;single value
(home-page "https://github.com/tesseract-ocr/tesseract")
(synopsis "Optical character recognition engine")
(description
@@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string
high accuracy. It supports many languages, output text formatting, hOCR
positional information and page layout analysis. Several image formats are
supported through the Leptonica library. It can also detect whether text is
-monospaced or proportional.")
+monospaced or proportional. Support for the English language is included by
+default. To add support for more languages, the
+@code{tesseract-ocr-tessdata-fast} package should be installed.")
(license license:asl2.0)))
(define-public gimagereader
--
2.36.1
next prev parent reply other threads:[~2022-08-12 5:09 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-12 5:05 [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
2022-08-12 5:07 ` Maxim Cournoyer [this message]
2022-08-12 11:27 ` Simon South
2022-08-12 12:52 ` Maxim Cournoyer
[not found] ` <87bksp61wn.fsf@simonsouth.net>
2022-08-12 20:08 ` bug#57151: " Maxim Cournoyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220812050752.3980-2-maxim.cournoyer@gmail.com \
--to=maxim.cournoyer@gmail.com \
--cc=57151@debbugs.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).