[bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR ***

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR ***
@ 2022-08-12  5:05 Maxim Cournoyer
  2022-08-12  5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
  0 siblings, 1 reply; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12  5:05 UTC (permalink / raw)
  To: 57151; +Cc: Maxim Cournoyer

Hello Guix,

This makes our tesseract-ocr package usable.  Here's a small experiment
comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU:

--8<---------------cut here---------------start------------->8---
$ time ocrad -i -s 10 /tmp/dump.ppm
komput�lo _ O Tht_, _l_.__ �

real    0m9.616s
user    0m9.397s
sys     0m0.157s

$ time tesseract -l eng /tmp/dump.ppm out && cat out.txt
Estimating resolution as 133

real    0m0.389s
user    0m0.602s
sys     0m0.053s
komputilo QR @ Thu, 21:32 ©

Log In
--8<---------------cut here---------------end--------------->8---

Maxim Cournoyer (2):
  gnu: Add tesseract-ocr-tessdata-fast.
  gnu: tesseract-ocr: Make the default install minimally useful.

 gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 57 insertions(+), 3 deletions(-)

-- 
2.36.1





^ permalink raw reply	[flat|nested] 6+ messages in thread

* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
  2022-08-12  5:05 [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Maxim Cournoyer
@ 2022-08-12  5:07 ` Maxim Cournoyer
  2022-08-12  5:07   ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
  2022-08-12 11:27   ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
  0 siblings, 2 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12  5:07 UTC (permalink / raw)
  To: 57151; +Cc: Maxim Cournoyer

* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
---
 gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e28bd17668..e2c9f561cc 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -29,6 +29,7 @@ (define-module (gnu packages ocr)
   #:use-module (guix gexp)
   #:use-module (guix git-download)
   #:use-module (guix build-system cmake)
+  #:use-module (guix build-system copy)
   #:use-module (guix build-system gnu)
   #:use-module (guix build-system python)
   #:use-module (gnu packages)
@@ -74,6 +75,32 @@ (define-public ocrad
 it produces text in 8-bit or UTF-8 formats.")
     (license license:gpl3+)))
 
+(define-public tesseract-ocr-tessdata-fast
+  (package
+    (name "tesseract-ocr-tessdata-fast")
+    (version "4.1.0")
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                    (url "https://github.com/tesseract-ocr/tessdata_fast")
+                    (commit version)))
+              (file-name (git-file-name name version))
+              (sha256
+               (base32
+                "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+    (build-system copy-build-system)
+    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+                     #:phases #~(modify-phases %standard-phases
+                                  (add-after 'unpack 'delete-broken-links
+                                    (lambda _
+                                      (delete-file "configs")
+                                      (delete-file "pdf.ttf"))))))
+    (home-page "https://github.com/tesseract-ocr/tessdata_fast")
+    (synopsis "Fast integer versions of trained LSTM models")
+    (description "This repository contains fast integer versions of trained
+models for the Tesseract OCR Engine.")
+    (license license:asl2.0)))
+
 (define-public tesseract-ocr
   (package
     (name "tesseract-ocr")
-- 
2.36.1





^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful.
  2022-08-12  5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
@ 2022-08-12  5:07   ` Maxim Cournoyer
  2022-08-12 11:27   ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
  1 sibling, 0 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12  5:07 UTC (permalink / raw)
  To: 57151; +Cc: Maxim Cournoyer

* gnu/packages/ocr.scm (tesseract-ocr)
[phases]{adjust-TESSDATA_PREFIX-macro}: New phase.
{install-minimal-tessdata}: New phase.
[native-inputs]: Add tesseract-ocr-tessdata-fast.
[search-paths]: New field.
[description]: Mention how to add support for more languages.
---
 gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e2c9f561cc..21d257ef24 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -132,6 +132,15 @@ (define-public tesseract-ocr
               (substitute* "configure.ac"
                 (("AC_SUBST\\(\\[XML_CATALOG_FILES])")
                  ""))))
+          (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
+            (lambda _
+              ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
+              ;; specific search-path than '/share' can be specified.  The
+              ;; build system uses CPPFLAGS for itself, so we can't simply set
+              ;; a make flag.
+              (substitute* "Makefile.am"
+                (("-DTESSDATA_PREFIX='\"@datadir@\"'")
+                 "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
           (add-after 'build 'build-training
             (lambda* (#:key parallel-build? #:allow-other-keys)
               (define n (if parallel-build? (number->string
@@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string
               (invoke "make" "-j" n "training")))
           (add-after 'install 'install-training
             (lambda _
-              (invoke "make" "training-install"))))))
+              (invoke "make" "training-install")))
+          (add-after 'install 'install-minimal-tessdata
+            ;; tesseract-ocr cannot be used without its trained models data;
+            ;; install the English language as a minimal base which can be
+            ;; extended via TESSDATA_PREFIX.
+            (lambda* (#:key native-inputs inputs #:allow-other-keys)
+              (define eng.traineddata
+                "/share/tesseract-ocr/tessdata/eng.traineddata")
+              (install-file (search-input-file (or native-inputs inputs)
+                                               eng.traineddata)
+                            (dirname (string-append #$output
+                                                    eng.traineddata))))))))
     (native-inputs
      (list asciidoc
            autoconf
@@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string
            libtool
            libxml2                      ;for XML_CATALOG_FILES
            libxslt
-           pkg-config))
+           pkg-config
+           tesseract-ocr-tessdata-fast))
     (inputs
      (list cairo
            icu4c
            leptonica
            pango
            python-wrapper))
+    (native-search-paths (list (search-path-specification
+                                (variable "TESSDATA_PREFIX")
+                                (files (list "share/tesseract-ocr/tessdata"))
+                                (separator #f)))) ;single value
     (home-page "https://github.com/tesseract-ocr/tesseract")
     (synopsis "Optical character recognition engine")
     (description
@@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string
 high accuracy.  It supports many languages, output text formatting, hOCR
 positional information and page layout analysis.  Several image formats are
 supported through the Leptonica library.  It can also detect whether text is
-monospaced or proportional.")
+monospaced or proportional.  Support for the English language is included by
+default.  To add support for more languages, the
+@code{tesseract-ocr-tessdata-fast} package should be installed.")
     (license license:asl2.0)))
 
 (define-public gimagereader
-- 
2.36.1





^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
  2022-08-12  5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
  2022-08-12  5:07   ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
@ 2022-08-12 11:27   ` Simon South
  2022-08-12 12:52     ` Maxim Cournoyer
  1 sibling, 1 reply; 6+ messages in thread
From: Simon South @ 2022-08-12 11:27 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 57151

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.

Maxim,

Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance?  The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.

This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.

(Thanks for working on this; it's been on my to-do list for a while as
well.)

-- 
Simon South
simon@simonsouth.net

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
  2022-08-12 11:27   ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
@ 2022-08-12 12:52     ` Maxim Cournoyer
       [not found]       ` <87bksp61wn.fsf@simonsouth.net>
  0 siblings, 1 reply; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 12:52 UTC (permalink / raw)
  To: Simon South; +Cc: 57151

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance?  The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.

That's a good idea!  I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts.  Which means
the individual variants could be added in at a later time by those
interested, eh :-).

A procedure returning a language-specific package variant would make
sense for that.

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57151: [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
       [not found]       ` <87bksp61wn.fsf@simonsouth.net>
@ 2022-08-12 20:08         ` Maxim Cournoyer
  0 siblings, 0 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 20:08 UTC (permalink / raw)
  To: Simon South, 57151-done

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> Which means the individual variants could be added in at a later time
>> by those interested, eh :-).
>
> Subtext noted.
>
> One last thing, in case you weren't already aware: Issue 47536 was
> opened a while ago regarding the missing tessdata package, so you may
> want to link it to your own issue 57151 and/or close it once your
> changes are committed:
>
> https://issues.guix.gnu.org/47536

Thanks for pointing that to me.  Pushed as ff0600c5ef.  I'll now close
the issue linked above.

Thanks!

Closing.

Maxim




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-08-12 20:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-12  5:05 [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Maxim Cournoyer
2022-08-12  5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
2022-08-12  5:07   ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
2022-08-12 11:27   ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
2022-08-12 12:52     ` Maxim Cournoyer
     [not found]       ` <87bksp61wn.fsf@simonsouth.net>
2022-08-12 20:08         ` bug#57151: " Maxim Cournoyer

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.