* [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR ***
@ 2022-08-12 5:05 Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
0 siblings, 1 reply; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 5:05 UTC (permalink / raw)
To: 57151; +Cc: Maxim Cournoyer
Hello Guix,
This makes our tesseract-ocr package usable. Here's a small experiment
comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU:
--8<---------------cut here---------------start------------->8---
$ time ocrad -i -s 10 /tmp/dump.ppm
komput�lo _ O Tht_, _l_.__ �
real 0m9.616s
user 0m9.397s
sys 0m0.157s
$ time tesseract -l eng /tmp/dump.ppm out && cat out.txt
Estimating resolution as 133
real 0m0.389s
user 0m0.602s
sys 0m0.053s
komputilo QR @ Thu, 21:32 ©
Log In
--8<---------------cut here---------------end--------------->8---
Maxim Cournoyer (2):
gnu: Add tesseract-ocr-tessdata-fast.
gnu: tesseract-ocr: Make the default install minimally useful.
gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 57 insertions(+), 3 deletions(-)
--
2.36.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
2022-08-12 5:05 [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Maxim Cournoyer
@ 2022-08-12 5:07 ` Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
2022-08-12 11:27 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
0 siblings, 2 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 5:07 UTC (permalink / raw)
To: 57151; +Cc: Maxim Cournoyer
* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
---
gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e28bd17668..e2c9f561cc 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -29,6 +29,7 @@ (define-module (gnu packages ocr)
#:use-module (guix gexp)
#:use-module (guix git-download)
#:use-module (guix build-system cmake)
+ #:use-module (guix build-system copy)
#:use-module (guix build-system gnu)
#:use-module (guix build-system python)
#:use-module (gnu packages)
@@ -74,6 +75,32 @@ (define-public ocrad
it produces text in 8-bit or UTF-8 formats.")
(license license:gpl3+)))
+(define-public tesseract-ocr-tessdata-fast
+ (package
+ (name "tesseract-ocr-tessdata-fast")
+ (version "4.1.0")
+ (source (origin
+ (method git-fetch)
+ (uri (git-reference
+ (url "https://github.com/tesseract-ocr/tessdata_fast")
+ (commit version)))
+ (file-name (git-file-name name version))
+ (sha256
+ (base32
+ "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+ (build-system copy-build-system)
+ (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+ #:phases #~(modify-phases %standard-phases
+ (add-after 'unpack 'delete-broken-links
+ (lambda _
+ (delete-file "configs")
+ (delete-file "pdf.ttf"))))))
+ (home-page "https://github.com/tesseract-ocr/tessdata_fast")
+ (synopsis "Fast integer versions of trained LSTM models")
+ (description "This repository contains fast integer versions of trained
+models for the Tesseract OCR Engine.")
+ (license license:asl2.0)))
+
(define-public tesseract-ocr
(package
(name "tesseract-ocr")
--
2.36.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful.
2022-08-12 5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
@ 2022-08-12 5:07 ` Maxim Cournoyer
2022-08-12 11:27 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
1 sibling, 0 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 5:07 UTC (permalink / raw)
To: 57151; +Cc: Maxim Cournoyer
* gnu/packages/ocr.scm (tesseract-ocr)
[phases]{adjust-TESSDATA_PREFIX-macro}: New phase.
{install-minimal-tessdata}: New phase.
[native-inputs]: Add tesseract-ocr-tessdata-fast.
[search-paths]: New field.
[description]: Mention how to add support for more languages.
---
gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e2c9f561cc..21d257ef24 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -132,6 +132,15 @@ (define-public tesseract-ocr
(substitute* "configure.ac"
(("AC_SUBST\\(\\[XML_CATALOG_FILES])")
""))))
+ (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
+ (lambda _
+ ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
+ ;; specific search-path than '/share' can be specified. The
+ ;; build system uses CPPFLAGS for itself, so we can't simply set
+ ;; a make flag.
+ (substitute* "Makefile.am"
+ (("-DTESSDATA_PREFIX='\"@datadir@\"'")
+ "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
(add-after 'build 'build-training
(lambda* (#:key parallel-build? #:allow-other-keys)
(define n (if parallel-build? (number->string
@@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string
(invoke "make" "-j" n "training")))
(add-after 'install 'install-training
(lambda _
- (invoke "make" "training-install"))))))
+ (invoke "make" "training-install")))
+ (add-after 'install 'install-minimal-tessdata
+ ;; tesseract-ocr cannot be used without its trained models data;
+ ;; install the English language as a minimal base which can be
+ ;; extended via TESSDATA_PREFIX.
+ (lambda* (#:key native-inputs inputs #:allow-other-keys)
+ (define eng.traineddata
+ "/share/tesseract-ocr/tessdata/eng.traineddata")
+ (install-file (search-input-file (or native-inputs inputs)
+ eng.traineddata)
+ (dirname (string-append #$output
+ eng.traineddata))))))))
(native-inputs
(list asciidoc
autoconf
@@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string
libtool
libxml2 ;for XML_CATALOG_FILES
libxslt
- pkg-config))
+ pkg-config
+ tesseract-ocr-tessdata-fast))
(inputs
(list cairo
icu4c
leptonica
pango
python-wrapper))
+ (native-search-paths (list (search-path-specification
+ (variable "TESSDATA_PREFIX")
+ (files (list "share/tesseract-ocr/tessdata"))
+ (separator #f)))) ;single value
(home-page "https://github.com/tesseract-ocr/tesseract")
(synopsis "Optical character recognition engine")
(description
@@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string
high accuracy. It supports many languages, output text formatting, hOCR
positional information and page layout analysis. Several image formats are
supported through the Leptonica library. It can also detect whether text is
-monospaced or proportional.")
+monospaced or proportional. Support for the English language is included by
+default. To add support for more languages, the
+@code{tesseract-ocr-tessdata-fast} package should be installed.")
(license license:asl2.0)))
(define-public gimagereader
--
2.36.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
2022-08-12 5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
@ 2022-08-12 11:27 ` Simon South
2022-08-12 12:52 ` Maxim Cournoyer
1 sibling, 1 reply; 6+ messages in thread
From: Simon South @ 2022-08-12 11:27 UTC (permalink / raw)
To: Maxim Cournoyer; +Cc: 57151
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
Maxim,
Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance? The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.
This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.
(Thanks for working on this; it's been on my to-do list for a while as
well.)
--
Simon South
simon@simonsouth.net
^ permalink raw reply [flat|nested] 6+ messages in thread
* [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
2022-08-12 11:27 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
@ 2022-08-12 12:52 ` Maxim Cournoyer
[not found] ` <87bksp61wn.fsf@simonsouth.net>
0 siblings, 1 reply; 6+ messages in thread
From: Maxim Cournoyer @ 2022-08-12 12:52 UTC (permalink / raw)
To: Simon South; +Cc: 57151
Hi Simon,
Simon South <simon@simonsouth.net> writes:
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance? The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.
That's a good idea! I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts. Which means
the individual variants could be added in at a later time by those
interested, eh :-).
A procedure returning a language-specific package variant would make
sense for that.
Thanks,
Maxim
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-08-12 20:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-12 5:05 [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Maxim Cournoyer
2022-08-12 5:07 ` [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful Maxim Cournoyer
2022-08-12 11:27 ` [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast Simon South
2022-08-12 12:52 ` Maxim Cournoyer
[not found] ` <87bksp61wn.fsf@simonsouth.net>
2022-08-12 20:08 ` bug#57151: " Maxim Cournoyer
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.