all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
@ 2023-02-27 20:55 jlicht
  2023-02-27 22:43 ` Simon South
  0 siblings, 1 reply; 10+ messages in thread
From: jlicht @ 2023-02-27 20:55 UTC (permalink / raw)
  To: 61851; +Cc: Jelle Licht

From: Jelle Licht <jlicht@fsfe.org>

* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast)[source]: Add recursive?
flag. Adjust hash accordingly.
[arguments]<#:phases>: Remove unneeded workaround.
---

 gnu/packages/ocr.scm | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index c1cd4f061e..e07d40bda4 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -82,18 +82,14 @@ (define-public tesseract-ocr-tessdata-fast
               (method git-fetch)
               (uri (git-reference
                     (url "https://github.com/tesseract-ocr/tessdata_fast")
+                    (recursive? #t) ; for tessconfigs
                     (commit version)))
               (file-name (git-file-name name version))
               (sha256
                (base32
-                "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+                "1hqdsy3zdy5b9l641fvhnawkw6wpb8nkvjql78q8g47js8109mhm"))))
     (build-system copy-build-system)
-    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
-                     #:phases #~(modify-phases %standard-phases
-                                  (add-after 'unpack 'delete-broken-links
-                                    (lambda _
-                                      (delete-file "configs")
-                                      (delete-file "pdf.ttf"))))))
+    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))))
     (home-page "https://github.com/tesseract-ocr/tessdata_fast")
     (synopsis "Fast integer versions of trained LSTM models")
     (description "This repository contains fast integer versions of trained
-- 
2.39.1





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-27 20:55 [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files jlicht
@ 2023-02-27 22:43 ` Simon South
  2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
  2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
  0 siblings, 2 replies; 10+ messages in thread
From: Simon South @ 2023-02-27 22:43 UTC (permalink / raw)
  To: jlicht; +Cc: 61851

Jelle,

Respectfully, and speaking only as an interested observer, I think this
may not be the right fix.

Guix's Tesseract is indeed missing its config files, causing (among
other things) the examples in the online documentation[0] to not work,
e.g.:

  ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  read_params_file: Can't open hocr
  The (quick) [brown] {fox} jumps!
  Over the $43,456.78 <lazy> #90 dog
  (...)

But the root issue appears to be a misconfiguration of the
TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
Tesseract's own config files to be installed in a folder other than the
one it's configured to search.

Fixing this places Tesseract's config files and the trained-data files
together beneath /usr/share/tessdata, allowing Tesseract to work as
expected:

  ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  (...)

This approach has the advantage of keeping the
tesseract-ocr-tessdata-fast package "pure" and focused only on
trained-data files, which will be important for the patch I'm working on
that will split it into multiple packages, one for each language and
script, to allow greater flexibility.

I'll respond to this email with a draft (!) patch to tesseract-ocr that
should achieve the same result as yours, making the config files
available for use.  Does this also fix the problem for you?  If so,
would you consider submitting this change instead?

-- 
Simon South
simon@simonsouth.net

[0] https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX.
  2023-02-27 22:43 ` Simon South
@ 2023-02-27 22:48   ` Simon South
  2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
  1 sibling, 0 replies; 10+ messages in thread
From: Simon South @ 2023-02-27 22:48 UTC (permalink / raw)
  To: jlicht; +Cc: 61851

---
 gnu/packages/ocr.scm | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index c1cd4f061e..fc069b83e3 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -88,7 +88,7 @@ (define-public tesseract-ocr-tessdata-fast
                (base32
                 "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
     (build-system copy-build-system)
-    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+    (arguments (list #:install-plan #~'(("." "share/tessdata"))
                      #:phases #~(modify-phases %standard-phases
                                   (add-after 'unpack 'delete-broken-links
                                     (lambda _
@@ -131,15 +131,6 @@ (define-public tesseract-ocr
               (substitute* "configure.ac"
                 (("AC_SUBST\\(\\[XML_CATALOG_FILES])")
                  ""))))
-          (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
-            (lambda _
-              ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
-              ;; specific search-path than '/share' can be specified.  The
-              ;; build system uses CPPFLAGS for itself, so we can't simply set
-              ;; a make flag.
-              (substitute* "Makefile.am"
-                (("-DTESSDATA_PREFIX='\"@datadir@\"'")
-                 "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
           (add-after 'build 'build-training
             (lambda* (#:key parallel-build? #:allow-other-keys)
               (define n (if parallel-build? (number->string
@@ -155,7 +146,7 @@ (define n (if parallel-build? (number->string
             ;; extended via TESSDATA_PREFIX.
             (lambda* (#:key native-inputs inputs #:allow-other-keys)
               (define eng.traineddata
-                "/share/tesseract-ocr/tessdata/eng.traineddata")
+                "/share/tessdata/eng.traineddata")
               (install-file (search-input-file (or native-inputs inputs)
                                                eng.traineddata)
                             (dirname (string-append #$output
@@ -183,7 +174,7 @@ (define eng.traineddata
      (list leptonica))
     (native-search-paths (list (search-path-specification
                                 (variable "TESSDATA_PREFIX")
-                                (files (list "share/tesseract-ocr/tessdata"))
+                                (files (list "share/tessdata"))
                                 (separator #f)))) ;single value
     (home-page "https://github.com/tesseract-ocr/tesseract")
     (synopsis "Optical character recognition engine")
-- 
2.39.1





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-27 22:43 ` Simon South
  2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
@ 2023-02-28  0:31   ` Jelle Licht
  2023-02-28 15:00     ` Simon South
  1 sibling, 1 reply; 10+ messages in thread
From: Jelle Licht @ 2023-02-28  0:31 UTC (permalink / raw)
  To: Simon South; +Cc: 61851, Maxim Cournoyer

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Jelle,
>
> Respectfully, and speaking only as an interested observer, I think this
> may not be the right fix.

Cunningham's law strikes again :) [1].

>
> Guix's Tesseract is indeed missing its config files, causing (among
> other things) the examples in the online documentation[0] to not work,
> e.g.:
>
>   ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
>   read_params_file: Can't open hocr
>   The (quick) [brown] {fox} jumps!
>   Over the $43,456.78 <lazy> #90 dog
>   (...)
>
> But the root issue appears to be a misconfiguration of the
> TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
> Tesseract's own config files to be installed in a folder other than the
> one it's configured to search.
>
> Fixing this places Tesseract's config files and the trained-data files
> together beneath /usr/share/tessdata, allowing Tesseract to work as
> expected:
>
>   ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
>   <?xml version="1.0" encoding="UTF-8"?>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>   (...)

I will believe you without any doubt, but there's this spooky comment
left in the tesseract-ocr 'adjust-TESSDATA_PREFIX-macro phase:

--8<---------------cut here---------------start------------->8---
  ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
  ;; specific search-path than '/share' can be specified.  The
  ;; build system uses CPPFLAGS for itself, so we can't simply set
  ;; a make flag.
--8<---------------cut here---------------end--------------->8---

This makes me believe the current situation was a deliberate choice, but
I personally don't understand what the original problem was/is.

> This approach has the advantage of keeping the
> tesseract-ocr-tessdata-fast package "pure" and focused only on
> trained-data files, which will be important for the patch I'm working on
> that will split it into multiple packages, one for each language and
> script, to allow greater flexibility.
>
> I'll respond to this email with a draft (!) patch to tesseract-ocr that
> should achieve the same result as yours, making the config files
> available for use.  Does this also fix the problem for you?  If so,
> would you consider submitting this change instead?

It seems to work for my stuff! I'm bringing Maxim to weigh in on this, as
they are the (un?)lucky expert according to my git-foo.

Thanks for paying attention!
- Jelle

[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
@ 2023-02-28 15:00     ` Simon South
  2023-02-28 15:35       ` Maxim Cournoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Simon South @ 2023-02-28 15:00 UTC (permalink / raw)
  To: Jelle Licht; +Cc: 61851, Maxim Cournoyer

Jelle Licht <jlicht@fsfe.org> writes:
> Cunningham's law strikes again :)

Ha, interesting.  That one's new to me.

> This makes me believe the current situation was a deliberate choice...

Yes, it was, and I realize now I didn't provide much in the way of
rationale in my previous email.  So here's the background information
for anyone interested:

Tesseract normally expects to find its data files in /usr/share/tessdata
and subfolders thereof.  We'd like to use Guix's native-search-paths
functionality to pull together data from (for instance) multiple
language-specific data packages, and Tesseract conveniently honours a
TESSDATA_PREFIX environment variable that specifies its data folder's
location, so it seems we are all set.

What should TESSDATA_PREFIX be set to?  Tesseract's documentation[0]
says

  TESSDATA_PREFIX environment variable should be set to the parent
  directory of “tessdata” directory.

So "share" then, presumably, to have the data files located at
"share/tessdata".  The man page[1] seems to confirm this:

  To use a non-standard language pack named foo.traineddata, set the
  TESSDATA_PREFIX environment variable so the file can be found at
  TESSDATA_PREFIX/tessdata/foo.traineddata...

This creates a problem, though, since defining a native-search-path of
just "share" will pull in files from virtually every single Guix
package.  The solution then is to introduce an intermediate folder,
"tesseract-ocr", that sidesteps this problem, and to configure Tesseract
appropriately at build time so it installs its data files to
"share/tesseract-ocr/tessdata" instead.  This is why the existing code
was written the way it was and what the comment you pointed out is
referring to.

However there's a problem with this, too: Patching Makefile.am the way
the code does results in only some of Tesseract's data files being
placed in "share/tesseract-ocr/tessdata"; you can see in the package
output there is still a "share/tessdata" folder that contains
Tesseract's config files.  Since these aren't also placed beneath
"share/tesseract-ocr/tessdata" Tesseract can't find them at runtime.

The solution to this seems to be to remove this phase and instead use
the "--datadir" configure flag to specify the desired data-folder path.
Doing this results in all of Tesseract's data files being installed
beneath "share/tesseract-ocr/tessdata" and the resulting package works
as you'd expect.

However the problem with this is... none of it is necessary in the first
place!  It turns out Tesseract's documentation is simply WRONG and the
program actually expects TESSDATA_PREFIX to contain the complete path to
the "tessdata" data folder, not the path of the folder directly above
it.  So Tesseract can be built as-is, the native-search-path can be
safely defined as "share/tessdata", and everything just works.

This is what the patch I passed on yesterday does.

-- 
Simon South
simon@simonsouth.net

[0] https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simplest-invocation-to-ocr-an-image

[1] https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-28 15:00     ` Simon South
@ 2023-02-28 15:35       ` Maxim Cournoyer
  2023-02-28 16:40         ` Simon South
  0 siblings, 1 reply; 10+ messages in thread
From: Maxim Cournoyer @ 2023-02-28 15:35 UTC (permalink / raw)
  To: Simon South; +Cc: Jelle Licht, 61851

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Jelle Licht <jlicht@fsfe.org> writes:
>> Cunningham's law strikes again :)
>
> Ha, interesting.  That one's new to me.
>
>> This makes me believe the current situation was a deliberate choice...
>
> Yes, it was, and I realize now I didn't provide much in the way of
> rationale in my previous email.  So here's the background information
> for anyone interested:
>
> Tesseract normally expects to find its data files in /usr/share/tessdata
> and subfolders thereof.  We'd like to use Guix's native-search-paths
> functionality to pull together data from (for instance) multiple
> language-specific data packages, and Tesseract conveniently honours a
> TESSDATA_PREFIX environment variable that specifies its data folder's
> location, so it seems we are all set.
>
> What should TESSDATA_PREFIX be set to?  Tesseract's documentation[0]
> says
>
>   TESSDATA_PREFIX environment variable should be set to the parent
>   directory of “tessdata” directory.
>
> So "share" then, presumably, to have the data files located at
> "share/tessdata".  The man page[1] seems to confirm this:
>
>   To use a non-standard language pack named foo.traineddata, set the
>   TESSDATA_PREFIX environment variable so the file can be found at
>   TESSDATA_PREFIX/tessdata/foo.traineddata...
>
> This creates a problem, though, since defining a native-search-path of
> just "share" will pull in files from virtually every single Guix
> package.  The solution then is to introduce an intermediate folder,
> "tesseract-ocr", that sidesteps this problem, and to configure Tesseract
> appropriately at build time so it installs its data files to
> "share/tesseract-ocr/tessdata" instead.  This is why the existing code
> was written the way it was and what the comment you pointed out is
> referring to.
>
> However there's a problem with this, too: Patching Makefile.am the way
> the code does results in only some of Tesseract's data files being
> placed in "share/tesseract-ocr/tessdata"; you can see in the package
> output there is still a "share/tessdata" folder that contains
> Tesseract's config files.  Since these aren't also placed beneath
> "share/tesseract-ocr/tessdata" Tesseract can't find them at runtime.
>
> The solution to this seems to be to remove this phase and instead use
> the "--datadir" configure flag to specify the desired data-folder path.
> Doing this results in all of Tesseract's data files being installed
> beneath "share/tesseract-ocr/tessdata" and the resulting package works
> as you'd expect.
>
> However the problem with this is... none of it is necessary in the first
> place!  It turns out Tesseract's documentation is simply WRONG and the
> program actually expects TESSDATA_PREFIX to contain the complete path to
> the "tessdata" data folder, not the path of the folder directly above
> it.  So Tesseract can be built as-is, the native-search-path can be
> safely defined as "share/tessdata", and everything just works.
>
> This is what the patch I passed on yesterday does.

Thanks for explaining, that makes sense!  Would you be so kind as to
open an issue with upstream about the misleading doc?  That'd complete
it and avoid any confusion in the future.

-- 
Thanks,
Maxim




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-28 15:35       ` Maxim Cournoyer
@ 2023-02-28 16:40         ` Simon South
  2023-02-28 21:41           ` Maxim Cournoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Simon South @ 2023-02-28 16:40 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: Jelle Licht, 61851

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
> Would you be so kind as to open an issue with upstream about the
> misleading doc?

I would've submitted a patch already were the project not using GitHub.
I don't have a GitHub account and don't intend to get one.

Would anyone else be willing to be open an issue on this?

-- 
Simon South
simon@simonsouth.net




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-28 16:40         ` Simon South
@ 2023-02-28 21:41           ` Maxim Cournoyer
  2023-03-16 20:38             ` Jelle Licht
  0 siblings, 1 reply; 10+ messages in thread
From: Maxim Cournoyer @ 2023-02-28 21:41 UTC (permalink / raw)
  To: Simon South; +Cc: Jelle Licht, 61851

Hello,

Simon South <simon@simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>> Would you be so kind as to open an issue with upstream about the
>> misleading doc?
>
> I would've submitted a patch already were the project not using GitHub.
> I don't have a GitHub account and don't intend to get one.
>
> Would anyone else be willing to be open an issue on this?

No problem; see: https://github.com/tesseract-ocr/tesseract/issues/4025.

-- 
Thanks,
Maxim




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-02-28 21:41           ` Maxim Cournoyer
@ 2023-03-16 20:38             ` Jelle Licht
  2023-03-21  3:13               ` bug#61851: " Maxim Cournoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Jelle Licht @ 2023-03-16 20:38 UTC (permalink / raw)
  To: Maxim Cournoyer, Simon South; +Cc: 61851


Hey folks,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

> Hello,
>
> Simon South <simon@simonsouth.net> writes:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>>> Would you be so kind as to open an issue with upstream about the
>>> misleading doc?
>>
>> I would've submitted a patch already were the project not using GitHub.
>> I don't have a GitHub account and don't intend to get one.
>>
>> Would anyone else be willing to be open an issue on this?
>
> No problem; see: https://github.com/tesseract-ocr/tesseract/issues/4025.

So it seems the issue was confirmed. In addition there seems to be some
inconsistencies between build system with regards to how the data dir is
interpreted by tesseract:

https://github.com/tesseract-ocr/tesseract/issues/4026

I think it makes sense for us to apply [a version of] Simon's patch.  QA
also seems to show green lights, ignoring the unrelated recent
openmpi-related failures.

WDYT?
 - Jelle




^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#61851: [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
  2023-03-16 20:38             ` Jelle Licht
@ 2023-03-21  3:13               ` Maxim Cournoyer
  0 siblings, 0 replies; 10+ messages in thread
From: Maxim Cournoyer @ 2023-03-21  3:13 UTC (permalink / raw)
  To: Jelle Licht; +Cc: 61851-done, Simon South

Hello,

Jelle Licht <jlicht@fsfe.org> writes:

> Hey folks,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello,
>>
>> Simon South <simon@simonsouth.net> writes:
>>
>>> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>>>> Would you be so kind as to open an issue with upstream about the
>>>> misleading doc?
>>>
>>> I would've submitted a patch already were the project not using GitHub.
>>> I don't have a GitHub account and don't intend to get one.
>>>
>>> Would anyone else be willing to be open an issue on this?
>>
>> No problem; see: https://github.com/tesseract-ocr/tesseract/issues/4025.
>
> So it seems the issue was confirmed. In addition there seems to be some
> inconsistencies between build system with regards to how the data dir is
> interpreted by tesseract:
>
> https://github.com/tesseract-ocr/tesseract/issues/4026
>
> I think it makes sense for us to apply [a version of] Simon's patch.  QA
> also seems to show green lights, ignoring the unrelated recent
> openmpi-related failures.
>
> WDYT?

I've now applied it, after writing a proper change log commit message,
and running the xvnc and lightdm system tests to get some confidence
(they make use of tesseract-ocr).

Thank you for looking into it!

-- 
Thanks,
Maxim




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-03-21  3:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 20:55 [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files jlicht
2023-02-27 22:43 ` Simon South
2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
2023-02-28 15:00     ` Simon South
2023-02-28 15:35       ` Maxim Cournoyer
2023-02-28 16:40         ` Simon South
2023-02-28 21:41           ` Maxim Cournoyer
2023-03-16 20:38             ` Jelle Licht
2023-03-21  3:13               ` bug#61851: " Maxim Cournoyer

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.