unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Simon South <simon@simonsouth.net>
Cc: Jelle Licht <jlicht@fsfe.org>, 61851@debbugs.gnu.org
Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Date: Tue, 28 Feb 2023 10:35:31 -0500	[thread overview]
Message-ID: <87mt4xiz0c.fsf@gmail.com> (raw)
In-Reply-To: <87h6v53kdn.fsf@simonsouth.net> (Simon South's message of "Tue, 28 Feb 2023 10:00:36 -0500")

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Jelle Licht <jlicht@fsfe.org> writes:
>> Cunningham's law strikes again :)
>
> Ha, interesting.  That one's new to me.
>
>> This makes me believe the current situation was a deliberate choice...
>
> Yes, it was, and I realize now I didn't provide much in the way of
> rationale in my previous email.  So here's the background information
> for anyone interested:
>
> Tesseract normally expects to find its data files in /usr/share/tessdata
> and subfolders thereof.  We'd like to use Guix's native-search-paths
> functionality to pull together data from (for instance) multiple
> language-specific data packages, and Tesseract conveniently honours a
> TESSDATA_PREFIX environment variable that specifies its data folder's
> location, so it seems we are all set.
>
> What should TESSDATA_PREFIX be set to?  Tesseract's documentation[0]
> says
>
>   TESSDATA_PREFIX environment variable should be set to the parent
>   directory of “tessdata” directory.
>
> So "share" then, presumably, to have the data files located at
> "share/tessdata".  The man page[1] seems to confirm this:
>
>   To use a non-standard language pack named foo.traineddata, set the
>   TESSDATA_PREFIX environment variable so the file can be found at
>   TESSDATA_PREFIX/tessdata/foo.traineddata...
>
> This creates a problem, though, since defining a native-search-path of
> just "share" will pull in files from virtually every single Guix
> package.  The solution then is to introduce an intermediate folder,
> "tesseract-ocr", that sidesteps this problem, and to configure Tesseract
> appropriately at build time so it installs its data files to
> "share/tesseract-ocr/tessdata" instead.  This is why the existing code
> was written the way it was and what the comment you pointed out is
> referring to.
>
> However there's a problem with this, too: Patching Makefile.am the way
> the code does results in only some of Tesseract's data files being
> placed in "share/tesseract-ocr/tessdata"; you can see in the package
> output there is still a "share/tessdata" folder that contains
> Tesseract's config files.  Since these aren't also placed beneath
> "share/tesseract-ocr/tessdata" Tesseract can't find them at runtime.
>
> The solution to this seems to be to remove this phase and instead use
> the "--datadir" configure flag to specify the desired data-folder path.
> Doing this results in all of Tesseract's data files being installed
> beneath "share/tesseract-ocr/tessdata" and the resulting package works
> as you'd expect.
>
> However the problem with this is... none of it is necessary in the first
> place!  It turns out Tesseract's documentation is simply WRONG and the
> program actually expects TESSDATA_PREFIX to contain the complete path to
> the "tessdata" data folder, not the path of the folder directly above
> it.  So Tesseract can be built as-is, the native-search-path can be
> safely defined as "share/tessdata", and everything just works.
>
> This is what the patch I passed on yesterday does.

Thanks for explaining, that makes sense!  Would you be so kind as to
open an issue with upstream about the misleading doc?  That'd complete
it and avoid any confusion in the future.

-- 
Thanks,
Maxim




  reply	other threads:[~2023-02-28 15:36 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27 20:55 [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files jlicht
2023-02-27 22:43 ` Simon South
2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
2023-02-28 15:00     ` Simon South
2023-02-28 15:35       ` Maxim Cournoyer [this message]
2023-02-28 16:40         ` Simon South
2023-02-28 21:41           ` Maxim Cournoyer
2023-03-16 20:38             ` Jelle Licht
2023-03-21  3:13               ` bug#61851: " Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mt4xiz0c.fsf@gmail.com \
    --to=maxim.cournoyer@gmail.com \
    --cc=61851@debbugs.gnu.org \
    --cc=jlicht@fsfe.org \
    --cc=simon@simonsouth.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).