unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Jelle Licht <jlicht@fsfe.org>
To: Simon South <simon@simonsouth.net>
Cc: 61851@debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Date: Tue, 28 Feb 2023 01:31:40 +0100	[thread overview]
Message-ID: <87bkle4olv.fsf@fsfe.org> (raw)
In-Reply-To: <878rgik9uo.fsf@simonsouth.net>

Hi Simon,

Simon South <simon@simonsouth.net> writes:

> Jelle,
>
> Respectfully, and speaking only as an interested observer, I think this
> may not be the right fix.

Cunningham's law strikes again :) [1].

>
> Guix's Tesseract is indeed missing its config files, causing (among
> other things) the examples in the online documentation[0] to not work,
> e.g.:
>
>   ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
>   read_params_file: Can't open hocr
>   The (quick) [brown] {fox} jumps!
>   Over the $43,456.78 <lazy> #90 dog
>   (...)
>
> But the root issue appears to be a misconfiguration of the
> TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
> Tesseract's own config files to be installed in a folder other than the
> one it's configured to search.
>
> Fixing this places Tesseract's config files and the trained-data files
> together beneath /usr/share/tessdata, allowing Tesseract to work as
> expected:
>
>   ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
>   <?xml version="1.0" encoding="UTF-8"?>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>   (...)

I will believe you without any doubt, but there's this spooky comment
left in the tesseract-ocr 'adjust-TESSDATA_PREFIX-macro phase:

--8<---------------cut here---------------start------------->8---
  ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
  ;; specific search-path than '/share' can be specified.  The
  ;; build system uses CPPFLAGS for itself, so we can't simply set
  ;; a make flag.
--8<---------------cut here---------------end--------------->8---

This makes me believe the current situation was a deliberate choice, but
I personally don't understand what the original problem was/is.

> This approach has the advantage of keeping the
> tesseract-ocr-tessdata-fast package "pure" and focused only on
> trained-data files, which will be important for the patch I'm working on
> that will split it into multiple packages, one for each language and
> script, to allow greater flexibility.
>
> I'll respond to this email with a draft (!) patch to tesseract-ocr that
> should achieve the same result as yours, making the config files
> available for use.  Does this also fix the problem for you?  If so,
> would you consider submitting this change instead?

It seems to work for my stuff! I'm bringing Maxim to weigh in on this, as
they are the (un?)lucky expert according to my git-foo.

Thanks for paying attention!
- Jelle

[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law




  parent reply	other threads:[~2023-02-28  0:32 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27 20:55 [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files jlicht
2023-02-27 22:43 ` Simon South
2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
2023-02-28  0:31   ` Jelle Licht [this message]
2023-02-28 15:00     ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Simon South
2023-02-28 15:35       ` Maxim Cournoyer
2023-02-28 16:40         ` Simon South
2023-02-28 21:41           ` Maxim Cournoyer
2023-03-16 20:38             ` Jelle Licht
2023-03-21  3:13               ` bug#61851: " Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bkle4olv.fsf@fsfe.org \
    --to=jlicht@fsfe.org \
    --cc=61851@debbugs.gnu.org \
    --cc=maxim.cournoyer@gmail.com \
    --cc=simon@simonsouth.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).