unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Simon South <simon@simonsouth.net>
To: jlicht@fsfe.org
Cc: 61851@debbugs.gnu.org
Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Date: Mon, 27 Feb 2023 17:43:43 -0500	[thread overview]
Message-ID: <878rgik9uo.fsf@simonsouth.net> (raw)
In-Reply-To: <fed85bc978d9469832e5aaad737a8816d5f49fa7.1677531307.git.jlicht@fsfe.org> (jlicht@fsfe.org's message of "Mon, 27 Feb 2023 21:55:16 +0100")

Jelle,

Respectfully, and speaking only as an interested observer, I think this
may not be the right fix.

Guix's Tesseract is indeed missing its config files, causing (among
other things) the examples in the online documentation[0] to not work,
e.g.:

  ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  read_params_file: Can't open hocr
  The (quick) [brown] {fox} jumps!
  Over the $43,456.78 <lazy> #90 dog
  (...)

But the root issue appears to be a misconfiguration of the
TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
Tesseract's own config files to be installed in a folder other than the
one it's configured to search.

Fixing this places Tesseract's config files and the trained-data files
together beneath /usr/share/tessdata, allowing Tesseract to work as
expected:

  ssouth@hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  (...)

This approach has the advantage of keeping the
tesseract-ocr-tessdata-fast package "pure" and focused only on
trained-data files, which will be important for the patch I'm working on
that will split it into multiple packages, one for each language and
script, to allow greater flexibility.

I'll respond to this email with a draft (!) patch to tesseract-ocr that
should achieve the same result as yours, making the config files
available for use.  Does this also fix the problem for you?  If so,
would you consider submitting this change instead?

-- 
Simon South
simon@simonsouth.net

[0] https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html




  reply	other threads:[~2023-02-27 22:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27 20:55 [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files jlicht
2023-02-27 22:43 ` Simon South [this message]
2023-02-27 22:48   ` [bug#61851] [PATCH] gnu: tesseract-ocr: Use standard TESSDATA_PREFIX Simon South
2023-02-28  0:31   ` [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files Jelle Licht
2023-02-28 15:00     ` Simon South
2023-02-28 15:35       ` Maxim Cournoyer
2023-02-28 16:40         ` Simon South
2023-02-28 21:41           ` Maxim Cournoyer
2023-03-16 20:38             ` Jelle Licht
2023-03-21  3:13               ` bug#61851: " Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878rgik9uo.fsf@simonsouth.net \
    --to=simon@simonsouth.net \
    --cc=61851@debbugs.gnu.org \
    --cc=jlicht@fsfe.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).