From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id QLjvCEYX/mPHoQAAbAwnHQ (envelope-from ) for ; Tue, 28 Feb 2023 16:01:26 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id cLztCEYX/mO2FgAAauVa8A (envelope-from ) for ; Tue, 28 Feb 2023 16:01:26 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D712C9F9D for ; Tue, 28 Feb 2023 16:01:25 +0100 (CET) Authentication-Results: aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1677596485; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=AKgWOIHmhGHtcbanoGKWxcVK38u1+AGPSNlsyi3l1pQ=; b=h8MsUJgad1Ykrp7h5K7f34Ey6ekw0EYNuI6bmvYJ/ccxpxzfzFDIubgJT9XikFpOls4iF9 9XcvkvTNhg9umQyZ51fKpuMnUmPUqTnLxIn1XdjzWxKjaAG1G2rd6Zqc1WdQEZPxxDZIdR S06BN0QDIUSNiiYr6FUAUS+AZ+TGzPdl4GFdZ1sesxG/n5vDm0tK2jczx3erwV7KFHfrYG rCVio+mGksk6KDdpRqYLuUPRJbj8LcQtM1szOb9xD+e2B3KDy/QY1yR9LgCeNamuwIHF2X 22jnggQBYXsD2x1C/p67eVGSr29Xa6fR0NHmx9vAGk+5KFX0Q/4QsNS5aa2OUA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Seal: i=1; s=key1; d=yhetil.org; t=1677596485; a=rsa-sha256; cv=none; b=h+QH4Gukq8nFfR2eFv4mrkhRaoQ48TtMEOW7s0lNafS5IgSHrJUN4/6nDeqWhVb7WX3xbD 2Gq61aPczOVHce5+CNvsGM/jwJ5vycca4bYogIimtZE46KxCBwNqtSPF2aPqgW3/Z0WJE7 ugDL6XNIDrZb/nufdi5eOv7zEgx5XkxX0w/M8aazpiBCN+PtbOyUv/uRX8TwoY2wr/Jfjp Gm5/OPNJA/xOqLSYyhjZnEqflNzMkS2khYeaEy5Xg38DnwdljGztyvY7mfFmQYiwpWWm90 7XxsOatmc2bEULF2ZY5sJD1VbKBVQ0SkSOcNL5SHgUfDtLeLr5L5uI8LKJuBzQ== Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pX1Td-00013G-GB; Tue, 28 Feb 2023 10:01:13 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pX1TT-0000rH-8r for guix-patches@gnu.org; Tue, 28 Feb 2023 10:01:04 -0500 Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pX1TS-0003G3-Ap for guix-patches@gnu.org; Tue, 28 Feb 2023 10:01:03 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pX1TS-0002bC-1K for guix-patches@gnu.org; Tue, 28 Feb 2023 10:01:02 -0500 X-Loop: help-debbugs@gnu.org Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files. Resent-From: Simon South Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Tue, 28 Feb 2023 15:01:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 61851 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: Jelle Licht Cc: 61851@debbugs.gnu.org, Maxim Cournoyer Received: via spool by 61851-submit@debbugs.gnu.org id=B61851.16775964469960 (code B ref 61851); Tue, 28 Feb 2023 15:01:01 +0000 Received: (at 61851) by debbugs.gnu.org; 28 Feb 2023 15:00:46 +0000 Received: from localhost ([127.0.0.1]:51748 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pX1TC-0002aZ-7K for submit@debbugs.gnu.org; Tue, 28 Feb 2023 10:00:46 -0500 Received: from mailout.easymail.ca ([64.68.200.34]:44174) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pX1T9-0002aI-Sm for 61851@debbugs.gnu.org; Tue, 28 Feb 2023 10:00:44 -0500 Received: from localhost (localhost [127.0.0.1]) by mailout.easymail.ca (Postfix) with ESMTP id 31ECEE8D27; Tue, 28 Feb 2023 15:00:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at emo08-pco.easydns.vpn Received: from mailout.easymail.ca ([127.0.0.1]) by localhost (emo08-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1JvG093rm1ZN; Tue, 28 Feb 2023 15:00:37 +0000 (UTC) Received: from laptop (23-233-96-72.cpe.pppoe.ca [23.233.96.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mailout.easymail.ca (Postfix) with ESMTPSA id 86487E74EC; Tue, 28 Feb 2023 15:00:37 +0000 (UTC) From: Simon South References: <878rgik9uo.fsf@simonsouth.net> <87bkle4olv.fsf@fsfe.org> Date: Tue, 28 Feb 2023 10:00:36 -0500 In-Reply-To: <87bkle4olv.fsf@fsfe.org> (Jelle Licht's message of "Tue, 28 Feb 2023 01:31:40 +0100") Message-ID: <87h6v53kdn.fsf@simonsouth.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: X-Migadu-Spam-Score: -3.69 X-Spam-Score: -3.69 X-Migadu-Scanner: scn0.migadu.com X-Migadu-Queue-Id: D712C9F9D List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+larch=yhetil.org@gnu.org Sender: guix-patches-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-TUID: D3P7Zx8b4bXd Jelle Licht writes: > Cunningham's law strikes again :) Ha, interesting. That one's new to me. > This makes me believe the current situation was a deliberate choice... Yes, it was, and I realize now I didn't provide much in the way of rationale in my previous email. So here's the background information for anyone interested: Tesseract normally expects to find its data files in /usr/share/tessdata and subfolders thereof. We'd like to use Guix's native-search-paths functionality to pull together data from (for instance) multiple language-specific data packages, and Tesseract conveniently honours a TESSDATA_PREFIX environment variable that specifies its data folder's location, so it seems we are all set. What should TESSDATA_PREFIX be set to? Tesseract's documentation[0] says TESSDATA_PREFIX environment variable should be set to the parent directory of =E2=80=9Ctessdata=E2=80=9D directory. So "share" then, presumably, to have the data files located at "share/tessdata". The man page[1] seems to confirm this: To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata... This creates a problem, though, since defining a native-search-path of just "share" will pull in files from virtually every single Guix package. The solution then is to introduce an intermediate folder, "tesseract-ocr", that sidesteps this problem, and to configure Tesseract appropriately at build time so it installs its data files to "share/tesseract-ocr/tessdata" instead. This is why the existing code was written the way it was and what the comment you pointed out is referring to. However there's a problem with this, too: Patching Makefile.am the way the code does results in only some of Tesseract's data files being placed in "share/tesseract-ocr/tessdata"; you can see in the package output there is still a "share/tessdata" folder that contains Tesseract's config files. Since these aren't also placed beneath "share/tesseract-ocr/tessdata" Tesseract can't find them at runtime. The solution to this seems to be to remove this phase and instead use the "--datadir" configure flag to specify the desired data-folder path. Doing this results in all of Tesseract's data files being installed beneath "share/tesseract-ocr/tessdata" and the resulting package works as you'd expect. However the problem with this is... none of it is necessary in the first place! It turns out Tesseract's documentation is simply WRONG and the program actually expects TESSDATA_PREFIX to contain the complete path to the "tessdata" data folder, not the path of the folder directly above it. So Tesseract can be built as-is, the native-search-path can be safely defined as "share/tessdata", and everything just works. This is what the patch I passed on yesterday does. --=20 Simon South simon@simonsouth.net [0] https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simples= t-invocation-to-ocr-an-image [1] https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc