From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Alexis Newsgroups: gmane.emacs.devel Subject: Re: "args-out-of-range" error when using data from external process on Windows Date: Thu, 18 Apr 2024 17:07:25 +1000 Message-ID: <87y19bywr6.fsf@gmail.com> References: <87bk671b7l.fsf@gmail.com> <86msprfbul.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; format=flowed Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="10966"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: mu4e 1.12.4; emacs 29.3 Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 18 09:36:57 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rxMKF-0002WX-Rz for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Apr 2024 09:36:55 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rxMJQ-0002M1-KM; Thu, 18 Apr 2024 03:36:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rxLrp-00063p-V7 for emacs-devel@gnu.org; Thu, 18 Apr 2024 03:07:33 -0400 Original-Received: from mail-pl1-x632.google.com ([2607:f8b0:4864:20::632]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rxLrn-00038v-UM; Thu, 18 Apr 2024 03:07:33 -0400 Original-Received: by mail-pl1-x632.google.com with SMTP id d9443c01a7336-1e3ff14f249so4130125ad.1; Thu, 18 Apr 2024 00:07:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1713424050; x=1714028850; darn=gnu.org; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=Yve9uPdB/PAOqnzO+JswJEVcp2dZTPHqJBEz5YYIlk8=; b=kCOI/aWwmV5IrwE5MiyOxX+l6nz9/93v2FOdNk5vCfnlcUuXWqZvxnaayBuur8pcGB bvP3jZBPTSp0p7ORvvSJDElMnX/fb5d6nY7PKOSqNyKmZQnSIACFMVMU6ycRrdiDfrqd E/6krB9hvtfGqguZQ7sVh7s39/M6Epm5v6udQkhpcSvR1S457PRA9d8v0Rj3A3kcp87r 1IiRxwMFhySJgYUqhNvchzRX1egjqNtERDFca5MRBB9wao6EXkdi0JBvS3F47YXLgoih dYnKtiTpRU0zqq4T3RDxkb7smDb+vfnp2YaCgu8l5t2UFWj6XryKkbvTj87lFOp/vdxF LLLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713424050; x=1714028850; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Yve9uPdB/PAOqnzO+JswJEVcp2dZTPHqJBEz5YYIlk8=; b=HwVvVgsgpNuk4SKvSckRJYUfkJ/2Fz6xWMVsxyIgFz7IuOKKQftzyHzWrFOVFedprq xiEdZJHdefa5Wrz8j1DOLVoDEsRJ0iqgove/xFYjEBLocpAfQ7i7TX44JO9jSeNQzrtD GvSUrT+udG/oOkMwhNAeP1ZmrzGLduYCqdOd/bn0NJeVpT30OP0vU2LP/jetVYlIt1u0 jbaTLzTVfLVL2I7X8bSuurf4ov/tVVChPa6sAgA3vz6pajiTNrPF11zRhjOyoPQIV2qH uKKICMteWVrvSwYSp7inlPkq/aJP96F4O1+PjwOdMLhUPx0CyNrJAdK9MqJt13qna13G x63w== X-Gm-Message-State: AOJu0Ywhmfz8BYeDcp1ue+47wXTGGZz42Mm/td6grj8n7bATWz3qIUTZ xh7xGPRVv8fiflnnh7JyIsi4Dk44XBR8O60g661MOHswmzyUvU58F8QYSA== X-Google-Smtp-Source: AGHT+IER0GU+b3sGD71WALTOzpfkj10r0xCmfIwjniMC8pSfRKYMw3mGMxOlOQYBff93fEwwfTR93Q== X-Received: by 2002:a17:903:234e:b0:1e5:e664:9c12 with SMTP id c14-20020a170903234e00b001e5e6649c12mr2596393plh.0.1713424049430; Thu, 18 Apr 2024 00:07:29 -0700 (PDT) Original-Received: from localhost ([120.21.220.186]) by smtp.gmail.com with ESMTPSA id m9-20020a170902db0900b001e2a43bafbasm787978plx.216.2024.04.18.00.07.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Apr 2024 00:07:29 -0700 (PDT) In-Reply-To: <86msprfbul.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 18 Apr 2024 09:01:38 +0300") Received-SPF: pass client-ip=2607:f8b0:4864:20::632; envelope-from=flexibeast@gmail.com; helo=mail-pl1-x632.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Thu, 18 Apr 2024 03:36:02 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:317802 Archived-At: Eli Zaretskii writes: > Crystal ball says the package assumes UTF-8 encoding of the text > from the sub-process, which is generally not what happens on > Windows. Or maybe the package assumes that UTF-8 text from a > sub-process will necessarily be decoded as UTF-8, which again > can fail if the default coding-systems are not UTF-8 (which > happens on Windows). The upshot is that the Lisp code expects > some number of characters, but gets a different number of > characters instead. > > But this is all basically stabbing in the dark, since I have no > idea what that package does and what the program whose output it > reads does. Hi Eli, Thanks for your prompt reply. Sorry for my email not being more descriptive and self-contained. i linked to the GitHub issue: https://github.com/flexibeast/ebuku/issues/32 as there is already an extended discussion there about this issue, which itself links to a previous issue and discussion: https://github.com/flexibeast/ebuku/issues/31 in which the user first reported an "Invalid string for collation" issue. That issue was addressed, after some discussion, by setting LC_ALL to the same value that the user had set LANG, i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one i'm asking about here. Some better background about the software involved: `buku` provides a command-line interface to an SQLite-based database of Web bookmarks, allowing one to save, delete and search for bookmarks, with each bookmark able to have a comment and tags associated with it. `Ebuku` is a package that provides an Emacs-based UI for buku. It allows the user to add bookmarks, edit them, remove them, search them etc. without actually leaving Emacs. It does so by running `call-process` to call `buku` with the appropriate options, receiving the resulting output in a buffer, then processing the data in that buffer in order to present the user with the relevant results. ebuku.el has a function: (defun ebuku--call-buku (args) "Internal function for calling `buku' with list ARGS." (unless ebuku-buku-path (error "Couldn't find buku: check 'ebuku-buku-path'")) (apply #'call-process `(,ebuku-buku-path nil t nil "--np" "--nc" "--db" ,ebuku-database-path ,@args))) which gets called in several places - e.g. https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 - to put the contents inside a temp buffer, which is then 'parsed' for the information to be presented to the user. In a comment from a couple of days ago, and after having noted in a comment on issue 31: https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703 that they'd set LANG on their system to "zh_CN.UTF-8", the user wrote (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816): > I set the value with (set-language-environment "UTF-8"). I > remember I set up this value bacause I don't want my files > containing Chinese to be encoded by GBK encoding. Then, in https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, i wrote: > if i remember correctly, the default encoding used by Windows is > UTF-16, not UTF-8. So i'm wondering if that's somehow being used > to transfer data from the buku process to the Emacs process, > regardless of the value of LANG and LC_ALL, and regardless of > the encoding of the buku database itself? to which the user responded: > I think the Powershell will use UTF-16 to encode instead of > UTF-8. Is that correct? Is that the case despite the user having specified "zh_CN.UTF-8"? But if that's the case, why does removing the CRAB emoji from text being operated on by string-match / match-string make the issue disappear? Is it perhaps something to do with the code point for the CRAB emoji being outside the BMP? > Suggest that you ask the user who reported that to show the > actual output of the sub-process (e.g., by running the same > command outside of Emacs and redirecting output to a file), and > if the output looks correct, examine the Lisp code which > processes that output, with an eye on how the text is decoded. > For example, if the text from the sub-process is supposed to be > UTF-8 encoded, your Lisp code should bind coding-system-for-read > to 'utf-8', to make sure it is decoded correctly. Thanks, i can certainly do that, modulo the issue of whether the LANG and LC_ALL variables have any effect data transferred between the `buku` sub-process and Emacs. But what should i do to handle the more general case of an arbitrary encoding? Do i need to have a defcustom, with 'reasonable defaults', that the user can set if necessary, which i use as the value to pass to coding-system-for-read? > Btw: using UTF-8 by default on MS-Windows is not a very good > idea, even with Windows 11 where one can enable UTF-8 support > (did they do it, btw?). Windows still doesn't support UTF-8 > well, even after the improvements in Windows 11, so the above > settings might very well cause trouble. Suggest to ask the user > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 > stuff is set up outside Emacs, to try without it. As i interpret their comments in the above discussions so far, yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as described above, had definitely `set-language-environment` as "UTF-8". i'll certainly take your suggestions back to the user. Thanks again, Alexis.