From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Gregory Heytings via "Emacs development discussions." Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 19:49:14 +0000 Message-ID: References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <838sakmccw.fsf@gnu.org> <83o8jgkrxo.fsf@gnu.org> <83im9okrcc.fsf@gnu.org> <9dcc71f4-1d76-1436-67c9-89d7711af42c@yandex.ru> <83eekckndb.fsf@gnu.org> Reply-To: Gregory Heytings Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="-212064758-1026367712-1606679353=:27861" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="34618"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Alpine 2.22 (NEB 394 2020-01-19) Cc: Dmitry Gutov , stephen.berman@gmx.net, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 20:50:39 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjSiV-0008tM-78 for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 20:50:39 +0100 Original-Received: from localhost ([::1]:42964 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjSiT-0005SZ-VY for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 14:50:38 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42418) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjShP-0004d1-Od for emacs-devel@gnu.org; Sun, 29 Nov 2020 14:49:31 -0500 Original-Received: from mx.sdf.org ([205.166.94.24]:56312) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjShN-0000j6-Fi; Sun, 29 Nov 2020 14:49:31 -0500 Original-Received: from sdf.org (IDENT:ghe@faeroes.freeshell.org [205.166.94.9]) by mx.sdf.org (8.15.2/8.14.5) with ESMTPS id 0ATJnG3x013350 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256 bits) verified NO); Sun, 29 Nov 2020 19:49:16 GMT Original-Received: (from ghe@localhost) by sdf.org (8.15.2/8.12.8/Submit) id 0ATJnGFp013131; Sun, 29 Nov 2020 19:49:16 GMT In-Reply-To: <83eekckndb.fsf@gnu.org> Received-SPF: pass client-ip=205.166.94.24; envelope-from=ghe@sdf.org; helo=mx.sdf.org X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260030 Archived-At: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---212064758-1026367712-1606679353=:27861 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE >>> Then I think injecting LC_ALL=3DC into the environment when running Gre= p=20 >>> in this case makes the results more useful? And we can then avoid=20 >>> using -a? >> >> I'm not so sure. LC_ALL=3DC seems more problematic than -a: >> >> $ grep =D1=84 test.txt >> =D1=84=D1=8B=D0=B2=D0=B0 >> $ grep -a =D1=84 test.txt >> =D1=84=D1=8B=D0=B2=D0=B0 >> $ LC_ALL=3DC grep =D1=84 test.txt >> (nothing) > > I guess this regression in Grep happened when they "internationalized"=20 > the DFA code, sigh... > FWIW, I "bisected" this with various versions of grep, and this regression= =20 happened in 2014, between versions 2.20 and 2.21: echo -ne "premi\xE8re\n" > latin1.txt echo -ne "premi\xC3\xA8re\n" > utf8.txt echo -ne "premi\xE8re\npremi\xC3\xA8re\n" > both.txt With 2.20 with rxvt (which is clever enough to display UTF-8 and Latin-1 at= the same time): $ grep prem *.txt both.txt:premi=C3=A8re both.txt:premi=C3=A8re latin1.txt:premi=C3=A8re utf8.txt:premi=C3=A8re With 2.20 with M-x shell (the \350 is a single character): both.txt:premi\350re both.txt:premi=C3=A8re latin1.txt:premi\350re utf8.txt:premi=C3=A8re With 2.21, with rxvt or M-x shell: grep prem *.txt Binary file both.txt matches Binary file latin1.txt matches utf8.txt:premi=C3=A8re ---212064758-1026367712-1606679353=:27861--