From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sat, 28 Nov 2020 23:04:10 +0200 Message-ID: <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36956"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: stephen.berman@gmx.net, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Nov 28 22:05:00 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kj7Ot-0009RX-QO for ged-emacs-devel@m.gmane-mx.org; Sat, 28 Nov 2020 22:04:59 +0100 Original-Received: from localhost ([::1]:60546 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kj7Os-0005oD-Rc for ged-emacs-devel@m.gmane-mx.org; Sat, 28 Nov 2020 16:04:58 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:53746) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kj7OE-0005KW-49 for emacs-devel@gnu.org; Sat, 28 Nov 2020 16:04:19 -0500 Original-Received: from mail-ed1-x532.google.com ([2a00:1450:4864:20::532]:41128) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kj7OC-0000xs-3N; Sat, 28 Nov 2020 16:04:17 -0500 Original-Received: by mail-ed1-x532.google.com with SMTP id 38so2336773edr.8; Sat, 28 Nov 2020 13:04:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=TJESFpy6Lxzwqc+qdreGOrYidJPA9YyDoTosRv1N+KI=; b=R/am3RqwtzVkmbDrIyWg+Xx+4tYFSMwVZUzjKpjEhTMqrBgJprO8AvoKI4OgD9XyX9 WhbyXDHlVgRynfkJ3io1B6RG+KR3FUdzpDKCk56h1dnqoGAQ0cN55V1ZisSVvSfgGkhi Nb2sPVYE4UeeLnfY8n8daisIRNGAEoh6u2KvtlPRADZ4uUxxxEAXpjfSWDTTb3A885ca 3b+geOKCDDYbIHxpppyl4Fp+A8izamDmV2iEIGNmMcDnnPlrDGGcLTqCZaaZxSBhJ74l 3BIQHC3/PjXWVu5820iUxRDbu/uk1TZn+if71CB6+7nE1VekLfCvE9jZxN6gegiBVPSy GZzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=TJESFpy6Lxzwqc+qdreGOrYidJPA9YyDoTosRv1N+KI=; b=VQTinr1sn7+Ni0HDH8qaReZY2WMzwW3vD5W4qAnVSykNjYE0gDyQiTs/EAWyH9Ile5 qak5bUSv4lWcJNUOZgVmoYkvAYgfZXrMmKqCoZ4OpH08utcHlnuBwnx1msWcZOl4pQ0G qlz8iUcV4udiwSQdd61xnoHf/UVmWOxIoEMWPmlOQM3fgm1uCtDJSKsR/vErra7jtIQW tSVeN5WcduVEt/Dsa8MYZk7yqaToXH3W+nnLhNYoce7QHltKQtdsHb4nNI9ugMAxAuDF j62aydkUOL/oil6KgSwrdxlB1GzZ+utDIjHvBuaOnYossdm+qNtRFm8rfBl4v0tCwqIG 80Ew== X-Gm-Message-State: AOAM5332N0p3Ge0lAkkT8DuVn79S46M21Ph2j3CXWOpDR4l67+Pi1Yh6 dNks1ynCtQNpjYk9GEledj3rVj0+srNjkA== X-Google-Smtp-Source: ABdhPJx1witOFlg6EtBPY+OJfPFoOiLioU7xzM/0cXwE5QqlXGYOkwwP32uM9QVIkA2OugoM41U8Gw== X-Received: by 2002:a05:6402:1d12:: with SMTP id dg18mr14654074edb.238.1606597452927; Sat, 28 Nov 2020 13:04:12 -0800 (PST) Original-Received: from [192.168.0.4] ([66.205.71.3]) by smtp.googlemail.com with ESMTPSA id j34sm6469687edd.57.2020.11.28.13.04.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 28 Nov 2020 13:04:11 -0800 (PST) In-Reply-To: <83blfhmdho.fsf@gnu.org> Content-Language: en-US Received-SPF: pass client-ip=2a00:1450:4864:20::532; envelope-from=raaahh@gmail.com; helo=mail-ed1-x532.google.com X-Spam_score_int: -14 X-Spam_score: -1.5 X-Spam_bar: - X-Spam_report: (-1.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:259976 Archived-At: On 28.11.2020 22:29, Eli Zaretskii wrote: >> Ah, so this way the user explicitly searches for a regexp encoded as >> latin-1? > > More accurately, this is how to search in files encoded in Latin-1. > (The regexp also gets encoded in latin-1, but the important part is > the files' encoding.) Right. So when there are files in different encodings, the result will be not great, as expected. >>> Adding -a probably cannot do any harm, but its support should be >>> detected, since I don't think it's portable enough (it isn't in the >>> latest Posix spec, at least). >> >> Are you sure about that? Are we sure it won't make searching binary >> files slower, for example? > > It will be slower, but more useful: by default Grep just says "Binary > file foo matches". Do we want to search the "binary" files at all? Right now we simply filter such matches out (see the definition of xref-matches-in-files), and I have seen no complaints. >> Also, the manual has this warning: >> >> Warning: The -a option might output binary garbage, which can have >> nasty side effects if the output is a terminal and if the terminal >> driver interprets some of it as commands. >> >> ...which might conceivably mess up our parsing of Grep output sometimes? > > This is not relevant, since we read that output, there's no terminal > device driver to interpret it and get messed up. Our interpreter is our regexp with which we parse. But I suppose as long as Grep doesn't insert unexpected newlines, the parser will be fine. > I actually don't think I understand why we need -a in this case, since > Grep looks for null bytes to decide this is a binary file, and encoded > non-ASCII characters don't have null bytes 9except if they are in > UTF-16). Good question. >> P.S. Or we can forgo all that and ask the users who want to search for >> non-ASCII strings to install ripgrep. > > We should support Grep regardless, since not everyone will have > ripgrep. And in any case, "C-x RET c" will be needed with it as well, > no? I'd have to test it explicitly to say for sure, but: ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.) https://blog.burntsushi.net/ripgrep/#pitch So if the file encoding is UTF-8, UTF-16, or latin-1 (AND the current system locale matches that encoding), the search should work fine across such files in different encodings, and without 'C-x RET c'. Which doesn't cover all situations, of course, but it's about as much as can be expected. And more than Grep can.