From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Mon, 30 Nov 2020 03:08:40 +0200 Message-ID: <59a60557-8cfc-fcdc-f0f5-e3e476c56aa1@yandex.ru> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <87tut8zfmk.fsf@mail.linkov.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="11714"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: Eli Zaretskii , stephen.berman@gmx.net, emacs-devel@gnu.org To: Juri Linkov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Mon Nov 30 02:09:37 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjXhB-0002yk-CP for ged-emacs-devel@m.gmane-mx.org; Mon, 30 Nov 2020 02:09:37 +0100 Original-Received: from localhost ([::1]:46168 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjXhA-0000Ex-E0 for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 20:09:36 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:45586) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjXgN-0008Ee-CK for emacs-devel@gnu.org; Sun, 29 Nov 2020 20:08:47 -0500 Original-Received: from mail-wr1-x435.google.com ([2a00:1450:4864:20::435]:33504) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kjXgL-0008Dk-9f; Sun, 29 Nov 2020 20:08:47 -0500 Original-Received: by mail-wr1-x435.google.com with SMTP id u12so13455388wrt.0; Sun, 29 Nov 2020 17:08:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=bAAgYJP8RBtwpvogal4KJOOtnCYsCX+QQAuBhT2Jiks=; b=QV+F3zqS6nSEnxYLW0ZG1hm20tl0D7658HZfgYyv59SDbF/gaPkxdXWfvRROl9NkbO UUagu3oCl/fZYXwi1XlzNQ075c9/nfpElNfxOLNiDBRF8sMjt1dqnS9+35bkcMTdmoZu YCGUqaY850EiCoLBi0CwB9sUsdY5j3RZGqz11Q8bdyRkkGjBDd6KVFl/hfkKCTU6xPMM ILgkwUcPczvpnGfLvSu5sgSuBhREZGtTVFd+taRjTB1sI4v4EieQZ12/ZBl68BNJ6TlM q/3c2ZETKJBbHnVdzFB3t7Z8VfASJKnWH/ENi3WiLxE+S3fi+cZmVprQ3wKBSf5QEQMv 3W0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=bAAgYJP8RBtwpvogal4KJOOtnCYsCX+QQAuBhT2Jiks=; b=eCWR60ssyJ4UNMEpExtB2XchjssDVGgZTZvhlHJmbVQ7+EJ4fgWqrXTHPnCm0zvpNg IpkOTaiaqDAugh17uP+EMtVQknM299R2EQklTF6vwvPERaflF9mzCDcW3plEVcX87F0m lsbS5IYspm6PKI7YgJ14A344K+7pWToYFxtIFu/aaFTG7MAj/EGhM5cjVD4UhuY+B6Gt olu4UhOY2YKYmyVm5cltLcEKZlNPpsZlSKfG1Xg7p+kCJYRx/Aur+HAojXg0FeyfX/wL 3YTRPqhye/YRRBWYqt2P5StiTLQBkYUFE/WTzMS+jfJRA9Viv+9UCxsoEdnY9bTlBaX1 rdqg== X-Gm-Message-State: AOAM531PDLiqhNCS/yZrtYJ1TQEsKDUxE5YTbKWFuc5MUpnze5tyD0Kr 6wFJAO5J5Dokyzzx2OGHecm+J8WdeodUGg== X-Google-Smtp-Source: ABdhPJzziflGKq8V5IaSWx7KPv//KUjt0VIljU+iv8nF28HbHm8f6QSUo/MjkniXZYPdrEfFiToG7g== X-Received: by 2002:adf:ebc5:: with SMTP id v5mr25192437wrn.392.1606698522846; Sun, 29 Nov 2020 17:08:42 -0800 (PST) Original-Received: from [192.168.0.4] ([66.205.71.3]) by smtp.googlemail.com with ESMTPSA id 3sm21414098wmi.24.2020.11.29.17.08.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 29 Nov 2020 17:08:42 -0800 (PST) In-Reply-To: <87tut8zfmk.fsf@mail.linkov.net> Content-Language: en-US Received-SPF: pass client-ip=2a00:1450:4864:20::435; envelope-from=raaahh@gmail.com; helo=mail-wr1-x435.google.com X-Spam_score_int: -14 X-Spam_score: -1.5 X-Spam_bar: - X-Spam_report: (-1.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260038 Archived-At: On 29.11.2020 21:37, Juri Linkov wrote: >> Do we want to search the "binary" files at all? Right now we simply filter >> such matches out (see the definition of xref-matches-in-files), and I have >> seen no complaints. > > There are two cases: a really binary file, and a legit ascii file > with an occasional ^@ char. And grep can't distinguish one from another. > There is an option --binary-files=binary, but unfortunately it doesn't help, > it still outputs "Binary file matches". Makes sense. > So xref parser needs to be smart enough to detect whether the matched line > contains binary garbage when '-a' is used, or it's purely ascii. I guess we can do that, but then some people might be a bit unhappy about not being able to search inside such files? It could be useful on occasion, too (TBC below *). > Moreover, I think we should apply the same heuristics to the grep output > in grep.el and add '-a' to the grep command by default. I guess we should. Or do the LC_ALL thing. I'm still unclear on the difference in effect between the two. > Then grep.el > should prettify the lines with real binary garbage e.g. by hiding groups of > bytes between 0 and 32, or adding a 'display' property with ellipsis. Why not. xref could also do something like that. >> Our interpreter is our regexp with which we parse. But I suppose as long as >> Grep doesn't insert unexpected newlines, the parser will be fine. > > For grep output a bigger problem is that grep on binary data > might output too long lines before the terminating newline. (*) We already have this kind of problem with "normal" files which contain minified assets (JS or CSS). The file contents are usually normal ASCII, but it's just one line which can reach several MBs in length. The usual way to deal with that is with project-ignores and grep-find-ignored-files. That works for both cases. >>> I actually don't think I understand why we need -a in this case, since >>> Grep looks for null bytes to decide this is a binary file, and encoded >>> non-ASCII characters don't have null bytes 9except if they are in >>> UTF-16). >> >> Good question. > > The grep manual says that binary data are either output bytes that > are improperly encoded for the current locale, or null input bytes. So... if we add LC_ALL=C but not '-a' we will allow the "improperly encoded" case but not the "null input bytes" one?