From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 18:07:38 +0200 Message-ID: References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <838sakmccw.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="21776"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: stephen.berman@gmx.net, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 17:08:48 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjPFo-0005Ys-2D for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 17:08:48 +0100 Original-Received: from localhost ([::1]:40748 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjPFj-0004w8-Cv for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 11:08:43 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:35324) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjPEm-0004Te-Up for emacs-devel@gnu.org; Sun, 29 Nov 2020 11:07:44 -0500 Original-Received: from mail-ej1-x62a.google.com ([2a00:1450:4864:20::62a]:37673) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kjPEk-0004Rl-WA; Sun, 29 Nov 2020 11:07:44 -0500 Original-Received: by mail-ej1-x62a.google.com with SMTP id f9so13711608ejw.4; Sun, 29 Nov 2020 08:07:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=BprZN2SehrOcXW/F597EZGC18+x9hsP5UJdYE5n0QzM=; b=ibxJ16a/+A0Nr+xm1g5qlOh5EK+LhaBMwjpPNHeRB2gp8vRPQyyiJMvNX20mnfT3qy 5Bgm/F8iks6ES0mo9DU2R+RwcT+iNhWMHjRZlHbtgodlZZNYPHl8jWedzXlUyRRztizl smFsvBCXZIjrhm8OcZtkMHACBIpqou0SiZEMmeqSjDq8GpDcV8tAqMOJEyxVU1wOkLMM XEIMbH6OOtVlE4SXFmuBGn3KlYdIGZSznaxMOeAlYDhiJ0E+x+UhMLTqyKe9VRpQY6WP WttrVFDLfmlhYgMXzRnWTG+WwnRMqXp6q5RK4PoJjkY186wG16+AGwkHZMhV743iM0fI i+cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=BprZN2SehrOcXW/F597EZGC18+x9hsP5UJdYE5n0QzM=; b=UVtZ0a2fmQSTEptJ4Ek/YPSpNrzJZEUsaAQ7b3lh1lqJQdgw4S+OIJu4iLuwRsjp0c vdaA6PDiCRavCJqgd3bpnEbHsDsr7kmL08PWZIeo6R6AEnYXYCGEOiTbAm0Z4SJl3VFq Kgy3gYeeA+OOpKA6I9OTCC0o3mc3FkCpL9cH3sBZkcUOJ/cQwhvReDgvsmRnp9jcS1rN axmA3qPwSOv/czsWxM+4u/OGw+mAT2XaDyNAO2rJoVASu5S3AmwzZxRwV4nVtbqcBBOf 4kJyhcStJ9o8/bJQxok/TocVF5Kv7hMXyhYqTBxBE8Ar1St3aOArf6oxft1+bztguW93 C4IQ== X-Gm-Message-State: AOAM532DaOo8blqnQkyGFWCX2NfrInOUCzXqf3QvcL9lgmHdKF3G22kZ o2t61NyiHfkXGesq+yEAaJ3Clgrrd0krCQ== X-Google-Smtp-Source: ABdhPJwjuaNHvRKn8HBUP/0F7yn4hvRsxTHqrc/S7I9tROfxQICeuKU4QlmhqXHklhBh3tAcnt5DPQ== X-Received: by 2002:a17:906:9345:: with SMTP id p5mr6676519ejw.40.1606666060756; Sun, 29 Nov 2020 08:07:40 -0800 (PST) Original-Received: from [192.168.0.4] ([66.205.71.3]) by smtp.googlemail.com with ESMTPSA id gt11sm4609488ejb.67.2020.11.29.08.07.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 29 Nov 2020 08:07:39 -0800 (PST) In-Reply-To: <838sakmccw.fsf@gnu.org> Content-Language: en-US Received-SPF: pass client-ip=2a00:1450:4864:20::62a; envelope-from=raaahh@gmail.com; helo=mail-ej1-x62a.google.com X-Spam_score_int: -14 X-Spam_score: -1.5 X-Spam_bar: - X-Spam_report: (-1.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260012 Archived-At: On 29.11.2020 17:06, Eli Zaretskii wrote: >> Do we want to search the "binary" files at all? > > We don't. I still hope to understand why -a was needed in this case. > Stephen? Looks like it actually depends on the encoding of the _output_. So if it can print some lines well but not others it can even print a line from a file and then later say it's a binary: $ grep "prem" latin1.txt premie?re is slightly different Binary file latin1.txt matches Adding -a or prepending 'LC_ALL=C' changes that: $ LC_ALL=C grep "prem" latin1.txt premi�re is first premie?re is slightly different So... looks like Grep searches through all files anyway. Just modifies its output in cases where it looks iffy. >>> We should support Grep regardless, since not everyone will have >>> ripgrep. And in any case, "C-x RET c" will be needed with it as well, >>> no? >> >> I'd have to test it explicitly to say for sure, but: >> >> ripgrep supports searching files in text encodings other than UTF-8, >> such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some >> support for automatically detecting UTF-16 is provided. Other text >> encodings must be specifically specified with the -E/--encoding flag.) >> >> https://blog.burntsushi.net/ripgrep/#pitch > > What is not clear to me is whether the _output_ is always in some > fixed encoding, like UTF-8. That doesn't seem to be stated in the > docs there. Judging by a small experiment, rg's output is in the same encoding as input, for each file. Which can be a nuisance when looking at the search results, but that's probably all. In any case, if one takes the pre-processing route, the end encoding will be UTF-8.