From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.devel Subject: Re: dired-do-find-regexp failure with latin-1 encoding Date: Sun, 29 Nov 2020 18:27:24 +0200 Message-ID: <1142c209-27d4-292c-f087-e0ccb480d893@yandex.ru> References: <87blfhjr4q.fsf@gmx.net> <83k0u5mjvf.fsf@gnu.org> <877dq5jp51.fsf@gmx.net> <83im9pmh0v.fsf@gnu.org> <106736d6-1732-3f24-15c5-af7bcfd688c6@yandex.ru> <83blfhmdho.fsf@gnu.org> <247a8edb-7b70-ad32-1ba1-43b5458a82b0@yandex.ru> <42ba5cae-e0d7-afd1-9974-62e7ee5840c6@yandex.ru> <83360smbq8.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="27410"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: stephen.berman@gmx.net, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Nov 29 17:29:14 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kjPZZ-000714-UI for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 17:29:13 +0100 Original-Received: from localhost ([::1]:44936 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kjPZY-0000AP-Uw for ged-emacs-devel@m.gmane-mx.org; Sun, 29 Nov 2020 11:29:12 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37494) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kjPXv-0007ZW-An for emacs-devel@gnu.org; Sun, 29 Nov 2020 11:27:31 -0500 Original-Received: from mail-wm1-x32e.google.com ([2a00:1450:4864:20::32e]:38979) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kjPXt-0006el-AX; Sun, 29 Nov 2020 11:27:31 -0500 Original-Received: by mail-wm1-x32e.google.com with SMTP id 3so13728350wmg.4; Sun, 29 Nov 2020 08:27:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=7hBXcf46B32qwTUY787YG+OHqr/FQJ3KQ+cX4LsE/YY=; b=LoTv0BC70ba22lC7TpEEuq1mmXJkCiWRHV0Y3Hsve90b4cBCcjcEP2drbCYkXzM9He qlhgBffOfUmy/YW00gVnnvVsIv+Pd4+uM9WtBG4qbISJWeC0Lsm82emosQPKU58VyzN7 sUasBoZu5r8q4USCKc5mwK78s0z8WqipgJ4wREFiNO6/djzxgTGY55LwmlOTK/XG/KT1 9JcEly8Fa8xbmxROtNLDddzUAhTFd6lZBJ8Crc38aTq+4EZUZjDOrrqafaL6JDqXD1JW cTaga0kBaJDHASft+DE/SJRkgH56x3Ud5JGtEo4Mi6GsNwmrjqhQQNBAxeZHEGWSc2aR CYwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=7hBXcf46B32qwTUY787YG+OHqr/FQJ3KQ+cX4LsE/YY=; b=NqV9iLeaYbh8ijZg1X9ZupF8AspWIHyxdCC49YvCI8nv9eL1h0wdxydCcbcQp3fW+D N51baEa2g12kRiR6mELWABgVV+NzZxRgSKeMNrjLF+cgglfHLDBxlITFMDx0nwNcOMie nW2dhGumxj0WTeVIXkZUxQgpD2PnuWHqK6+QbHTd7CVjDbPJNZzUfOvc/AGeVVtN2i9N Rzpm0pMpRBMT9B/YuIjEzYzNbFI+JBzina2bAuSSKlaosS15Dz61ifen4wyTwIv7iiFX bFn7gtGNmdhkcV0C1wQlZkr2Bhj4Ok4JHs71xR5oe+pvquYV3tLSrAaInhiwwLwh9AZU N8Hw== X-Gm-Message-State: AOAM533pElgolwe516EGX2scFeGH45LMbCeNhLFIkxW3CFKiOqaYCPqE zBl5+3BkzwsbBxiszeUlFbPInynpZkfsag== X-Google-Smtp-Source: ABdhPJwVcLsxttGQFApztWp+GqDxSnPLatSUqoIF+1UUMlh+Gc+IjRRuyHjivX1gy+0nYUBZNOrQgg== X-Received: by 2002:a1c:64d7:: with SMTP id y206mr19401364wmb.9.1606667246947; Sun, 29 Nov 2020 08:27:26 -0800 (PST) Original-Received: from [192.168.0.4] ([66.205.71.3]) by smtp.googlemail.com with ESMTPSA id f7sm796281wmc.1.2020.11.29.08.27.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 29 Nov 2020 08:27:26 -0800 (PST) In-Reply-To: <83360smbq8.fsf@gnu.org> Content-Language: en-US Received-SPF: pass client-ip=2a00:1450:4864:20::32e; envelope-from=raaahh@gmail.com; helo=mail-wm1-x32e.google.com X-Spam_score_int: -4 X-Spam_score: -0.5 X-Spam_bar: / X-Spam_report: (-0.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, HEADER_FROM_DIFFERENT_DOMAINS=0.248, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:260013 Archived-At: On 29.11.2020 17:19, Eli Zaretskii wrote: >> From: Dmitry Gutov >> Cc: stephen.berman@gmx.net, emacs-devel@gnu.org >> Date: Sun, 29 Nov 2020 02:49:25 +0200 >> >> On 28.11.2020 23:04, Dmitry Gutov wrote: >>> or latin-1 (AND the current system locale matches that encoding), the >>> search should work fine across such files in different encodings, and >>> without 'C-x RET c' >> >> Correction: only utf-8 and utf-16 detection is automatic. latin-1 needs >> explicit arguments '-E latin-1' passed to rg. >> >> The official recommended workaround is to use a --pre flag which is >> similar to what Stephen did originally by inserting 'iconv ...' in the >> shell command string: https://github.com/BurntSushi/ripgrep/issues/746 > > How can --pre help? It still cannot easily support different > encodings in the same command, right? It can help by calling iconv with different arguments depending on the contents of each file. Which is valuable, I think, because we're normally not piping file contents to grep (or, potentially, rg), instead we pass multiple file names to it using xargs. That wouldn't be easy, but some script that performs conversion based on file contents could work. >> I suppose if we really wanted, we could insert some custom program that >> chooses what to 'iconv' with, but that would be slower, of course. But >> it could work with Grep, too. > > It would be brittle, unless that program actually reads the entire > file (which will be slow). How does Emacs do it? Does it read until the end of the file? If not, we could try to reuse some of its logic. Otherwise, yes, our options are either slow or brittle. That might be why ripgrep's author decided to offload this responsibility, looking at the discussion referenced above. In any case, --pre will already become significantly slower than the current behavior (it will spawn a process for each searched file), so we might afford the "slow" approach here because we won't enable it by default anyway.