From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Dmitry Gutov Newsgroups: gmane.emacs.bugs Subject: bug#31796: 27.1; dired-do-find-regexp-and-replace fails to find multiline regexps Date: Thu, 17 Dec 2020 02:40:09 +0200 Message-ID: References: <10120030-8b8d-b702-add4-8f099f934ed5@chalmers.se> <831rgivl7l.fsf@gnu.org> <83lfequ30g.fsf@gnu.org> <83a6v6tss9.fsf@gnu.org> <08c0bbce-051e-7a49-106a-d6d0629b2224@yandex.ru> <87blffns95.fsf@mail.linkov.net> <8c124412-3bb3-fd92-4c3b-da4b3a8bdcac@yandex.ru> <87blfec4l3.fsf@mail.linkov.net> <00d1c8ef-5601-6445-199e-1590ddfae9e9@yandex.ru> <87eek2902v.fsf@mail.linkov.net> <873605mstj.fsf@mail.linkov.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------58A17A2BD72C517B3BDA843D" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="6061"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 Cc: abela@chalmers.se, 31796@debbugs.gnu.org To: Juri Linkov Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Dec 17 01:41:20 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kphM8-0001So-Cx for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 17 Dec 2020 01:41:20 +0100 Original-Received: from localhost ([::1]:34484 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kphM7-0005dT-74 for geb-bug-gnu-emacs@m.gmane-mx.org; Wed, 16 Dec 2020 19:41:19 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37528) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kphLr-0005aq-4E for bug-gnu-emacs@gnu.org; Wed, 16 Dec 2020 19:41:04 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:50930) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kphLq-0006FU-0P for bug-gnu-emacs@gnu.org; Wed, 16 Dec 2020 19:41:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kphLp-0005CR-S6 for bug-gnu-emacs@gnu.org; Wed, 16 Dec 2020 19:41:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Dmitry Gutov Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 17 Dec 2020 00:41:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31796 X-GNU-PR-Package: emacs Original-Received: via spool by 31796-submit@debbugs.gnu.org id=B31796.160816562419933 (code B ref 31796); Thu, 17 Dec 2020 00:41:01 +0000 Original-Received: (at 31796) by debbugs.gnu.org; 17 Dec 2020 00:40:24 +0000 Original-Received: from localhost ([127.0.0.1]:34243 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kphLD-0005BR-KS for submit@debbugs.gnu.org; Wed, 16 Dec 2020 19:40:23 -0500 Original-Received: from mail-wr1-f53.google.com ([209.85.221.53]:35818) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kphL9-0005B6-2G for 31796@debbugs.gnu.org; Wed, 16 Dec 2020 19:40:22 -0500 Original-Received: by mail-wr1-f53.google.com with SMTP id r3so24902897wrt.2 for <31796@debbugs.gnu.org>; Wed, 16 Dec 2020 16:40:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language; bh=CPqxTm/xT2Cg//YcWHQZmoM9E2sceN9pQv2JHCeYXT4=; b=LxJwYlqEYI2HC8qjWo4UxJ7KeVSi6/vZWqs1ViLJy8GRYwF/dbkmO/OhH+8rU7CgNZ fx4VTqB4K50DHEoLt6fUXGipP1u9oOY+On/D56CRm50KUwUJ1owJcIwmWsHs+cetCtOB aHo9ZvSeWU9S918EPtIUztgYtoqROAb0VheuSY3zkNVzSvY6zA7p6cetYQCM196Cq5jx Ly6DajEhavzfb/ldeXFgNY3HWgurzVlwREcS8zrky6d7DC5mGdbWQ4sW2FlmXj7dGlKt 3n2ZoO/dyJvuHP6vq97rRmV+cFOXa8I3bJ16MaRpOkpDc8om61e344c7FEb89MnY/NjC QYjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language; bh=CPqxTm/xT2Cg//YcWHQZmoM9E2sceN9pQv2JHCeYXT4=; b=PL2LG6AOsiqVe8aaUZFxjZoMhnx0B3XG/WrWfVAa2YtwWNmKephUfi1u2cd6iNKKeF Tf59H2eM3NBWRrv/GPQFr7s7+IzcbQfOQywxM4Q5+raRZtnA0UyUcKniFSzm7ZvjU5NC c8npceiufgkjAW2/I6GvLBUPEbEFfI5nm2noYGUAvZ2xjvPh6IqzVCQF6DVUPBoa16F3 EA0iXJXQoK4vitB64yCkHkHj1ewZuPNX5rY5Lr9jhggIpknjwEN3K0ylkes3k9MMGgdw SmZhWSkDPlsDUY4AxEmgKG7PF7GcA0ykSFvSKcCJ/i3Rz8RI7AKYBoj36Dsc6ZAI2MMK gcxg== X-Gm-Message-State: AOAM531+FmEyJZ4KeOwet7YWIvYT+m9hGGSZNd51aTOj6SB7S1TWoGjP RtZp9yo64AudJ3W9nSC054xti1jfaxh2MQ== X-Google-Smtp-Source: ABdhPJx8zHHQcaSka7tbiyr5T+dZnTEvj8tSVVZkZAT84DySQZor01TD/WT6CxU60tEBHcl0lvTQ9g== X-Received: by 2002:adf:e60f:: with SMTP id p15mr9328123wrm.60.1608165612615; Wed, 16 Dec 2020 16:40:12 -0800 (PST) Original-Received: from [192.168.0.6] ([46.251.119.176]) by smtp.googlemail.com with ESMTPSA id y63sm5297175wmd.21.2020.12.16.16.40.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Dec 2020 16:40:11 -0800 (PST) In-Reply-To: <873605mstj.fsf@mail.linkov.net> Content-Language: en-US X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:196234 Archived-At: This is a multi-part message in MIME format. --------------58A17A2BD72C517B3BDA843D Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 16.12.2020 22:32, Juri Linkov wrote: >>> Another backup plan is to use ripgrep. Its multiline handling with -U >>> also allows to search words ignoring any whitespace, even newlines. >>> This is like isearch-lax-whitespace using search-whitespace-regexp >>> when it contains a newline, e.g. "[ \t\r\n]+". >> >> Right. It has a problem of its own, though: it still outputs a file name >> per line, even when a match is spread across several lines (unlike >> pcregrep). So we're left guessing where a given multiline match ends. >> >> Also, 'sort' doesn't seem to be able to treat both : and \0 as separators >> at the same time. >> >> Here's a rough patch, for illustration. > > Thanks, now finally it's possible to search text ignoring whitespace > between words, for example: > > Find regexp: file[ > ]+names > > finds everything correctly, even though current implementation maybe > not the most elegant. > >> It's kind of working, but I'm not loving it. > > What do you think about using the option `rg --json`? > Emacs has the fast JSON parsing library now, so using > JSON output would be more reliable. Very interesting. It returns better data, each multiline match is wholly in one entry instead of being spread across lines. Even the matches are annotated with match string/length/absolute position. We should really investigate it, but perhaps a bit later, including our capability to parse it quickly when there are a lot of matches (>1000), how said byte offsets interact with different file encodings. Also, its output is not one JSON document but a series of them (including ones with just search statistics which we'll want to skip), but some re-search-forward followed by (json-parse-buffer) should do the trick. In the meantime, here's a smaller patch using the traditional output format. I figure since there is a file name on each line anyway, --null doesn't help much. So it can be simplified a little (see attached). Unfortunately, xref-replace-in-matches is broken for such multiline matches. And, of course, it merges together matches on adjacent lines, whether they are one match or several (that hasn't changed from the previous match). So more investigation is needed. --------------58A17A2BD72C517B3BDA843D Content-Type: text/x-patch; charset=UTF-8; name="ripgrep-multiline.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="ripgrep-multiline.diff" diff --git a/lisp/progmodes/xref.el b/lisp/progmodes/xref.el index 6e99e9d8ac..7c0c54e6eb 100644 --- a/lisp/progmodes/xref.el +++ b/lisp/progmodes/xref.el @@ -1390,6 +1390,7 @@ xref-matches-in-files ;; The 'auto' default would be fine too, but ripgrep can't handle ;; the options we pass in that case. (grep-highlight-matches nil) + (multiline (string-match-p "\n" regexp)) (command (grep-expand-template (cdr (or (assoc @@ -1397,7 +1398,14 @@ xref-matches-in-files xref-search-program-alist) (user-error "Unknown search program `%s'" xref-search-program))) - (xref--regexp-to-extended regexp)))) + (xref--regexp-to-extended regexp) + nil + nil + nil + (when multiline '("-U"))))) + (if (and multiline (not (eq xref-search-program 'ripgrep))) + (user-error "Sorry, multiline searches are not supported with `%s'" + xref-search-program)) (when remote-id (require 'tramp) (setq files (mapcar @@ -1425,6 +1433,27 @@ xref-matches-in-files (not (looking-at "Binary file .* matches"))) (user-error "Search failed with status %d: %s" status (buffer-substring (point-min) (line-end-position)))) + (if multiline + (let (match line last-line file) + (while (re-search-forward grep-re nil t) + (if (and match + (equal file (match-string 1)) + (= (string-to-number (match-string 2)) + (1+ last-line))) + (progn + (setq last-line (string-to-number (match-string 2)) + match (concat match + "\n" + (buffer-substring + (match-end 0) + (line-end-position))))) + (when match + (push (list line file match (1+ (- last-line line))) hits)) + (setq match (buffer-substring (match-end 0) (line-end-position)) + file (match-string 1) + line (string-to-number (match-string 2)) + last-line line))) + (push (list line file match (1+ (- last-line line))) hits))) (while (re-search-forward grep-re nil t) (push (list (string-to-number (match-string line-group)) (match-string file-group) @@ -1536,7 +1565,7 @@ xref--convert-hits (kill-buffer tmp-buffer)))) (defun xref--collect-matches (hit regexp tmp-buffer) - (pcase-let* ((`(,line ,file ,text) hit) + (pcase-let* ((`(,line ,file ,text ,lines-num) hit) (remote-id (file-remote-p default-directory)) (file (and file (concat remote-id file))) (buf (xref--find-file-buffer file)) @@ -1548,7 +1577,7 @@ xref--collect-matches (forward-line (1- line)) (xref--collect-matches-1 regexp file line (line-beginning-position) - (line-end-position) + (line-end-position (or lines-num 1)) syntax-needed))) ;; Using the temporary buffer is both a performance and a buffer ;; management optimization. --------------58A17A2BD72C517B3BDA843D--