unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Matches for multiline regexps
@ 2005-06-16  1:40 Luc Teirlinck
  2005-06-16  2:09 ` Luc Teirlinck
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-16  1:40 UTC (permalink / raw)


There are some inconsistencies between `occur' and friends on the one
hand and `how-many', `flush-lines' and `keep-lines' on the other hand,
in as far as matches for multiline regexps are concerned.

In a buffer containing five lines all containing "11":

11
11
11
11
11

`M-x occur RET 11 C-q C-j 11 RET' finds four matches (lines 1 through 4)
which seems logical to me.

`M-x how-many RET 11 C-q C-j 11 RET' (with point at bob) finds two matches.

The difference is that how-many, after finding the match on line one,
skips over the match, and starts searching for the next match at the
end of that match, hence not finding the match at the beginning of
line 2 (which is partially covered by the first match).

`flush-lines' and `keep-lines' follow the `how-many' "philosophy".

Should I just document the difference, or is this a bug in `how-many'
and friends that needs to be fixed?

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-16  1:40 Matches for multiline regexps Luc Teirlinck
@ 2005-06-16  2:09 ` Luc Teirlinck
  2005-06-16  2:24 ` Luc Teirlinck
  2005-06-16 16:24 ` Richard Stallman
  2 siblings, 0 replies; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-16  2:09 UTC (permalink / raw)


Or does occur have problems with multiline regexps?

In my previous example buffer:

11
11
11
11
11

`M-x occur RET 11 C-q C-j 11 RET' produces the following *Occur*
buffer:

4 matches for "11
11" in buffer: bu
      2:11
      4:11
      6:11
      8:11

I could not find out from the docs what those numbers in front of the
11's are supposed to mean.  They are clearly not line numbers.

Starting from a buffer with the following five lines:

11
22
33
11
22

`M-x occur RET 11 C-q C-j 22 RET' produces the following *Occur*
buffer:

2 matches for "11
22" in buffer: bu
      2:11
      6:11

This at the very least seems inconsistent with the `occur' docstring:

  Show all lines in the current buffer containing a match for regexp.

  If a match spreads across multiple lines, all those lines are shown.

So why do I not see any 22 lines in the *Occur* buffer?

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-16  1:40 Matches for multiline regexps Luc Teirlinck
  2005-06-16  2:09 ` Luc Teirlinck
@ 2005-06-16  2:24 ` Luc Teirlinck
  2005-06-16 16:24 ` Richard Stallman
  2 siblings, 0 replies; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-16  2:24 UTC (permalink / raw)


`query-replace' follows the `how-many' algorithm.  If you do not
replace the match on line one, it does not offer to replace the match
on line two in the same example buffer:

11
11
11
11
11

So it does seem that occur is the one with exceptional behavior.

Things should be unambiguous for regexps whose matches do not overlap.
Hpowever, in the _second_ example I gave where occur seemed at the
very least to contradict its own docstring, there were no overlapping
matches.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-16  1:40 Matches for multiline regexps Luc Teirlinck
  2005-06-16  2:09 ` Luc Teirlinck
  2005-06-16  2:24 ` Luc Teirlinck
@ 2005-06-16 16:24 ` Richard Stallman
  2005-06-17  3:26   ` Luc Teirlinck
  2005-06-17  3:30   ` Luc Teirlinck
  2 siblings, 2 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-16 16:24 UTC (permalink / raw)
  Cc: emacs-devel

    `M-x occur RET 11 C-q C-j 11 RET' finds four matches (lines 1 through 4)
    which seems logical to me.

    `M-x how-many RET 11 C-q C-j 11 RET' (with point at bob) finds two matches.

One function looks for lines that match, while the other finds
matches, so it is natural that they may differ.

It is not a bug, and I don't see why we need to change the
documentation either.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-16 16:24 ` Richard Stallman
@ 2005-06-17  3:26   ` Luc Teirlinck
  2005-06-17 14:58     ` Richard Stallman
  2005-06-17  3:30   ` Luc Teirlinck
  1 sibling, 1 reply; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-17  3:26 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman wrote:

       `M-x occur RET 11 C-q C-j 11 RET' finds four matches (lines 1 through 4)
       which seems logical to me.

       `M-x how-many RET 11 C-q C-j 11 RET' (with point at bob) finds two matches.

   One function looks for lines that match, while the other finds
   matches, so it is natural that they may differ.

   It is not a bug, and I don't see why we need to change the
   documentation either.

But are the _other_ problems I reported no bugs either?
>From my previous message:

   Or does occur have problems with multiline regexps?

   In my previous example buffer:

   11
   11
   11
   11
   11

   `M-x occur RET 11 C-q C-j 11 RET' produces the following *Occur*
   buffer:

   4 matches for "11
   11" in buffer: bu
	 2:11
	 4:11
	 6:11
	 8:11

   I could not find out from the docs what those numbers in front of the
   11's are supposed to mean.  They are clearly not line numbers.

Additional remark: from simpler examples. it appears that they are
_intended_ to be line numbers.  If so, this is a bug.

   Starting from a buffer with the following five lines:

   11
   22
   33
   11
   22

   `M-x occur RET 11 C-q C-j 22 RET' produces the following *Occur*
   buffer:

   2 matches for "11
   22" in buffer: bu
	 2:11
	 6:11

   This at the very least seems inconsistent with the `occur' docstring:

     Show all lines in the current buffer containing a match for regexp.

     If a match spreads across multiple lines, all those lines are shown.

   So why do I not see any 22 lines in the *Occur* buffer?

Note that in the second example, there are not even overlapping matches.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-16 16:24 ` Richard Stallman
  2005-06-17  3:26   ` Luc Teirlinck
@ 2005-06-17  3:30   ` Luc Teirlinck
  2005-06-17 14:58     ` Richard Stallman
  1 sibling, 1 reply; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-17  3:30 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman wrote:

       `M-x occur RET 11 C-q C-j 11 RET' finds four matches (lines 1 through 4)
       which seems logical to me.

       `M-x how-many RET 11 C-q C-j 11 RET' (with point at bob) finds two matches.

   One function looks for lines that match, while the other finds
   matches, so it is natural that they may differ.

But `keep-lines' and `flush-lines' also look for lines that match.
Just to be sure: is the following behavior intended:

Start with a buffer containing three lines with "11" in them and a
final newline:

11
11
11

Result of `M-x keep-lines RET 11 C-q C-j 11 RET' with point at bob:

11
11

(The `occur' interpretation would keep all three lines.) 

Result of `M-x flush-lines RET 11 C-q C-j 11 RET' with point at bob:

11

(The `occur' interpretation would delete all three lines.)

Maybe it is OK (just asking to be sure).

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-17  3:26   ` Luc Teirlinck
@ 2005-06-17 14:58     ` Richard Stallman
  2005-06-18  2:48       ` Luc Teirlinck
  2005-06-18  3:17       ` Luc Teirlinck
  0 siblings, 2 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-17 14:58 UTC (permalink / raw)
  Cc: emacs-devel

       4 matches for "11
       11" in buffer: bu
	     2:11
	     4:11
	     6:11
	     8:11

       I could not find out from the docs what those numbers in front of the
       11's are supposed to mean.  They are clearly not line numbers.

    Additional remark: from simpler examples. it appears that they are
    _intended_ to be line numbers.  If so, this is a bug.

Yes, it seems to be a bug in counting the line numbers.  Could you fix
that too?

Of course, it could simply count lines from the beginning to each
matching line, each time; but that would be terribly inefficient in a
large buffer.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-17  3:30   ` Luc Teirlinck
@ 2005-06-17 14:58     ` Richard Stallman
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-17 14:58 UTC (permalink / raw)
  Cc: emacs-devel

    Start with a buffer containing three lines with "11" in them and a
    final newline:

    11
    11
    11

    Result of `M-x keep-lines RET 11 C-q C-j 11 RET' with point at bob:

    11
    11

The basic ideas of keep-lines and flush-lines don't obviously extend
to multi-line regexps.  Figuring out what they ought to mean in such a
case would be rather difficult--and why take the trouble?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-17 14:58     ` Richard Stallman
@ 2005-06-18  2:48       ` Luc Teirlinck
  2005-06-19  3:50         ` Richard Stallman
  2005-06-18  3:17       ` Luc Teirlinck
  1 sibling, 1 reply; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-18  2:48 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman wrote:

       Additional remark: from simpler examples. it appears that they are
       _intended_ to be line numbers.  If so, this is a bug.

   Yes, it seems to be a bug in counting the line numbers.  Could you fix
   that too?

I will take a look at it, but first a decision has to be made on how
we treat overlapping matches.  (I am talking about matches that
themselves overlap.  I have no problem handling a match that starts on
the same line on which a previous match ended, but later on the line,
so that the matches themselves do not overlap, only one of their lines.)

The current occur implementation for multiline regexps has _several_
problems.  Apart from getting the line numbers wrong, the matches do
not get correctly displayed: only their first line is shown.  The
current implementation _tries_ to "correctly" (in one of the two
possible interpretations of what is "correct") find all matches in
case there are overlapping matches.  But it does not come close to
succeeding in that.  Worse, it has to pay for its attempt to do so by
failing to find all matches in more natural cases where there are no
overlapping matches and only one possible interpretation of "correct".
The present occur implementation differs radically in philosophy with
all other word or regexp search functions in Emacs and is backward
incompatible with Emacs 21.

I propose to have occur treat overlapping matches the same as the
other Emacs search functions do, which is also the way occur behaved
before Emacs 22.  That is, given a buffer with the following five lines:

11
11
11
11
11

`M-x occur RET 11 C-q C-j 11 RET' will find two matches, one on line 1
and one on line 3.  Those are the only matches that
`C-M-s 11 C-q C-j 11 RET C-s C-s C-s...' at beginning of buffer is
going to find.  It is what occur does in Emacs 21.  Implementing this
correctly seems relatively easy and does not require paying a price in
efficiency.  If this interpretation is good enough for C-M-s, then why
not for occur?

Trying to fix occur to handle the other interpretation of "correct"
(matches at lines 1, 2, 3 and 4) is possible but more difficult.  (The
current occur version can do that correctly in this example, but fails
for many other examples.)  Even a completely correct implementation
would still present problems.  It could make the handling of more
natural regexps less efficient, it clashes with all other search
functions in its philosophy, and it would not be clear how to display
all multiline matches in a way that is clear and avoids excessive
redundancy, because there could be a _lot_ of overlapping lines
between matches.  With my proposal only _consecutive_ entries in the
*Occur* buffer could overlap and the overlap would be at most one
line.  With a correct implementation of the other interpretation,
there is no limit in amount of overlap.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-17 14:58     ` Richard Stallman
  2005-06-18  2:48       ` Luc Teirlinck
@ 2005-06-18  3:17       ` Luc Teirlinck
  1 sibling, 0 replies; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-18  3:17 UTC (permalink / raw)


Of course, in my previous message, with "overlapping matches", I meant
overlapping _multiline_ matches.

The current occur implementation treats overlapping _one-line_ matches
according to the same philosophy as the one I propose for multiline
matches, which is one further argument for my proposal.  The _current_
occur implementation shows the line only once and the only matches
that are highlighted are the ones that match according to the C-M-s
philosophy.

For example, with a buffer containing only "111", with the current
occur,  after `M-x occur RET 11 RET", the highlighting only finds a
match at position 1, not at position 2. 

To be consistent with the way it currently treats multiline
overlapping matches, occur should show the line enough times to
highlight all overlapping matches without overlap in their highlighting.

I am starting to wonder whether the change from the Emacs 21 behavior
in the multiline case was really intentional.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-18  2:48       ` Luc Teirlinck
@ 2005-06-19  3:50         ` Richard Stallman
  2005-06-19 14:14           ` Juri Linkov
  2005-06-20  1:57           ` Luc Teirlinck
  0 siblings, 2 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-19  3:50 UTC (permalink / raw)
  Cc: emacs-devel

    I will take a look at it, but first a decision has to be made on how
    we treat overlapping matches.

When not displaying context, it should display each line that contains
any part of one or more matches.  It should not display any line more
than once.

When context lines are specified, it is less clear.  One idea is to
display each group of lines that contains a match, plus context around
it.  When there are multiple matches within one line, that line should
only appear once.  However, if a line is partly matched by more than
one multiline match, it is not clear what the right thing is, do I'd say
don't spend much time on it.

    I propose to have occur treat overlapping matches the same as the
    other Emacs search functions do, which is also the way occur behaved
    before Emacs 22.

That would be a step backwards.  Please do not make that change.

    `M-x occur RET 11 C-q C-j 11 RET' will find two matches, one on line 1
    and one on line 3.  Those are the only matches that
    `C-M-s 11 C-q C-j 11 RET C-s C-s C-s...' at beginning of buffer is
    going to find.

These commands do different jobs; the right thing for one is not the right
thing for the other.  Consistency is not the right goal here.

    To be consistent with the way it currently treats multiline
    overlapping matches, occur should show the line enough times to
    highlight all overlapping matches without overlap in their highlighting.

    For example, with a buffer containing only "111", with the current
    occur,  after `M-x occur RET 11 RET", the highlighting only finds a
    match at position 1, not at position 2. 

Ideally it should show the line once, but highlight all matches in
that line.

    I am starting to wonder whether the change from the Emacs 21 behavior
    in the multiline case was really intentional.

The code appears to be designed for uniline matches.  However, the
Emacs 21 behavior you described is not desirable behavior.  It was
merely how things happened to be.  Returning to that should not be
the goal.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-19  3:50         ` Richard Stallman
@ 2005-06-19 14:14           ` Juri Linkov
  2005-06-20  3:50             ` Richard Stallman
  2005-06-20  1:57           ` Luc Teirlinck
  1 sibling, 1 reply; 17+ messages in thread
From: Juri Linkov @ 2005-06-19 14:14 UTC (permalink / raw)
  Cc: teirllm, emacs-devel

> When not displaying context, it should display each line that contains
> any part of one or more matches.  It should not display any line more
> than once.
>
> When context lines are specified, it is less clear.  One idea is to
> display each group of lines that contains a match, plus context around
> it.  When there are multiple matches within one line, that line should
> only appear once.

The current duplicating of context lines of consecutive matched lines
is too inconvenient.  It really should work like grep or diff, and
to join consecutive lines with their context lines into one block.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-19  3:50         ` Richard Stallman
  2005-06-19 14:14           ` Juri Linkov
@ 2005-06-20  1:57           ` Luc Teirlinck
  2005-06-20 17:51             ` Richard Stallman
  1 sibling, 1 reply; 17+ messages in thread
From: Luc Teirlinck @ 2005-06-20  1:57 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman wrote:

   The code appears to be designed for uniline matches.  However, the
   Emacs 21 behavior you described is not desirable behavior.  It was
   merely how things happened to be.  Returning to that should not be
   the goal.

I took a closer look at the current occur code.  It definitely does
not make the slightest attempt at handling regexps that can match more
than one line.  The fact that the regexp does not match a newline
seems to be a basic assumption made in the current occur code.  The
fact that it seemed to try (but not succeed) to handle regexps that
could match newlines was merely an optical illusion.  The current
occur code does not make any attempt to handle overlapping matches
either.  The fact that it seemed to try was again an optical illusion.

The current occur code completely fails in several respects for
regexps that can match newlines.  This has been the situation for more
than three years now in CVS.  The fact that no bug reports were ever
filed shows that nobody using Emacs CVS has attempted to use occur for
a regexp that can match newlines in more than three years.  The only
reason I tried was out of curiosity, to check how it would handle
overlapping matches.

Making occur handle multiline matches (which it could do in 21.3, but
it apparently lost that ability when occur got completely rewritten
more than three years ago), as well as overlapping matches (which is
tricky and which occur never could) and change the way context lines
are handled seems to require a complete rewrite of the current occur
machinery into something much more complex.  That seems to be a major
task and I currently to not have the time to take it on.  Experience
of the last three plus years seems to show that nobody appears to be
interested in such functionality anyway.

Maybe somebody more familiar with, and more heavily interested in,
occur than I am might be willing to take on the task.  (I merely
stumbled on the problem by rereading man/search.texi, which was
extensively changed since I last read it.)

I will take care of flush-lines and keep-lines, however.  I am nearly
ready with it and only need to double check some things.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-19 14:14           ` Juri Linkov
@ 2005-06-20  3:50             ` Richard Stallman
  2005-06-20  4:47               ` Juri Linkov
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Stallman @ 2005-06-20  3:50 UTC (permalink / raw)
  Cc: teirllm, emacs-devel

    The current duplicating of context lines of consecutive matched lines
    is too inconvenient.

Inconvenient for whom?  The user, or the maintainers of occur?

      It really should work like grep or diff, and
    to join consecutive lines with their context lines into one block.

I am not necessarily against it.  On the other hand, I don't like the
idea of such a major rewrite of this code now.  Although it would be
part of fixing bugs in the handling of multiline regexps with context,
I would rather fix that bug in a simple way with little rewriting.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-20  3:50             ` Richard Stallman
@ 2005-06-20  4:47               ` Juri Linkov
  2005-06-21  2:00                 ` Richard Stallman
  0 siblings, 1 reply; 17+ messages in thread
From: Juri Linkov @ 2005-06-20  4:47 UTC (permalink / raw)
  Cc: teirllm, emacs-devel

>     The current duplicating of context lines of consecutive matched lines
>     is too inconvenient.
>
> Inconvenient for whom?  The user, or the maintainers of occur?

For the user.  grep and diff don't duplicate context lines,
for good reasons.

>       It really should work like grep or diff, and
>     to join consecutive lines with their context lines into one block.
>
> I am not necessarily against it.  On the other hand, I don't like the
> idea of such a major rewrite of this code now.  Although it would be
> part of fixing bugs in the handling of multiline regexps with context,
> I would rather fix that bug in a simple way with little rewriting.

I didn't suggest a major rewrite now, and I even don't consider this a bug.

(BTW, there are separate packages extending occur, such as moccur.el
which could be integrated into Emacs later).

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-20  1:57           ` Luc Teirlinck
@ 2005-06-20 17:51             ` Richard Stallman
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-20 17:51 UTC (permalink / raw)
  Cc: emacs-devel

Could you add an item to TODO and a comment in the code
saying that occur isn't handling multiline regexps?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Matches for multiline regexps
  2005-06-20  4:47               ` Juri Linkov
@ 2005-06-21  2:00                 ` Richard Stallman
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Stallman @ 2005-06-21  2:00 UTC (permalink / raw)
  Cc: teirllm, emacs-devel

    >     The current duplicating of context lines of consecutive matched lines
    >     is too inconvenient.
    >
    > Inconvenient for whom?  The user, or the maintainers of occur?

    For the user.  grep and diff don't duplicate context lines,
    for good reasons.

I can see both advantages and disadvantages, for the user.  THe
advantage I see in listing each match separately is that you can see
how many matches there are, and where each one starts and ends.
For some regexps that might not always be obvious.

However, I am not sure this advantage is important enough to base the
decision on.

    (BTW, there are separate packages extending occur, such as moccur.el
    which could be integrated into Emacs later).

Could you tell me about moccur.el, so I can think about it
for the future?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-06-21  2:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-16  1:40 Matches for multiline regexps Luc Teirlinck
2005-06-16  2:09 ` Luc Teirlinck
2005-06-16  2:24 ` Luc Teirlinck
2005-06-16 16:24 ` Richard Stallman
2005-06-17  3:26   ` Luc Teirlinck
2005-06-17 14:58     ` Richard Stallman
2005-06-18  2:48       ` Luc Teirlinck
2005-06-19  3:50         ` Richard Stallman
2005-06-19 14:14           ` Juri Linkov
2005-06-20  3:50             ` Richard Stallman
2005-06-20  4:47               ` Juri Linkov
2005-06-21  2:00                 ` Richard Stallman
2005-06-20  1:57           ` Luc Teirlinck
2005-06-20 17:51             ` Richard Stallman
2005-06-18  3:17       ` Luc Teirlinck
2005-06-17  3:30   ` Luc Teirlinck
2005-06-17 14:58     ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).