unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Regexp: match any character including newline
@ 2013-10-16 14:42 Yuri Khan
  2013-10-16 15:31 ` Kai Großjohann
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yuri Khan @ 2013-10-16 14:42 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Hello All,

I’m doing regexp replacements on a hard-wrapped XHTML-alike. Here’s an
original fragment:

===
<tr><td><pre><code>X(n, t)
X a(n, t)</code></pre></td><td></td>
    <td><requires><p><code>T</code> shall be
        <concept>Copy&shy;Insert&shy;able</concept> into
        <code>X</code>.</p></requires>
        <p>post: <code>distance(begin(), end()) == n</code></p>
        <p>Constructs a sequence container with <code>n</code> copies
        of <code>t</code></p></td></tr>
===

Here’s what I need to turn it into:

===
<expression><pre><code>X(n, t)
X a(n, t)</code></pre></expression>
<return_type></return_type>
<assertion_note><requires><p><code>T</code> shall be
        <concept>Copy&shy;Insert&shy;able</concept> into
        <code>X</code>.</p></requires>
        <p>post: <code>distance(begin(), end()) == n</code></p>
        <p>Constructs a sequence container with <code>n</code> copies
        of <code>t</code></p></assertion_note>
===

To this end, I want to do a regexp replace of:

===
<tr><td>\(.*?\)</td><td>\(.*?\)</td>
    <td>\(.*?\)</td></tr>
===

with

===
<expression>\1</expression>
<return_type>\2</return_type>
<assertion_note>\3</assertion_note>
===

except that “.” needs to match any character including newline.

I know the obvious solution: instead of “.”, use the following monstrosity:

===
\(?:.\|
\)
===

However, I find that very cumbersome to type, especially since I have
to press C-q C-j in between.

Is there a way to make “.” match newline too, or is there an easier
way to match any character including newline? (I don’t want to limit
myself to [:ascii:] as there are also Unicode-specific dashes.)

For now, I’ve devised a workaround of using [^@] where @ is a
character that does not occur in the text. Maybe [^^] since it’s
easier to type and looks cute :)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
  2013-10-16 14:42 Regexp: match any character including newline Yuri Khan
@ 2013-10-16 15:31 ` Kai Großjohann
  2013-10-16 15:56   ` Yuri Khan
  2013-10-16 16:53 ` Drew Adams
  2013-10-17  2:25 ` Eric Abrahamsen
  2 siblings, 1 reply; 8+ messages in thread
From: Kai Großjohann @ 2013-10-16 15:31 UTC (permalink / raw)
  To: Yuri Khan; +Cc: help-gnu-emacs@gnu.org

Yuri Khan wrote:
> 
> To this end, I want to do a regexp replace of:
> 
> ===
> <tr><td>\(.*?\)</td><td>\(.*?\)</td>
>     <td>\(.*?\)</td></tr>
> ===
> 
> with
> 
> ===
> <expression>\1</expression>
> <return_type>\2</return_type>
> <assertion_note>\3</assertion_note>
> ===

You can use keyboard macros, but you will need a mode that understands
XML.  Let's say you install nxml (it's part of Emacs I think).  Let's
say the content is in a file foo.xml, so that nxml mode is turned on.
Consider that point is before the <tr>.  Now you can use C-M-f to move
it before the <td>.  Now you can use C-M-n to move it after the closing
</td>.  Even if the content of <td>...</td> contains tags!

So you can record a keyboard macro that does the following steps:

- Move after the <tr>
- Insert "<expression>"
- Move to after the </td> with C-M-n
- Insert "</expression>" (using C-c /, say)
- Insert a newline
- Insert "<return_type>"
- Move to after the </td> with C-M-n
- C-c / to insert "</return_type>"
- "<assertion_note>", C-M-n, C-c /
- Use C-M-f to move past the closing </tr>

After all of this, you've got:

<tr><expression><td>foo</td></expression>
<return_type><td>bar</td></return_type>
<assertion_node><td>baz</td></assertion_node></tr>

Now you can do this:  You set the mark with C-space.  You move backward
over the whole thing with C-M-p.  Now the whole <tr>...</tr> is marked.
 Now you can use query-replace to replace <tr>, <td>, </td> and </tr>
with nothing in the highlighted region.  (Need to experiment a bit
whether the region goes away after a query-replace.  If it does, C-x C-x
might be your friend.)

See?  No regex anywhere.  Way cool!  Instead, you're exploiting the
navigation that you get from Emacs modes.

Kai



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
  2013-10-16 15:31 ` Kai Großjohann
@ 2013-10-16 15:56   ` Yuri Khan
  0 siblings, 0 replies; 8+ messages in thread
From: Yuri Khan @ 2013-10-16 15:56 UTC (permalink / raw)
  To: Kai Großjohann; +Cc: help-gnu-emacs@gnu.org

On Wed, Oct 16, 2013 at 10:31 PM, Kai Großjohann
<kai.grossjohann@gmx.net> wrote:

> You can use keyboard macros, but you will need a mode that understands
> XML.  Let's say you install nxml (it's part of Emacs I think).  Let's
> say the content is in a file foo.xml, so that nxml mode is turned on.
> Consider that point is before the <tr>.  Now you can use C-M-f to move
> it before the <td>.  Now you can use C-M-n to move it after the closing
> </td>.  Even if the content of <td>...</td> contains tags!

Good alternate approach. If only macros were as fast and responsive as
regexp replace in my configuration…

In my case, nesting is not a concern (as HTML tables almost never nest
except for layouting, and even then it’s evil), so regexps are an
adequate tool.

> See?  No regex anywhere.  Way cool!  Instead, you're exploiting the
> navigation that you get from Emacs modes.

This is way cool indeed, and I am in fact using nxml-mode and its
navigation commands.

However, this line of thought makes me wish for a match/replace
language as concise as regexps and at least as powerful as XSLT :]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
       [not found] <mailman.4131.1381934579.10748.help-gnu-emacs@gnu.org>
@ 2013-10-16 15:58 ` Rustom Mody
  2013-10-16 16:16   ` Yuri Khan
       [not found]   ` <mailman.4141.1381940186.10748.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 8+ messages in thread
From: Rustom Mody @ 2013-10-16 15:58 UTC (permalink / raw)
  To: help-gnu-emacs

On Wednesday, October 16, 2013 8:12:54 PM UTC+5:30, Yuri Khan wrote:
> Hello All,
> 
> 
> I’m doing regexp replacements on a hard-wrapped XHTML-alike. Here’s an
> original fragment:

Regexp handling of xml is commonly a source of grief.
It is usually better to use a dedicated tool like this
http://www.crummy.com/software/BeautifulSoup/
or (more xmlish than htmlish)
http://lxml.de/

These are python solutions. Im sure there are equivalent ones in other scripting languages of your choice


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
  2013-10-16 15:58 ` Rustom Mody
@ 2013-10-16 16:16   ` Yuri Khan
       [not found]   ` <mailman.4141.1381940186.10748.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 8+ messages in thread
From: Yuri Khan @ 2013-10-16 16:16 UTC (permalink / raw)
  To: Rustom Mody; +Cc: help-gnu-emacs@gnu.org

On Wed, Oct 16, 2013 at 10:58 PM, Rustom Mody <rustompmody@gmail.com> wrote:

> Regexp handling of xml is commonly a source of grief.

Oh, please don’t get me wrong. I know all about the Chomsky hierarchy,
the pumping lemmas, and Tony the Pony[1]. Regexps only cause grief
when they collide with nesting.

[1]: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
       [not found]   ` <mailman.4141.1381940186.10748.help-gnu-emacs@gnu.org>
@ 2013-10-16 16:48     ` Rustom Mody
  0 siblings, 0 replies; 8+ messages in thread
From: Rustom Mody @ 2013-10-16 16:48 UTC (permalink / raw)
  To: help-gnu-emacs

On Wednesday, October 16, 2013 9:46:18 PM UTC+5:30, Yuri Khan wrote:
> On Wed, Oct 16, 2013 at 10:58 PM, Rustom Mody  wrote:
> 
> > Regexp handling of xml is commonly a source of grief.
> 
> Oh, please don’t get me wrong. I know all about the Chomsky hierarchy,
> the pumping lemmas, and Tony the Pony[1]. Regexps only cause grief
> when they collide with nesting.

heh! Enjoy the pony-ride!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Regexp: match any character including newline
  2013-10-16 14:42 Regexp: match any character including newline Yuri Khan
  2013-10-16 15:31 ` Kai Großjohann
@ 2013-10-16 16:53 ` Drew Adams
  2013-10-17  2:25 ` Eric Abrahamsen
  2 siblings, 0 replies; 8+ messages in thread
From: Drew Adams @ 2013-10-16 16:53 UTC (permalink / raw)
  To: Yuri Khan, help-gnu-emacs

> “.” needs to match any character including newline.
> I know the obvious solution: instead of “.”, use the following
> monstrosity:
>
> \(?:.\|
> \)
> 
> However, I find that very cumbersome to type, especially since I
> have to press C-q C-j in between.
> 
> Is there a way to make “.” match newline too, or is there an easier
> way to match any character including newline?

1. I and others have requested this for vanilla Emacs a few times,
as a user toggle.  E.g.:

* http://lists.gnu.org/archive/html/emacs-devel/2006-03/msg00162.html
* http://lists.gnu.org/archive/html/emacs-devel/2006-03/msg00476.html
* http://lists.gnu.org/archive/html/emacs-devel/2006-11/msg01559.html
* http://lists.gnu.org/archive/html/emacs-devel/2006-12/msg00115.html

2. In Icicles at least, you can use `C-M-.' to toggle what `.'
represents in the minibuffer (i.e., for most interactive use).  When
`.' matches also a newline, it appears as `.' in the minibuffer, but
the actual regexp used under the covers is "\(.\|[
]\)".  (When this is the case, it is also highlighted, so you can tell.)

IOW, when newline is also being matched by `.', this propertized string
is inserted in the minibuffer when you type `.':

#("\\(.\\|[
]\\)" 0 10 (face highlight display "."))

Not the ideal solution (hence the requests cited), but handy enough.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regexp: match any character including newline
  2013-10-16 14:42 Regexp: match any character including newline Yuri Khan
  2013-10-16 15:31 ` Kai Großjohann
  2013-10-16 16:53 ` Drew Adams
@ 2013-10-17  2:25 ` Eric Abrahamsen
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2013-10-17  2:25 UTC (permalink / raw)
  To: help-gnu-emacs

Yuri Khan <yuri.v.khan@gmail.com> writes:

> Hello All,
>
> I’m doing regexp replacements on a hard-wrapped XHTML-alike. Here’s an
> original fragment:
>
> ===
> <tr><td><pre><code>X(n, t)
> X a(n, t)</code></pre></td><td></td>
>     <td><requires><p><code>T</code> shall be
>         <concept>Copy&shy;Insert&shy;able</concept> into
>         <code>X</code>.</p></requires>
>         <p>post: <code>distance(begin(), end()) == n</code></p>
>         <p>Constructs a sequence container with <code>n</code> copies
>         of <code>t</code></p></td></tr>
> ===

Another option (though I'm not claiming you'll actually want to do this)
is to use xml.el (comes with emacs?) to parse that xml into a tree, and
then mess with the tree. Parsing the above gets me:

((tr nil (td nil (pre nil (code nil "X(n, t) X a(n, t)"))) (td nil) " "
(td nil (requires nil (p nil (code nil "T") " shall be " (concept nil
"Copy?Insert?able") " into " (code nil "X") ".")) " " (p nil "post: "
(code nil "distance(begin(), end()) == n")) " " (p nil "Constructs a
sequence container with " (code nil "n") " copies of " (code nil
"t")))))

`xml-entity-alist' would have to be tweaked.

Like I said, you probably wouldn't want this, but it's an interesting
option...

E




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-10-17  2:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-16 14:42 Regexp: match any character including newline Yuri Khan
2013-10-16 15:31 ` Kai Großjohann
2013-10-16 15:56   ` Yuri Khan
2013-10-16 16:53 ` Drew Adams
2013-10-17  2:25 ` Eric Abrahamsen
     [not found] <mailman.4131.1381934579.10748.help-gnu-emacs@gnu.org>
2013-10-16 15:58 ` Rustom Mody
2013-10-16 16:16   ` Yuri Khan
     [not found]   ` <mailman.4141.1381940186.10748.help-gnu-emacs@gnu.org>
2013-10-16 16:48     ` Rustom Mody

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).