unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* why emacs lisp's regex has 2-steps escapes?
@ 2008-07-09 10:30 Xah
  2008-07-10  8:17 ` Kevin Rodgers
  2008-07-10  9:39 ` Alan Mackenzie
  0 siblings, 2 replies; 3+ messages in thread
From: Xah @ 2008-07-09 10:30 UTC (permalink / raw)
  To: help-gnu-emacs

emacs regex has a odd pecularity in that it needs a lot backslashes.
More specifically, a string first needs to be properly escaped, then
this passed to the regex engine.

For example, suppose you have this text “Sin[x] + Sin[y]” and you need
to capture the x or y.

In emacs i need to use
“\\(\\[[a-z]\\]\\)”
for the actual regex
“\(\[[a-z]\]\)”.

Here's somewhat typical but long regex for matching a html image tag

(search-forward-regexp "<img +src=\"\\([^\"]+\\)\" +alt=\"\\([^\"]+\\)?
\" +width=\"\\([0-9]+\\)\" +height=\"\\([0-9]+\\)\" ?>" nil t)

The toothpick syndrom gets crazy making already difficult regex syntax
impossible to read and hard to code.

My question is, why is elisp's regex has this 2-steps process? Is this
some design decision or just happened that way historically?

Second question: can't elisp create some like “regex-string” wrapper
function that automatically takes care of the quoting? I can't see how
this migth be difficult?

Thanks.

  Xah
∑ http://xahlee.org/^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: why emacs lisp's regex has 2-steps escapes?
  2008-07-09 10:30 why emacs lisp's regex has 2-steps escapes? Xah
@ 2008-07-10  8:17 ` Kevin Rodgers
  2008-07-10  9:39 ` Alan Mackenzie
  1 sibling, 0 replies; 3+ messages in thread
From: Kevin Rodgers @ 2008-07-10  8:17 UTC (permalink / raw)
  To: help-gnu-emacs

Xah wrote:
> emacs regex has a odd pecularity in that it needs a lot backslashes.
> More specifically, a string first needs to be properly escaped, then
> this passed to the regex engine.
> 
> For example, suppose you have this text “Sin[x] + Sin[y]” and you need
> to capture the x or y.
> 
> In emacs i need to use
> “\\(\\[[a-z]\\]\\)”

If all you want to capture is the x or y (without the square brackets):

	"\\[\\([a-z]\\)\\]"

> for the actual regex
> “\(\[[a-z]\]\)”.

The enclosing double quotes are misleading in this context.  I would
simply write (again, capturing the letter but not the brackets):

	\[\([a-z]\)\]

Could you show the corresponding syntax in Perl or Java, as both a
conceptual (unquoted) regular expression and as a string literal (for
comparison)?

> Here's somewhat typical but long regex for matching a html image tag
> 
> (search-forward-regexp "<img +src=\"\\([^\"]+\\)\" +alt=\"\\([^\"]+\\)?
> \" +width=\"\\([0-9]+\\)\" +height=\"\\([0-9]+\\)\" ?>" nil t)
> 
> The toothpick syndrom gets crazy making already difficult regex syntax
> impossible to read and hard to code.

One of the reasons Emacs regular expressions are hard-to-read in this
way is that parentheses are defined as normal characters that need to be
escaped when they are to be interpreted as grouping delimiters, whereas
other languages interpret parentheses the opposite (as metacharacters
that need to be escaped to be matched literally).

> My question is, why is elisp's regex has this 2-steps process? Is this
> some design decision or just happened that way historically?

It is due to the distinction between a string and the syntax for
representing it in a program, and the interpretation of the characters
in a string itself (vs. its surface representation) as a regular
expression.

This is just like writing a shell command (using double quotes around
the regular expression) that calls the grep program (which never "sees"
the quotes).

> Second question: can't elisp create some like “regex-string” wrapper
> function that automatically takes care of the quoting? I can't see how
> this migth be difficult?

All you need to do is specify a regular expression syntax and a string
literal syntax that don't define meanings for the same character (here:
backslash).

-- 
Kevin Rodgers
Denver, Colorado, USA





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: why emacs lisp's regex has 2-steps escapes?
  2008-07-09 10:30 why emacs lisp's regex has 2-steps escapes? Xah
  2008-07-10  8:17 ` Kevin Rodgers
@ 2008-07-10  9:39 ` Alan Mackenzie
  1 sibling, 0 replies; 3+ messages in thread
From: Alan Mackenzie @ 2008-07-10  9:39 UTC (permalink / raw)
  To: Xah; +Cc: help-gnu-emacs

On Wed, Jul 09, 2008 at 03:30:27AM -0700, Xah wrote:
> emacs regex has a odd pecularity in that it needs a lot backslashes.
> More specifically, a string first needs to be properly escaped, then
> this passed to the regex engine.

Yes.  The greatest number of consecutive backslashes I've seen (in a
non-joke context) is 10.

> For example, suppose you have this text ???Sin[x] + Sin[y]??? and you need
> to capture the x or y.

Ironically, Xah, you are doing the same sort of thing in your post,
using crazy quote characters (if that is indeed what they are), 0x5397c
and 0x5397d (according to C-u C-x =).  Over my SSH link to my SSP, your
quotes look something like "â~@~]", and are most difficult to read
without a pair of sunspecs which filters out the UTF.

Could you, perhaps, use the standard ASCII quotes 0x22 and 0x27 here,
please?

> In emacs i need to use
> ???\\(\\[[a-z]\\]\\)???
> for the actual regex
> ???\(\[[a-z]\]\)???.

> Here's somewhat typical but long regex for matching a html image tag

> (search-forward-regexp "<img +src=\"\\([^\"]+\\)\" +alt=\"\\([^\"]+\\)?
> \" +width=\"\\([0-9]+\\)\" +height=\"\\([0-9]+\\)\" ?>" nil t)

> The toothpick syndrom gets crazy making already difficult regex syntax
> impossible to read and hard to code.

> My question is, why is elisp's regex has this 2-steps process? Is this
> some design decision or just happened that way historically?

> Second question: can't elisp create some like ???regex-string??? wrapper
> function that automatically takes care of the quoting? I can't see how
> this might be difficult?

Well, I've hacked up a function to display regexps in *scratch*,
concentrating in particular on deeply nested \( .... \| .... \)
constructs.  It doesn't work so well when the regexp's length exceeds
the window width, but it could be enhanced:

#########################################################################
(defun translate-rnt (regexp)
  "REGEXP is a string.  Translate any \t \n \r and \f characters
to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
The original string is modified."
  (let (pos)
    (while (setq pos (string-match "[\t\n\r\f]" regexp))
      (setq ch (aref regexp pos))
      (aset regexp pos
	    (cond ((eq ch ?\t) ?Î)
		  ((eq ch ?\n) ?ñ)
		  ((eq ch ?\r) ?®)
		  (t           ?£))))
    regexp))

(defun pp-regexp (regexp)
  "Pretty print a regexp.  This means, contents of \\\\\(s are lowered a line."
  (or (stringp regexp) (error "parameter is not a string."))
  (let ((depth 0)
	(re (copy-sequence regexp))
	(start 0)     ; earliest position still without an acm-depth property.
	(pos 0)	      ; current analysis position.
	(max-depth 0) ; How many lines do we need to print?
	(min-depth 0) ; Pick up "negative depth" errors.
	pr-line	      ; output line being constructed
	line-no	; line number of pr-line, varies between min-depth and max-depth.
	)
    (translate-rnt re)
    ;; apply acm-depth properties to the whole string.
    (while (< start (length re))
      (setq pos (string-match "\\\\\\((\\(\\?:\\)?\\||\\|)\\)"
				  re start))
      (put-text-property start (or pos (length re)) 'acm-depth depth re)
      (when pos
	(setq ch (aref (match-string 1 re) 0))
	(cond
	 ((eq ch ?\()
	  (put-text-property pos (match-end 1) 'acm-depth depth re)
	  (setq depth (1+ depth))
	  (if (> depth max-depth) (setq max-depth depth)))

	 ((eq ch ?\|)
	  (put-text-property pos (match-end 1) 'acm-depth (1- depth) re)
	  (if (< (1- depth) min-depth) (setq min-depth (1- depth))))

	 (t				; (eq ch ?\))
	  (setq depth (1- depth))
	  (if (< depth min-depth) (setq min-depth depth))
	  (put-text-property pos (match-end 1) 'acm-depth depth re))))
      (setq start (if pos (match-end 1) (length re))))

    ;; print out the strings
    (setq line-no min-depth)
    (while (<= line-no max-depth)
      (with-current-buffer "*scratch*"
	(goto-char (point-max)) (insert ?\n)
	(setq pr-line "")
	(setq start 0)
	(while (< start (length re))
	  (setq pos (next-single-property-change start 'acm-depth re (length re)))
	  (setq depth (get-text-property start 'acm-depth re))
	  (setq pr-line
		(concat pr-line
			(if (= depth line-no)
			    (substring re start pos)
			  (make-string (- pos start) ?\ ))))
	  (setq start pos))
	(insert pr-line)
	(setq line-no (1+ line-no))))))
#########################################################################      

> Thanks.

>   Xah

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-07-10  9:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-09 10:30 why emacs lisp's regex has 2-steps escapes? Xah
2008-07-10  8:17 ` Kevin Rodgers
2008-07-10  9:39 ` Alan Mackenzie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).