From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: gmane.emacs.bugs
Subject: bug#192: regexp does not work as documented
Date: Sun, 11 May 2008 21:28:18 -0400
Message-ID: <jwvr6c8pbd6.fsf-monnier+emacsbugreports__26426.5529149727$1210557191$gmane$org@gnu.org>
References: <87k5i8ukq8.fsf@stupidchicken.com>
	<200805061335.11379.bruno@clisp.org> <48204B3D.6000500@gmx.at>
	<yg5skwqc6ho.fsf@nested.ece.lsu.edu>
	<yg5bq3ddij2.fsf@nested.ece.lsu.edu> <4826A303.3030002@gmx.at>
	<87abiwoqzd.fsf@stupidchicken.com>
	<yg57ie0df8u.fsf@nested.ece.lsu.edu>
	<jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org>
	<yg5tzh4bqtw.fsf@nested.ece.lsu.edu>
Reply-To: Stefan Monnier <monnier@iro.umontreal.ca>,
	192@emacsbugs.donarmstrong.com
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-Trace: ger.gmane.org 1210557147 17002 80.91.229.12 (12 May 2008 01:52:27 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 12 May 2008 01:52:27 +0000 (UTC)
Cc: Chong Yidong <cyd@stupidchicken.com>, 192@emacsbugs.donarmstrong.com,
	Bruno Haible <bruno@clisp.org>, emacs-devel@gnu.org
To: David Koppelman <koppel@ece.lsu.edu>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon May 12 03:53:02 2008
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1JvNDZ-0001e7-Vp
	for geb-bug-gnu-emacs@m.gmane.org; Mon, 12 May 2008 03:52:57 +0200
Original-Received: from localhost ([127.0.0.1]:33682 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1JvNCp-0001oT-Jz
	for geb-bug-gnu-emacs@m.gmane.org; Sun, 11 May 2008 21:51:59 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1JvNCb-0001kU-Qa
	for bug-gnu-emacs@gnu.org; Sun, 11 May 2008 21:51:45 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1JvNCa-0001k3-IE
	for bug-gnu-emacs@gnu.org; Sun, 11 May 2008 21:51:44 -0400
Original-Received: from [199.232.76.173] (port=41643 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1JvNCa-0001ju-9T
	for bug-gnu-emacs@gnu.org; Sun, 11 May 2008 21:51:44 -0400
Original-Received: from rzlab.ucr.edu ([138.23.92.77]:50907)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <debbugs@rzlab.ucr.edu>) id 1JvNCZ-0007oZ-5g
	for bug-gnu-emacs@gnu.org; Sun, 11 May 2008 21:51:44 -0400
Original-Received: from rzlab.ucr.edu (rzlab.ucr.edu [127.0.0.1])
	by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id m4C1pdpL014232; 
	Sun, 11 May 2008 18:51:41 -0700
Original-Received: (from debbugs@localhost)
	by rzlab.ucr.edu (8.13.8/8.13.8/Submit) id m4C1Z3BY011761;
	Sun, 11 May 2008 18:35:03 -0700
X-Loop: don@donarmstrong.com
Resent-From: Stefan Monnier <monnier@iro.umontreal.ca>
Resent-To: bug-submit-list@donarmstrong.com
Resent-CC: Emacs Bugs <bug-gnu-emacs@gnu.org>
Resent-Date: Mon, 12 May 2008 01:35:03 +0000
Resent-Message-ID: <handler.192.B192.121055571011085@emacsbugs.donarmstrong.com>
Resent-Sender: don@donarmstrong.com
X-Emacs-PR-Message: report 192
X-Emacs-PR-Package: emacs
X-Emacs-PR-Keywords: 
Original-Received: via spool by 192-submit@emacsbugs.donarmstrong.com
	id=B192.121055571011085
	(code B ref 192); Mon, 12 May 2008 01:35:03 +0000
Original-Received: (at 192) by emacsbugs.donarmstrong.com; 12 May 2008 01:28:30 +0000
Original-Received: from ironport2-out.teksavvy.com (ironport2-out.pppoe.ca
	[206.248.154.182])
	by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id m4C1SPHc011079
	for <192@emacsbugs.donarmstrong.com>; Sun, 11 May 2008 18:28:27 -0700
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AmoCAAs6J0jO+JgrdGdsb2JhbACBU5A6ASeXCw
X-IronPort-AV: E=Sophos;i="4.27,470,1204520400"; d="el'?scan'208";a="20308168"
Original-Received: from smtp.pppoe.ca (HELO smtp.teksavvy.com) ([65.39.196.238])
	by ironport2-out.teksavvy.com with ESMTP; 11 May 2008 21:28:19 -0400
Original-Received: from pastel.home ([206.248.152.43])
	by smtp.teksavvy.com (Internet Mail Server v1.0) with ESMTP id SGO89018;
	Sun, 11 May 2008 21:28:18 -0400
Original-Received: by pastel.home (Postfix, from userid 20848)
	id 702CD7F83; Sun, 11 May 2008 21:28:18 -0400 (EDT)
In-Reply-To: <yg5tzh4bqtw.fsf@nested.ece.lsu.edu> (David Koppelman's message
	of "Sun, 11 May 2008 14:09:31 -0500")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)
X-detected-kernel: by monty-python.gnu.org: Linux 2.6 (newer, 3)
Resent-Date: Sun, 11 May 2008 21:51:44 -0400
X-BeenThere: bug-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Bug reports for GNU Emacs,
	the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.bugs:17969
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/17969>

--=-=-=

> A better solution would be to have font-lock use multi-line extended
> regions selectively. Perhaps a hint in the current keyword syntax
> (say, explicitly applying the font-lock-multiline property), or a
> separate method for providing multi-line keywords to font-lock.

I don't understand the difference between the above and the application
of font-lock-multiline properties which you seem to have tried and rejected.

I don't necessarily disagree with your rejection of font-lock-multiline:
it can have disastrous effect indeed if the multiline region becomes large.

> Such keywords would get the multi-line extended regions, the other
> just the whole-line extensions (or whatever the hooks do).
> Is this something the font-lock maintainers would consider?

I guess I simply do not understand what you propose.  Any improvement in
the multiline handling is welcome, but beware: this is not an easy area.

>> (while (re-search-forward <RE> nil t)
>>   (font-lock-fontify-region (match-beginning 0) (match-end 0)))
> I wouldn't do that without suppressing other keywords.

FWIW, I do pretty much exactly the above loop in smerge-mode and
I haven't heard complaints yet.

>> If someone wants that, I have a parser that takes a regexp and turns it
>> into something like `rx' syntax.  It uses my lex.el library (which
>> takes an `rx'-like input syntax).

> That sounds useful, either E-mail it to me or let me know
> where to find it.

Find the current version attached.  Consider it as 99.9% untested
code, tho.  Also you need to eval it before you can byte-compile it.
And I strongly recommend you byte-compile it to reduce the
specpdl usage.


        Stefan


--=-=-=
Content-Type: application/emacs-lisp
Content-Disposition: attachment; filename=lex.el
Content-Transfer-Encoding: quoted-printable

;;; lex.el --- Lexical analyser construction

;; Copyright (C) 2008  Stefan Monnier

;; Author: Stefan Monnier <monnier@iro.umontreal.ca>
;; Keywords:=20

;; This program is free software; you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.

;; This program is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with this program.  If not, see <http://www.gnu.org/licenses/>.

;;; Commentary:

;; Format or regexps, is the same as used for `rx' and `sregex'.
;; Additions:
;; - (ere RE) specify regexps using the ERE syntax.
;; - (inter REs...) (aka `&') make a regexp that only matches
;;   if all its branches match.  E.g. (inter (ere ".*a.*") (ere ".*b.*"))
;;   match any string that contain both an "a" and a "b", in any order.
;; - (case-fold REs...) and (case-sensitive REs...) make a regexp that
;;   is case sensitive or not, regardless of case-fold-search.

;; Input format of lexers:
;;
;; ALIST of the form ((RE . VAL) ...)

;; Format of compiled DFA lexers:
;;
;; nil                     ; The trivial lexer that fails
;; (CHAR . LEXER)
;; (table . CHAR-TABLE)
;; (stop VAL . LEXER)      ; Match the empty string at point or LEXER.
;; (check (PREDICATE . ARG) SUCCESS-LEXER . FAILURE-LEXER)

;; Intermediate NFA nodes may additionally look like:
;; (or LEXERs...)
;; (orelse LEXERs...)
;; (and LEXERs...)
;; (join CONT . EXIT)

;;; Todo:

;; - submatches
;; - backrefs
;; - search rather than just match
;; - extensions:
;;   - repeated submatches
;;   - negation
;;   - lookbehind and lookahead
;;   - match(&search?) backward
;;   - agrep

;;; Notes

;; Search
;; ------

;; To turn a match into a search, the basic idea is to use ".*RE" to get
;; a search-DFA as opposed to the match-DFA generated from "RE".

;; Search in Plan9's regexp library is done as follows: match ".*RE" until
;; reaching the first match and then continue with only "RE".  The first
;; ".*RE" match corresponds to a search success for the leftmost shortest
;; match.  If we want the longest match, we need to continue.  But if we
;; continue with ".*RE" then we have no idea when to stop, so we should only
;; continue with "RE".
;; Downside: we may still match things after the "leftmost longest" match,
;; but hopefully will stop soon after.  I.e. we may look at chars past the
;; end of the leftmost longest match, but hopefully not too many.

;; Alternatives:
;; - Like emacs/src/regexp.c, we can just start a match at every buffer
;;   position.  Advantage: no need for submatch info in order to find
;;   (match-beginning 0), no need for a separate search-DFA.
;;   Downsize: O(N^2) rather than O(N).  But it's no worse than what we live
;;   with for decades in src/regexp.c.
;;
;; - After the shortest-search, stop the search and do a longest-match
;;   starting at position (match-beginning 0).  The good thing is that we
;;   will not look at any char further than needed.  Also we don't need to
;;   figure out how to switch from ".*RE" to "RE" in the middle of the sear=
ch.
;;   The downside is that we end up looking twice at the chars common to the
;;   shortest and longest matches.  Also this doesn't work: the shortest
;;   match may not be the leftmost match, so we can't just start the match
;;   at (match-beginning 0).
;;
;; - After the shortest-search, use the submatch data to figure out the NFA
;;   states (corresponding to the current search-DFA state) which are only
;;   reachable from later starting positions than (match-beginning 0),
;;   remove them and figure out from that the match-DFA state to which
;;   to switch.  Problem is: there might not be any such state in the
;;   match-DFA.
;;
;; - Generate a specialized search&match-DFA which encodes the job done by
;;   Plan9's regexp library.  I.e. do a specialized merge on
;;   (or LEXER (anything . LEXER)) where whenever we get a `stop' we don't
;;   merge any more.  After matching such a lexer, we still have to figure
;;   which of the matches we had is the leftmost longest match, of course.
;;   Actually, it's not that easy: the tail of a `stop' in the match-DFA can
;;   only match things whose (match-beginning 0) may be the same as the one
;;   of the `stop', whereas we also want to accept longer matches that start
;;   before (match-beginning 0).  So we want to keep merging on the tail of
;;   `stop' nodes, but only "partially" (whatever that means).

;; Lookahead
;; ---------

;; I suspect that the (?=3D<RE>) lookahead can be encoded using something l=
ike
;; `andalso'.  Of course, it can also trivially be encoded as a predicate,
;; but then we get an O(N^2) complexity.

;;; Code:

(eval-when-compile (require 'cl))

(defun copy-char-table (ct1)
  (let* ((subtype (char-table-subtype ct1))
         (ct2 (make-char-table subtype)))
    (map-char-table (lambda (c v) (set-char-table-range ct2 c v)) ct1)
    (dotimes (i (or (get subtype 'char-table-extra-slots) 0))
      (set-char-table-extra-slot ct2 i (char-table-extra-slot ct1 i)))
    ct2))

(defun lex--char-table->alist (ct)
  (let ((res ()))
    (map-char-table (lambda (k v)
                      (push (cons (if (consp k)
                                      ;; If k is a cons cell, we have to
                                      ;; copy it because map-char-table
                                      ;; reuses it.
                                      (cons (car k) (cdr k))
                                    ;; Otherwise, create a trivial cons-cell
                                    ;; so we have fewer cases to handle.
                                    (cons k k))
                                  v)
                            res))
                    ct)
    res))

(defun lex--merge-into (op al1 al2 ct)
  (assert (memq op '(and or orelse)))
  ;; We assume that map-char-table calls its function with increasing
  ;; `key' arguments.
  (while (and al1 al2)
    (let ((k1 (caar al1)) (k2 (caar al2)))
      (cond
       ;; Perfect overlap.
       ((equal k1 k2)
        (set-char-table-range ct k1
                              (lex--merge op (cdr (pop al1)) (cdr (pop al2)=
))))
       ;; k1 strictly greater than k2.
       ((and (consp k1) (consp k2) (> (car k1) (cdr k2)))
        (let ((v (cdr (pop al1))))
          (if (not (eq op 'and)) (set-char-table-range ct k1 v))))
       ;; k2 strictly greater than k1.
       ((and (consp k1) (consp k2) (> (car k2) (cdr k1)))
        (let ((v (cdr (pop al2))))
          (if (not (eq op 'and)) (set-char-table-range ct k2 v))))
       ;; There's partial overlap.
       ((and (consp k1) (consp k2) (> (cdr k1) (cdr k2)))
        (if (not (eq op 'and))
            (set-char-table-range ct (cons (1+ (cdr k2)) (cdr k1)) (cdar al=
1)))
        (setcdr k1 (cdr k2)))
       ((and (consp k1) (consp k2) (< (cdr k1) (cdr k2)))
        (if (not (eq op 'and))
            (set-char-table-range ct (cons (1+ (cdr k1)) (cdr k2)) (cdar al=
2)))
        (setcdr k2 (cdr k1)))
       ;; Now the tails are equal.
       ((and (consp k1) (consp k2) (> (car k1) (car k2)))
        (set-char-table-range ct k1 (lex--merge op (cdr (pop al1)) (cdar al=
2)))
        (setcdr k2 (1- (car k1))))
       ((and (consp k1) (consp k2) (< (car k1) (car k2)))
        (set-char-table-range ct k2 (lex--merge op (cdar al1) (cdr (pop al2=
))))
        (setcdr k1 (1- (car k2))))
       ((assert nil)))))
  (if (not (eq op 'and))
      (dolist (x (or al1 al2))
        (set-char-table-range ct (car x) (cdr x)))))

(defvar lex--states)
(defvar lex--memoize)

(defun lex--set-eq (l1 l2)
  (let ((len (length l2)))
    (setq l2 (copy-sequence l2))
    (while (consp l1)
      (assert (=3D len (length l2)))
      (unless (> len
                 (setq len (length (setq l2 (delq (pop l1) l2)))))
        (setq l1 t)))
    (not l1)))

(define-hash-table-test 'lex--set-eq 'lex--set-eq
  (lambda (l)
    (let ((hash 0))
      (while l
        (let ((x (pop l)))
          (if (memq x l) (progn (debug) nil)
            (setq hash (+ hash (sxhash x))))))
      hash)))
=20=20=20=20=20=20

(defun lex--flatten-state (state)
  (assert (memq (car state) '(and or orelse)))
  (let ((op (car state))
        (todo (cdr state))
        (done (list state))
        (res nil))
    (while todo
      (setq state (pop todo))
      (cond
       ((null state) (if (eq op 'and) (setq res nil todo nil)))
       ((memq state done) nil)
       ((eq (car-safe state) op) (setq todo (append (cdr state) todo)))
       (t (unless (memq state res) (push state res)))))
    (cons op (nreverse res))))

(defun lex--merge-2 (op lex1 lex2)
  (assert (memq op '(and or orelse)))
  ;; The order between lex1 and lex2 matters: preference is given to lex1.
  (cond
   ((eq lex1 lex2) (debug) lex1)  ;CHECK: ruled out by `lex--flatten-state'?
   ;; ((equal lex1 lex2) lex1)             ;Stack overflow :-(

   ;; Handle the 2 possible nil cases.
   ;; CHECK: ruled out by `lex--flatten-state'?
   ((null lex1) (debug) (if (eq op 'and) nil lex2))
   ((null lex2) (debug) (if (eq op 'and) nil lex1))

   ;; Do the predicate cases before the `stop' because the stop should
   ;; always come after the checks.
   ;; TODO: add optimizations for pairs of `checks' which are redundant,
   ;; or mutually exclusive, ... although we can also do it in lex-optimize.
   ((and (eq (car lex1) 'check) (eq (car lex2) 'check)
         (equal (nth 1 lex1) (nth 1 lex2))) ; Same predicate.
    (list* 'check (nth 1 lex1)
           (lex--merge op (nth 2 lex1) (nth 2 lex2))
           (lex--merge op (nthcdr 3 lex1) (nthcdr 3 lex2))))
   ((eq (car lex1) 'check)
    (list* 'check (nth 1 lex1)
           (lex--merge op (nth 2 lex1) lex2)
           (lex--merge op (nthcdr 3 lex1) lex2)))
   ((eq (car lex2) 'check)
    (list* 'check (nth 1 lex2)
           (lex--merge op lex1 (nth 2 lex2))
           (lex--merge op lex1 (nthcdr 3 lex2))))

   ;; Joins have the form (join CONT . EXIT) where EXIT is a lexer
   ;; corresponding to the rest of the regexp after the `and' sub-regexp.
   ;; All the joins corresponding to the same `and' have the same EXIT.
   ;; CONT is a lexer that contains another join inside, it corresponds to
   ;; the decision to not yet leave the `and'.
   ((and (eq (car lex1) 'join) (eq (car lex2) 'join))
    (assert (eq (cddr lex1) (cddr lex2))) ;Check they're the same join.
    `(join ,(lex--merge op (cadr lex1) (cadr lex2)) ,@(cddr lex1)))
   ;; If one the two lex's is a join but the other not, the other must
   ;; contain a corresponding join somewhere inside.
   ((eq (car lex1) 'join)
    (let ((next (lex--merge op (nth 1 lex1) lex2)))
      ;; lex1 is a valid exit point but lex2 isn't.
      (if (eq op 'and)
          next
        ;; FIXME: lex1 is implicitly an `or' (or `orelse'?) between (cadr
        ;; lex1) and (cddr lex1).  Here we construct an `or' (or `orelse'?)
        ;; between `next' and (cddr lex1).  I.e. we lose the `op' and we do
        ;; not preserve the ordering between lex2 and (cddr lex1).
        `(join ,next ,@(cddr lex1)))))
   ((eq (car lex2) 'join)
    (let ((next (lex--merge op lex1 (nth 1 lex2))))
      (if (eq op 'and) next `(join ,next ,@(cddr lex2)))))

   ;; The three `stop' cases.
   ((and (eq (car lex1) 'stop) (eq (car lex2) 'stop))
    ;; Here is where we give precedence to `lex1'.
    (if (eq op 'orelse) lex1
      (list* 'stop (cadr lex1) (lex--merge op (cddr lex1) (cddr lex2)))))
   ((eq (car lex1) 'stop)
    (let ((next (lex--merge op (cddr lex1) lex2)))
      (ecase op
        (or     (list* 'stop (cadr lex1) next))
        (orelse lex1)
        ;; CHECK: We should have hit a `join' before reaching a `stop'.
        (and    (debug) next))))
   ((eq (car lex2) 'stop)
    (let ((next (lex--merge op lex1 (cddr lex2))))
      ;; For `orelse', we want here to delay the `stop' until the point
      ;; where we know that lex1 doesn't match.  Sadly, I don't know how to
      ;; do it.
      (ecase op
        ;; CHECK: We should have hit a `join' before reaching a `stop'.
        (and         (debug) next)
        ;; FIXME: One thing we can do is to mark the value attached to the
        ;; `stop' so as to indicate that an earlier match may finish later.
        ;; This way, if the match is not `earlystop', we know it's one of
        ;; the leftmost ones, and maybe the search loop can avoid some work
        ;; when determining which is the leftmost longest match.
        (orelse      (list* 'stop `(earlystop ,(cadr lex2)) next))
        ((or orelse) (list* 'stop (cadr lex2) next)))))

   ;; The most general case.
   ((and (eq (car lex1) 'table) (eq (car lex2) 'table))
    (let ((al1 (lex--char-table->alist (cdr lex1)))
          (al2 (lex--char-table->alist (cdr lex2)))
          (ct (make-char-table 'lexer)))
      (lex--merge-into op al1 al2 ct)
      (cons 'table ct)))

   ((and (characterp (car lex1)) (characterp (car lex2))
         (eq (car lex1) (car lex2)))
    (cons (car lex1) (lex--merge op (cdr lex1) (cdr lex2))))
   ((and (characterp (car lex1)) (characterp (car lex2)))
    (unless (eq op 'and)
      (let ((ct (make-char-table 'lexer)))
        (aset ct (car lex1) (cdr lex1))
        (aset ct (car lex2) (cdr lex2))
        (cons 'table ct))))
   ((and (characterp (car lex1)) (eq (car lex2) 'table))
    (let ((next (lex--merge op (cdr lex1) (aref (cdr lex2) (car lex1)))))
      (if (eq op 'and)
          (if next (cons (car lex1) next))
        (let ((ct (copy-sequence (cdr lex2))))
          (aset ct (car lex1) next)
          (cons 'table ct)))))
   ((and (eq (car lex1) 'table) (characterp (car lex2)))
    (let ((next (lex--merge op (aref (cdr lex1) (car lex2)) (cdr lex2))))
      (if (eq op 'and)
          (if next (cons (car lex2) next))
        (let ((ct (copy-sequence (cdr lex1))))
          (aset ct (car lex2) next)
          (cons 'table ct)))))

   ((or (memq (car lex1) '(or orelse and state))
        (memq (car lex2) '(or orelse and state)))
    ;; `state' nodes are nodes whose content is not known yet, so we
    ;; have to delay the merge via the memoization table.
    ;; `or' and `and' nodes should only happen when the other `op' is being
    ;; performed, in which case we can't do the merge either before lex1
    ;; and lex2 have both been merged.
    (lex--merge op lex1 lex2))
   ((assert nil))))

(defun lex--merge-now (&rest state)
  (assert (memq (car state) '(and or orelse)))
  ;; Re-flatten, in case one of the sub-states was changed.
  (setq state (lex--flatten-state state))
  (if (<=3D (length state) 2)
      ;; FIXME: for `and' nodes, this is not enough: we need to look for and
      ;; remove the corresponding `join' node(s).
      (cadr state)
    (let ((op (pop state))
          (res (pop state)))
      (dolist (lex state)
        (setq res (lex--merge-2 op res lex)))
      (if (and (eq op 'and) (eq (car res) 'join))
          ;; Eliminate the join once it was all merged.
          ;; FIXME: This arbitrarily chooses `or' instead of `orelse',
          ;; and it arbitrarily gives CONT precedence to EXIT.  That doesn't
          ;; sound right.
          (lex--merge 'or (cadr res) (cddr res))
        res))))

(defun lex--merge (&rest state)
  (assert (memq (car state) '(and or orelse)))
  (setq state (lex--flatten-state state))
  (if (<=3D (length state) 2)
      (cadr state)
    (or (gethash state lex--memoize)
        (progn
          ;; (debug)
          (assert (memq (car state) '(and or orelse)))
          (push state lex--states)
          ;; The `state' node will be later on modified via setcar/setcdr,
          ;; se be careful to use a copy of it for the key.
          (puthash (cons (car state) (cdr state)) state lex--memoize)
          state))))

(defun lex--compile-category (category)
  (if (and (integerp category) (< category 128))
      category
    (if (symbolp category)
        (if (=3D 1 (length (symbol-name category)))
            (aref (symbol-name category) 0)
          (require 'rx)
          (cdr (assq category rx-categories))))))

(defun lex--compile-syntax (&rest syntaxes)
  (mapcar (lambda (x)
            (if (and (integerp x) (< x 32)) x
              (if (symbolp x)
                  (setq x (if (=3D 1 (length (symbol-name x)))
                              (symbol-name x)
                            (require 'rx)
                            (cdr (assq x rx-syntax)))))
              (if (characterp x) (setq x (string x)))
              (car (string-to-syntax x))))
          syntaxes))

(defconst lex--char-classes
  `((alnum alpha digit)
    (alpha word (?a . ?z) (?A . ?Z))
    (blank ?\s ?\t)
    (cntrl (?\0 . ?\C-_))
    (digit (?0 . ?9))
    ;; Include all multibyte chars, plus all the bytes except 128-159.
    (graph (?! . ?~) multibyte (#x3fffa0 . #x3fffff))
    ;; src/regexp.c handles case-folding inconsistently: lower and upper
    ;; match both lower- and uppercase ascii chars, but lower also matches
    ;; uppercase non-ascii chars whereas upper does not match lowercase
    ;; nonascii chars.  Here I simply ignore case-fold for [:lower:] and
    ;; [:upper:] because it's simpler and doesn't seem worse.
    (lower (check (lex--match-lower)))
    (upper (check (lex--match-upper)))
    (print graph ?\s)
    (punct (check (not (lex--match-syntax . ,(lex--compile-syntax "w"))))
           (?! . ?/) (?: . ?@) (?\[ . ?`) (?\{ . ?~))
    (space (check (lex--match-syntax . ,(lex--compile-syntax " "))))
    (xdigit digit (?a . ?f) (?A . ?F))
    (ascii (?\0 . ?\177))
    (nonascii (?\200 . #x3fffff))
    (unibyte ascii (#x3fff00 . #x3fffff))
    (multibyte (#x100 . #x3ffeff))
    (word (check (lex--match-syntax . ,(lex--compile-syntax "w"))))
    ;; `rx' alternative names.
    (numeric digit)
    (num digit)
    (control cntrl)
    (hex-digit xdigit)
    (hex xdigit)
    (graphic graph)
    (printing print)
    (alphanumeric alnum)
    (letter alpha)
    (alphabetic alpha)
    (lower-case lower)
    (upper-case upper)
    (punctuation punct)
    (whitespace space)
    (white space))
  "Definition of char classes.
Each element has the form (CLASS . DEFINITION) where definition
is a list of elements that can be either CHAR or (CHAR . CHAR),
or CLASS (another char class) or (check (PREDICATE . ARG))
or (check (not (PREDICATE . ARG))).")

(defvar lex--char-equiv-table nil
  "Equiv-case table to use to compile case-insensitive regexps.")

(defun lex--char-equiv (char)
  (when lex--char-equiv-table
    (let ((chars ())
          (tmp char))
      (while (and (setq tmp (aref lex--char-equiv-table tmp))
                  (not (eq tmp char)))
        (push tmp chars))
      (if chars (cons char chars)))))
=20=20=20=20

(defun lex--nfa (re state)
  (assert state)                   ;If `state' is nil we can't match anyway.
  (cond
   ((characterp re)
    (let ((chars (lex--char-equiv re)))
      (if (null chars)
          (cons re state)
        (let ((ct (make-char-table 'lexer)))
          (dolist (char chars) (aset ct char state))
          (cons 'table ct)))))
   ((stringp re)
    (if (null lex--char-equiv-table)
        ;; (Very) minor optimization.
        (nconc (mapcar 'identity re) state)
      (lex--nfa `(seq ,@(mapcar 'identity re)) state)))
   (t
    (ecase (or (car-safe re) re)
      ((: seq sequence
        ;; Hack!
        group)
       (dolist (elem (reverse (cdr re)))
         (setq state (lex--nfa elem state)))
       state)
      ((char in not-char)
       (let ((chars (cdr re))
             (checks nil)
             (fail nil)
             (char nil) ;The char seen, or nil if none, or t if more than o=
ne.
             (ct (make-char-table 'lexer)))
         (when (or (eq 'not (car chars)) (eq 'not-char (car re)))
           (setq chars (cdr chars))
           (set-char-table-range ct t state)
           (setq fail state)
           (setq state nil))
         (while chars
           (let ((range (pop chars)))
             (cond
              ((stringp range)
               (setq chars (append (cdr (lex--parse-charset range)) chars)))
              ((symbolp range)
               (setq range (or (cdr (assq range lex--char-classes))
                               (error "Uknown char class `%s'" range)))
               (setq chars (append range chars)))
              ((and (consp range) (eq 'check (car range)))
               (push (cadr range) checks))
              (t
               (setq char (if (or char (not (characterp range))
                                  (and lex--char-equiv-table
                                       (lex--char-equiv range)))
                              t range))
               ;; Set the range, first, regardless of case-folding.  This is
               ;; important because case-tables like to be set with few
               ;; large ranges rather than many small ones, as is done in
               ;; the case-fold loop.
               (set-char-table-range ct range state)
               (when (and lex--char-equiv-table
                          ;; Avoid looping over all characters.
                          (not (equal range '(#x100 . #x3ffeff))))
                 ;; Add all the case-equiv chars.
                 (let ((i (if (consp range) (car range) range))
                       (max (if (consp range) (cdr range) range))
                       char)
                   (while (<=3D i max)
                     (setq char i)
                     (while (and (setq char (aref lex--char-equiv-table cha=
r))
                                 (not (eq char i)))
                       (aset ct char state))
                     (setq i (1+ i)))))))))

         (let ((res (if (or (eq char t) fail)
                        (cons 'table ct)
                      (if char (cons char state)))))
           (if (and (not fail) checks)
               (setq state (lex--nfa 'anything state)))
           (dolist (check checks)
             (setq res
                   (if fail
                       ;; We do an `and' of the negation of the check and r=
es.
                       (if (eq (car-safe check) 'not)
                           (list 'check (cadr check) res)
                         (list* 'check check nil res))
                     ;; An `or' of the check and res.
                     (if (eq (car-safe check) 'not)
                         (list 'check (cadr check) res state)
                       (list* 'check check state res)))))
           res)))

      ((union or | orelse)
       (apply 'lex--merge (if (eq (car re) 'orelse) 'orelse 'or)
              (mapcar (lambda (re) (lex--nfa re state)) (cdr re))))

      ((inter intersection &)
       ;; Just using `and' is not enough because we have to enforce that the
       ;; sub-regexps (rather than the whole regexp) match the same string.
       ;; So we need to mark the juncture point.
       (let* ((join `(join nil ,@state)))
         (apply 'lex--merge 'and
                (mapcar (lambda (re) (lex--nfa re join)) (cdr re)))))

      ((0+ zero-or-more * *\?)
       (let ((newstate (list 'state)))
         (let ((lexer (lex--nfa (cons 'seq (cdr re)) newstate)))
           (setcdr newstate (if (memq (car re) '(*\?))
                                (list state lexer)
                              (list lexer state))))
         (setcar newstate (if (memq (car re) '(*\?)) 'orelse 'or))
         (assert (memq (car newstate) '(and or orelse)))
         (push newstate lex--states)
         newstate))
=20=20=20=20=20=20
      ((string-end eos eot buffer-end eob) `(check (lex--match-eobp) ,state=
))
      ((string-start bos bot buffer-start bob)
       `(check (lex--match-bobp) ,state))
      ((line-end eol) `(check (lex--match-eolp) ,state))
      ((line-start bol) `(check (lex--match-bolp) ,state))
      ((word-start bow) `(check (lex--match-bowp) ,state))
      ((word-end eow) `(check (lex--match-eowp) ,state))
      ((symbol-start) `(check (lex--match-bosp) ,state))
      ((symbol-end) `(check (lex--match-eosp) ,state))
      ((not-word-boundary) `(check (lex--match-not-word-boundary) ,state))
      ((word-boundary) `(check (lex--match-not-word-boundary) nil . ,state))
      ((syntax) `(check (lex--match-syntax
                          . ,(apply 'lex--compile-syntax (cdr re)))
                        ,(lex--nfa 'anything state)))
      ((not-syntax) `(check (lex--match-syntax
                             . ,(apply 'lex--compile-syntax (cdr re)))
                            nil . ,(lex--nfa 'anything state)))
      ((category) `(check (lex--match-category
                           . ,(lex--compile-category (cadr re)))
                          ,(lex--nfa 'anything state)))
      ((not-category) `(check (lex--match-category
                               . ,(lex--compile-category (cadr re)))
                              nil . ,(lex--nfa 'anything state)))
=20=20=20=20=20=20
      ;; `rx' accepts char-classes directly as regexps.  Let's reluctantly
      ;; do the same.
      ((digit numeric num control cntrl hex-digit hex xdigit blank graphic
        graph printing print alphanumeric alnum letter alphabetic alpha asc=
ii
        nonascii lower lower-case upper upper-case punctuation punct space
        whitespace white)
       (lex--nfa `(char ,re) state))

      ((case-sensitive)
       (let ((lex--char-equiv-table nil))
         (lex--nfa `(seq ,@(cdr re)) state)))

      ((case-fold)
       (let ((lex--char-equiv-table (get-eqvcase-table (current-case-table)=
)))
         (lex--nfa `(seq ,@(cdr re)) state)))

      ((point
        ;; Sub groups!
        submatch group backref
        ;; Greediness control
        minimal-match maximal-match)
       (error "`%s' Not implemented" (or (car-safe re) re)))

      ((not-newline nonl dot) (lex--nfa '(char not ?\n) state))
      (anything (lex--nfa '(char not) state))
      ((word wordchar) (lex--nfa '(syntax w) state))
      ((not-wordchar) (lex--nfa '(not-syntax w) state))

      (any
       ;; `rx' uses it for (char ...) sets, and sregex uses it for `dot'.
       (lex--nfa (if (consp re) (cons 'char (cdr re)) '(char not ?\n)) stat=
e))

      (not
       ;; The `not' as used in `rx' should be deprecated: we may want to
       ;; provide real support for a fully general `not', but its semantics
       ;; will most likely be different.  E.g. we may want to let
       ;; (not (char ...)) match the empty string.
       (setq re (cadr re))
       (ecase (or (car-safe re) re)
         (word-boundary
          (message "`not' deprecated: use not-word-boundary")
          (lex--nfa 'not-word-boundary state))
         ((any in char)
          (message "`not' deprecated: use (%s not ...)" (or (car-safe re) r=
e))
          (lex--nfa (list* (car re) 'not (cdr re)) state))
         ((category syntax)
          (message "`not' deprecated: use not-%s" (car re))
          (lex--nfa (cons (intern (format "not-%s" (car re))) (cdr re)) sta=
te))))

      (and
       ;; `rx' defined `and' as `sequence', but we may want to define it
       ;; as intersection instead.
       (error "`and' is deprecated, use `seq', `:', or `sequence' instead"))
=20=20=20=20=20=20=20=20=20
      ((1+ one-or-more + +\?)
       (lex--nfa `(seq (seq ,@(cdr re))
                       (,(if (memq (car re) '(+\?)) '*\? '0+) ,@(cdr re)))
                 state))
      ((opt zero-or-one optional \?)
       (lex--nfa `(or (seq ,@(cdr re)) "") state))
      ((\?\?)
       (lex--nfa `(orelse "" (seq ,@(cdr re))) state))
      ((repeat ** =3D)
       (let ((min (nth 1 re))
             (max (nth 2 re))
             (res (nthcdr 3 re)))
         (unless res
           (setq res (list max)) (setq max min))
         (lex--nfa `(seq ,@(append (make-list (or min 0)
                                              (if (eq (length res) 1)
                                                  (car res)
                                                (cons 'seq res)))
                                   (if (null max)
                                       `((0+ ,@res))
                                     (make-list (- max (or min 0))
                                                `(opt ,@res)))))
                   state)))
      ((>=3D) (lex--nfa `(repeat ,(nth 1 re) nil ,@(nthcdr 2 re)) state))

      ((bre re ere)
       (lex--nfa (lex-parse-re (nth 1 re) (car re)) state))))))


(defun lex--dfa-wrapper (f)
  (let* ((lex--states ())
         (lex--memoize (make-hash-table :test 'lex--set-eq))
         (res (funcall f))
         (postponed ())
         (states-dfa (make-hash-table :test 'eq)))

    (while lex--states
      (dolist (state (prog1 lex--states (setq lex--states nil)))
        (let ((merged (apply 'lex--merge-now state)))
          (if (memq (car merged) '(and or orelse))
              ;; The merge could not be performed for some reason:
              ;; let's re-schedule it.
              (push state postponed)
            (puthash state merged states-dfa))))

      (unless lex--states
        ;; If states-dfa is empty it means we haven't made any progress,
        ;; so we're stuck in an infinite loop.  Hopefully this cannot happe=
n?
        (assert (not (zerop (hash-table-count states-dfa))))
        (maphash (lambda (k v)
                   (unless v
                     ;; With `intersection', lex--merge may end up returning
                     ;; nil if the intersection is empty, so `v' can be
                     ;; nil here.  In since `k' is necessarily a cons cell,
                     ;; we can't turn it into nil, so we turn it into
                     ;; a more costly lexer that also fails for all inputs.
                     (setq v '(?a)))
                   (setcar k (car v))
                   (setcdr k (cdr v)))
                 states-dfa)
        (clrhash states-dfa)
        (setq lex--states postponed)
        (setq postponed nil)))

    res))

(defun lex-compile (alist)
  (lex--dfa-wrapper
   (lambda ()
     (let ((lex--char-equiv-table
            (if case-fold-search (get-eqvcase-table (current-case-table)))))
       (apply 'lex--merge 'or
              (mapcar (lambda (x) (lex--nfa (car x) (list 'stop (cdr x))))
                      alist))))))

(defun lex-search-dfa (match-dfa)
  ;; This constructs a search-DFA whose last match should be the leftmost
  ;; longest match.
  (lex--dfa-wrapper
   (lambda ()
     (lex--nfa '(*\? (char not)) match-dfa))))

(defun lex--terminate-if (new old)
  (cond
   ((eq new t) t)
   ((eq old t) t)
   (t (while new (let ((x (pop new))) (if (memq x old) (push x old)))) old)=
))

(defun lex--optimize-1 (lexer)
  (let ((terminate nil))
    (cons
     (case (car lexer)
       (table
        (let ((ct (cdr lexer))
              (char nil))
          ;; Optimize each entry.
          (map-char-table
           (lambda (range v)
             (let ((cell (lex--optimize v)))
               (setq terminate (lex--terminate-if (cdr cell) terminate))
               (set-char-table-range ct range (car cell))))
           ct)
          ;; Optimize the internal representation of the table.
          (optimize-char-table (cdr lexer))
          ;; Eliminate the table if possible.
          (map-char-table
           (lambda (range v)
             (setq char
                   (if (and (characterp range) (null char))
                       range t)))
           ct)
          (case char
            ((nil) nil)
            ((t) lexer)
            (t (setcar lexer 'char) (setcdr lexer (aref ct char)) lexer))))
       (stop
        (let ((cell (lex--optimize (cddr lexer))))
          (setq terminate t)
          (setf (cddr lexer) (car cell)))
        lexer)
       (check
        (let* ((test (nth 1 lexer))
               (cellf (lex--optimize (nthcdr 3 lexer)))
               (fail (setf (nthcdr 3 lexer) (car cellf)))
               (cells (lex--optimize (nth 2 lexer)))
               (succ (setf (nth 2 lexer) (car cells))))
          (setq terminate (lex--terminate-if (cdr cellf) terminate))
          (setq terminate (lex--terminate-if (cdr cells) terminate))
          (when (eq 'check (car succ))
            (cond
             ((equal test (nth 1 succ)) ;Same successful test.
              (debug)
              (setf (nth 2 lexer) (setq succ (nth 2 succ))))
             ;; TODO: we can add rules such as bobp -> eolp,
             ;; bosp -> bowp, (syntax X) -> (syntax Y X), ...
             ))
          (when (eq 'check (car fail))
            (cond
             ((equal test (nth 1 fail)) ;Same failing test.
              (debug)
              (setf (nthcdr 3 lexer) (setq fail (nthcdr 3 succ))))
             ;; TODO: we can add rules such as !eolp -> !bobp,
             ;; !bowp -> !bosp, !(syntax Y X) -> !(syntax X), ...
             ))
          (if (or succ fail) lexer)))
       (t
        (assert (characterp (car lexer)))
        (let ((cell (lex--optimize (cdr lexer))))
          (setq terminate (lex--terminate-if (cdr cell) terminate))
          (if (setf (cdr lexer) (car cell))
              lexer))))
     (if (consp terminate)
         (delq lexer terminate)
       terminate))))

(defun lex--optimize (lexer)
  (when lexer
    ;; The lex--memoize cache maps lexer states to (LEXER . TERMINATE) where
    ;; TERMINATE is either t to say that LEXER can terminate or a list of
    ;; lexers which means that LEXER terminates only if one of the lexers in
    ;; the list terminates.
    (let ((cache (gethash lexer lex--memoize)))
      (if cache
          ;; Optimize (char C) to nil.
          (if (and (characterp (caar cache)) (null (cdar cache))) nil cache)
        ;; Store a value indicating that we're in the process of computing =
it,
        ;; so when we encounter a loop, we don't recurse indefinitely.
        ;; Not knowing any better, we start by stating the tautology that
        ;; `lexer' terminates if and only if `lexer' terminates.
        (let ((cell (cons lexer (list lexer))))
          (puthash lexer cell lex--memoize)
          (let ((res (lex--optimize-1 lexer)))
            (if (and (car res) (cdr res))
                res
              (setcar lexer ?a)
              (setcdr lexer nil)
              (puthash lexer '(nil) lex--memoize)
              nil)))))))

(defun lex-optimize (lexer)
  (let ((lex--memoize (make-hash-table :test 'eq)))
    (prog1 (car (lex--optimize lexer))
      (message "Visited %d states" (hash-table-count lex--memoize)))))

(defmacro lex-case (object posvar &rest cases)
  (declare (indent 2))
  (let* ((i -1)
         (alist (mapcar (lambda (case) (cons (car case) (incf i))) cases))
         (lex (lex-compile alist))
         (tmpsym (make-symbol "tmp")))
    (setq i -1)
    `(let ((,tmpsym (lex-match-string ',lex ,object ,posvar)))
       (case (cdr ,tmpsym)
         ,@(mapcar (lambda (case)
                     `(,(incf i) (set-match-data
                                  (list ,posvar (setq ,posvar (car ,tmpsym)=
)))
                       ,@(cdr case)))
                   cases)))))

;;; Matching engine

(defun lex--match-bobp (arg pos &optional string)
  (=3D pos (if string 0 (point-min))))

(defun lex--match-eobp (arg pos &optional string)
  (=3D pos (if string (length string) (point-max))))

(defun lex--match-bolp (arg pos &optional string)
  (if string (or (=3D pos 0) (eq ?\n (aref string (1- pos))))
    (memq (char-before pos) '(nil ?\n))))

(defun lex--match-eolp (arg pos &optional string)
  (if string (or (=3D pos (length string)) (eq ?\n (aref string pos)))
    (memq (char-after pos) '(nil ?\n))))

(defun lex--match-bowp (arg pos &optional string)
  (and (not (if string (and (> pos 0)
                            (eq ?w (char-syntax (aref string (1- pos)))))
              (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos))))=
)))
       (if string (and (< pos (length string))
                       (eq ?w (char-syntax (aref string pos))))
         (eq 2 (car (syntax-after pos))))))

(defun lex--match-eowp (arg pos &optional string)
  (and (if string (and (> pos 0)
                       (eq ?w (char-syntax (aref string (1- pos)))))
         (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos))))))
       (not (if string (and (< pos (length string))
                            (eq ?w (char-syntax (aref string pos))))
              (eq 2 (car (syntax-after pos)))))))

(defun lex--match-bosp (arg pos &optional string)
  (and (not (if string
                (and (> pos 0)
                     (memq (char-syntax (aref string (1- pos))) '(?w ?_)))
              (and (> pos (point-min))
                   (memq (car (syntax-after (1- pos))) '(2 3)))))
       (if string (and (< pos (length string))
                       (memq (char-syntax (aref string pos)) '(?w ?_)))
         (memq (car (syntax-after pos)) '(2 3)))))

(defun lex--match-eosp (arg pos &optional string)
  (and (if string (and (> pos 0)
                       (memq (char-syntax (aref string (1- pos))) '(?w ?_)))
         (and (> pos (point-min)) (memq (car (syntax-after (1- pos))) '(2 3=
))))
       (not (if string (and (< pos (length string))
                            (memq (char-syntax (aref string pos)) '(?w ?_)))
              (memq (car (syntax-after pos)) '(2 3))))))

(defun lex--match-not-word-boundary (arg pos &optional string)
  (eq (if string (and (> pos 0)
                      (eq ?w (char-syntax (aref string (1- pos)))))
        (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos))))))
      (if string (and (< pos (length string))
                      (eq ?w (char-syntax (aref string pos))))
        (eq 2 (car (syntax-after pos))))))

(defun lex--match-upper (arg pos &optional string)
  (when (< pos (if string (length string) (point-max)))
    (let ((char (if string (aref string pos) (char-after pos))))
      (not (eq (downcase char) char)))))

(defun lex--match-lower (arg pos &optional string)
  (when (< pos (if string (length string) (point-max)))
    (let ((char (if string (aref string pos) (char-after pos))))
      (not (eq (upcase char) char)))))


(defun lex--match-category (category pos &optional string)
  (when (< pos (if string (length string) (point-max)))
    (aref (char-category-set (if string (aref string pos)
                               (char-after pos)))
          category)))

(defun lex--match-syntax (syntaxes pos &optional string)
  (when (< pos (if string (length string) (point-max)))
    (memq (car (if string (aref (syntax-table) (aref string pos))
                 (syntax-after pos)))
          syntaxes)))


(defun lex-match-string (lex string &optional start)
  ;; FIXME: Move this to C.
  (unless start (setq start 0))
  (let ((match nil))
    (while
        (progn
          (while (eq (car lex) 'check)
            (setq lex (if (funcall (car (nth 1 lex)) (cdr (nth 1 lex))
                                   start string)
                         (nth 2 lex) (nthcdr 3 lex))))
          (when (eq (car lex) 'stop)
            (setq match (cons start (cadr lex)))
            (message "Found match: %s" match)
            (setq lex (cddr lex)))
          (assert (not (eq (car lex) 'stop)))
          (and lex (< start (length string))))
      (let ((c (aref string start)))
        (setq start (1+ start))
        (setq lex (cond
                    ((eq (car lex) 'table) (aref (cdr lex) c))
                    ((integerp (car lex)) (if (eq c (car lex)) (cdr lex))))=
)))
    (message "Final search pos considered: %s" start)
    match))
=20=20=20=20=20=20=20=20
;;; Regexp parsers.

(defun lex--tokenizer (lex string)
  (let ((tokens ())
        (i 0)
        tmp)
    (while (and (< i (length string))
                (setq tmp (lex-match-string lex string i)))
      (push (cons (cdr tmp) (substring string i (setq i (car tmp)))) tokens=
))
    (nreverse tokens)))

(defun lex--parse-charset (string)
  (let ((i 0)
        (ranges ()))
    (when (eq (aref string i) ?^)
      (push 'not ranges)
      (setq i (1+ i)))
    (let ((op nil)
          (case-fold-search nil))
      (while (not (eq op 'stop))
        (lex-case string i
          ((seq "[:" (0+ (char (?a . ?z) (?A . ?Z))) ":]")
           (push (intern (substring string (+ 2 (match-beginning 0))
                                    (- (match-end 0) 2)))
                 ranges))
          ((seq anything "-" anything)
           (push (cons (aref string (match-beginning 0))
                       (aref string (1- (match-end 0))))
                 ranges))
          (anything (push (aref string (1- (match-end 0))) ranges))
          (eob (setq op 'stop))))
=20=20=20=20=20=20
      `(char ,@(nreverse ranges)))))

(defconst lex--parse-re-lexspec
  '(((or "*" "+" "?" "*?" "+?" "??") . suffix)
    ((seq "[" (opt "^") (opt "]")
          (0+ (or (seq (char not ?\]) "-" (char not ?\]))
                  (seq "[:" (1+ (char (?a . ?z) (?A . ?Z))) ":]")
                  (char not ?\]))) "]") . charset)
    ((seq "\\c" anything) . category)
    ((seq "\\C" anything) . not-category)
    ((seq "\\s" anything) . syntax)
    ((seq "\\S" anything) . not-syntax)
    ((seq "\\" (char (?1 . ?9))) . backref)
    ("\\'" . eob)
    ("\\`" . bob)
    ("." . dot)
    ("^" . bol)
    ("$" . eol)
    ("." . dot)
    ("\\<" . bow)
    ("\\>" . eow)
    ("\\_<" . symbol-start)
    ("\\_>" . symbol-end)
    ("\\w" . wordchar)
    ("\\W" . not-wordchar)
    ("\\b" . word-boundary)
    ("\\B" . not-word-boundary)
    ("\\=3D" . point)
    ((or (seq ?\\ anything) anything) . char)))
=20=20

(defconst lex--parse-ere-lexer
  (let ((case-fold-search nil))
    (lex-compile
     (append '(("(?:" . shy-group)
               ("|"  . or)
               ((seq "{" (0+ (char (?0 . ?9)))
                     (opt (seq "," (0+ (char (?0 . ?9))))) "}") . repeat)
               ((or ")" eob) . stop)
               ("(" . group))
             lex--parse-re-lexspec))))

(defconst lex--parse-bre-lexer
  (let ((case-fold-search nil))
    (lex-compile
     (append '(("\\(?:" . shy-group)
               ("\\|"  . or)
               ((seq "\\{" (0+ (char (?0 . ?9)))
                     (opt (seq "," (0+ (char (?0 . ?9))))) "\\}") . repeat)
               ((or "\\)" eob) . stop)
               ("\\(" . group))
             lex--parse-re-lexspec))))

(defun lex--parse-re (string i lexer)
  (let ((stack ())
        (op nil)
        (res nil)
        tmp)
    (while (and (not (eq op 'stop))
                (setq tmp (lex-match-string lexer string i)))
      (ecase (cdr tmp)
        (shy-group
         (setq tmp (lex--parse-re string (car tmp) lexer))
         (unless (eq (aref string (1- (car tmp))) ?\))
           (error "Unclosed shy-group"))
         (push (cdr tmp) res))
        (group
         (setq tmp (lex--parse-re string (car tmp) lexer))
         (unless (eq (aref string (1- (car tmp))) ?\))
           (error "Unclosed shy-group"))
         (push (list 'group (cdr tmp)) res))
        (suffix
         (if (null res) (error "Non-prefixed suffix operator")
           (setq res (cons (list (cdr (assoc (substring string i (car tmp))
                                            '(("*" . 0+)
                                              ("+" . 1+)
                                              ("?" . opt)
                                              ("*?" . *\?)
                                              ("+?" . +\?)
                                              ("??" . \?\?))))
                                 (car res))
                           (cdr res)))))
        (or (push `(or (seq ,@(nreverse res))) stack)
            (setq res nil))
        (charset
         (push (lex--parse-charset (substring string (1+ i) (1- (car tmp))))
               res))
        (repeat
         ;; Here we would like to have sub-matches :-(
         (let* ((min (string-to-number
                      (substring string (+ i (if (eq (aref string i) ?\\) 2=
 1))
                                 (car tmp))))
                (max (let ((comma (string-match "," string i)))
                       (if (not (and comma (< comma (car tmp))))
                           min
                         (if (=3D comma (- (car tmp) 2))
                             nil
                           (string-to-number (substring string (1+ comma)))=
)))))
           (if (null res) (error "Non-prefixed repeat operator")
             (setq res (cons `(repeat ,min ,max ,(car res)) (cdr res))))))
        (stop (setq op 'stop))
        ((syntax category not-syntax not-category)
         (push (list (cdr tmp) (aref string (1- (car tmp)))) res))
        (backref
         (push (list (cdr tmp) (- (aref string (1- (car tmp))) ?0)) res))
        (char
         (push (aref string (1- (car tmp))) res))
        (t (push (cdr tmp) res)))
      (setq i (car tmp)))
    (let ((re `(seq ,@(nreverse res))))
      (while stack (setq re (nconc (pop stack) (list re))))
      (cons i re))))

(defun lex-parse-re (string &optional lexer)
  (setq lexer (cond ((eq lexer 'ere) lex--parse-ere-lexer)
                    ((memq lexer '(bre re nil)) lex--parse-bre-lexer)
                    (t lexer)))
  (let ((res (lex--parse-re string 0 lexer)))
    (if (< (car res) (length string))
        (error "Regexp parsing failed around %d: ...%s..."
               (substring string (1- i) (1+ i)))
      (cdr res))))


;; (defun lex--parse-re (string i)
;;   (let ((stack ())
;;         (op nil)
;;         (res nil))
;;     (while (and (not (eq op 'stop)))
;;       (lex-case string i
;;         ("(?:"                          ;shy-group.
;;          (let ((tmp (lex--parse-re string i)))
;;            (setq i (car tmp))
;;            (unless (eq (aref string (1- i)) ?\)) (error "Unclosed shy-gr=
oup"))
;;            (push (cdr tmp) res)))
;;         ((or "*?" "+?" "??")
;;          (error "Greediness control unsupported `%s'" (match-string 0 st=
ring)))
;;         ((or "*" "+" "?")
;;          (if (null res) (error "Non-prefixed suffix operator")
;;            (setq res (cons (list (cdr (assq (aref string (1- i))
;;                                             '((?* . 0+)
;;                                               (?+ . 1+)
;;                                               (?? . opt))))
;;                                  (car res))
;;                            (cdr res)))))
;;         ("|" (push `(or (seq ,@(nreverse res))) stack)
;;          (setq res nil))
;;         ((seq "[" (opt "^") (opt "]")
;;                    (0+ (or (seq (char not ?\]) "-" (char not ?\]))
;;                            (seq "[:" (1+ (char (?a . ?z) (?A . ?Z))) ":]=
")
;;                            (char not ?\]))) "]")
;;          (push (lex--parse-charset
;;                 (substring string (1+ (match-beginning 0))
;;                            (1- (match-end 0))))
;;                res))
;;         ((seq "{" (0+ (char (?0 . ?9)))
;;                    (opt (seq "," (0+ (char (?0 . ?9))))) "}")
;;          ;; Here we would like to have sub-matches :-(
;;          (let* ((min (string-to-number (substring string
;;                                                   (1+ (match-beginning 0=
))
;;                                                   (match-end 0))))
;;                 (max (let ((comma (string-match "," string (match-beginn=
ing 0))))
;;                        (if (not (and comma (< comma (match-end 0))))
;;                            min
;;                          (if (=3D comma (- (match-end 0) 2))
;;                              nil
;;                            (string-to-number (substring string (1+ comma=
))))))))
;;            (if (null res) (error "Non-prefixed repeat operator")
;;              (setq res (cons `(repeat ,min ,max ,(car res)) (cdr res))))=
))
;;         ((or ")" eob) (setq op 'stop))
;;         ("\\'" (push 'eob res))
;;         ("\\`" (push 'bob res))
;;         ("^" (push 'bol res))
;;         ("$" (push 'eol res))
;;         ("." (push 'dot res))

;;         ((or "(" "\\<" "\\>" "\\_<" "\\_>" "\\c" "\\s" "\\C" "\\S" "\\w"=
 "\\W"
;;              "\\b" "\\B" "\\=3D" (seq "\\" (char (?1 . ?9))))
;;          (error "Unsupported construct `%s'" (match-string 0 string)))

;;         ((or (seq ?\\ anything) anything)
;;          (push (aref string (1- (match-end 0))) res))
;;         ("" (error "This should not be reachable"))))
;;     (let ((re `(seq ,@(nreverse res))))
;;       (while stack (setq re (nconc (pop stack) (list re))))
;;       (cons i re))))


(provide 'lex)
;;; lex.el ends here

--=-=-=--