From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Rationale for split-string? Date: Tue, 20 May 2003 10:55:20 +0900 Organization: The XEmacs Project Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <87el2uz8jb.fsf@tleepslib.sk.tsukuba.ac.jp> References: <87brz57at2.fsf@tleepslib.sk.tsukuba.ac.jp> <200304171744.h3HHiJCx009215@rum.cs.yale.edu> <87adem27ey.fsf@tleepslib.sk.tsukuba.ac.jp> <87ist8yv4n.fsf@tleepslib.sk.tsukuba.ac.jp> <87vfx5vor0.fsf@tleepslib.sk.tsukuba.ac.jp> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1053396085 6569 80.91.224.249 (20 May 2003 02:01:25 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Tue, 20 May 2003 02:01:25 +0000 (UTC) Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Tue May 20 04:01:23 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 19HwQn-0001fy-00 for ; Tue, 20 May 2003 04:00:45 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 19Hwar-0004FD-00 for ; Tue, 20 May 2003 04:11:10 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.20) id 19HwNp-0005RI-Jg for emacs-devel@quimby.gnus.org; Mon, 19 May 2003 21:57:41 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.20) id 19HwN8-0005Gm-Kv for emacs-devel@gnu.org; Mon, 19 May 2003 21:56:58 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.20) id 19HwN2-00056o-Qy for emacs-devel@gnu.org; Mon, 19 May 2003 21:56:55 -0400 Original-Received: from tleepslib.sk.tsukuba.ac.jp ([130.158.98.109]) by monty-python.gnu.org with esmtp (Exim 4.20) id 19HwMN-0004jN-4W; Mon, 19 May 2003 21:56:11 -0400 Original-Received: from steve by tleepslib.sk.tsukuba.ac.jp with local (Exim 3.36 #1 (Debian)) id 19HwLZ-0000ER-00; Tue, 20 May 2003 10:55:21 +0900 Original-To: rms@gnu.org, emacs-devel@gnu.org In-Reply-To: <87vfx5vor0.fsf@tleepslib.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Wed, 23 Apr 2003 13:09:23 +0900") User-Agent: Gnus/5.1001 (Gnus v5.10.1) XEmacs/21.5 (carrot, linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: Emacs development discussions. List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:14010 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:14010 >>>>> "sjt" == Stephen J Turnbull writes: sjt> OK. That is satisfactory for XEmacs, and we'll implement sjt> that. sjt> Unless you say you prefer to do it yourself, I will also sjt> submit a patch against GNU Emacs CVS head, and audit the Lisp sjt> code in CVS head to make sure there are no surprises from sjt> callers with non-default SEPARATORS. Enclosed are patches for lisp/subr.el and lispref/strings.texi to implement the API for split-string discussed earlier. Also enclosed is the result of an audit of uses of split-string in Emacs CVS (as of about three weeks ago). I didn't notice any cases where the changed specification made existing code out-and-out incorrect, so there are no further patches suggested. However, I think a lot of the uses with an explicit SEPARATORS are semantically dubious without using the OMIT-NULLS flag (and most were semantically dubious before the change to split-string, because it's at least theoretically possible for a null string to arise in the interior of the list). Most other uses of split-string are dubious in that either they depend heavily on undocumented implementation details of other utilities (eg, that the fields in /etc/mtab are separated by exactly one space) or are not very robust to bogus input. People who understand the modules in question might want to take a closer look. A few I couldn't tell at all without doing a much deeper analysis of the code than I have time for right now: ./lisp/calendar/todo-mode.el:869: needs checking ./lisp/eshell/em-pred.el:601: needs checking ./lisp/mh-e/mh-utils.el:1606: needs checking ./lisp/textmodes/reftex.el:934: needs checking ./lisp/textmodes/reftex.el:2161: needs checking If you set default-directory to the root of the Emacs hierarchy, the following function is useful to jump to the reference. nb. a few of the references have changed since I started the audit. (defun sjt/parse-grep-n2 () "Parse `grep -n -#' output for filename and line number." (interactive) (beginning-of-line) (when (re-search-forward "^\\(\\S-+\\):\\([0-9]+\\):") (cons (match-string 1) (string-to-number (match-string 2))))) (defun sjt/parse-grep-n-and-go () "Jump to place specified by `grep -n' output." (interactive) (let* ((pair (sjt/parse-grep-n2)) (file (car pair)) (line (cdr pair))) (find-file file) (goto-line line))) lisp/ChangeLog 2003-05-16 Stephen J. Turnbull * subr.el (split-string): Implement specification that splitting on explicit separators retains null fields. Add new argument OMIT-NULLS. Special-case (split-string "a string"). lispref/ChangeLog 2003-05-16 Stephen J. Turnbull * strings.texi (Creating Strings): Update split-string specification and examples. Index: lisp/subr.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/subr.el,v retrieving revision 1.350 diff -u -r1.350 subr.el --- lisp/subr.el 24 Apr 2003 23:14:12 -0000 1.350 +++ lisp/subr.el 16 May 2003 10:03:58 -0000 @@ -1792,19 +1792,45 @@ (buffer-substring-no-properties (match-beginning num) (match-end num))))) -(defun split-string (string &optional separators) - "Splits STRING into substrings where there are matches for SEPARATORS. -Each match for SEPARATORS is a splitting point. -The substrings between the splitting points are made into a list +(defconst split-string-default-separators "[ \f\t\n\r\v]+" + "The default value of separators for `split-string'. + +A regexp matching strings of whitespace. May be locale-dependent +\(as yet unimplemented). Should not match non-breaking spaces. + +Warning: binding this to a different value and using it as default is +likely to have undesired semantics.") + +;; The specification says that if both SEPARATORS and OMIT-NULLS are +;; defaulted, OMIT-NULLS should be treated as t. Simplifying the logical +;; expression leads to the equivalent implementation that if SEPARATORS +;; is defaulted, OMIT-NULLS is treated as t. +(defun split-string (string &optional separators omit-nulls) + "Splits STRING into substrings bounded by matches for SEPARATORS. + +The beginning and end of STRING, and each match for SEPARATORS, are +splitting points. The substrings matching SEPARATORS are removed, and +the substrings between the splitting points are collected as a list, which is returned. -If SEPARATORS is absent, it defaults to \"[ \\f\\t\\n\\r\\v]+\". -If there is match for SEPARATORS at the beginning of STRING, we do not -include a null substring for that. Likewise, if there is a match -at the end of STRING, we don't include a null substring for that. +If SEPARATORS is non-nil, it should be a regular expression matching text +which separates, but is not part of, the substrings. If nil it defaults to +`split-string-default-separators', normally \"[ \\f\\t\\n\\r\\v]+\", and +OMIT-NULLS is forced to t. + +If OMIT-NULLs is t, zero-length substrings are omitted from the list \(so +that for the default value of SEPARATORS leading and trailing whitespace +are effectively trimmed). If nil, all zero-length substrings are retained, +which correctly parses CSV format, for example. + +Note that the effect of `(split-string STRING)' is the same as +`(split-string STRING split-string-default-separators t)'). In the rare +case that you wish to retain zero-length substrings when splitting on +whitespace, use `(split-string STRING split-string-default-separators)'. Modifies the match data; use `save-match-data' if necessary." - (let ((rexp (or separators "[ \f\t\n\r\v]+")) + (let ((keep-nulls (not (if separators omit-nulls t))) + (rexp (or separators split-string-default-separators)) (start 0) notfirst (list nil)) @@ -1813,16 +1839,14 @@ (= start (match-beginning 0)) (< start (length string))) (1+ start) start)) - (< (match-beginning 0) (length string))) + (< start (length string))) (setq notfirst t) - (or (eq (match-beginning 0) 0) - (and (eq (match-beginning 0) (match-end 0)) - (eq (match-beginning 0) start)) + (if (or keep-nulls (< start (match-beginning 0))) (setq list (cons (substring string start (match-beginning 0)) list))) (setq start (match-end 0))) - (or (eq start (length string)) + (if (or keep-nulls (< start (length string))) (setq list (cons (substring string start) list))) Index: lispref/strings.texi =================================================================== RCS file: /cvsroot/emacs/emacs/lispref/strings.texi,v retrieving revision 1.23 diff -u -r1.23 strings.texi --- lispref/strings.texi 4 Feb 2003 14:47:54 -0000 1.23 +++ lispref/strings.texi 16 May 2003 10:03:59 -0000 @@ -259,30 +259,46 @@ Lists}. @end defun -@defun split-string string separators +@defun split-string string separators omit-nulls This function splits @var{string} into substrings at matches for the regular expression @var{separators}. Each match for @var{separators} defines a splitting point; the substrings between the splitting points are made -into a list, which is the value returned by @code{split-string}. +into a list, which is the value returned by @code{split-string}. If +@var{omit-nulls} is @code{t}, null strings will be removed from the +result list. Otherwise, null strings are left in the result. If @var{separators} is @code{nil} (or omitted), -the default is @code{"[ \f\t\n\r\v]+"}. +the default is the value of @code{split-string-default-separators}. -For example, +@defvar split-string-default-separators +The default value of @var{separators} for @code{split-string}, initially +@samp{"[ \f\t\n\r\v]+"}. + +As a special case, when @var{separators} is @code{nil} (or omitted), +null strings are always omitted from the result. Thus: @example -(split-string "Soup is good food" "o") -@result{} ("S" "up is g" "" "d f" "" "d") -(split-string "Soup is good food" "o+") -@result{} ("S" "up is g" "d f" "d") +(split-string " two words ") +@result{} ("two" "words") +@end example + +The result is not @samp{("" "two" "words" "")}, which would rarely be +useful. If you need such a result, use an explict value for +@var{separators}: + +@example +(split-string " two words " split-string-default-separators) +@result{} ("" "two" "words" "") @end example -When there is a match adjacent to the beginning or end of the string, -this does not cause a null string to appear at the beginning or end -of the list: +More examples: @example -(split-string "out to moo" "o+") -@result{} ("ut t" " m") +(split-string "Soup is good food" "o") +@result{} ("S" "up is g" "" "d f" "" "d") +(split-string "Soup is good food" "o" t) +@result{} ("S" "up is g" "d f" "d") +(split-string "Soup is good food" "o+") +@result{} ("S" "up is g" "d f" "d") @end example Empty matches do count, when not adjacent to another match: bash-2.05b$ find . -name '*.el' | xargs fgrep -2 -n split-string /dev/null ./lisp/apropos.el:267: want OMIT-NULLS t ./lisp/calendar/todo-mode.el:869: needs checking ./lisp/cvs-status.el:286: new semantics preferred; no error checking ./lisp/diff-mode.el:1047: OK, double default ./lisp/ediff-diff.el:1143: OK ./lisp/emacs-lisp/authors.el:460: double default, OK ./lisp/emacs-lisp/crm.el:419: new semantics preferred; no error checking ./lisp/emacs-lisp/crm.el:605: new semantics preferred; no error checking ./lisp/emacs-lisp/lisp-mnt.el:412: want OMIT-NULLS t ./lisp/emacs-lisp/unsafep.el:111: mentioned in comment, not used ./lisp/eshell/em-cmpl.el:403: new semantics preferred; no error checking ./lisp/eshell/em-ls.el:257: OK, double default ./lisp/eshell/em-pred.el:601: needs checking ./lisp/eshell/esh-util.el:228: want OMIT-NULLS t ./lisp/eshell/esh-util.el:449: new semantics preferred; no error checking ./lisp/eshell/esh-var.el:568: new semantics preferred; no error checking ./lisp/files.el:4254: double default, OK ./lisp/filesets.el:1202: new semantics preferred; no error checking ./lisp/gdb-ui.el:1001: new semantics preferred; no error checking ./lisp/gnus/gnus-art.el:4645: new semantics preferred; no error checking ./lisp/gnus/gnus-group.el:3798: OK ./lisp/gnus/gnus.el:2679: OK ./lisp/gnus/gnus.el:2681: OK ./lisp/gnus/mailcap.el:367: OK, could use OMIT-NULLS t instead ./lisp/gnus/mailcap.el:502: want OMIT-NULLS t ./lisp/gnus/mailcap.el:648: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mailcap.el:702: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mailcap.el:870: OK, could use OMIT-NULLS t instead ./lisp/gnus/mailcap.el:940: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/message.el:4701: want OMIT-NULLS t ./lisp/gnus/mm-decode.el:55: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mm-decode.el:57: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mm-decode.el:264: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mm-decode.el:363: OK, double default ./lisp/gnus/mml.el:307: new semantics preferred; no error checking (splitting MIME content type) ./lisp/gnus/mml.el:337: ditto ./lisp/gnus/nnslashdot.el:364: OK, double default ./lisp/gnus/nnslashdot.el:488: OK, could use OMIT-NULLS t instead ./lisp/gnus/nnultimate.el:176: OK, could use OMIT-NULLS t instead ./lisp/gnus/pop3.el:249: want OMIT-NULLS t ./lisp/gnus/pop3.el:346: want OMIT-NULLS t ./lisp/gnus/pop3.el:347: want OMIT-NULLS t ./lisp/gnus/pop3.el:409: want OMIT-NULLS t ./lisp/gnus/rfc2231.el:131: new semantics preferred; no error checking (splitting encoded word into locale info) ./lisp/gud.el:1817: OK ./lisp/gud.el:1847: OK ./lisp/gud.el:2288: OK, double default ./lisp/gud.el:2813: OK ./lisp/hexl.el:635: double default, OK ./lisp/hexl.el:652: double default, OK ./lisp/ido.el:2502: want OMIT-NULLS t ./lisp/ido.el:2868: want OMIT-NULLS t ./lisp/info.el:387: want OMIT-NULLS t ./lisp/info.el:390: want OMIT-NULLS t ./lisp/mail/rfc2368.el:137: OK ./lisp/mail/rfc2368.el:144: new semantics preferred; no error checking ./lisp/mail/smtpmail.el:602: want OMIT-NULLS t ./lisp/mh-e/mh-alias.el:156: want OMIT-NULLS t ./lisp/mh-e/mh-alias.el:289: OK ./lisp/mh-e/mh-alias.el:469: OK ./lisp/mh-e/mh-comp.el:374: OK, double default ./lisp/mh-e/mh-e.el:2164: OK, double default ./lisp/mh-e/mh-index.el:475: OK, double default ./lisp/mh-e/mh-seq.el:966: OK, double default ./lisp/mh-e/mh-utils.el:1606: needs checking ./lisp/net/eudc-export.el:126: OK ./lisp/net/eudc.el:161: Emacs 21 compatible ./lisp/net/eudc.el:419: want OMIT-NULLS t ./lisp/net/eudc.el:442: check this ./lisp/net/eudc.el:833: want OMIT-NULLS t ./lisp/net/eudcb-ldap.el:90: OK ./lisp/net/ldap.el:415: new semantics preferred; no error checking ./lisp/net/ldap.el:420: OK ./lisp/net/tramp.el:5658: check this ./lisp/net/tramp.el:6257: tramp-split-string is not quite emacs compatible ./lisp/pcmpl-cvs.el:175: new semantics preferred; no error checking ./lisp/pcmpl-gnu.el:127: OK, double default ./lisp/pcmpl-linux.el:46: double default, OK ./lisp/pcmpl-linux.el:88: want OMIT-NULLS t ./lisp/pcmpl-linux.el:101: want OMIT-NULLS t ./lisp/pcmpl-rpm.el:39: OK, double default ./lisp/pcmpl-rpm.el:46: OK, double default ./lisp/pcmpl-unix.el:89: new semantics preferred; no error checking ./lisp/pcvs-util.el:227: want OMIT-NULLS t ./lisp/pcvs-util.el:228: want OMIT-NULLS t ./lisp/progmodes/ada-prj.el:590: want OMIT-NULLS t ./lisp/progmodes/ada-xref.el:207: new semantics preferred; no error checking ./lisp/progmodes/fortran.el:267: want OMIT-NULLS t ./lisp/progmodes/idlw-shell.el:1734: could use new split-string with OMIT-NULLS t ./lisp/progmodes/idlwave.el:3702: prior XEmacs-compatible, could use new split-string ./lisp/progmodes/inf-lisp.el:285: double default, OK ./lisp/progmodes/vhdl-mode.el:13030: new semantics preferred; no error checking ./lisp/progmodes/vhdl-mode.el:13171: new semantics preferred; no error checking ./lisp/progmodes/vhdl-mode.el:13698: new semantics preferred; no error checking ./lisp/progmodes/vhdl-mode.el:13701: new semantics preferred; no error checking ./lisp/textmodes/bibtex.el:2665: new semantics preferred; no error checking ./lisp/textmodes/reftex-cite.el:192: Gone? ./lisp/textmodes/reftex-cite.el:373: new semantics preferred; no error checking ./lisp/textmodes/reftex-cite.el:383: new semantics preferred; no error checking ./lisp/textmodes/reftex-cite.el:445: OK ./lisp/textmodes/reftex-cite.el:863: new semantics preferred; no error checking ./lisp/textmodes/reftex-cite.el:961: new semantics preferred; no error checking ./lisp/textmodes/reftex-index.el:1552: new semantics preferred; no error checking ./lisp/textmodes/reftex-index.el:1685: want OMIT-NULLS t ./lisp/textmodes/reftex-index.el:1734: OK, double default ./lisp/textmodes/reftex-index.el:1748: OK, double default ./lisp/textmodes/reftex-index.el:1755: OK, double default ./lisp/textmodes/reftex-index.el:1762: new semantics preferred; no error checking ./lisp/textmodes/reftex-index.el:1818: new semantics preferred; no error checking ./lisp/textmodes/reftex-parse.el:343: new semantics preferred; no error checking ./lisp/textmodes/reftex-parse.el:482: OK, mapconcat used ./lisp/textmodes/reftex-parse.el:990: new semantics preferred; no error checking ./lisp/textmodes/reftex.el:934: needs checking ./lisp/textmodes/reftex.el:1455: OK, double default ./lisp/textmodes/reftex.el:1488: OK, double default ./lisp/textmodes/reftex.el:1556: OK, could use OMIT-NULLS t instead ./lisp/textmodes/reftex.el:2161: needs checking (uses explicit re or explicit ws) ./lisp/vc-cvs.el:789: new semantics preferred; requires rewrite to use ./lisp/xml.el:432: OK ./lisp/xml.el:436: OK -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.