unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Stephen J. Turnbull" <stephen@xemacs.org>
Subject: Re: Rationale for split-string?
Date: Tue, 20 May 2003 10:55:20 +0900	[thread overview]
Message-ID: <87el2uz8jb.fsf@tleepslib.sk.tsukuba.ac.jp> (raw)
In-Reply-To: <87vfx5vor0.fsf@tleepslib.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Wed, 23 Apr 2003 13:09:23 +0900")

>>>>> "sjt" == Stephen J Turnbull <stephen@xemacs.org> writes:

    sjt> OK.  That is satisfactory for XEmacs, and we'll implement
    sjt> that.

    sjt> Unless you say you prefer to do it yourself, I will also
    sjt> submit a patch against GNU Emacs CVS head, and audit the Lisp
    sjt> code in CVS head to make sure there are no surprises from
    sjt> callers with non-default SEPARATORS.

Enclosed are patches for lisp/subr.el and lispref/strings.texi to
implement the API for split-string discussed earlier.

Also enclosed is the result of an audit of uses of split-string in
Emacs CVS (as of about three weeks ago).  I didn't notice any cases
where the changed specification made existing code out-and-out
incorrect, so there are no further patches suggested.  However, I
think a lot of the uses with an explicit SEPARATORS are semantically
dubious without using the OMIT-NULLS flag (and most were semantically
dubious before the change to split-string, because it's at least
theoretically possible for a null string to arise in the interior of
the list).  Most other uses of split-string are dubious in that either
they depend heavily on undocumented implementation details of other
utilities (eg, that the fields in /etc/mtab are separated by exactly
one space) or are not very robust to bogus input.  People who
understand the modules in question might want to take a closer look.

A few I couldn't tell at all without doing a much deeper analysis of
the code than I have time for right now:

./lisp/calendar/todo-mode.el:869:  needs checking
./lisp/eshell/em-pred.el:601:  needs checking
./lisp/mh-e/mh-utils.el:1606:  needs checking
./lisp/textmodes/reftex.el:934:  needs checking
./lisp/textmodes/reftex.el:2161:  needs checking

If you set default-directory to the root of the Emacs hierarchy, the
following function is useful to jump to the reference.  nb. a few of
the references have changed since I started the audit.

(defun sjt/parse-grep-n2 ()
  "Parse `grep -n -#' output for filename and line number."
  (interactive)
  (beginning-of-line)
  (when (re-search-forward "^\\(\\S-+\\):\\([0-9]+\\):")
    (cons (match-string 1) (string-to-number (match-string 2)))))

(defun sjt/parse-grep-n-and-go ()
  "Jump to place specified by `grep -n' output."
  (interactive)
  (let* ((pair (sjt/parse-grep-n2))
	 (file (car pair))
         (line (cdr pair)))
    (find-file file)
    (goto-line line)))


lisp/ChangeLog 2003-05-16 Stephen J. Turnbull <stephen@xemacs.org>

	* subr.el (split-string): Implement specification that splitting
	on explicit separators retains null fields.  Add new argument
	OMIT-NULLS.  Special-case (split-string "a string").

lispref/ChangeLog
2003-05-16  Stephen J. Turnbull  <stephen@xemacs.org>

	* strings.texi (Creating Strings): Update split-string
	specification and examples.

Index: lisp/subr.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/subr.el,v
retrieving revision 1.350
diff -u -r1.350 subr.el
--- lisp/subr.el	24 Apr 2003 23:14:12 -0000	1.350
+++ lisp/subr.el	16 May 2003 10:03:58 -0000
@@ -1792,19 +1792,45 @@
 	(buffer-substring-no-properties (match-beginning num)
 					(match-end num)))))
 
-(defun split-string (string &optional separators)
-  "Splits STRING into substrings where there are matches for SEPARATORS.
-Each match for SEPARATORS is a splitting point.
-The substrings between the splitting points are made into a list
+(defconst split-string-default-separators "[ \f\t\n\r\v]+"
+  "The default value of separators for `split-string'.
+
+A regexp matching strings of whitespace.  May be locale-dependent
+\(as yet unimplemented).  Should not match non-breaking spaces.
+
+Warning: binding this to a different value and using it as default is
+likely to have undesired semantics.")
+
+;; The specification says that if both SEPARATORS and OMIT-NULLS are
+;; defaulted, OMIT-NULLS should be treated as t.  Simplifying the logical
+;; expression leads to the equivalent implementation that if SEPARATORS
+;; is defaulted, OMIT-NULLS is treated as t.
+(defun split-string (string &optional separators omit-nulls)
+  "Splits STRING into substrings bounded by matches for SEPARATORS.
+
+The beginning and end of STRING, and each match for SEPARATORS, are
+splitting points.  The substrings matching SEPARATORS are removed, and
+the substrings between the splitting points are collected as a list,
 which is returned.
-If SEPARATORS is absent, it defaults to \"[ \\f\\t\\n\\r\\v]+\".
 
-If there is match for SEPARATORS at the beginning of STRING, we do not
-include a null substring for that.  Likewise, if there is a match
-at the end of STRING, we don't include a null substring for that.
+If SEPARATORS is non-nil, it should be a regular expression matching text
+which separates, but is not part of, the substrings.  If nil it defaults to
+`split-string-default-separators', normally \"[ \\f\\t\\n\\r\\v]+\", and
+OMIT-NULLS is forced to t.
+
+If OMIT-NULLs is t, zero-length substrings are omitted from the list \(so
+that for the default value of SEPARATORS leading and trailing whitespace
+are effectively trimmed).  If nil, all zero-length substrings are retained,
+which correctly parses CSV format, for example.
+
+Note that the effect of `(split-string STRING)' is the same as
+`(split-string STRING split-string-default-separators t)').  In the rare
+case that you wish to retain zero-length substrings when splitting on
+whitespace, use `(split-string STRING split-string-default-separators)'.
 
 Modifies the match data; use `save-match-data' if necessary."
-  (let ((rexp (or separators "[ \f\t\n\r\v]+"))
+  (let ((keep-nulls (not (if separators omit-nulls t)))
+	(rexp (or separators split-string-default-separators))
 	(start 0)
 	notfirst
 	(list nil))
@@ -1813,16 +1839,14 @@
 				       (= start (match-beginning 0))
 				       (< start (length string)))
 				  (1+ start) start))
-		(< (match-beginning 0) (length string)))
+		(< start (length string)))
       (setq notfirst t)
-      (or (eq (match-beginning 0) 0)
-	  (and (eq (match-beginning 0) (match-end 0))
-	       (eq (match-beginning 0) start))
+      (if (or keep-nulls (< start (match-beginning 0)))
 	  (setq list
 		(cons (substring string start (match-beginning 0))
 		      list)))
       (setq start (match-end 0)))
-    (or (eq start (length string))
+    (if (or keep-nulls (< start (length string)))
 	(setq list
 	      (cons (substring string start)
 		    list)))


Index: lispref/strings.texi
===================================================================
RCS file: /cvsroot/emacs/emacs/lispref/strings.texi,v
retrieving revision 1.23
diff -u -r1.23 strings.texi
--- lispref/strings.texi	4 Feb 2003 14:47:54 -0000	1.23
+++ lispref/strings.texi	16 May 2003 10:03:59 -0000
@@ -259,30 +259,46 @@
 Lists}.
 @end defun
 
-@defun split-string string separators
+@defun split-string string separators omit-nulls
 This function splits @var{string} into substrings at matches for the regular
 expression @var{separators}.  Each match for @var{separators} defines a
 splitting point; the substrings between the splitting points are made
-into a list, which is the value returned by @code{split-string}.
+into a list, which is the value returned by @code{split-string}.  If
+@var{omit-nulls} is @code{t}, null strings will be removed from the
+result list.  Otherwise, null strings are left in the result.
 If @var{separators} is @code{nil} (or omitted),
-the default is @code{"[ \f\t\n\r\v]+"}.
+the default is the value of @code{split-string-default-separators}.
 
-For example,
+@defvar split-string-default-separators
+The default value of @var{separators} for @code{split-string}, initially
+@samp{"[ \f\t\n\r\v]+"}.
+
+As a special case, when @var{separators} is @code{nil} (or omitted),
+null strings are always omitted from the result.  Thus:
 
 @example
-(split-string "Soup is good food" "o")
-@result{} ("S" "up is g" "" "d f" "" "d")
-(split-string "Soup is good food" "o+")
-@result{} ("S" "up is g" "d f" "d")
+(split-string "  two words ")
+@result{} ("two" "words")
+@end example
+
+The result is not @samp{("" "two" "words" "")}, which would rarely be
+useful.  If you need such a result, use an explict value for
+@var{separators}:
+
+@example
+(split-string "  two words " split-string-default-separators)
+@result{} ("" "two" "words" "")
 @end example
 
-When there is a match adjacent to the beginning or end of the string,
-this does not cause a null string to appear at the beginning or end
-of the list:
+More examples:
 
 @example
-(split-string "out to moo" "o+")
-@result{} ("ut t" " m")
+(split-string "Soup is good food" "o")
+@result{} ("S" "up is g" "" "d f" "" "d")
+(split-string "Soup is good food" "o" t)
+@result{} ("S" "up is g" "d f" "d")
+(split-string "Soup is good food" "o+")
+@result{} ("S" "up is g" "d f" "d")
 @end example
 
 Empty matches do count, when not adjacent to another match:

bash-2.05b$ find . -name '*.el' | xargs fgrep -2 -n split-string /dev/null
./lisp/apropos.el:267:  want OMIT-NULLS t
./lisp/calendar/todo-mode.el:869:  needs checking
./lisp/cvs-status.el:286:  new semantics preferred; no error checking
./lisp/diff-mode.el:1047:  OK, double default
./lisp/ediff-diff.el:1143:  OK
./lisp/emacs-lisp/authors.el:460:  double default, OK
./lisp/emacs-lisp/crm.el:419:  new semantics preferred; no error checking
./lisp/emacs-lisp/crm.el:605:  new semantics preferred; no error checking
./lisp/emacs-lisp/lisp-mnt.el:412:  want OMIT-NULLS t
./lisp/emacs-lisp/unsafep.el:111:  mentioned in comment, not used
./lisp/eshell/em-cmpl.el:403:  new semantics preferred; no error checking
./lisp/eshell/em-ls.el:257:  OK, double default
./lisp/eshell/em-pred.el:601:  needs checking
./lisp/eshell/esh-util.el:228:  want OMIT-NULLS t
./lisp/eshell/esh-util.el:449:  new semantics preferred; no error checking
./lisp/eshell/esh-var.el:568:  new semantics preferred; no error checking
./lisp/files.el:4254:  double default, OK
./lisp/filesets.el:1202:  new semantics preferred; no error checking
./lisp/gdb-ui.el:1001:  new semantics preferred; no error checking
./lisp/gnus/gnus-art.el:4645:  new semantics preferred; no error checking
./lisp/gnus/gnus-group.el:3798:  OK
./lisp/gnus/gnus.el:2679:  OK
./lisp/gnus/gnus.el:2681:  OK
./lisp/gnus/mailcap.el:367:  OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:502:  want OMIT-NULLS t
./lisp/gnus/mailcap.el:648:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mailcap.el:702:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mailcap.el:870:  OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:940:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/message.el:4701:  want OMIT-NULLS t
./lisp/gnus/mm-decode.el:55:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mm-decode.el:57:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mm-decode.el:264:  new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mm-decode.el:363:  OK, double default
./lisp/gnus/mml.el:307:	 new semantics preferred; no error checking (splitting MIME content type)
./lisp/gnus/mml.el:337:	 ditto
./lisp/gnus/nnslashdot.el:364:  OK, double default
./lisp/gnus/nnslashdot.el:488:  OK, could use OMIT-NULLS t instead
./lisp/gnus/nnultimate.el:176:  OK, could use OMIT-NULLS t instead
./lisp/gnus/pop3.el:249:  want OMIT-NULLS t
./lisp/gnus/pop3.el:346:  want OMIT-NULLS t
./lisp/gnus/pop3.el:347:  want OMIT-NULLS t
./lisp/gnus/pop3.el:409:  want OMIT-NULLS t
./lisp/gnus/rfc2231.el:131:  new semantics preferred; no error checking (splitting encoded word into locale info)
./lisp/gud.el:1817:  OK
./lisp/gud.el:1847:  OK
./lisp/gud.el:2288:  OK, double default
./lisp/gud.el:2813:  OK
./lisp/hexl.el:635:  double default, OK
./lisp/hexl.el:652:  double default, OK
./lisp/ido.el:2502:  want OMIT-NULLS t
./lisp/ido.el:2868:  want OMIT-NULLS t
./lisp/info.el:387:  want OMIT-NULLS t
./lisp/info.el:390:  want OMIT-NULLS t
./lisp/mail/rfc2368.el:137:  OK
./lisp/mail/rfc2368.el:144:  new semantics preferred; no error checking
./lisp/mail/smtpmail.el:602:  want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:156:  want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:289:  OK
./lisp/mh-e/mh-alias.el:469:  OK
./lisp/mh-e/mh-comp.el:374:  OK, double default
./lisp/mh-e/mh-e.el:2164:  OK, double default
./lisp/mh-e/mh-index.el:475:  OK, double default
./lisp/mh-e/mh-seq.el:966:  OK, double default
./lisp/mh-e/mh-utils.el:1606:  needs checking
./lisp/net/eudc-export.el:126:  OK
./lisp/net/eudc.el:161:  Emacs 21 compatible
./lisp/net/eudc.el:419:	 want OMIT-NULLS t
./lisp/net/eudc.el:442:	 check this
./lisp/net/eudc.el:833:	 want OMIT-NULLS t
./lisp/net/eudcb-ldap.el:90:  OK
./lisp/net/ldap.el:415:	 new semantics preferred; no error checking
./lisp/net/ldap.el:420:	 OK
./lisp/net/tramp.el:5658:  check this
./lisp/net/tramp.el:6257:  tramp-split-string is not quite emacs compatible
./lisp/pcmpl-cvs.el:175:  new semantics preferred; no error checking
./lisp/pcmpl-gnu.el:127:  OK, double default
./lisp/pcmpl-linux.el:46:  double default, OK
./lisp/pcmpl-linux.el:88:  want OMIT-NULLS t
./lisp/pcmpl-linux.el:101:  want OMIT-NULLS t
./lisp/pcmpl-rpm.el:39:  OK, double default
./lisp/pcmpl-rpm.el:46:  OK, double default
./lisp/pcmpl-unix.el:89:  new semantics preferred; no error checking
./lisp/pcvs-util.el:227:  want OMIT-NULLS t
./lisp/pcvs-util.el:228:  want OMIT-NULLS t
./lisp/progmodes/ada-prj.el:590:  want OMIT-NULLS t
./lisp/progmodes/ada-xref.el:207:  new semantics preferred; no error checking
./lisp/progmodes/fortran.el:267:  want OMIT-NULLS t
./lisp/progmodes/idlw-shell.el:1734:  could use new split-string with OMIT-NULLS t
./lisp/progmodes/idlwave.el:3702:  prior XEmacs-compatible, could use new split-string
./lisp/progmodes/inf-lisp.el:285:  double default, OK
./lisp/progmodes/vhdl-mode.el:13030:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13171:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13698:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13701:  new semantics preferred; no error checking
./lisp/textmodes/bibtex.el:2665:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:192:  Gone?
./lisp/textmodes/reftex-cite.el:373:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:383:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:445:  OK
./lisp/textmodes/reftex-cite.el:863:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:961:  new semantics preferred; no error checking
./lisp/textmodes/reftex-index.el:1552:  new semantics preferred; no error checking
./lisp/textmodes/reftex-index.el:1685:  want OMIT-NULLS t
./lisp/textmodes/reftex-index.el:1734:  OK, double default
./lisp/textmodes/reftex-index.el:1748:  OK, double default
./lisp/textmodes/reftex-index.el:1755:	OK, double default
./lisp/textmodes/reftex-index.el:1762:  new semantics preferred; no error checking
./lisp/textmodes/reftex-index.el:1818:	new semantics preferred; no error checking
./lisp/textmodes/reftex-parse.el:343:  new semantics preferred; no error checking
./lisp/textmodes/reftex-parse.el:482:  OK, mapconcat used
./lisp/textmodes/reftex-parse.el:990:  new semantics preferred; no error checking
./lisp/textmodes/reftex.el:934:  needs checking
./lisp/textmodes/reftex.el:1455:  OK, double default
./lisp/textmodes/reftex.el:1488:  OK, double default
./lisp/textmodes/reftex.el:1556:  OK, could use OMIT-NULLS t instead
./lisp/textmodes/reftex.el:2161:  needs checking (uses explicit re or explicit ws)
./lisp/vc-cvs.el:789:	 new semantics preferred; requires rewrite to use
./lisp/xml.el:432:  OK
./lisp/xml.el:436:  OK



-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

  parent reply	other threads:[~2003-05-20  1:55 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-17  9:06 Rationale for split-string? Stephen J. Turnbull
2003-04-17 11:30 ` Stefan Reichör
2003-04-18  1:54   ` Richard Stallman
2003-04-18  2:59     ` Steve Youngs
2003-04-17 17:44 ` Stefan Monnier
2003-04-17 19:32   ` Luc Teirlinck
2003-04-18 11:50   ` Stephen J. Turnbull
2003-04-18 14:17     ` Stefan Monnier
2003-04-19  8:18       ` Stephen J. Turnbull
2003-04-19 13:35     ` Richard Stallman
2003-04-19  4:14   ` Richard Stallman
2003-04-19  8:55     ` Stephen J. Turnbull
2003-04-21  0:59       ` Richard Stallman
2003-04-21  1:55         ` Luc Teirlinck
2003-04-21 10:58         ` Stephen J. Turnbull
2003-04-21 21:11           ` Luc Teirlinck
2003-04-21 23:43             ` Miles Bader
2003-04-22  3:26               ` Luc Teirlinck
2003-04-22  4:09                 ` Jerry James
2003-04-22  8:15                   ` Eli Zaretskii
2003-04-22 13:22                     ` Stephen J. Turnbull
2003-04-22 14:38                       ` Jerry James
2003-04-22 12:56                   ` Luc Teirlinck
2003-04-22 14:56                     ` Jerry James
2003-04-22 15:27                       ` Luc Teirlinck
2003-04-22 13:19                 ` Stephen J. Turnbull
2003-04-22 13:39                   ` Miles Bader
2003-04-22 13:51                   ` Luc Teirlinck
2003-04-22 16:26                   ` Luc Teirlinck
2003-04-23  1:00           ` Richard Stallman
2003-04-23  4:09             ` Stephen J. Turnbull
2003-04-24 23:12               ` Richard Stallman
2003-05-20  1:55               ` Stephen J. Turnbull [this message]
2003-05-22 15:00                 ` Kai Großjohann
  -- strict thread matches above, loose matches on Subject: below --
2003-05-20  3:11 Bill Wohler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87el2uz8jb.fsf@tleepslib.sk.tsukuba.ac.jp \
    --to=stephen@xemacs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).