Re: [PATCH] In string-split, add support for character sets and predicates.

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Mark H Weaver <mhw@netris.org>
To: Daniel Hartwig <mandyke@gmail.com>
Cc: guile-devel@gnu.org
Subject: Re: [PATCH] In string-split, add support for character sets and predicates.
Date: Mon, 08 Oct 2012 11:40:44 -0400	[thread overview]
Message-ID: <87sj9pm0oz.fsf@tines.lan> (raw)
In-Reply-To: <CAN3veRcLxasvVsq2RJFFQEgqzEb-zQiHvXWvafcA-7z=0Etzog@mail.gmail.com> (Daniel Hartwig's message of "Mon, 8 Oct 2012 19:23:09 +0800")

Hi Daniel,

Thanks for the patch!  I have a few comments.

Daniel Hartwig <mandyke@gmail.com> writes:
> From 0aeed16baa70eca143fec05e864f98d95d7267e8 Mon Sep 17 00:00:00 2001
> From: Daniel Hartwig <mandyke@gmail.com>
> Date: Mon, 8 Oct 2012 18:35:00 +0800
> Subject: [PATCH] In string-split, add support for character sets and
>  predicates.
>
> * libguile/srfi-13.c (string-split): Add support for splitting on
>   character sets and predicates, like string-index and others.  Keep the
>   original (fast) path when splitting by character and refactor using
>   string-index-right for other types; the later involves handling SCM
>   values so there is less chance to optimize anyway.

As Ludovic frequently reminds us (and I agree), rationales should be in
the source code comments, not in the commit message.

> * test-suite/tests/strings.test (string-split): Add tests covering
>   the new argument types.
> ---
>  libguile/srfi-13.c            |   53 ++++++++++++++++++++++++++++++----
>  libguile/srfi-13.h            |    2 +-
>  test-suite/tests/strings.test |   62 ++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 108 insertions(+), 9 deletions(-)
>
> diff --git a/libguile/srfi-13.c b/libguile/srfi-13.c
> index 2834553..1874754 100644
> --- a/libguile/srfi-13.c
> +++ b/libguile/srfi-13.c
> @@ -2993,11 +2993,22 @@ SCM_DEFINE (scm_string_tokenize, "string-tokenize", 1, 3, 0,
>  #undef FUNC_NAME
>  
>  SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
> -	    (SCM str, SCM chr),
> +	    (SCM str, SCM char_pred),
>  	    "Split the string @var{str} into a list of the substrings delimited\n"
> -	    "by appearances of the character @var{chr}.  Note that an empty substring\n"
> -	    "between separator characters will result in an empty string in the\n"
> -	    "result list.\n"
> +	    "by appearances characters which\n"

"by appearances of characters that"
                ^^            ^^^^

(the difference between 'which' and 'that' is described in
<http://www.kentlaw.edu/academics/lrw/grinker/LwtaThat_Versus_Which.htm>)

> +            "\n"
> +            "@itemize @bullet\n"
> +            "@item\n"
> +            "equals @var{char_pred}, if it is a character,\n"

Should be "equal", not "equals", because the subject "characters" is
plural.

> +            "\n"
> +            "@item\n"
> +            "satisfies the predicate @var{char_pred}, if it is a procedure,\n"

Should be "satisfy" for the same reason.

> +            "\n"
> +            "@item\n"
> +            "is in the set @var{char_pred}, if it is a character set.\n"

"are in the set".

> +            "@end itemize\n\n"
> +            "Note that an empty substring between separator characters\n"
> +            "will result in an empty string in the result list.\n"
>  	    "\n"
>  	    "@lisp\n"
>  	    "(string-split \"root:x:0:0:root:/root:/bin/bash\" #\\:)\n"
> @@ -3014,13 +3025,39 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>  	    "@end lisp")
>  #define FUNC_NAME s_scm_string_split
>  {
> +  SCM sidx, slast_idx;
>    long idx, last_idx;
>    int narrow;
>    SCM res = SCM_EOL;
>  
>    SCM_VALIDATE_STRING (1, str);
> -  SCM_VALIDATE_CHAR (2, chr);
>    
> +  if (SCM_CHARP (char_pred))
> +    {
> +      goto split_char;

I'd prefer to avoid the use of 'goto' here, and instead use nested 'if's
here.  (I admit that 'goto's occasionally make code simpler and more
readable, but not in this case IMO).

Can you please put the code between 'split_char' and 'done' within this
'if', and all the code from here to 'split_char' within the 'else'?

> +    }
> +  else if (!SCM_CHARSETP (char_pred))
> +    {
> +      SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)),
> +                  char_pred, SCM_ARG2, FUNC_NAME);
> +    }
> +
> +  sidx = scm_string_length (str);
> +  slast_idx = SCM_BOOL_F;
> +  while (scm_is_true (sidx))
> +    {
> +      slast_idx = sidx;
> +      sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
> +      if (scm_is_true (sidx))
> +        {
> +          SCM substr = scm_substring (str, scm_oneplus (sidx), slast_idx);
> +          res = scm_cons (substr, res);
> +        }
> +    }

It is needlessly inefficient to test 'scm_is_true (sidx)' twice per
iteration of this loop.  The first test is also a waste.  Here's one way
to avoid the redundant tests:

  slast_idx = scm_string_length (str);
  for (;;)
    {
      sidx = scm_string_index_right (str, char_pred, SCM_INUM0, slast_idx);
      if (scm_is_false (sidx))
        break;
      res = scm_cons (scm_substring (str, scm_oneplus (sidx), slast_idx), res);
      slast_idx = sidx;
    }

> +  res = scm_cons (scm_substring (str, SCM_INUM0, slast_idx), res);
> +  goto done;
> +
> + split_char:
>    /* This is explicit wide/narrow logic (instead of using
>       scm_i_string_ref) is a speed optimization.  */
>    idx = scm_i_string_length (str);
> @@ -3031,7 +3068,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>        while (idx >= 0)
>          {
>            last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(chr))
> +          while (idx > 0 && buf[idx-1] != (char) SCM_CHAR(char_pred))
>              idx--;
>            if (idx >= 0)
>              {
> @@ -3046,7 +3083,7 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>        while (idx >= 0)
>          {
>            last_idx = idx;
> -          while (idx > 0 && buf[idx-1] != SCM_CHAR(chr))
> +          while (idx > 0 && buf[idx-1] != SCM_CHAR(char_pred))
>              idx--;
>            if (idx >= 0)
>              {
> @@ -3055,6 +3092,8 @@ SCM_DEFINE (scm_string_split, "string-split", 2, 0, 0,
>              }
>          }
>      }
> +
> + done:
>    scm_remember_upto_here_1 (str);
>    return res;
>  }

[...]

Everything else looks good.  Thanks for including a full set of tests!

     Mark

next prev parent reply	other threads:[~2012-10-08 15:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-08 11:23 [PATCH] In string-split, add support for character sets and predicates Daniel Hartwig
2012-10-08 15:40 ` Mark H Weaver [this message]
2012-10-09  3:34   ` Daniel Hartwig
2012-10-09 17:48     ` Mark H Weaver
2012-10-10  1:37       ` Daniel Hartwig
2012-10-10  2:14         ` Mark H Weaver
2012-10-10  3:15           ` Daniel Hartwig
2012-10-10  3:25             ` Mark H Weaver
2012-10-10  3:28               ` Daniel Hartwig
2012-10-10  7:59                 ` Mark H Weaver
2012-10-10 20:44                   ` Ludovic Courtès
2012-10-12  6:38                   ` Daniel Hartwig
2012-10-12 12:23                     ` Mark H Weaver
2012-10-10 20:42               ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sj9pm0oz.fsf@tines.lan \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=mandyke@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).