From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: regexp-split for Guile
Date: Sat, 20 Oct 2012 09:27:42 -0400
Message-ID: <87txtpjmsx.fsf@tines.lan>
References: <20120917140133.GA6315@yarrow> <87lig830ox.fsf@zigzag.favinet>
	<20120917200603.GB6315@yarrow> <20120918195915.GE6315@yarrow>
	<CAN3veRevPFfY3D+VK+pP2vJpW+6O=cyf3jpu8qYM0s_QTMUN0A@mail.gmail.com>
	<87y5jbfj60.fsf@tines.lan> <20121020040126.GA25831@yarrow>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1350739700 25374 80.91.229.3 (20 Oct 2012 13:28:20 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sat, 20 Oct 2012 13:28:20 +0000 (UTC)
To: guile-devel@gnu.org
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sat Oct 20 15:28:25 2012
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1TPZ6E-0003zQ-M4
	for guile-devel@m.gmane.org; Sat, 20 Oct 2012 15:28:22 +0200
Original-Received: from localhost ([::1]:48469 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1TPZ66-0007DR-Ne
	for guile-devel@m.gmane.org; Sat, 20 Oct 2012 09:28:14 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:46727)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1TPZ62-0007DH-OB
	for guile-devel@gnu.org; Sat, 20 Oct 2012 09:28:13 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1TPZ61-0001LS-Bn
	for guile-devel@gnu.org; Sat, 20 Oct 2012 09:28:10 -0400
Original-Received: from world.peace.net ([96.39.62.75]:35164)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1TPZ61-0001Jw-6q
	for guile-devel@gnu.org; Sat, 20 Oct 2012 09:28:09 -0400
Original-Received: from c-98-217-64-74.hsd1.ma.comcast.net ([98.217.64.74]
	helo=tines.lan)
	by world.peace.net with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.72) (envelope-from <mhw@netris.org>)
	id 1TPZ5i-0001YJ-FV; Sat, 20 Oct 2012 09:27:50 -0400
In-Reply-To: <20121020040126.GA25831@yarrow> (Chris K. Jester-Young's message
	of "Sat, 20 Oct 2012 00:01:26 -0400")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 96.39.62.75
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:15008
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/15008>

Hi Chris,

"Chris K. Jester-Young" <cky944@gmail.com> writes:
> On Fri, Oct 12, 2012 at 05:57:11PM -0400, Mark H Weaver wrote:
>> Beyond matters of taste, I don't like this because it makes bugs less
>> likely to be caught.  Suppose 'limit' is a computed value, normally
>> expected to be positive.  Code that follows may implicitly assume that
>> the returned list has no more than 'limit' elements.  Now suppose that
>> due to a bug or exceptional circumstance, the computed 'limit' ends up
>> being less than 1.  Now 'regexp-split' switches to a qualitatively
>> different mode of behavior.
>
> I am sympathetic to this. It would definitely be good for the limit to
> mean only that, and not have two other meanings attached to it.
>
> So, in this spirit, below is my proposal for something that I hope would
> fit within the character of your feedback, while not making the common
> use cases needlessly verbose: we should favour the common use cases by
> making them easy to use.
>
> Before I begin, remember that in Perl's split, the default limit is 0,
> which is to strip off all the blank trailing fields. This is the common
> use case when using whitespace as a delimiter, where you simply want to
> ignore all the end-of-line whitespace. Making the calling code manually
> call drop-right-while is counter-productive for this common use case.
>
> Here is my proposal:
>
>     (regexp-split pat str #:key limit (trim? (not limit)))
>
> With no optional arguments specified (so, #:limit is #f and #:trim? is
> #t), it behaves like limit == 0 in Perl. i.e., return all fields, minus
> blank trailing ones.
>
> With a #:limit specified (which must be a positive integer), return
> that number of fields at most (subsequent ones are not split out, and
> are returned as part of the last field, with all delimiters intact).
>
> With #:trim? given a false value, return all fields, including blank
> trailing ones. This is false by default iff #:limit is specified.
>
> Rationale: The common use case is the most succinct version. The next
> most common use case has a relatively short formulation (#:trim?).
> Also, the default for #:trim? is based on common use cases depending on
> whether #:limit is specified. (Trim-with-limit is not supported in Perl,
> but it seemed to take more work to ban it here than just let it be.)

I generally like your new proposal, but after mulling it over some more,
I think that trimming should be off by default, regardless of how limit
is set.  The thing is, it seems to me that the only time #:trim? #t
makes sense is when you're splitting based on whitespace.  In most other
cases, trimming is not a sensible default.

As a programmer, I don't want basic tools like 'regexp-split' adding a
post-processing pass on the results without me explicitly asking for it.
Furthermore, if I add (or remove) the #:limit argument, I'd be
unpleasantly surprised to see any other changes in behavior.

While it's sometimes reasonable for _user_ interfaces to try to guess
what the user wanted to enable shorter commands, programming interfaces
should not do so, IMO.  This kind of cleverness is expected in Perl
circles, but not in the Scheme world.

Also, if we're going to add a built-in trimmer to 'regexp-split', I'd
like to see a "trim both ends" mode as well.  When splitting by
whitespace, I suspect #:trim 'both is wanted as often as #:trim 'right.

So how about something like this?

    (regexp-split pat str #:key limit trim)

where (member trim (#f 'both 'right 'left))

For example:

    (regexp-split "/\\" "foo/bar\baz/")
      => ("foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  ")
      => ("" "foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  " #:trim 'right)
      => ("" "foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5)
      => ("" "foo" "bar" "baz" "")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5 #:trim 'right)
      => ("" "foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 5 #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 3 #:trim 'both)
      => ("foo" "bar" "baz")
    (regexp-split " +" "  foo  bar  baz  " #:limit 2 #:trim 'both)
      => ("foo" "bar")

What do you think?

Thanks for working on this!

     Mark