unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Daniel Brooks <db48x@db48x.net>
To: Matt Armstrong <matt@rfc20.org>
Cc: Alan Mackenzie <acm@muc.de>, Naoya Yamashita <conao3@gmail.com>,
	emacs-devel@gnu.org
Subject: Re: [PATCH] Interpret #r"..." as a raw string
Date: Tue, 02 Mar 2021 01:56:43 -0800	[thread overview]
Message-ID: <87blc1khes.fsf@db48x.net> (raw)
In-Reply-To: <m2zgzmkse2.fsf@matts-mbp-2016.lan> (Matt Armstrong's message of "Mon, 01 Mar 2021 21:59:33 -0800")

Matt Armstrong <matt@rfc20.org> writes:

> Alan Mackenzie <acm@muc.de> writes:
>
> C++ has probably the most flexible "gold standard" raw string literals.

With respect, I think that Raku “wins” this
fight. https://docs.raku.org/language/quoting is really worth reading;
it's a work of art. You can think of the quote operator as a function
that takes 13 named boolean arguments plus a choice of opening and
closing delimiters.

> As Alan I think rightly points out, this makes the language and all
> tools that process the language more complex.  This is a high cost, so
> the feature should deliver some real value.

Certainly true. As the ordinary Lisp string syntax already allows
multi-line strings, and interpolation is handled by the format function,
the primary benefit is to turn off escaping. We could also offer a
choice of opening and closing delimiters, though the proposed code
didn't implement that.

I think the benefit will be worth it. If we offered a little more choice
of delimiters, then we could gain more benefit when the string must also
contain double quotes. This need have a large complexity cost.

> For those that don't know, C++'s raw string literals can be as imple as
> this for the string "raw-content":
>
>    R"(raw-content)"
>
> But if the content itself contains the character sequence )" then the
> programmer can specify any delimiter they want:
>
>    R"DELIMITER(raw-content)"more-raw-content)DELIMITER"
>
> But as you can see above, it isn't always clearer to write a raw string
> literal.

I would say that there are four ways to choose the delimiters.

The simplest way is just accepting just one specific delimiter, often
with no way to include that character in the string. For example,
Scala's syntax is raw"foo", but without any form of escaping that will
allow a double quote inside the string. C#'s syntax is @"foo", but you
can include a double-quote by repeating it, so @"foo""bar" is the string
”foo"bar”. Most languages are in this category, and this is how the
proposed code works.

Then there is the sed→perl→raku way, where the parser accepts a wide
variety of characters as the opening delimiter, and uses it to compute
which closing delimiter to look for. Raku allows any character not
allowed in identifiers, which is most characters not in the L or N
Unicode categories. Sed and Perl just allow punctuation characters.

There is the Rust way, where the parser looks for a double-quote
proceeded by zero or more #'s. The closing delimiter is a double-quote
followed by the same number of #'s.

And finally the C++11 way, where it looks for a double-quote followed by
zero to sixteen source characters (with a few minor exceptions) followed
by an opening parenthesis. The closing delimiter is a closing
parenthesis followed by the same zero to sixteen characters in the same
order as in the opening delimiter followed by a double-quote character.

Of the three, I think Raku's way is the most fun because it allows the
widest choice of characters (q🕶awesome!🕶, for example). I'd be fine with
the current proposal, but if others think that it is important to allow
double-quotes inside the raw string, then I think Rust's syntax is the
next logical step. #r##"foo"## would fit in well with the rest of elsip;
it won't look as out of place as the others, and it's only a small
increment in compexity.

Or maybe we want to invent something completely new. As Emacs buffers
may include images which are treated as if they were characters of
unusual size, perhaps we could use gifs. A string bracketed by a GIF of
a dude putting on sunglasses would really show those other languages up.

As it's nicer when delimiters are paired, we could allow the closing GIF
to be horizontally mirrored so that both dudes are either looking
inwards at the string or outwards at the rest of the world.

db48x

PS: if anyone wants to go the Perl/Raku way, I happen to have built a
list of the paired punctuation characters recently:

var _PiPf = map[rune]rune{
	'«': '»', '‘': '’', '“': '”', '‹': '›', '⸂': '⸃', '⸄': '⸅', '⸉': '⸊',
	'⸌': '⸍', '⸜': '⸝', '⸠': '⸡',
}

var _PsPf = map[rune]rune{
	'‚': '’', '„': '”',
}

var _PsPe = map[rune]rune{
	'(': ')', '[': ']', '{': '}', '༺': '༻', '༼': '༽', '᚛': '᚜', '⁅': '⁆',
	'⁽': '⁾', '₍': '₎', '❨': '❩', '❪': '❫', '❬': '❭', '❮': '❯', '❰': '❱',
	'❲': '❳', '❴': '❵', '⟅': '⟆', '⟦': '⟧', '⟨': '⟩', '⟪': '⟫', '⦃': '⦄',
	'⦅': '⦆', '⦇': '⦈', '⦉': '⦊', '⦋': '⦌', '⦑': '⦒', '⦓': '⦔', '⦕': '⦖',
	'⦗': '⦘', '⧘': '⧙', '⧚': '⧛', '⧼': '⧽', '〈': '〉', '《': '》',
	'「': '」', '『': '』', '【': '】', '〔': '〕', '〖': '〗', '〘': '〙',
	'〚': '〛', '〝': '〞', '︗': '︘', '︵': '︶', '︷': '︸', '︹': '︺',
	'︻': '︼', '︽': '︾', '︿': '﹀', '﹁': '﹂', '﹃': '﹄', '﹇': '﹈',
	'﹙': '﹚', '﹛': '﹜', '﹝': '﹞', '(': ')', '[': ']', '{': '}',
	'⦅': '⦆', '「': '」', '⸨': '⸩',
}

var _SmSm = map[rune]rune{
	'<': '>',
}

This is obviously written in Go. My source code is at
https://github.com/db48x/goparsify/blob/master/literals.go#L298-L322.

Feel free to use these tables however you like; I consider them to be a
mere listing of facts and as such they're not copyrightable.

The basic algorithm that Perl uses is that the delimiter may be any
punctuation character, and if the opening delimiter is a key in any of
these tables then the closing delimiter is expected to be the
corresponding value; otherwise the closing delimiter is expected to be
identical to the opening delimiter.

Raku is similar, execept that it allows any unicode character that isn't
designated as belonging to identifiers rather than just punctuation.

For speed you'll obviously prefer to do a single lookup into one hash
table, but for organizational purposes it's nicer to have them grouped
by unicode category. This will help you update them when new characters
are added in the future.



  reply	other threads:[~2021-03-02  9:56 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-26 18:18 [PATCH] Interpret #r"..." as a raw string Naoya Yamashita
2021-02-26 18:27 ` [External] : " Drew Adams
2021-02-26 18:53   ` Naoya Yamashita
2021-02-26 19:03     ` Drew Adams
2021-02-26 19:48     ` Stefan Monnier
2021-02-26 20:23       ` Naoya Yamashita
2021-02-26 20:34         ` Andreas Schwab
2021-02-26 20:39           ` Naoya Yamashita
2021-02-26 20:45             ` Andreas Schwab
2021-02-26 20:50               ` Naoya Yamashita
2021-02-26 20:54                 ` Andreas Schwab
2021-02-26 20:03     ` Eli Zaretskii
2021-02-26 20:34       ` Naoya Yamashita
2021-02-26 19:09 ` Andreas Schwab
2021-02-26 20:00 ` Eli Zaretskii
2021-02-27  0:39   ` Daniel Brooks
2021-02-27 16:14     ` Richard Stallman
2021-02-27 16:18       ` Stefan Monnier
2021-03-01  5:19         ` Richard Stallman
2021-03-02  5:45           ` Matt Armstrong
2021-03-03  5:53             ` Richard Stallman
2021-03-03  6:14               ` Daniel Brooks
2021-03-03  7:00               ` Eli Zaretskii
2021-03-04  2:47                 ` Matt Armstrong
2021-03-04 13:49                   ` Eli Zaretskii
2021-03-04 16:55                     ` Matt Armstrong
2021-03-05  5:44                       ` Richard Stallman
2021-03-05  5:39                   ` Richard Stallman
2021-03-05  8:01                     ` Eli Zaretskii
2021-03-06  5:13                       ` Richard Stallman
2021-03-06  6:04                         ` Matt Armstrong
2021-03-07  6:13                           ` Richard Stallman
2021-03-07 17:20                             ` [External] : " Drew Adams
2021-03-06  8:27                         ` Eli Zaretskii
2021-03-06  9:51                           ` Daniel Brooks
2021-03-06 10:24                             ` Eli Zaretskii
2021-03-07  6:08                           ` Richard Stallman
2021-02-27 20:41       ` Daniel Brooks
2021-02-28  6:22 ` Zhu Zihao
2021-03-01  5:26   ` Richard Stallman
2021-03-01 12:06 ` Alan Mackenzie
2021-03-01 12:13   ` Andreas Schwab
2021-03-02  5:59   ` Matt Armstrong
2021-03-02  9:56     ` Daniel Brooks [this message]
2021-03-02 10:13       ` Andreas Schwab
2021-03-02 10:55         ` Daniel Brooks
2021-03-02 11:18           ` Andreas Schwab
2021-03-02 11:26             ` Daniel Brooks
2021-03-02 11:14       ` Alan Mackenzie
2021-03-02 11:52         ` Daniel Brooks
2021-03-02 12:01     ` Dmitry Gutov
2021-03-02 14:14       ` Alan Mackenzie
2021-03-02 14:32         ` Dmitry Gutov
2021-03-02 15:06           ` Alan Mackenzie
2021-03-02 11:41 ` Aurélien Aptel
2021-03-02 13:49   ` Stefan Monnier
2021-03-02 14:46     ` Aurélien Aptel
2021-03-02 15:11       ` Stefan Monnier
2021-03-02 16:07         ` Aurélien Aptel
2021-03-03  7:31           ` Alfred M. Szmidt
2021-03-03 16:02           ` Stefan Monnier
2021-03-02 20:36     ` Daniel Brooks
2021-03-03  0:27       ` Stefan Monnier
2021-03-03  0:42         ` Daniel Brooks
2021-03-03  8:16       ` Andreas Schwab
2021-03-03  9:25         ` Daniel Brooks
2021-03-03  9:29           ` Andreas Schwab
2021-03-03 10:02             ` Daniel Brooks
2021-03-03 10:11               ` Daniel Brooks
2021-03-03 10:14                 ` Andreas Schwab
2021-03-03 11:48                   ` Daniel Brooks
2021-03-03 10:12       ` Michael Albinus
2021-03-03 10:42         ` Daniel Brooks
2021-03-03 10:49           ` Michael Albinus
2021-03-03 16:12           ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87blc1khes.fsf@db48x.net \
    --to=db48x@db48x.net \
    --cc=acm@muc.de \
    --cc=conao3@gmail.com \
    --cc=emacs-devel@gnu.org \
    --cc=matt@rfc20.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).