unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: David Kastrup <dak@gnu.org>
To: Marko Rauhamaa <marko@pacujo.net>
Cc: guile-user@gnu.org
Subject: Re: guile can't find a chinese named file
Date: Wed, 15 Feb 2017 12:18:21 +0100	[thread overview]
Message-ID: <8760kb66te.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <87fujfwwvj.fsf@elektro.pacujo.net> (Marko Rauhamaa's message of "Wed, 15 Feb 2017 12:50:56 +0200")

Marko Rauhamaa <marko@pacujo.net> writes:

> David Kastrup <dak@gnu.org>:
>
>> If you tell Emacs that some external entity is in UTF-8, it will
>> represent all valid UTF-8 sequences as properly decoded characters,
>> and it has special codes for all bytes not part of valid UTF-8.
>>
>> As a result, it works with valid UTF-8 perfectly as expected but will
>> reproduce arbitrary byte streams thrown at it perfectly when decoding
>> as UTF-8 and then reencoding into UTF-8 again.
>>
>> Guile is lacking this byte stream reproducibility when
>> decoding/reencoding. That makes it a whole lot less robust for dealing
>> with externally provided material.
>
> Python3 supports this by abusing the surrogate code points. I don't
> recommend following Python's lead.

Emacs uses overlong byte sequences for 0x00 to 0x7f to represent bytes
with values 0x80 to 0xff not part of valid UTF-8 sequences.  Those
cannot occur in valid UTF-8, but they handle nice internally with regard
to detecting character boundaries in string/character handling.
Basically, those are patterns 0xc0 0x80 ... 0xc0 0xbf and 0c1 0x80
... 0xc1 0xbf for representing 0x80 ... 0xbf and 0xc0 ... 0xff when the
latter are not part of proper (and consequently uniquely encoded) UTF-8.

Which means that random byte sequences get blown up by less than 50%
internally (less because some bytes 0x80...0xff end up in combinations
constituting valid UTF-8 sequences and thus will pass transparently).

> Instead, when decoding a byte string into Unicode, the application
> should be returned a list:
>
>    ( chars bytes chars bytes ... chars )
>
> or some similar mechanism.

This would seriously inflate random byte sequences and require string
handling to special-case the counters.  The Emacs way is comparatively
modest, and the internal representation meets most of the UTF-8
invariants important for fast string processing.  Perhaps the most
astonishing thing is that this reencoding results in sensible sort
orders: "Isolated bytes 0x80...0xff" sort right after 0x00...0x7f.

-- 
David Kastrup



  reply	other threads:[~2017-02-15 11:18 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-27 11:58 guile can't find a chinese named file Thomas Morley
2016-11-27 12:16 ` Chaos Eternal
2016-11-28  8:54   ` Thomas Morley
2017-01-26 21:59     ` Linas Vepstas
2017-01-30 14:20 ` Ludovic Courtès
2017-01-30 15:48   ` David Kastrup
2017-01-30 16:41     ` Ludovic Courtès
2017-01-30 17:04       ` David Kastrup
2017-01-30 15:54   ` Marko Rauhamaa
2017-01-30 16:19     ` David Kastrup
2017-01-30 16:33       ` Marko Rauhamaa
2017-01-30 16:42         ` David Kastrup
2017-01-30 17:58           ` Marko Rauhamaa
2017-01-30 18:32             ` David Kastrup
2017-01-30 18:50               ` Eli Zaretskii
2017-01-30 19:00                 ` David Kastrup
2017-01-30 19:32                   ` Eli Zaretskii
2017-01-30 19:59                     ` Eli Zaretskii
2017-01-30 20:42                       ` Mike Gran
2017-01-31  3:31                         ` Eli Zaretskii
2017-01-31  6:16                           ` Mike Gran
2017-01-31  8:51                           ` David Kastrup
2017-01-30 19:01               ` Marko Rauhamaa
2017-01-30 19:27                 ` David Kastrup
2017-02-14 20:10                   ` Linas Vepstas
2017-02-14 20:54                     ` Mike Gran
2017-02-14 21:07                       ` Marko Rauhamaa
2017-02-14 21:52                         ` Mike Gran
2017-02-14 22:12                           ` Marko Rauhamaa
2017-02-14 22:19                           ` Chris Vine
2017-02-15  7:15                             ` Marko Rauhamaa
2017-02-15  9:18                             ` tomas
2017-02-15  9:54                               ` David Kastrup
2017-02-15 10:10                                 ` tomas
2017-02-15 17:04                                   ` Eli Zaretskii
2017-02-15 20:07                                     ` tomas
2017-02-15 20:22                                       ` Eli Zaretskii
2017-02-15 10:50                                 ` Marko Rauhamaa
2017-02-15 11:18                                   ` David Kastrup [this message]
2017-02-15 10:15                               ` Chris Vine
2017-02-15 11:48                                 ` tomas
2017-02-15 12:13                                   ` Chris Vine
2017-02-15 12:41                                     ` tomas
2017-02-15 13:11                                       ` Chris Vine
2017-02-15 13:31                                         ` tomas
2017-02-15 17:07                                     ` Eli Zaretskii
2017-02-26 20:58                                       ` Andy Wingo
2017-02-27 16:02                                         ` Eli Zaretskii
2017-02-26 20:52                                 ` Andy Wingo
2017-02-15 16:59                               ` Eli Zaretskii
2017-02-15 17:53                                 ` Marko Rauhamaa
2017-02-15 20:20                                 ` tomas
2017-02-15 20:32                                   ` Eli Zaretskii
2017-02-15 21:04                                     ` Marko Rauhamaa
2017-02-16  5:44                                       ` Eli Zaretskii
2017-02-16  6:15                                         ` Marko Rauhamaa
2017-02-16  6:29                                           ` Eli Zaretskii
2017-02-16  6:41                                             ` Eli Zaretskii
2017-02-16  7:16                                               ` Marko Rauhamaa
2017-02-16  8:26                                                 ` David Kastrup
2017-02-16 10:21                                                   ` Marko Rauhamaa
2017-02-16 10:43                                                     ` David Kastrup
2017-02-16 11:04                                                       ` Marko Rauhamaa
2017-02-16 11:11                                                         ` David Kastrup
2017-02-16 11:32                                                           ` Marko Rauhamaa
2017-02-16 11:49                                                             ` David Kastrup
2017-02-16 12:14                                                               ` Marko Rauhamaa
2017-02-16 16:21                                                                 ` Eli Zaretskii
2017-02-16 16:38                                                                   ` Marko Rauhamaa
2017-02-16 17:46                                                                     ` Eli Zaretskii
2017-02-16 18:38                                                                       ` Marko Rauhamaa
2017-02-16 18:46                                                                         ` Eli Zaretskii
2017-02-16 19:35                                                                           ` Marko Rauhamaa
2017-02-16 20:10                                                                             ` Eli Zaretskii
2017-02-16 20:52                                                                               ` David Kastrup
2017-02-16 21:13                                                                                 ` Marko Rauhamaa
2017-02-17  6:44                                                                                   ` Eli Zaretskii
2017-02-17  8:46                                                                                     ` Marko Rauhamaa
2017-02-17  9:04                                                                                       ` David Kastrup
2017-02-17  9:57                                                                                         ` tomas
2017-02-17  9:07                                                                                       ` Eli Zaretskii
2017-02-17  6:32                                                                                 ` Eli Zaretskii
2017-02-16 16:06                                                 ` Eli Zaretskii
2017-02-16 16:35                                                   ` Marko Rauhamaa
2017-02-16 17:41                                                     ` Eli Zaretskii
2017-02-16 18:30                                                     ` Mike Gran
2017-02-16 18:48                                                       ` David Kastrup
2017-02-16  7:02                                             ` Marko Rauhamaa
2017-02-16 15:47                                               ` Eli Zaretskii
2017-02-15 21:15                                     ` tomas
2017-02-16  5:54                                       ` Eli Zaretskii
2017-02-14 23:58                       ` David Kastrup
2017-02-15 10:12                         ` tomas
2017-02-15 12:04                           ` Marko Rauhamaa
2017-02-26 21:20                         ` Andy Wingo
2017-02-27  9:10                           ` David Kastrup
2017-02-27 11:02                             ` Andy Wingo
2017-02-27 12:09                               ` David Kastrup
2017-02-27 12:33                                 ` Andy Wingo
2017-02-27 16:07                           ` Eli Zaretskii
2017-02-27 19:29                             ` Andy Wingo
2017-02-27 20:24                               ` Jan Wedekind
2017-02-27 20:33                                 ` Eli Zaretskii
2017-02-14 22:26                     ` Ludovic Courtès
2017-02-26 21:23                       ` Andy Wingo
2017-01-30 19:41                 ` Eli Zaretskii
2017-01-30 20:46                   ` Marko Rauhamaa
2017-01-31 12:20                     ` tomas
2017-02-14 19:58             ` Linas Vepstas
2017-02-26 21:33               ` Andy Wingo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8760kb66te.fsf@fencepost.gnu.org \
    --to=dak@gnu.org \
    --cc=guile-user@gnu.org \
    --cc=marko@pacujo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).