From mboxrd@z Thu Jan 1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Philipp Stephani
Newsgroups: gmane.emacs.bugs
Subject: bug#53236: 26.1;
encode-coding-string does not encode the string as expected
Date: Thu, 13 Jan 2022 21:23:33 +0100
Message-ID:
References: <8735lra07e.fsf@metalevel.at>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
logging-data="14923"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: 53236@debbugs.gnu.org
To: Markus Triska
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Jan 13 21:24:15 2022
Return-path:
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.92)
(envelope-from )
id 1n86dq-0003kI-RI
for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 13 Jan 2022 21:24:14 +0100
Original-Received: from localhost ([::1]:41636 helo=lists1p.gnu.org)
by lists.gnu.org with esmtp (Exim 4.90_1)
(envelope-from )
id 1n86dp-0000bY-Ol
for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 13 Jan 2022 15:24:13 -0500
Original-Received: from eggs.gnu.org ([209.51.188.92]:46130)
by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.90_1) (envelope-from )
id 1n86de-0000RE-PY
for bug-gnu-emacs@gnu.org; Thu, 13 Jan 2022 15:24:02 -0500
Original-Received: from debbugs.gnu.org ([209.51.188.43]:41778)
by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
(Exim 4.90_1) (envelope-from )
id 1n86de-0004fo-Fi
for bug-gnu-emacs@gnu.org; Thu, 13 Jan 2022 15:24:02 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
(envelope-from ) id 1n86de-0007kx-7q
for bug-gnu-emacs@gnu.org; Thu, 13 Jan 2022 15:24:02 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: Philipp Stephani
Original-Sender: "Debbugs-submit"
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Thu, 13 Jan 2022 20:24:02 +0000
Resent-Message-ID:
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 53236
X-GNU-PR-Package: emacs
Original-Received: via spool by 53236-submit@debbugs.gnu.org id=B53236.164210543229800
(code B ref 53236); Thu, 13 Jan 2022 20:24:02 +0000
Original-Received: (at 53236) by debbugs.gnu.org; 13 Jan 2022 20:23:52 +0000
Original-Received: from localhost ([127.0.0.1]:34681 helo=debbugs.gnu.org)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from )
id 1n86dU-0007ka-Dq
for submit@debbugs.gnu.org; Thu, 13 Jan 2022 15:23:52 -0500
Original-Received: from mail-oi1-f169.google.com ([209.85.167.169]:39459)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from ) id 1n86dS-0007kH-Dd
for 53236@debbugs.gnu.org; Thu, 13 Jan 2022 15:23:51 -0500
Original-Received: by mail-oi1-f169.google.com with SMTP id e81so9368254oia.6
for <53236@debbugs.gnu.org>; Thu, 13 Jan 2022 12:23:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:cc:content-transfer-encoding;
bh=pyV1sMVRcizTwdhe6lk9IZy+dkc2Yxwn2m/nWjQ/92g=;
b=gKMAxrKcuXi9aHgUvMS3pMBwwpIg1+lc17epVDnmIUtJkoPQS1lVVysS5ougXnULpJ
6xqEp5in2CJTYCqh965rZDEov0HSc4igE5/l43O6x6XndWepBHKRM+Zqg5Ax7UTXhRDQ
gVRqFrSxHdwvqDYIJzHpTYcmGPZwJ4n+B+5M0vjLMBRXF79k1IBe3zB8wrsLjbTcVpss
zSm1RmsfKMytKuZBiCTVMbqKt3U6huruHQLu9adzpbb8jbElIVokx6Vd4FfX2NJ4xS3u
ELEOXmmB6SiikKma0nQ7Xv9KoV8/CBYphaqEeQ/tSq9iRbXr49ISGfUNtvc/HRtGkI/z
XCAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:cc:content-transfer-encoding;
bh=pyV1sMVRcizTwdhe6lk9IZy+dkc2Yxwn2m/nWjQ/92g=;
b=7Bc4BRiwpMeqCpUnrxNAdrwd5kS/pME+RcAfB3MOP3bKT5u3chaN/H49HqwWCdFvP0
zH8cSpXplTejkwrdgLJh0q/lDo3i3l5NwnIuCB2QcFvNOyi2fEHzB3MM/BrMTNxgX4xK
DgOCjZJ9fbGgPjOJxOwRq+NpZIUapRLsTIJ2tDYhNjroDpp+e97djX029TW8Q6/NcXDG
GbJWtobag2xHJV8BE67SUt/7jVqfaSW1jO4xPd9MvbqXAjeMps+dPwQdVxFt0ILrQi6Q
R5RN7EooDScuVdgJbc83pdxVJVBQg4DaQtaYXVko7H7c7zkk22fWf3DJv3CLYD/gopMK
am6Q==
X-Gm-Message-State: AOAM530p3FXrddcSc2Me74HhoKrsYn2oLBsSF7xYAyvpNBjGoatOozbi
vCStOWrQlGYoo0mggh98BfIbUGh7XHlwOMzQPopFZyq4l7o=
X-Google-Smtp-Source: ABdhPJzyl3fb4/K4KofEHn6YlA5uNQaxtz0ks6vuj6PByEl6I6TI3gVEQTECocOxGGSK3L6mc9cJgl4FPP844CkMd7o=
X-Received: by 2002:aca:eb52:: with SMTP id j79mr9603352oih.150.1642105424605;
Thu, 13 Jan 2022 12:23:44 -0800 (PST)
In-Reply-To: <8735lra07e.fsf@metalevel.at>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
the Swiss army knife of text editors"
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "bug-gnu-emacs"
Xref: news.gmane.io gmane.emacs.bugs:224133
Archived-At:
Am Do., 13. Jan. 2022 um 21:14 Uhr schrieb Markus Triska :
>
> Dear all,
>
> please consider the UTF-8 encoding of the Unicode codepoint 0x80, which
> is formed by two bytes. In hexadecimal notation, they are: 0xC2 0x80.
>
> We can use decode-coding-string to verify that this byte sequence is
> decoded to 0x80 when specifying utf-8, which works exactly as expected:
>
> (decode-coding-string "\xC2\x80" 'utf-8)
>
> This yields "\200", which is the same as "\x80", as verified via:
>
> (string=3D "\200" "\x80") --> t
There are two possible interpretations of "\200":
1. The unibyte string containing the byte #x80
2. The multibyte string containing the Unicode character U+0080
The string literal "\200" gives you the former, while
(decode-coding-string "\xC2\x80" 'utf-8) gives you the latter. In
fact,
(string=3D (decode-coding-string "\xC2\x80" 'utf-8) "\200") =E2=87=92 nil
but
(string=3D (decode-coding-string "\xC2\x80" 'utf-8) "\u0080") =E2=87=92 t
>
> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
> a string equivalent to "\xC2\x80", but that seems not to be the case. I g=
et:
>
> (encode-coding-string "\200" 'utf-8) --> "\200"
Here "\200" gives you the unibyte string that contains the byte #x80.
That can't be encoded as UTF-8 (since UTF-8 encodes Unicode scalar
values, not raw bytes), so it's left alone.
However,
(encode-coding-string "\u0080" 'utf-8) =E2=87=92 "\302\200"
There's some background in the chapter "Text representations" in the
ELisp manual.
HTH