From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: ynyaaa@gmail.com Newsgroups: gmane.emacs.bugs Subject: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents Date: Sun, 06 Oct 2019 02:18:08 +0900 Message-ID: <86lftz157z.fsf@gmail.com> References: <86ftkbscry.fsf@gmail.com> <83h84r89i4.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="239376"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (windows-nt) Cc: 37580@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Oct 05 19:19:12 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1iGni3-00106F-UB for geb-bug-gnu-emacs@m.gmane.org; Sat, 05 Oct 2019 19:19:12 +0200 Original-Received: from localhost ([::1]:58124 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iGni2-0002cK-3q for geb-bug-gnu-emacs@m.gmane.org; Sat, 05 Oct 2019 13:19:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54219) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iGnhv-0002cB-AO for bug-gnu-emacs@gnu.org; Sat, 05 Oct 2019 13:19:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iGnhu-0001iS-B4 for bug-gnu-emacs@gnu.org; Sat, 05 Oct 2019 13:19:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:36302) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iGnhu-0001iK-7y for bug-gnu-emacs@gnu.org; Sat, 05 Oct 2019 13:19:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1iGnhu-0003ax-2J for bug-gnu-emacs@gnu.org; Sat, 05 Oct 2019 13:19:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: ynyaaa@gmail.com Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 05 Oct 2019 17:19:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 37580 X-GNU-PR-Package: emacs Original-Received: via spool by 37580-submit@debbugs.gnu.org id=B37580.157029590313773 (code B ref 37580); Sat, 05 Oct 2019 17:19:01 +0000 Original-Received: (at 37580) by debbugs.gnu.org; 5 Oct 2019 17:18:23 +0000 Original-Received: from localhost ([127.0.0.1]:45123 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iGnhH-0003a5-0Y for submit@debbugs.gnu.org; Sat, 05 Oct 2019 13:18:23 -0400 Original-Received: from mail-pl1-f180.google.com ([209.85.214.180]:45781) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iGnhE-0003Zr-PN for 37580@debbugs.gnu.org; Sat, 05 Oct 2019 13:18:21 -0400 Original-Received: by mail-pl1-f180.google.com with SMTP id u12so4655365pls.12 for <37580@debbugs.gnu.org>; Sat, 05 Oct 2019 10:18:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=w30p3ufQGP7b1Rq47o9Ij9rzSGY92pfxUJgYGUnS3Zk=; b=FVf7uonlfZ8yDDeKyRy6V250mtTn2UKoJNRHNKMGZDSKwdDs/NJf/wDfu2+7Kt7sKg PAWLZIHpFSIaWJ79ZTRF/AEHBDrGWne5Hx62q4SAxRIBVAPYkuJD8BhTap3KQFGc5jSG pVaze4nb0XqheLfhnca1sIulqKUTKZ5EhD7D+3eeBHm4pUZ1b+rJZ7l1JeXlCEx/5sLe cM09647XBPeNpXf8jGgeKQMJ7yxEpVgyBmnqJsN0vOJ9M/n9GrX92b56APXmP08xc/PG UgnScHPKb09b7GmEyWyoqS/OuF8aXI4jJQpLfZ428oJMbxAsmzBZlAs9zxlsnyHS2+yD 2PfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=w30p3ufQGP7b1Rq47o9Ij9rzSGY92pfxUJgYGUnS3Zk=; b=M1fgLrggEbmRxlkGeNHYScEHqiWfneDtyf+T3B/AfV7jYJM4ymPGAqyVW+2VumOewV 1RIQPP50PXpPA50w914NyWoPQ3zzKrqf2MtgyNeFhuc7JAmRmlZmzZ/kspuukp2lL5Bb gw3uoUYrYr9i+Wx6iw76KN+Yjj9wJ9oGUD3Zg5K9gZrVr8cYCp3Tv0AAK3792ZVj5kyg RvU3NUKnqXK5nGA4v0Uruf47TuuiqjSvWl+2y7CfyFDCeiu1Trep5zhUOLGAhteyXYBh m+6Dm9P27sHKkTOQvJPPO7ebP5M1y1Fh2DWpr4KuEUtc4OJYMdXNgqxHKdYR0+OQ6bBz ePkA== X-Gm-Message-State: APjAAAVgcixKl3V1ivxAdE15nYKKC6kjvUUwdnFt8SgV8PXSISqsRmWh NfVyt3W6YthCMIT8cA8NO9Zypyxx X-Google-Smtp-Source: APXvYqwt/1Yk/dKcrP8XLo2AlekE87el6vGCvVGSErsy4BUI3Ozy9M/9CWSH0fAvs/9z9Z9AVhuNpA== X-Received: by 2002:a17:902:848e:: with SMTP id c14mr21384096plo.217.1570295894672; Sat, 05 Oct 2019 10:18:14 -0700 (PDT) Original-Received: from HP (east49-p55.eaccess.hi-ho.ne.jp. [219.105.5.56]) by smtp.gmail.com with ESMTPSA id x12sm9734343pfm.130.2019.10.05.10.18.11 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sat, 05 Oct 2019 10:18:13 -0700 (PDT) In-Reply-To: <83h84r89i4.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 02 Oct 2019 18:14:43 +0300") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:168386 Archived-At: Eli Zaretskii writes: > I don't think this is a bug. Changing the multibyte-ness of a buffer > really does change the contents. You should only do that where it > makes sense. Sometimes I find broken utf-8 texts on the Internet. Some characters are split into surrogate pairs, and each surrogate character is encoded as if it is a normal BMP character. utf-8 coding system does not decode such sequences. Changing multibyte-ness converts them to surrogate characters. And encode-decode process with utf-16be outputs the intended characeters. Suppose the character is #x10000, the correspoding pair is (#xD800 #xDC00). The miss-encoded sequence is: (encode-coding-string "\xD800\xDC00" 'utf-8) => "\355\240\200\355\260\200" It is not decoded with utf-8. (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8) 'utf-8) => "\355\240\200\355\260\200" Changing multibyte-ness, the sequence is converted into surrogate characters. (with-temp-buffer (insert (encode-coding-string "\xD800\xDC00" 'utf-8)) (set-buffer-multibyte nil) (set-buffer-multibyte t) (buffer-string)) => "\xD800\xDC00" The surrogate pair can be converted into the original character. (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be) 'utf-16be) => "\x10000"