From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Nathan Trapuzzano Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Sat, 29 Mar 2014 13:01:17 -0400 Message-ID: <87a9c9aqhu.fsf@nbtrap.com> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> <83eh1mfd09.fsf@gnu.org> <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1396112522 24805 80.91.229.3 (29 Mar 2014 17:02:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 17:02:02 +0000 (UTC) Cc: Eli Zaretskii , monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 18:01:56 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTwdo-0004oC-G4 for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 18:01:56 +0100 Original-Received: from localhost ([::1]:40193 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTwdo-00042D-5D for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 13:01:56 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48571) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTwdV-0003mU-4X for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:01:43 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTwdP-0005sw-3G for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:01:37 -0400 Original-Received: from gproxy3-pub.mail.unifiedlayer.com ([69.89.30.42]:52049) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1WTwdO-0005sY-RC for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:01:31 -0400 Original-Received: (qmail 31401 invoked by uid 0); 29 Mar 2014 17:01:29 -0000 Original-Received: from unknown (HELO CMOut01) (10.0.90.82) by gproxy3.mail.unifiedlayer.com with SMTP; 29 Mar 2014 17:01:29 -0000 Original-Received: from host393.hostmonster.com ([66.147.240.193]) by CMOut01 with id jV1L1n00j4B3kjm01V1PPJ; Sat, 29 Mar 2014 11:01:28 -0600 X-Authority-Analysis: v=2.1 cv=Re0DVTdv c=1 sm=1 tr=0 a=GZ6qK+eS4AuCRVUKGEKC+Q==:117 a=GZ6qK+eS4AuCRVUKGEKC+Q==:17 a=DsvgjBjRAAAA:8 a=f5113yIGAAAA:8 a=4GsTxW34auoA:10 a=D0rjyWuMlWIA:10 a=lfvU_ReahkwA:10 a=ngU5ixn2AAAA:8 a=fWyWhr6xdMwA:10 a=Wn2H3lT2AAAA:8 a=LvjfA-SuRwj81JSQj3IA:9 a=xgg2bd1uY6sA:10 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nbtrap.com; s=default; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References:Subject:Cc:To:From; bh=TUMPWTlXmY0SzGueJWJLBDSoUqwRn/jqmaYCZNsS0bQ=; b=dxJ1OfVJnemzicLWrMKYM7vUZPeI4QaO8Jlal6Gyf2oKzFtSevSZ81NMj+Tq6+Z2+Y78jpHMbqtGk2m+ghjFQnvU9ZAndjHYv6jqa2uHollqeZaqd9dQnHEBzv3NwHKB; Original-Received: from [50.90.253.209] (port=51262 helo=Nathan-GNU) by host393.hostmonster.com with esmtpsa (TLSv1.2:CAMELLIA128-SHA:128) (Exim 4.82) (envelope-from ) id 1WTwdF-0003aH-8S; Sat, 29 Mar 2014 11:01:21 -0600 In-Reply-To: <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Sat, 29 Mar 2014 18:23:17 +0900") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-Identified-User: {1585:host393.hostmonster.com:nbtrapco:nbtrap.com} {sentby:smtp auth 50.90.253.209 authed with nbtrap@nbtrap.com} X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 69.89.30.42 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171159 Archived-At: "Stephen J. Turnbull" writes: > What is relevant is how to represent byte streams in Emacs. The > obvious non-unibyte way is a one-to-one mapping of bytes to Unicode > characters. It is *extremely* convenient if the first 128 of those > bytes correspond to the ASCII coded character set, because so many > wire protocols use ASCII "words" syntactically. The other 128 don't > matter much, so why not just use the extremely convenient Latin-1 set > for them? Sorry if someone brought this up already, but one reason raw bytes shouldn't be represented as Latin-1 characters is that the "raw bytes"-ness would be lost when writing them back to disk if the stream also contained characters outside the Latin-1 range. For example, say we decode a stream of raw bytes as utf8, but that the stream contains some non-utf8 sequences. IIUC, Emacs will interpret those as "raw bytes", so that when it goes to encode the string to write it back, they will be written back verbatim. Whereas, if they had been interpreted as Latin-1 characters, they would get written back as the UTF8 equivalents. Hence you have the odd situation where you can decode and then encode and end up with a different string. Someone brought up Python in another post. Python (version 3 at least) does the same thing when, e.g., interpreting filenames. If you pass a string (_not_ bytes) to os.listdir, but the contents of the directory can't all be decoded as utf-8, it will return strings (_not_ bytes) where the non-utf8 sequences are Python-specific "characters" (in the Unicode private use areas I believe) representing "raw bytes", i.e. entities to be written back to the disk as the same raw sequences that were read therefrom.