From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: Re: Unicode and Guile Date: Thu, 6 Nov 2003 20:16:35 +0200 Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Message-ID: <20031106181635.GA9546@lark> References: <20031021171534.GA13246@lark> <200310260003.RAA10375@morrowfield.regexps.com> <20031031132525.GB715@lark> <200311032031.MAA19389@morrowfield.regexps.com> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1068553827 25438 80.91.224.253 (11 Nov 2003 12:30:27 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 11 Nov 2003 12:30:27 +0000 (UTC) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue Nov 11 13:30:24 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AJXf5-0005MC-00 for ; Tue, 11 Nov 2003 13:30:23 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AJYaU-0000Ok-EJ for guile-devel@m.gmane.org; Tue, 11 Nov 2003 08:29:42 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AJYaB-0000OM-Vh for guile-devel@gnu.org; Tue, 11 Nov 2003 08:29:23 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AJYZf-0000Hu-Eu for guile-devel@gnu.org; Tue, 11 Nov 2003 08:29:22 -0500 Original-Received: from [216.166.232.203] (helo=ambient.2y.net) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.24) id 1AJYUA-0007bF-Fi for guile-devel@gnu.org; Tue, 11 Nov 2003 08:23:10 -0500 Original-Received: from localhost (mantis.schoolnet.na [::ffff:196.44.140.238]) (AUTH: LOGIN wingo) by ambient.2y.net with esmtp; Tue, 11 Nov 2003 07:21:43 -0500 Original-Received: from wingo by localhost with local (Exim 3.36 #1 (Debian)) id 1AHogO-0002UP-00 for ; Thu, 06 Nov 2003 20:16:36 +0200 Original-To: guile-devel@gnu.org Mail-Followup-To: guile-devel@gnu.org Content-Disposition: inline In-Reply-To: <200311032031.MAA19389@morrowfield.regexps.com> X-Operating-System: Linux lark 2.4.20-1-686 User-Agent: Mutt/1.5.4i X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Developers list for Guile, the GNU extensibility library List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.devel:2994 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.devel:2994 Well, I just wasted an afternoon reading the Unicode 4.0 spec. I'll never have that time back again ;) On Mon, 03 Nov 2003, Tom Lord wrote: > Part of the problem is that Unicode specifications are very careful to > _not_ define "character" (except ambiguously). > > In different contexts related to my question, it might mean a unicode > code point, a code value, or something more complicated such as a > grapheme (which may be represented as a string of unicode code > points). There is the term "abstract characters", which maps to code points. But yes, I see that there are some problems with the concept. What do you mean by "code value"? We have the following passage in chapter two of the spec, talking about one of the ten design principles, "Characters, Not Glyphs": The Unicode Standard draws a distinction between characters and glyphs. Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. Characters are represented by code points that reside only in a memory representation, as strings in memory, or on disk. The Unicode Standard deals only with character codes. Character codes, or code points encoded via UTF-8, UTF-16, or UTF-32, are characters, according to this passage. Granted, end users sometimes expect certain character combinations to be treated as characters, such as `ch' in traditional Spanish, or lower-case a plus grave accent composite in French. That level of processing is much higher than that of simple strings, and more related (it seems to me) to the rendering of glyphs on the screen. I've actually had a change of thought about encodings; while UTF-8 or UTF-16 are good for disk and network transfer, the extensive character-based API of Scheme (read-char from ports, for instance) lends itself better to uniform representation for individual characters. So the natural native format for strings in memory might well be UTF-32. But that's another issue... > It's a nasty problem to try to unify unicode types with scheme types. Indeed :-/ > Suppose: > > * CHAR? is a code value in some encoding (say, UTF-8 or UTF-16) I think I'm leaning towards 4-byte values here so that you can never read-char a partial character. > * CHAR? is a unicode code point -- a 21 bit value. > > This approach has the same problems with string efficiency or > complexity Complexity isn't so much of an issue. Efficiency is, however; applications with large amounts of string data might want to choose a different encoding for their storage. > * CHAR? is a "grapheme" -- the user's idea of a character. > > Ray Dillenger is currently exploring this (see recent c.l.s.) Will check this out. It does sound painful, though :/ > >> There's a need for a new type, `text', which acts like the text > >> contents of an emacs buffer > > > Maybe. This issue is, in my opinion, orthogonal to simple strings. > > But perhaps its worth mentioning in this context because it suggests a > very straightforward approach for Guile: > > CHAR? is 8 bits. STRING? is a sequence of 8-bit chars. And > everything unicode is orthogonal to that. While there may be support > for manipulating unicode strings represented as STRING? and unicode > characters represented as CHAR?, fundamentally, CHAR? and STRING? are > kept butt-simple and the unicode support is something new. Hm. Let's consider some use cases. Let's say an app wants to ask the user her name, she might want to write her name in her native Arabic. Or perhaps her address, or anything "local". If the app then wants to communicate this information to her Chinese friend (who also knows Arabic), the need for Unicode is fundamental. We can probably agree there. The question becomes, is the user's name logically a simple string (can we read it in with standard procedures), or must we use this text-buffer, complete with marks, multiple backends, et al? It seems more natural, to me, for this to be provided via simple strings, although I could be wrong here. I was looking at what Python did (http://www.python.org/peps/pep-100.html), and they did make a distinction. They have a separate unicode string representation which, like strings, is a subclass of SequenceObject. So they maintained separate representation while being relatively transparent to the programmer. Pretty slick, it seems. C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16 as their native format, although I've never coded in the first two. Networking was probably the most important consideration there. Perhaps the way forward would be to leave what we've been calling "simple strings" alone, and somehow (perhaps with GOOPS, but haven't thought too much about this) pull the Python trick of having a unicode string that can be used everywhere simple strings can. Thoughts on that idea? Regards, wingo. _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel