From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel,gmane.emacs.gnus.general Subject: Re: MML charset tag regression Date: Tue, 29 Apr 2003 16:12:12 +0900 Organization: The XEmacs Project Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <874r4h4w1f.fsf@tleepslib.sk.tsukuba.ac.jp> References: <8465p3kgpl.fsf@lucy.is.informatik.uni-duisburg.de> <84bryuogke.fsf@lucy.is.informatik.uni-duisburg.de> <200304281158.UAA10974@etlken.m17n.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1051600375 15877 80.91.224.249 (29 Apr 2003 07:12:55 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Tue, 29 Apr 2003 07:12:55 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Tue Apr 29 09:12:51 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 19APIJ-00047e-00 for ; Tue, 29 Apr 2003 09:12:51 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 19APQq-0006B1-00 for ; Tue, 29 Apr 2003 09:21:41 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19APIr-0007An-00 for emacs-devel@quimby.gnus.org; Tue, 29 Apr 2003 03:13:25 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 19APIU-0007A8-00 for emacs-devel@gnu.org; Tue, 29 Apr 2003 03:13:02 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 19APIS-00079g-00 for emacs-devel@gnu.org; Tue, 29 Apr 2003 03:13:01 -0400 Original-Received: from tleepslib.sk.tsukuba.ac.jp ([130.158.98.109]) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19APIS-00078Y-00 for emacs-devel@gnu.org; Tue, 29 Apr 2003 03:13:00 -0400 Original-Received: from steve by tleepslib.sk.tsukuba.ac.jp with local (Exim 3.36 #1 (Debian)) id 19APHh-0003N4-00; Tue, 29 Apr 2003 16:12:13 +0900 Original-To: Kenichi Handa In-Reply-To: (Simon Josefsson's message of "Tue, 29 Apr 2003 01:05:12 +0200") User-Agent: Gnus/5.090016 (Oort Gnus v0.16) XEmacs/21.5 (cabbage) Original-cc: ding@gnus.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: Emacs development discussions. List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:13543 gmane.emacs.gnus.general:51940 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:13543 >>>>> "Simon" == Simon Josefsson writes: Simon> Emacs behaves different from xterm, gnome-terminal, gedit, Simon> etc though. The X protocol is designed so that clients with different needs/wants can negotiate the best available transfer. Simon> Is this a bug in that client? Yes. We lose. Simon> Or maybe emacs can detect that the TEXT request failed? Is Simon> "?????" some magic string emacs can test for? No. Heuristic, yes. Standard or wide-spread practice, no. Unfortunately, a failed request should return a failure indication, and no data, not some bogus data. Apparently these clients fail to do that correctly. The big problem with TEXT is that it gives the requestor no way to negotiate content. TEXT is simply whatever the selection owner chooses to spew; you'd better be able to handle it. Emacs should avoid asking for TEXT. The algorithm should be 0. Ask for TARGETS. A proper client will be able to tell you what it supports. (We may be able to cache this information, and avoid X protocol round-trips.) In steps 1-4 below, qualify with "unless known to be unavailable." 1. Ask for UTF8_STRING or COMPOUND_TEXT first. Default to UTF8_STRING, but there should be a user option to start with COMPOUND_TEXT (the Unihan disambiguation problem). 2. Ask for the other universal encoding. 3. Ask for STRING (ISO 8859/1, if that is not known to be unacceptable). 4. Ask for Heaven's intercession, and TEXT. (Now I see why UTF8_STRING is a good thing; even though the _sender_ can use COMPOUND_TEXT to send UTF-8 reliably, requesting COMPOUND_TEXT doesn't restrict the sender to UTF-8.) Simon> Unless there is some well-agreed on non-controversial Simon> recommendation on how internationalized X11 cut'n'paste Simon> should work, all attempts to get a complete system working Simon> seems futile. I don't see why the above should be controversial, except that there's the Unihan political issue, and some Asian language users would want the factory default to be COMPOUND_TEXT in Han-using locales. To deal with broken clients, it might be best to have the above algorithm implemented as a Lisp list containing targets in order of desirability. Then if a client is known to send junk when COMPOUND_TEXT is requested, you can not send it. This might also allow the selection request function to be flexibly used. (Eg, if the selection contains an image, you could prepend (PIXMAP POSTSCRIPT) to the list of text targets, where presumably the text targets would get the ALT string from HTML or a tooltip from a toolbar button, etc. To get a file name, you could prepend (FILE) (the problem with the text targets is that they might be interpreted as "send me the file contents"). And so on.) By having a cache of windows we've gotten stuff from, we could (1) avoid round-trips to get the TARGET list, and (2) keep a record of TARGETs that give undesired results, etc. Simon> Galeon uses GTK2 and obviously it doesn't produce a good Simon> COMPOUND_TEXT. Depends on what you mean by "good." This method guarantees that a font capable of displying the text is available in the standard X distribution (ISTR that ISO 8859/5 fonts appeared well after Japanese fonts in X, and I doubt that X distributes KOI8 fonts at all, although they're easily available). >> The new encoding method using "Non-Standard Character Set >> Encodings" of COMPOUND_TEXT makes the cyrillic case much more >> complicated. In some case (perhaps only in KOI8 locale), X >> clients recently start to encode cyrillic characters in "ESC % >> / 0 ...". They don't consider the situation that the requester >> is running in a different locale. :-( I don't understand the problem, as long the extended segment is properly formed, you know it's KOI8. How is this different from TEXT? The extended segment is much better than the alternative I've seen, which is sending non-Latin-1 text as STRING! Simon> Do you mean the client sends data in a locale-specific Simon> charset via COMPOUND_TEXT? Ouch. COMPOUND_TEXT _is_ basically locale-specific. It's a modal ISO 2022 encoding. The only semantic difference between the usual escape sequence and the extended segment used for UTF-8 and KOI8 is that extended segments can be used for not-yet-standardized encodings that don't have an ISO-registered final byte. The method is actually better than that for the standard encodings since it includes a length parameter. >> Perhaps, we should make Emacs to request UTF8_STRING at first >> if the locale is UTF8, and if that request fails, request >> COMPOUND_TEXT. Simon> This sounds like a good idea to me. Locales are just plain broken for this purpose. As Handa-san points out, you have no idea what locale the partner is running in. Our own locale is the best heuristic for Emacs if the partner is unwilling to talk about it, but really we need clients that implement a proper negotiation protocol. I'm regularly running clients in three separate locales simultaneously on the same host (POSIX, ja_JP.eucJP, en_US.utf8). I imagine many Europeans are in a similar situation. (And I haven't even started to talk about my development/testing environment!) I think that we should start by being "selfish", ie, think about what form of text Emacs is best prepared to use, and request that. I would say _always_ request UTF8_STRING unless we have reason to believe the sender can't do it (eg, previously failed) or our user would prefer COMPOUND_TEXT (eg, that fraction of Han users). (I'm thinking in terms of emacs-unicode, obviously.) Also, a related topic, I think that we should think carefully about canonicalizing variant codes (such as "full-width" Latin or Cyrillic characters). For example, I'm pretty careful about the aesthetics of half-width and full-width characters in my Japanese mail, but my colleagues no longer are (in fact, I once received a mail in which the 4 digits of the year were in three different encodings! JIS X 0201, JIS X 0208, and ASCII). When I investigated this curiosity, what I found is that on most Windows and Mac systems the full- and half-width variants are visually hard to distinguish, and the JIS Roman and ASCII characters are the identical glyph with different indices in the Cmap going to the same CID. Of course such canonicalization needs to be user-controllable, but I doubt most users will even notice if we default to canonicalization. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.