From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: Emacs puts binary junk into the clipboard, marking it as text Date: Wed, 20 Sep 2006 11:20:43 +0900 Message-ID: References: <1158280855.14121.69.camel@chrislap.madeupdomain.com> <450A514E.6020205@swipnet.se> <450BE084.10905@swipnet.se> <450C3380.2050008@swipnet.se> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1158718950 4690 80.91.229.2 (20 Sep 2006 02:22:30 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 20 Sep 2006 02:22:30 +0000 (UTC) Cc: christopher.ian.moore@gmail.com, emacs-pretest-bug@gnu.org, ihs_4664@yahoo.com, richard.stallman@gnu.org, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Sep 20 04:22:28 2006 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GPrjH-00055Z-Kx for ged-emacs-devel@m.gmane.org; Wed, 20 Sep 2006 04:22:28 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GPrjG-0003Jb-VF for ged-emacs-devel@m.gmane.org; Tue, 19 Sep 2006 22:22:27 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1GPrj0-0003D6-EG for emacs-devel@gnu.org; Tue, 19 Sep 2006 22:22:10 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1GPriy-00036A-Ji for emacs-devel@gnu.org; Tue, 19 Sep 2006 22:22:09 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1GPriy-00035a-DC; Tue, 19 Sep 2006 22:22:08 -0400 Original-Received: from [150.29.246.133] (helo=mx1.aist.go.jp) by monty-python.gnu.org with esmtp (Exim 4.52) id 1GPrm9-0006es-Tm; Tue, 19 Sep 2006 22:25:26 -0400 Original-Received: from smtp3.aist.go.jp ([150.29.246.12]) by mx1.aist.go.jp with ESMTP id k8K2M0Xa010975; Wed, 20 Sep 2006 11:22:00 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp3.aist.go.jp with ESMTP id k8K2LvHh005258; Wed, 20 Sep 2006 11:21:57 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 3.36 #1 (Debian)) id 1GPrhb-0006Ox-00; Wed, 20 Sep 2006 11:20:43 +0900 Original-To: Stefan Monnier In-reply-to: (message from Stefan Monnier on Tue, 19 Sep 2006 12:15:58 -0400) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:60026 gmane.emacs.pretest.bugs:14012 Archived-At: In article , Stefan Monnier writes: > I think we can't know what should be done, so we should strive for > simplicity and try to avoid losing information. I.e. just return the > unibyte string as-is. Even if it doesn't conform to ICCCM? I'll attach the relevant part of that document. "Jan D." writes: > W.r.t the standards, Emacs has two choices, return a valid UTF8-string > or don't return anything at all. I'm beginning to think the second > option is the best. This will be useful for checking UTF-8 validity. (define-ccl-program ccl-check-utf-8 '(0 ((r0 = 1) (loop (read-if (r1 < #x80) (repeat) ((r0 = 0) (if (r1 < #xC2) (end)) (read r2) (if ((r2 & #xC0) != #x80) (end)) (if (r1 < #xE0) ((r0 = 1) (repeat))) (read r2) (if ((r2 & #xC0) != #x80) (end)) (if (r1 < #xF0) ((r0 = 1) (repeat))) (read r2) (if ((r2 & #xC0) != #x80) (end)) (if (r1 < #xF8) ((r0 = 1) (repeat))) (read r2) (if ((r2 & #xC0) != #x80) (end)) (if (r1 == #xF8) ((r0 = 1) (repeat))) (end)))))) "Check if the input unibyte string is a valid UTF-8 sequence or not. If it is valid, set the register `r0' to 1, else set it to 0.") (defun string-utf-8-p (string) "Return non-nil iff STRING is a unibyte string of valid UTF-8 sequence." (if (or (not (stringp string)) (multibyte-string-p string)) (error "Not a unibyte string: %s" string)) (let ((status (make-vector 9 0))) (ccl-execute-on-string ccl-check-utf-8 status string) (= (aref status 0) 1))) --- Kenichi Handa handa@m17n.org Inter-Client Communication Conventions Manual Version 2.0.xf86.1 [...] 2.7. Use of Selection Properties The names of the properties used in selection data transfer are chosen by the requestor. The use of None property fields in ConvertSelection requests (which request the selection owner to choose a name) is not permitted by these conventions. The selection owner always chooses the type of the property in the selection data transfer. Some types have special semantics assigned by convention, and these are reviewed in the following sections. In all cases, a request for conversion to a target should return either a property of one of the types listed in the previous table for that target or a property of type INCR and then a property of one of the listed types. Certain selection properties may contain resource IDs. The selection owner should ensure that the resource is not destroyed and that its contents are not changed until after the selection transfer is complete. Requestors that rely on the existence or on the proper contents of a resource must operate on the resource (for example, by copying the con- tents of a pixmap) before deleting the selection property. The selection owner will return a list of zero or more items of the type indicated by the property type. In general, the number of items in the list will correspond to the number of disjoint parts of the selection. Some targets (for example, side-effect targets) will be of length zero irrespective of the number of disjoint selection parts. In the case of fixed-size items, the requestor may determine the number of items by the property size. Selection property types are listed in the table below. For variable-length items such as text, the separators are also listed. ------------------------------------- Type Atom Format Separator ------------------------------------- APPLE_PICT 8 Self-sizing ATOM 32 Fixed-size ATOM_PAIR 32 Fixed-size BITMAP 32 Fixed-size C_STRING 8 Zero COLORMAP 32 Fixed-size COMPOUND_TEXT 8 Zero DRAWABLE 32 Fixed-size INCR 32 Fixed-size INTEGER 32 Fixed-size PIXEL 32 Fixed-size PIXMAP 32 Fixed-size SPAN 32 Fixed-size STRING 8 Zero UTF8_STRING 8 Zero WINDOW 32 Fixed-size ------------------------------------- It is expected that this table will grow over time. 2.7.1. TEXT Properties In general, the encoding for the characters in a text string property is specified by its type. It is highly desirable for there to be a simple, invertible mapping between string property types and any character set names embedded within font names in any font naming standard adopted by the Con- sortium. The atom TEXT is a polymorphic target. Requesting conver- sion into TEXT will convert into whatever encoding is conve- nient for the owner. The encoding chosen will be indicated by the type of the property returned. TEXT is not defined as a type; it will never be the returned type from a selec- tion conversion request. If the requestor wants the owner to return the contents of the selection in a specific encoding, it should request con- version into the name of that encoding. In the table in section 2.6.2, the word TEXT (in the Type column) is used to indicate one of the registered encoding names. The type would not actually be TEXT; it would be STRING or some other ATOM naming the encoding chosen by the owner. STRING as a type or a target specifies the ISO Latin-1 char- acter set plus the control characters TAB (hex 09) and NEW- LINE (hex 0A). The spacing interpretation of TAB is context dependent. Other ASCII control characters are explicitly not included in STRING at the present time. COMPOUND_TEXT as a type or a target specifies the Compound Text interchange format; see the Compound Text Encoding. UTF8_STRING as a type or a target specifies an UTF-8 encoded string, with NEWLINE (U+000A, hex 0A) as end-of-line marker. There are some text objects where the source or intended user, as the case may be, does not have a specific character set for the text, but instead merely requires a zero-termi- nated sequence of bytes with no other restriction; no ele- ment of the selection mechanism may assume that any byte value is forbidden or that any two differing sequences are equivalent.8 For these objects, the type C_STRING should be used. Rationale An example of the need for C_STRING is to transmit the names of files; many operating systems do not interpret filenames as having a character set. For example, the same character string uses a differ- ent sequence of bytes in ASCII and EBCDIC, and so most operating systems see these as different filenames and offer no way to treat them as the same. Thus no character-set based property type is suitable. Type STRING, COMPOUND_TEXT, UTF8_STRING, and C_STRING prop- erties will consist of a list of elements separated by null characters; other encodings will need to specify an appro- priate list format.