unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: Emacs puts binary junk into the clipboard, marking it as text
       [not found] <1158280855.14121.69.camel@chrislap.madeupdomain.com>
@ 2006-09-15  7:07 ` Jan Djärv
  2006-09-15 16:30   ` Kevin Rodgers
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Djärv @ 2006-09-15  7:07 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel



Chris Moore skrev:
> Please describe exactly what actions triggered the bug
> and the precise symptoms of the bug:
> 
> I run the Xfce 4 desktop environment, along with the
> xfce4-clipman-plugin applet which collects clipboard entries and
> allows me to chose between them from a menu.
> 
> I have x-select-enable-clipboard set to t in Emacs, so whenever I
> 'kill' regions of the buffer, they get sent to the clipboard.
> 
> Occasionally the clipman applet will start consuming all available
> CPU.  This happens when I copy certain binary characters.  Seems the
> clipman gets stuck in a loop trying to convert illegal an illegal UTF8
> string.
> 
> A very simple case which reproduces the bug:
> 
>> I made a 1-byte file containing just character 0300 (octal),
>> copied that using Emacs, and clipman started printing its error
>> message over and over again.
> 
> I reported this bug firstly to the Xfce BTS:
> 
>   http://bugzilla.xfce.org/show_bug.cgi?id=1945
> 
> but they told me it was a gtk bug, so I raised the same bug in the
> GNOME tracker:
> 
>   http://bugzilla.gnome.org/show_bug.cgi?id=349856
> 
> and they tell me it's an Emacs bug, saying:
> 
>> Well, if emacs puts binary junk into a text property it is not gtk's
> fault.
>> Look at gtk_selection_data_get_text(): We only try to convert
> something to
>> utf8 if the sender claims that it is text...
> 
> So I'm raising it here too!

Isn't 0300 a valid unicode character?  Anyway, when Emacs gets a selection 
request for the clipboard with type UTF8_STRING, it eventually ends up in 
xselect-convert-to-string.  This function does:

	   ((eq type 'UTF8_STRING)
	    (setq str (encode-coding-string str 'utf-8)))

As far as I can tell, it does not check if str is all text, it seems to return 
  non-text unconverted.  Should we check str first?  And if it does contain 
non-text, what should Emacs send back as type?  STRING, TEXT?

	Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-15  7:07 ` Emacs puts binary junk into the clipboard, marking it as text Jan Djärv
@ 2006-09-15 16:30   ` Kevin Rodgers
  2006-09-16 11:31     ` Jan D.
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Rodgers @ 2006-09-15 16:30 UTC (permalink / raw)
  Cc: emacs-devel

Jan Djärv wrote:
> 
> 
> Chris Moore skrev:
>> Please describe exactly what actions triggered the bug
>> and the precise symptoms of the bug:
>>
>> I run the Xfce 4 desktop environment, along with the
>> xfce4-clipman-plugin applet which collects clipboard entries and
>> allows me to chose between them from a menu.
>>
>> I have x-select-enable-clipboard set to t in Emacs, so whenever I
>> 'kill' regions of the buffer, they get sent to the clipboard.
>>
>> Occasionally the clipman applet will start consuming all available
>> CPU.  This happens when I copy certain binary characters.  Seems the
>> clipman gets stuck in a loop trying to convert illegal an illegal UTF8
>> string.
>>
>> A very simple case which reproduces the bug:
>>
>>> I made a 1-byte file containing just character 0300 (octal),
>>> copied that using Emacs, and clipman started printing its error
>>> message over and over again.
>>
>> I reported this bug firstly to the Xfce BTS:
>>
>>   http://bugzilla.xfce.org/show_bug.cgi?id=1945
>>
>> but they told me it was a gtk bug, so I raised the same bug in the
>> GNOME tracker:
>>
>>   http://bugzilla.gnome.org/show_bug.cgi?id=349856
>>
>> and they tell me it's an Emacs bug, saying:
>>
>>> Well, if emacs puts binary junk into a text property it is not gtk's
>> fault.
>>> Look at gtk_selection_data_get_text(): We only try to convert
>> something to
>>> utf8 if the sender claims that it is text...
>>
>> So I'm raising it here too!
> 
> Isn't 0300 a valid unicode character?

Yes, but it is not encoded as a single byte in UTF-8, it would be 2
bytes: o303 o200 (xC3 x80).

> Anyway, when Emacs gets a 
> selection request for the clipboard with type UTF8_STRING, it eventually 
> ends up in xselect-convert-to-string.  This function does:
> 
>        ((eq type 'UTF8_STRING)
>         (setq str (encode-coding-string str 'utf-8)))
> 
> As far as I can tell, it does not check if str is all text, it seems to 
> return  non-text unconverted.  Should we check str first?  And if it 
> does contain non-text, what should Emacs send back as type?  STRING, TEXT?

Doesn't that all depend on buffer-file-coding-system and
selection-coding-system being set correctly?

-- 
Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-15 16:30   ` Kevin Rodgers
@ 2006-09-16 11:31     ` Jan D.
  2006-09-16 17:25       ` Jan D.
  0 siblings, 1 reply; 13+ messages in thread
From: Jan D. @ 2006-09-16 11:31 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel

Kevin Rodgers wrote:
> Jan Djärv wrote:
>>
>>
>> Chris Moore skrev:
>>> A very simple case which reproduces the bug:
>>>
>>>> I made a 1-byte file containing just character 0300 (octal),
>>>> copied that using Emacs, and clipman started printing its error
>>>> message over and over again.
>>>
>>> I reported this bug firstly to the Xfce BTS:
>>>
>>>   http://bugzilla.xfce.org/show_bug.cgi?id=1945
>>>
>>> but they told me it was a gtk bug, so I raised the same bug in the
>>> GNOME tracker:
>>>
>>>   http://bugzilla.gnome.org/show_bug.cgi?id=349856
>>>
>>> and they tell me it's an Emacs bug, saying:
>>>
>>>> Well, if emacs puts binary junk into a text property it is not gtk's
>>> fault.
>>>> Look at gtk_selection_data_get_text(): We only try to convert
>>> something to
>>>> utf8 if the sender claims that it is text...
>>>
>>> So I'm raising it here too!
>>
>> Isn't 0300 a valid unicode character?
>
> Yes, but it is not encoded as a single byte in UTF-8, it would be 2
> bytes: o303 o200 (xC3 x80).
>

But that is as it should be, UTF8_STRING says data is in UTF-8, so Emacs 
sends o303 o200.  gtk_selection_data_get_text does not complain on that.

Anyway, xfce should not loop like that, gtk_selection_data_get_text does 
not loop, it just prints one error message and returns.

>> Anyway, when Emacs gets a selection request for the clipboard with 
>> type UTF8_STRING, it eventually ends up in 
>> xselect-convert-to-string.  This function does:
>>
>>        ((eq type 'UTF8_STRING)
>>         (setq str (encode-coding-string str 'utf-8)))
>>
>> As far as I can tell, it does not check if str is all text, it seems 
>> to return  non-text unconverted.  Should we check str first?  And if 
>> it does contain non-text, what should Emacs send back as type?  
>> STRING, TEXT?
>
> Doesn't that all depend on buffer-file-coding-system and
> selection-coding-system being set correctly?
>

Yes, but I kind of assumed that was the case.

Anyway, I will fix this somehow, we should not be sending non-UTF8 as a 
UTF8_STRING.

    Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-16 11:31     ` Jan D.
@ 2006-09-16 17:25       ` Jan D.
  2006-09-19  5:05         ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Jan D. @ 2006-09-16 17:25 UTC (permalink / raw)
  Cc: emacs-pretest-bug, Kevin Rodgers, richard.stallman, emacs-devel

Jan D. wrote:
> Kevin Rodgers wrote:
>> Jan Djärv wrote:
>>>
>>>
>>> Chris Moore skrev:
>>>> A very simple case which reproduces the bug:
>>>>
>>>>> I made a 1-byte file containing just character 0300 (octal),
>>>>> copied that using Emacs, and clipman started printing its error
>>>>> message over and over again.
>>>>
>>>>
> Anyway, I will fix this somehow, we should not be sending non-UTF8 as 
> a UTF8_STRING.

I've checked in a fix that changes UTF8_STRING to STRING if the data 
doesn't look like UTF8.  However, this might give errors too.  The only 
way to be sure to copy raw binary data correctly is by adding a new type 
(like application-specific/octet-stream).   But if we do that, nobody 
will be able to get data from Emacs, as such a type is not standard and 
unsupported.  Copy-paste with raw binary data is just something most 
apps don't do.

Please try this, it is hopefully better.

    Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-16 17:25       ` Jan D.
@ 2006-09-19  5:05         ` Kenichi Handa
  2006-09-19  6:15           ` Jan Djärv
  2006-09-19 10:54           ` Stefan Monnier
  0 siblings, 2 replies; 13+ messages in thread
From: Kenichi Handa @ 2006-09-19  5:05 UTC (permalink / raw)
  Cc: emacs-pretest-bug, ihs_4664, christopher.ian.moore, emacs-devel,
	richard.stallman

In article <450C3380.2050008@swipnet.se>, "Jan D." <jan.h.d@swipnet.se> writes:

> I've checked in a fix that changes UTF8_STRING to STRING if the data 
> doesn't look like UTF8.  However, this might give errors too.  The only 
> way to be sure to copy raw binary data correctly is by adding a new type 
> (like application-specific/octet-stream).   But if we do that, nobody 
> will be able to get data from Emacs, as such a type is not standard and 
> unsupported.  Copy-paste with raw binary data is just something most 
> apps don't do.

AFAIK, only when TEXT is requested, an selection owner can
choose the returning type from STRING, COMPOUND_TEXT, or
UTF8_STRING.  When UTF8_STRING is requested, we should
return it or return nothing.

And, if Emacs owns a unibyte string, perhaps the right thing
is to make it multibyte according to the current
lang. env. (by string-make-multibyte) at first, then encode
it by utf-8.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19  5:05         ` Kenichi Handa
@ 2006-09-19  6:15           ` Jan Djärv
  2006-09-19  7:14             ` Kenichi Handa
  2006-09-19 10:54           ` Stefan Monnier
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Djärv @ 2006-09-19  6:15 UTC (permalink / raw)
  Cc: richard.stallman, emacs-pretest-bug, ihs_4664,
	christopher.ian.moore, emacs-devel



Kenichi Handa skrev:
> In article <450C3380.2050008@swipnet.se>, "Jan D." <jan.h.d@swipnet.se> writes:
> 
>> I've checked in a fix that changes UTF8_STRING to STRING if the data 
>> doesn't look like UTF8.  However, this might give errors too.  The only 
>> way to be sure to copy raw binary data correctly is by adding a new type 
>> (like application-specific/octet-stream).   But if we do that, nobody 
>> will be able to get data from Emacs, as such a type is not standard and 
>> unsupported.  Copy-paste with raw binary data is just something most 
>> apps don't do.
> 
> AFAIK, only when TEXT is requested, an selection owner can
> choose the returning type from STRING, COMPOUND_TEXT, or
> UTF8_STRING.  When UTF8_STRING is requested, we should
> return it or return nothing.
> 
> And, if Emacs owns a unibyte string, perhaps the right thing
> is to make it multibyte according to the current
> lang. env. (by string-make-multibyte) at first, then encode
> it by utf-8.

What would that do to illegal UTF-8 sequences in the original unibyte string? 
  I.e. will this procedure always produce valid UTF-8 data?

	Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19  6:15           ` Jan Djärv
@ 2006-09-19  7:14             ` Kenichi Handa
  0 siblings, 0 replies; 13+ messages in thread
From: Kenichi Handa @ 2006-09-19  7:14 UTC (permalink / raw)
  Cc: emacs-pretest-bug, ihs_4664, christopher.ian.moore, emacs-devel,
	richard.stallman

In article <450F8AF7.5010702@swipnet.se>, Jan Djärv <jan.h.d@swipnet.se> writes:

> > AFAIK, only when TEXT is requested, an selection owner can
> > choose the returning type from STRING, COMPOUND_TEXT, or
> > UTF8_STRING.  When UTF8_STRING is requested, we should
> > return it or return nothing.
> > 
> > And, if Emacs owns a unibyte string, perhaps the right thing
> > is to make it multibyte according to the current
> > lang. env. (by string-make-multibyte) at first, then encode
> > it by utf-8.

> What would that do to illegal UTF-8 sequences in the original unibyte string? 

The original unibyte string won't be in UTF-8 format.  But,
string-make-multibyte will convert it to a correct multibyte
string, thus encoding that multibyte string by UTF-8 will
produce a correct UTF-8 string ... usually.

>   I.e. will this procedure always produce valid UTF-8 data?

No.  If a byte in the original unibyte string is not a valid
code point of the primary charset of the current lang. env.,
string-make-unibyte will produce a multibyte string that
contains eight-bit-control or eight-bit-graphic character.
Then, encoding it by UTF-8 will results in incorrect UTF-8
sequence.  So, for safely, we must delete such eight-bit
characters or replace them with U+FFFD (REPLACEMENT
CHARACTER) before encoding by UTF-8.

Or, in such a case, don't return anything (which means Emacs
doesn't hold a requested data).

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19  5:05         ` Kenichi Handa
  2006-09-19  6:15           ` Jan Djärv
@ 2006-09-19 10:54           ` Stefan Monnier
  2006-09-19 11:14             ` Kenichi Handa
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2006-09-19 10:54 UTC (permalink / raw)
  Cc: Jan D., emacs-pretest-bug, ihs_4664, emacs-devel,
	richard.stallman, christopher.ian.moore

>> I've checked in a fix that changes UTF8_STRING to STRING if the data
>> doesn't look like UTF8.  However, this might give errors too.  The only
>> way to be sure to copy raw binary data correctly is by adding a new type
>> (like application-specific/octet-stream).   But if we do that, nobody
>> will be able to get data from Emacs, as such a type is not standard and
>> unsupported.  Copy-paste with raw binary data is just something most
>> apps don't do.

> AFAIK, only when TEXT is requested, an selection owner can
> choose the returning type from STRING, COMPOUND_TEXT, or
> UTF8_STRING.  When UTF8_STRING is requested, we should
> return it or return nothing.

Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use
to keep track of valid unicode chars that have no corresponding character in
emacs-mule.  So the presence of eight-bit-* chars does not imply that the
utf-8 encoded form of the text will contain an invalid utf-8 byte sequence.

> And, if Emacs owns a unibyte string, perhaps the right thing
> is to make it multibyte according to the current
> lang. env. (by string-make-multibyte) at first, then encode
> it by utf-8.

That sounds terribly fragile/buggy.


        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19 10:54           ` Stefan Monnier
@ 2006-09-19 11:14             ` Kenichi Handa
  2006-09-19 16:15               ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2006-09-19 11:14 UTC (permalink / raw)
  Cc: christopher.ian.moore, emacs-pretest-bug, ihs_4664,
	richard.stallman, emacs-devel

In article <jwv3baojfj7.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > AFAIK, only when TEXT is requested, an selection owner can
> > choose the returning type from STRING, COMPOUND_TEXT, or
> > UTF8_STRING.  When UTF8_STRING is requested, we should
> > return it or return nothing.

> Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use
> to keep track of valid unicode chars that have no corresponding character in
> emacs-mule.  So the presence of eight-bit-* chars does not imply that the
> utf-8 encoded form of the text will contain an invalid utf-8 byte sequence.

Yes, but such eight-bit-* chars can be detected by checking
`untranslated-utf-8' property.

> > And, if Emacs owns a unibyte string, perhaps the right thing
> > is to make it multibyte according to the current
> > lang. env. (by string-make-multibyte) at first, then encode
> > it by utf-8.

> That sounds terribly fragile/buggy.

Then, what do you think Emacs should do in such a case?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19 11:14             ` Kenichi Handa
@ 2006-09-19 16:15               ` Stefan Monnier
  2006-09-19 19:32                 ` Jan D.
  2006-09-20  2:20                 ` Kenichi Handa
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Monnier @ 2006-09-19 16:15 UTC (permalink / raw)
  Cc: christopher.ian.moore, emacs-pretest-bug, ihs_4664,
	richard.stallman, emacs-devel

>> Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use
>> to keep track of valid unicode chars that have no corresponding character in
>> emacs-mule.  So the presence of eight-bit-* chars does not imply that the
>> utf-8 encoded form of the text will contain an invalid utf-8 byte sequence.

> Yes, but such eight-bit-* chars can be detected by checking
> `untranslated-utf-8' property.

Sure, but the current code doesn't do that.

>> > And, if Emacs owns a unibyte string, perhaps the right thing
>> > is to make it multibyte according to the current
>> > lang. env. (by string-make-multibyte) at first, then encode
>> > it by utf-8.

>> That sounds terribly fragile/buggy.

> Then, what do you think Emacs should do in such a case?

I think we can't know what should be done, so we should strive for
simplicity and try to avoid losing information.  I.e. just return the
unibyte string as-is.


        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19 16:15               ` Stefan Monnier
@ 2006-09-19 19:32                 ` Jan D.
  2006-09-20  2:20                 ` Kenichi Handa
  1 sibling, 0 replies; 13+ messages in thread
From: Jan D. @ 2006-09-19 19:32 UTC (permalink / raw)
  Cc: Kenichi Handa, emacs-pretest-bug, ihs_4664, emacs-devel,
	richard.stallman, christopher.ian.moore

Stefan Monnier skrev:
>>> Also IIRC a perfectly valid utf-8 buffer may contain eight-bit-* chars, use
>>> to keep track of valid unicode chars that have no corresponding character in
>>> emacs-mule.  So the presence of eight-bit-* chars does not imply that the
>>> utf-8 encoded form of the text will contain an invalid utf-8 byte sequence.
>>>       
>
>   
>> Yes, but such eight-bit-* chars can be detected by checking
>> `untranslated-utf-8' property.
>>     
>
> Sure, but the current code doesn't do that.
>
>   
>>>> And, if Emacs owns a unibyte string, perhaps the right thing
>>>> is to make it multibyte according to the current
>>>> lang. env. (by string-make-multibyte) at first, then encode
>>>> it by utf-8.
>>>>         
>
>   
>>> That sounds terribly fragile/buggy.
>>>       
>
>   
>> Then, what do you think Emacs should do in such a case?
>>     
>
> I think we can't know what should be done, so we should strive for
> simplicity and try to avoid losing information.  I.e. just return the
> unibyte string as-is.
>   

That was the problem the original report was about.  Gtk+-applications
print big warnings.  And there is no agreed upon selection type that
represents just  bytes.

W.r.t the standards, Emacs has two choices, return a valid UTF8-string
or don't return anything at all.  I'm beginning to think the second
option is the best.

    Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-19 16:15               ` Stefan Monnier
  2006-09-19 19:32                 ` Jan D.
@ 2006-09-20  2:20                 ` Kenichi Handa
  2006-10-19  7:19                   ` Jan Djärv
  1 sibling, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2006-09-20  2:20 UTC (permalink / raw)
  Cc: christopher.ian.moore, emacs-pretest-bug, ihs_4664,
	richard.stallman, emacs-devel

In article <jwvk63z7s2s.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I think we can't know what should be done, so we should strive for
> simplicity and try to avoid losing information.  I.e. just return the
> unibyte string as-is.

Even if it doesn't conform to ICCCM?  I'll attach the
relevant part of that document.

"Jan D." <jan.h.d@swipnet.se> writes:

> W.r.t the standards, Emacs has two choices, return a valid UTF8-string
> or don't return anything at all.  I'm beginning to think the second
> option is the best.

This will be useful for checking UTF-8 validity.

(define-ccl-program ccl-check-utf-8
  '(0
    ((r0 = 1)
     (loop
      (read-if (r1 < #x80) (repeat)
	((r0 = 0)
	 (if (r1 < #xC2) (end))
	 (read r2)
	 (if ((r2 & #xC0) != #x80) (end))
	 (if (r1 < #xE0) ((r0 = 1) (repeat)))
	 (read r2)
	 (if ((r2 & #xC0) != #x80) (end))
	 (if (r1 < #xF0) ((r0 = 1) (repeat)))
	 (read r2)
	 (if ((r2 & #xC0) != #x80) (end))
	 (if (r1 < #xF8) ((r0 = 1) (repeat)))
	 (read r2)
	 (if ((r2 & #xC0) != #x80) (end))
	 (if (r1 == #xF8) ((r0 = 1) (repeat)))
	 (end))))))
  "Check if the input unibyte string is a valid UTF-8 sequence or not.
If it is valid, set the register `r0' to 1, else set it to 0.")

(defun string-utf-8-p (string)
  "Return non-nil iff STRING is a unibyte string of valid UTF-8 sequence."
  (if (or (not (stringp string))
	  (multibyte-string-p string))
      (error "Not a unibyte string: %s" string))
  (let ((status (make-vector 9 0)))
    (ccl-execute-on-string ccl-check-utf-8 status string)
    (= (aref status 0) 1)))


---
Kenichi Handa
handa@m17n.org


       Inter-Client Communication Conventions Manual

		     Version 2.0.xf86.1
[...]
2.7.  Use of Selection Properties

The names of the properties used in selection data transfer
are chosen by the requestor.  The use of None property
fields in ConvertSelection requests (which request the
selection owner to choose a name) is not permitted by these
conventions.

The selection owner always chooses the type of the property
in the selection data transfer.  Some types have special
semantics assigned by convention, and these are reviewed in
the following sections.

In all cases, a request for conversion to a target should
return either a property of one of the types listed in the
previous table for that target or a property of type INCR
and then a property of one of the listed types.

Certain selection properties may contain resource IDs.	The
selection owner should ensure that the resource is not
destroyed and that its contents are not changed until after
the selection transfer is complete.  Requestors that rely on
the existence or on the proper contents of a resource must
operate on the resource (for example, by copying the con-
tents of a pixmap) before deleting the selection property.

The selection owner will return a list of zero or more items
of the type indicated by the property type.  In general, the
number of items in the list will correspond to the number of
disjoint parts of the selection.  Some targets (for example,
side-effect targets) will be of length zero irrespective of
the number of disjoint selection parts.  In the case of
fixed-size items, the requestor may determine the number of
items by the property size.  Selection property types are
listed in the table below.  For variable-length items such
as text, the separators are also listed.

-------------------------------------
Type Atom	Format	 Separator
-------------------------------------
APPLE_PICT	  8	 Self-sizing
ATOM		  32	 Fixed-size
ATOM_PAIR	  32	 Fixed-size
BITMAP		  32	 Fixed-size
C_STRING	  8	 Zero
COLORMAP	  32	 Fixed-size
COMPOUND_TEXT	  8	 Zero
DRAWABLE	  32	 Fixed-size
INCR		  32	 Fixed-size
INTEGER 	  32	 Fixed-size
PIXEL		  32	 Fixed-size
PIXMAP		  32	 Fixed-size
SPAN		  32	 Fixed-size
STRING		  8	 Zero
UTF8_STRING	  8	 Zero
WINDOW		  32	 Fixed-size
-------------------------------------


It is expected that this table will grow over time.

2.7.1.	TEXT Properties

In general, the encoding for the characters in a text string
property is specified by its type.  It is highly desirable
for there to be a simple, invertible mapping between string
property types and any character set names embedded within
font names in any font naming standard adopted by the Con-
sortium.

The atom TEXT is a polymorphic target.	Requesting conver-
sion into TEXT will convert into whatever encoding is conve-
nient for the owner.  The encoding chosen will be indicated
by the type of the property returned.  TEXT is not defined
as a type; it will never be the returned type from a selec-
tion conversion request.

If the requestor wants the owner to return the contents of
the selection in a specific encoding, it should request con-
version into the name of that encoding.

In the table in section 2.6.2, the word TEXT (in the Type
column) is used to indicate one of the registered encoding
names.	The type would not actually be TEXT; it would be
STRING or some other ATOM naming the encoding chosen by the
owner.

STRING as a type or a target specifies the ISO Latin-1 char-
acter set plus the control characters TAB (hex 09) and NEW-
LINE (hex 0A).	The spacing interpretation of TAB is context
dependent.  Other ASCII control characters are explicitly
not included in STRING at the present time.

COMPOUND_TEXT as a type or a target specifies the Compound
Text interchange format; see the Compound Text Encoding.

UTF8_STRING as a type or a target specifies an UTF-8 encoded
string, with NEWLINE (U+000A, hex 0A) as end-of-line marker.

There are some text objects where the source or intended
user, as the case may be, does not have a specific character
set for the text, but instead merely requires a zero-termi-
nated sequence of bytes with no other restriction; no ele-
ment of the selection mechanism may assume that any byte
value is forbidden or that any two differing sequences are
equivalent.8  For these objects, the type C_STRING should be
used.

			 Rationale

     An example of the need for C_STRING is to transmit
     the names of files; many operating systems do not
     interpret filenames as having a character set. For
     example, the same character string uses a differ-
     ent sequence of bytes in ASCII and EBCDIC, and so
     most operating systems see these as different
     filenames and offer no way to treat them as the
     same. Thus no character-set based property type is
     suitable.


Type STRING, COMPOUND_TEXT, UTF8_STRING, and C_STRING prop-
erties will consist of a list of elements separated by null
characters; other encodings will need to specify an appro-
priate list format.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Emacs puts binary junk into the clipboard, marking it as text
  2006-09-20  2:20                 ` Kenichi Handa
@ 2006-10-19  7:19                   ` Jan Djärv
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Djärv @ 2006-10-19  7:19 UTC (permalink / raw)
  Cc: emacs-pretest-bug, ihs_4664, emacs-devel, richard.stallman,
	christopher.ian.moore



Kenichi Handa wrote:

> This will be useful for checking UTF-8 validity.
> 
> (define-ccl-program ccl-check-utf-8
>   '(0
>     ((r0 = 1)
>      (loop
>       (read-if (r1 < #x80) (repeat)
> 	((r0 = 0)
> 	 (if (r1 < #xC2) (end))
> 	 (read r2)
> 	 (if ((r2 & #xC0) != #x80) (end))
> 	 (if (r1 < #xE0) ((r0 = 1) (repeat)))
> 	 (read r2)
> 	 (if ((r2 & #xC0) != #x80) (end))
> 	 (if (r1 < #xF0) ((r0 = 1) (repeat)))
> 	 (read r2)
> 	 (if ((r2 & #xC0) != #x80) (end))
> 	 (if (r1 < #xF8) ((r0 = 1) (repeat)))
> 	 (read r2)
> 	 (if ((r2 & #xC0) != #x80) (end))
> 	 (if (r1 == #xF8) ((r0 = 1) (repeat)))
> 	 (end))))))
>   "Check if the input unibyte string is a valid UTF-8 sequence or not.
> If it is valid, set the register `r0' to 1, else set it to 0.")
> 
> (defun string-utf-8-p (string)
>   "Return non-nil iff STRING is a unibyte string of valid UTF-8 sequence."
>   (if (or (not (stringp string))
> 	  (multibyte-string-p string))
>       (error "Not a unibyte string: %s" string))
>   (let ((status (make-vector 9 0)))
>     (ccl-execute-on-string ccl-check-utf-8 status string)
>     (= (aref status 0) 1)))
> 
> 


Thanks.  I used them to check for UTF-8.  We now decline selection requests 
for UTF8_STRING if the data is not in UTF-8.

	Jan D.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-10-19  7:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1158280855.14121.69.camel@chrislap.madeupdomain.com>
2006-09-15  7:07 ` Emacs puts binary junk into the clipboard, marking it as text Jan Djärv
2006-09-15 16:30   ` Kevin Rodgers
2006-09-16 11:31     ` Jan D.
2006-09-16 17:25       ` Jan D.
2006-09-19  5:05         ` Kenichi Handa
2006-09-19  6:15           ` Jan Djärv
2006-09-19  7:14             ` Kenichi Handa
2006-09-19 10:54           ` Stefan Monnier
2006-09-19 11:14             ` Kenichi Handa
2006-09-19 16:15               ` Stefan Monnier
2006-09-19 19:32                 ` Jan D.
2006-09-20  2:20                 ` Kenichi Handa
2006-10-19  7:19                   ` Jan Djärv

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).