unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* auto-recognizing utf-16le ?
@ 2009-06-15 11:40 Miles Bader
  2009-06-15 21:45 ` Andreas Schwab
  0 siblings, 1 reply; 6+ messages in thread
From: Miles Bader @ 2009-06-15 11:40 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

Someone on #emacs noticed that emacs doesn't seem to auto-recognize
files encoding using utf-16le.  Visiting a file which uses such an
encoding results in the buffer having coding-system "no-conversion
(alias: binary)", and lots of ^@ (NUL) characters in the buffer.

Forcing the encoding with "C-x C-m r utf-16le RET" results in the
correct thing happening.

[He was on windows where this coding system is common, so it's kind of
annoying for him.]

I noticed that the same happens on debian.

I thought maybe he could just do:

   (prefer-coding-system 'utf-16le-dos)

but it seems to have no effect.

To reproduce:

   1. Save this message's attachment to a file "/tmp/oink"

   2. Start emacs with:  HOME=/tmp emacs -Q

   3. Visit the file you saved:  C-x C-f /tmp/oink RET

   4. ** Notice that the buffer contains ^@ (NUL) characters, and that
      the buffer coding-system is "no-conversion (binary)"

   5. Re-visit the file, forcing the coding-system:

         C-x C-m r utf-16le RET yes RET

   6. ** Notice that the file contents are now correct

   7. Kill the current buffer:  C-x k RET

   8. Evaluate:  M-: (prefer-coding-system 'utf-16le) RET

   9. Visit the file again:  C-x C-f /tmp/oink RET

  10. ** Notice that prefer-coding-system didn't seem to have any effect


Thanks,

-Miles


[-- Attachment #2: test file encoded using utf-16le --]
[-- Type: application/octet-stream, Size: 30 bytes --]

[-- Attachment #3: Type: text/plain, Size: 167 bytes --]



-- 
Justice, n. A commodity which in a more or less adulterated condition the
State sells to the citizen as a reward for his allegiance, taxes and personal
service.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: auto-recognizing utf-16le ?
  2009-06-15 11:40 auto-recognizing utf-16le ? Miles Bader
@ 2009-06-15 21:45 ` Andreas Schwab
  2009-06-16  0:20   ` Miles Bader
  2009-06-16  2:04   ` Kenichi Handa
  0 siblings, 2 replies; 6+ messages in thread
From: Andreas Schwab @ 2009-06-15 21:45 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel

Miles Bader <miles.bader@necel.com> writes:

> Someone on #emacs noticed that emacs doesn't seem to auto-recognize
> files encoding using utf-16le.

UTF-16 detection never tries to auto detect files without a signature.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: auto-recognizing utf-16le ?
  2009-06-15 21:45 ` Andreas Schwab
@ 2009-06-16  0:20   ` Miles Bader
  2009-06-16  2:04   ` Kenichi Handa
  1 sibling, 0 replies; 6+ messages in thread
From: Miles Bader @ 2009-06-16  0:20 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:
>> Someone on #emacs noticed that emacs doesn't seem to auto-recognize
>> files encoding using utf-16le.
>
> UTF-16 detection never tries to auto detect files without a signature.

So are UTF-16 files without a signature an anomaly?

-Miles

-- 
Philosophy, n. A route of many roads leading from nowhere to nothing.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: auto-recognizing utf-16le ?
  2009-06-15 21:45 ` Andreas Schwab
  2009-06-16  0:20   ` Miles Bader
@ 2009-06-16  2:04   ` Kenichi Handa
  2009-06-16 15:01     ` Andreas Schwab
  1 sibling, 1 reply; 6+ messages in thread
From: Kenichi Handa @ 2009-06-16  2:04 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel, miles

In article <m2r5xlw4qp.fsf@igel.home>, Andreas Schwab <schwab@linux-m68k.org> writes:

> Miles Bader <miles.bader@necel.com> writes:
> > Someone on #emacs noticed that emacs doesn't seem to auto-recognize
> > files encoding using utf-16le.

> UTF-16 detection never tries to auto detect files without a signature.

No.  detect_coding_utf_16 tries to check if the file is
UTF-16 or not by checking the dispersion of Eth and Oth
bytes where E is even and O is odd.  But, there were two
bugs in the code.  One was already fixed by this change.

2009-06-15  Andreas Schwab  <schwab@linux-m68k.org>

	* coding.c (detect_coding_utf_16): Fix typo counting odd bytes.

And, I've just installed a fix of another bug.

So, with the latest code, if you set
inhibit-null-byte-detection to t, and prefer utf-16be and/or
utf-16le, Emacs will detect UTF-16 files without BOM in most
cases.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: auto-recognizing utf-16le ?
  2009-06-16  2:04   ` Kenichi Handa
@ 2009-06-16 15:01     ` Andreas Schwab
  2009-06-17  0:43       ` Kenichi Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Schwab @ 2009-06-16 15:01 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel, miles

Kenichi Handa <handa@m17n.org> writes:

> And, I've just installed a fix of another bug.

I think instead of

      while (detect_info->rejected != CATEGORY_MASK_UTF_16)

you probably want to check this:

      while ((detect_info->rejected & CATEGORY_MASK_UTF_16) != CATEGORY_MASK_UTF_16)

since there may be bits for other categories already set in rejected.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: auto-recognizing utf-16le ?
  2009-06-16 15:01     ` Andreas Schwab
@ 2009-06-17  0:43       ` Kenichi Handa
  0 siblings, 0 replies; 6+ messages in thread
From: Kenichi Handa @ 2009-06-17  0:43 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel, miles

In article <m24oug1av0.fsf@linux-m68k.org>, Andreas Schwab <schwab@linux-m68k.org> writes:

> Kenichi Handa <handa@m17n.org> writes:
> > And, I've just installed a fix of another bug.

> I think instead of

>       while (detect_info->rejected != CATEGORY_MASK_UTF_16)

> you probably want to check this:

>       while ((detect_info->rejected & CATEGORY_MASK_UTF_16) != CATEGORY_MASK_UTF_16)

> since there may be bits for other categories already set in rejected.

Ah!  You are right, thank you.  I installed that fix.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-06-17  0:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-15 11:40 auto-recognizing utf-16le ? Miles Bader
2009-06-15 21:45 ` Andreas Schwab
2009-06-16  0:20   ` Miles Bader
2009-06-16  2:04   ` Kenichi Handa
2009-06-16 15:01     ` Andreas Schwab
2009-06-17  0:43       ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).