non-ascii chars in octal in sub-shell windows

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* non-ascii chars in octal in sub-shell windows
@ 2010-01-15  3:22 Joseph Brenner
  2010-01-15  8:08 ` Eli Zaretskii
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Joseph Brenner @ 2010-01-15  3:22 UTC (permalink / raw)
  To: help-gnu-emacs


When running a program that outputs utf-8 characters such as u-umlaut,
in a terminal window I'll see the actual character, but in an emacs
sub-shell I'm seeing the octal form (which looks like: \374).

Currently, in my emacs init files, I have these two lines:

  (set-language-environment 'utf-8)
  (prefer-coding-system 'utf-8)

What else might I need to get characters in a *shell* buffer to display
correctly?

I've tried a few other things, like set-terminal-coding-system,
and set-locale-environment to no avail.



The system LANG envar is set to: "en_US.UTF-8".

describe-coding-system reports:

Coding system for saving this buffer:
  Not set locally, use the default.
Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8 (alias: mule-utf-8)

Coding system for inter-client cut and paste:
  nil
Coding systems for process I/O:
  encoding input to the process: U -- utf-8-unix (alias: mule-utf-8-unix)

  decoding output from the process: U -- utf-8-unix (alias: mule-utf-8-unix)

Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. iso-2022-7bit
  3. iso-latin-1 (alias: iso-8859-1 latin-1)
  4. iso-2022-7bit-lock (alias: iso-2022-int-1)
  5. iso-2022-8bit-ss2
  6. emacs-mule
  7. raw-text
  8. iso-2022-jp (alias: junet)
  9. in-is13194-devanagari (alias: devanagari)
  10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  11. utf-8-auto
  12. utf-8-with-signature
  13. utf-16
  14. utf-16be-with-signature (alias: utf-16-be)
  15. utf-16le-with-signature (alias: utf-16-le)
  16. utf-16be
  17. utf-16le
  18. japanese-shift-jis (alias: shift_jis sjis)
  19. chinese-big5 (alias: big5 cn-big5 cp950)
  20. undecided

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION	TARGET PATTERN		CODING SYSTEM(s)
  ---------	--------------		----------------
  File I/O      "\\.dz\\'"              (no-conversion . no-conversion)
                "\\.xz\\(~\\|\\.~[0-9]+~\\)?\\'"
                                        (no-conversion . no-conversion)
                "\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'"
                                        (no-conversion . no-conversion)
                "\\.\\(?:tgz\\|svgz\\|sifz\\)\\(~\\|\\.~[0-9]+~\\)?\\'"
                                        (no-conversion . no-conversion)
                "\\.tbz2?\\'"           (no-conversion . no-conversion)
                "\\.bz2\\(~\\|\\.~[0-9]+~\\)?\\'"
                                        (no-conversion . no-conversion)
                "\\.Z\\(~\\|\\.~[0-9]+~\\)?\\'"
                                        (no-conversion . no-conversion)
                "\\.elc\\'"             utf-8-emacs
                "\\.utf\\(-8\\)?\\'"    utf-8
                "\\.xml\\'"             xml-find-file-coding-system
                "\\(\\`\\|/\\)loaddefs.el\\'"
                                        (raw-text . raw-text-unix)
                "\\.tar\\'"             (no-conversion . no-conversion)
                "\\.po[tx]?\\'\\|\\.po\\."
                                        po-find-file-coding-system
                "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
                                        latexenc-find-file-coding-system
                ""                      (undecided)
  Process I/O	nothing specified
  Network I/O	nothing specified

[back]


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: non-ascii chars in octal in sub-shell windows
  2010-01-15  3:22 non-ascii chars in octal in sub-shell windows Joseph Brenner
@ 2010-01-15  8:08 ` Eli Zaretskii
       [not found] ` <mailman.1455.1263542904.18930.help-gnu-emacs@gnu.org>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Eli Zaretskii @ 2010-01-15  8:08 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Joseph Brenner <doom@kzsu.stanford.edu>
> Date: Thu, 14 Jan 2010 19:22:20 -0800
> 
> 
> When running a program that outputs utf-8 characters such as u-umlaut,
> in a terminal window I'll see the actual character, but in an emacs
> sub-shell I'm seeing the octal form (which looks like: \374).

\374 (252 decimal, FC hex) cannot appear in any valid UTF-8 sequence,
AFAIK.  Are you absolutely sure that program produces UTF-8 encoded
text?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: non-ascii chars in octal in sub-shell windows
       [not found] ` <mailman.1455.1263542904.18930.help-gnu-emacs@gnu.org>
@ 2010-01-15  8:41   ` Joseph Brenner
  0 siblings, 0 replies; 5+ messages in thread
From: Joseph Brenner @ 2010-01-15  8:41 UTC (permalink / raw)
  To: help-gnu-emacs


Eli Zaretskii <eliz@gnu.org> writes:
>> Joseph Brenner <doom@kzsu.stanford.edu> wrote:

>> When running a program that outputs utf-8 characters such as u-umlaut,
>> in a terminal window I'll see the actual character, but in an emacs
>> sub-shell I'm seeing the octal form (which looks like: \374).
>
> \374 (252 decimal, FC hex) cannot appear in any valid UTF-8 sequence,
> AFAIK.  Are you absolutely sure that program produces UTF-8 encoded
> text?

Oops.  Thanks, you've called it right: the problem was not on
the emacs side in this case, but on the script side.  (Funny
that my terminal windows aren't respecting my locale's encoding,
but that's something I can live with.)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: non-ascii chars in octal in sub-shell windows
  2010-01-15  3:22 non-ascii chars in octal in sub-shell windows Joseph Brenner
  2010-01-15  8:08 ` Eli Zaretskii
       [not found] ` <mailman.1455.1263542904.18930.help-gnu-emacs@gnu.org>
@ 2010-01-15  8:50 ` Peter Dyballa
       [not found] ` <mailman.1459.1263545434.18930.help-gnu-emacs@gnu.org>
  3 siblings, 0 replies; 5+ messages in thread
From: Peter Dyballa @ 2010-01-15  8:50 UTC (permalink / raw)
  To: Joseph Brenner; +Cc: help-gnu-emacs


Am 15.01.2010 um 04:22 schrieb Joseph Brenner:

> When running a program that outputs utf-8 characters such as u-umlaut,
> in a terminal window I'll see the actual character, but in an emacs
> sub-shell I'm seeing the octal form (which looks like: \374).

No, you're not running such a programme! The LATIN SMALL LETTER U WITH  
DIAERESIS, ü, is encoded in UTF-8 as C3BC. In UTF-16 it is 00FC –  
exactly two bytes! Obviously your programme just outputs some ISO  
Latin dialect or such...

--
Mit friedvollen Grüßen

   Pete

Banken sprengen heißt Sonne rein lassen.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: non-ascii chars in octal in sub-shell windows
       [not found] ` <mailman.1459.1263545434.18930.help-gnu-emacs@gnu.org>
@ 2010-01-15 22:08   ` Joseph Brenner
  0 siblings, 0 replies; 5+ messages in thread
From: Joseph Brenner @ 2010-01-15 22:08 UTC (permalink / raw)
  To: help-gnu-emacs

Peter Dyballa <Peter_Dyballa@Web.DE> writes:
> Joseph Brenner:

>> When running a program that outputs utf-8 characters such as u-umlaut,
>> in a terminal window I'll see the actual character, but in an emacs
>> sub-shell I'm seeing the octal form (which looks like: \374).
>
> No, you're not running such a programme! The LATIN SMALL LETTER U
> WITH DIAERESIS, ü, is encoded in UTF-8 as C3BC. In UTF-16 it is
> 00FC –  exactly two bytes! Obviously your programme just outputs
> some ISO  Latin dialect or such...

Correct.

If anyone's interested in the details of the screw-up, here's some
off-topic chattering about perl programming:

A typical perl test script is based on the Test::More module,
which provides features to do checks such as:

  is_deeply( $some_structure, $expected_structure,
             "Testing whether structure is as expected.");

This routine outputs different messages to STDOUT and/or STDERR
depending on whether the check passes or fails.

I was seeing octal junk in those output messages, even after adding
some commands to the *.t script like so:

  binmode STDOUT, ':encoding(utf8)';
  binmode STDERR, ':encoding(utf8)';

Normally, that would be all it would take to convince perl it needs to
output UTF-8, in the case of Test::More routines, this approach fails,
because it creates new output handles of it's own.

Unbeknownst to me, the documentation for Test::More has been
recommending doing something more like this:

  my $builder = Test::More->builder;
  binmode $builder->output,         ":encoding(utf8)";
  binmode $builder->failure_output, ":encoding(utf8)";

Note that merely doing this sort of thing has no effect:
  use utf8;
  use locale;

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-01-15 22:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-15  3:22 non-ascii chars in octal in sub-shell windows Joseph Brenner
2010-01-15  8:08 ` Eli Zaretskii
     [not found] ` <mailman.1455.1263542904.18930.help-gnu-emacs@gnu.org>
2010-01-15  8:41   ` Joseph Brenner
2010-01-15  8:50 ` Peter Dyballa
     [not found] ` <mailman.1459.1263545434.18930.help-gnu-emacs@gnu.org>
2010-01-15 22:08   ` Joseph Brenner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).