character encoding confusion

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* character encoding confusion
@ 2010-07-07 16:27 patrol
  2010-07-07 22:37 ` Pascal J. Bourguignon
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: patrol @ 2010-07-07 16:27 UTC (permalink / raw)
  To: help-gnu-emacs

I created a program in C that requires the degree symbol. The mode
line indicates that Emacs is using the Latin-1 character encoding.
According to Latin-1 encoding tables, the degree symbol is encoded as
decimal 176, so that's what I used in my code. But when the character
printed, it wasn't the degree symbol; it was a "shaded box" looking
thing. Then I looked at an ASCII table here (http://
www.asciitable.com/), and it says that 176 is indeed the shaded box
that was printed in my program, and the degree character was decimal
248. So I used 248 in my code, and I got the degree symbol I wanted.

But all this leaves me with the question that if Emacs was supposedly
encoding the file in Latin-1, why doesn't the code for the degree
symbol match up with the Latin-1 table? Why does it instead match up
with some non-standard "extended" ASCII that I just happened to come
across.

Can anyone shed light on this?
Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-07 16:27 character encoding confusion patrol
@ 2010-07-07 22:37 ` Pascal J. Bourguignon
  2010-07-08  0:19   ` patrol
  2010-07-08  1:38 ` John Bokma
  2010-07-08  7:32 ` Tim X
  2 siblings, 1 reply; 9+ messages in thread
From: Pascal J. Bourguignon @ 2010-07-07 22:37 UTC (permalink / raw)
  To: help-gnu-emacs

patrol <patrol_boat@hotmail.com> writes:

> I created a program in C that requires the degree symbol. The mode
> line indicates that Emacs is using the Latin-1 character encoding.
> According to Latin-1 encoding tables, the degree symbol is encoded as
> decimal 176, so that's what I used in my code. But when the character
> printed, it wasn't the degree symbol; it was a "shaded box" looking
> thing. Then I looked at an ASCII table here (http://
> www.asciitable.com/), and it says that 176 is indeed the shaded box

176 is not an ASCII code.  ASCII contains only codes from 0 to 127.


> that was printed in my program, and the degree character was decimal
> 248. So I used 248 in my code, and I got the degree symbol I wanted.
>
> But all this leaves me with the question that if Emacs was supposedly
> encoding the file in Latin-1, why doesn't the code for the degree
> symbol match up with the Latin-1 table? Why does it instead match up
> with some non-standard "extended" ASCII that I just happened to come
> across.
>
> Can anyone shed light on this?

Remember that C only deals with integer.  There is no character type in C.  

So, what happens when you call: printf("%c",176); ?

Have a look at setlocale, LC_ALL, etc, and libiconv.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-07 22:37 ` Pascal J. Bourguignon
@ 2010-07-08  0:19   ` patrol
  2010-07-08  1:15     ` Barry Margolin
  2010-07-08 15:40     ` Pascal J. Bourguignon
  0 siblings, 2 replies; 9+ messages in thread
From: patrol @ 2010-07-08  0:19 UTC (permalink / raw)
  To: help-gnu-emacs

On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon)
wrote:
>
> Remember that C only deals with integer.  There is no character type in C.

I thought there was a char data type. Well, not exactly sure the
relevence of that...

> So, what happens when you call: printf("%c",176); ?

Well as I said in my post, I get a shaded square. But printf("%c",
248) yields the degree sign. But like I said, under Latin-1 and UTF-8,
176 is the degree sign, not 248.

> Have a look at setlocale, LC_ALL, etc, and libiconv.

I don't have any experience with this, but I did printf("%d", LC_ALL),
which returned 0. Don't know what that means, but I'm not sure why
locale settings should matter. Aren't Latin-1 and UTF-8 universal
encodings? If a file is encoded in Latin-1, wouldn't the degree sign
map to 176 regardless of locale?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-08  0:19   ` patrol
@ 2010-07-08  1:15     ` Barry Margolin
  2010-07-08 15:40     ` Pascal J. Bourguignon
  1 sibling, 0 replies; 9+ messages in thread
From: Barry Margolin @ 2010-07-08  1:15 UTC (permalink / raw)
  To: help-gnu-emacs

In article 
<d8471b2d-27e0-4a7e-bc94-8372d2df3f01@r27g2000yqb.googlegroups.com>,
 patrol <patrol_boat@hotmail.com> wrote:

> On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon)
> wrote:
> >
> > Remember that C only deals with integer.  There is no character type in C.
> 
> I thought there was a char data type. Well, not exactly sure the
> relevence of that...

char is C's name for small integers.

-- 
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-07 16:27 character encoding confusion patrol
  2010-07-07 22:37 ` Pascal J. Bourguignon
@ 2010-07-08  1:38 ` John Bokma
  2010-07-08 13:24   ` patrol
  2010-07-08  7:32 ` Tim X
  2 siblings, 1 reply; 9+ messages in thread
From: John Bokma @ 2010-07-08  1:38 UTC (permalink / raw)
  To: help-gnu-emacs

patrol <patrol_boat@hotmail.com> writes:

> I created a program in C that requires the degree symbol. The mode
> line indicates that Emacs is using the Latin-1 character encoding.
> According to Latin-1 encoding tables, the degree symbol is encoded as
> decimal 176, so that's what I used in my code. But when the character
> printed, it wasn't the degree symbol; it was a "shaded box" looking
> thing. Then I looked at an ASCII table here (http://
> www.asciitable.com/), and it says that 176 is indeed the shaded box
> that was printed in my program, and the degree character was decimal
> 248. So I used 248 in my code, and I got the degree symbol I wanted.
>
> But all this leaves me with the question that if Emacs was supposedly
> encoding the file in Latin-1, why doesn't the code for the degree
> symbol match up with the Latin-1 table? Why does it instead match up
> with some non-standard "extended" ASCII that I just happened to come
> across.
>
> Can anyone shed light on this?

Like Pascal wrote ASCII is 7 bit (0..127 decimal) and there are various
extensions ("code pages"). It sounds to me you're running your program
in a DOS box? If so the current code page, most likely 437 [1], of the DOS
box shows a shaded box for 176.

[1] http://en.wikipedia.org/wiki/Codepage_437

-- 
John Bokma                                                               j3b

Hacking & Hiking in Mexico -  http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-07 16:27 character encoding confusion patrol
  2010-07-07 22:37 ` Pascal J. Bourguignon
  2010-07-08  1:38 ` John Bokma
@ 2010-07-08  7:32 ` Tim X
  2010-07-08 13:30   ` patrol
  2 siblings, 1 reply; 9+ messages in thread
From: Tim X @ 2010-07-08  7:32 UTC (permalink / raw)
  To: help-gnu-emacs

patrol <patrol_boat@hotmail.com> writes:

> I created a program in C that requires the degree symbol. The mode
> line indicates that Emacs is using the Latin-1 character encoding.
> According to Latin-1 encoding tables, the degree symbol is encoded as
> decimal 176, so that's what I used in my code. But when the character
> printed, it wasn't the degree symbol; it was a "shaded box" looking
> thing. Then I looked at an ASCII table here (http://
> www.asciitable.com/), and it says that 176 is indeed the shaded box
> that was printed in my program, and the degree character was decimal
> 248. So I used 248 in my code, and I got the degree symbol I wanted.
>
> But all this leaves me with the question that if Emacs was supposedly
> encoding the file in Latin-1, why doesn't the code for the degree
> symbol match up with the Latin-1 table? Why does it instead match up
> with some non-standard "extended" ASCII that I just happened to come
> across.
>

There has been a lot of chnage in emacs encoding and there are a number
of possibilities. 

1. What version of emacs?

2. Can you clarify what you mean by you created a program in C that
requires the degree symbol. Do you mean it needs that character in the
source code, as standard input or as a value in an input file? Are you
running the program inside emacs or are you generating datafiles for the
program to consume etc. 

3. Depending on how this is all interacting, your OS locale settings,
whether your running in GUI mode under X or within an xterm or the
console can all be relevant. 

Tim


-- 
tcross (at) rapttech dot com dot au


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-08  1:38 ` John Bokma
@ 2010-07-08 13:24   ` patrol
  0 siblings, 0 replies; 9+ messages in thread
From: patrol @ 2010-07-08 13:24 UTC (permalink / raw)
  To: help-gnu-emacs

On Jul 7, 9:38 pm, John Bokma <j...@castleamber.com> wrote:

> Like Pascal wrote ASCII is 7 bit (0..127 decimal) and there are various
> extensions ("code pages"). It sounds to me you're running your program
> in a DOS box? If so the current code page, most likely 437 [1], of the DOS
> box shows a shaded box for 176.
>
> [1]http://en.wikipedia.org/wiki/Codepage_437

Thanks. That was helpful. I'm running Windows 7, and I found out that
it's using code page 850, which, like 437, maps degree to 248 and the
light shaded box to 176. I think I'm beginning to understand what's
going on. Emacs is indeed encoding the file in Latin-1, but when it
comes to the statement printf("%c", 248), it doesn't matter how the
file itself is encoded; what matters is how the computer interprets
this statement. And how it interprets this statement depends on the
codepage that the operating system is using. Thanks!

> John Bokma                                                               j3b
>
> Hacking & Hiking in Mexico -  http://johnbokma.com/http://castleamber.com/- Perl & Python Development- Hide quoted text -

Sounds like fun!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-08  7:32 ` Tim X
@ 2010-07-08 13:30   ` patrol
  0 siblings, 0 replies; 9+ messages in thread
From: patrol @ 2010-07-08 13:30 UTC (permalink / raw)
  To: help-gnu-emacs

On Jul 8, 3:32 am, Tim X <t...@nospam.dev.null> wrote:

> There has been a lot of chnage in emacs encoding and there are a number
> of possibilities.
>
> 1. What version of emacs?

The latest stable release, 23.2

> 2. Can you clarify what you mean by you created a program in C that
> requires the degree symbol. Do you mean it needs that character in the
> source code, as standard input or as a value in an input file? Are you
> running the program inside emacs or are you generating datafiles for the
> program to consume etc.

It's just a temperature table that I'm printing, so I need to print
"degree symbol"F. My code for that was printf("%cF", 176), which gave
me "shaded box"F.

> 3. Depending on how this is all interacting, your OS locale settings,
> whether your running in GUI mode under X or within an xterm or the
> console can all be relevant.

Yes, I think I understand the problem better now (see response to
John) and I agree that the OS locale settings are relevant. Thank you.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: character encoding confusion
  2010-07-08  0:19   ` patrol
  2010-07-08  1:15     ` Barry Margolin
@ 2010-07-08 15:40     ` Pascal J. Bourguignon
  1 sibling, 0 replies; 9+ messages in thread
From: Pascal J. Bourguignon @ 2010-07-08 15:40 UTC (permalink / raw)
  To: help-gnu-emacs

patrol <patrol_boat@hotmail.com> writes:

> On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon)
> wrote:
>>
>> Remember that C only deals with integer.  There is no character type in C.
>
> I thought there was a char data type. Well, not exactly sure the
> relevence of that...

Yes, but char is defined as being some integer between MIN_CHAR and MAX_CHAR.

In C, there is no character type.

There are several ways to write literal integers,  such as 0101, 65,
'A', or 0x41, and there are several ways to write vector literals,
such as: {65,66,67,0} or "ABC", but there is no character.  And
therefore no string.

Until you write:

    typedef struct {
       unsigned code;
    }   character;

    typedef struct {
       int allocated;
       int length;
       character* contents;
    }   string;

and the associated functions.

>> So, what happens when you call: printf("%c",176); ?
>
> Well as I said in my post, I get a shaded square. 

When you call  printf("%c",176);, one byte of value 176 is sent to the
output stream (file or terminal).  That's all, as far as C is concerned.

> But printf("%c", 248) yields the degree sign. But like I said, under
> Latin-1 and UTF-8, 176 is the degree sign, not 248.

So what?

(Alternatively, you may try to find in what coding system 248 is a
degree sign:

CL-USER> (do-external-symbols (cs :charset) 
           (ignore-errors 
             (let ((str (ext:convert-string-from-bytes #(248) cs)))
               (when (string-equal  (char-name (character str)) "DEGREE_SIGN")
                 (print (list cs str))))))

(CHARSET:CP866 "°") 
(CHARSET:CP860 "°") 
(CHARSET:CP861 "°") 
(CHARSET:CP862 "°") 
(CHARSET:CP863 "°") 
(CHARSET:CP863-IBM "°") 
(CHARSET:CP869 "°") 
(CHARSET:CP861-IBM "°") 
(CHARSET:CP862-IBM "°") 
(CHARSET:CP437 "°") 
(CHARSET:CP852-IBM "°") 
(CHARSET:CP857 "°") 
(CHARSET:CP850 "°") 
(CHARSET:CP869-IBM "°") 
(CHARSET:CP852 "°") 
(CHARSET:CP775 "°") 
(CHARSET:CP860-IBM "°") 
(CHARSET:CP865-IBM "°") 
(CHARSET:CP737 "°") 
(CHARSET:CP437-IBM "°") 
(CHARSET:CP865 "°") 
NIL

But this only tells us that your terminal is configured to convert the
bytes it receives using some Microsoft-specific coding system.
)

>> Have a look at setlocale, LC_ALL, etc, and libiconv.
>
> I don't have any experience with this, but I did printf("%d", LC_ALL),
> which returned 0. Don't know what that means, but I'm not sure why
> locale settings should matter. Aren't Latin-1 and UTF-8 universal
> encodings? If a file is encoded in Latin-1, wouldn't the degree sign
> map to 176 regardless of locale?

LC_ALL is an environment variable in a POSIX system that informs
libraries and programs what language and character encoding the
current user and terminal expect.  There is a set of associated
variables.

setlocale(3) is a library function that let you indicate the language
and character encoding should be used for the current user and
terminal.

Type:
    man 3 setlocale
and read also all the manual pages listed in the SEE ALSO section.

Type
    man 3 iconv

For example, in my ~/.bash_env, I have:

    LC_CTYPE=en_US.UTF-8
    export LC_CTYPE

This defines an environment variable named LC_CTYPE whose value is
en_US.UTF-8.  This value indicate that I want the messages in USA
English, and encoded in UTF-8.  So programs may call getenv(3) to get
the value of these environment variables, and pass it to setlocale to
inform the libraries what encoding to use, and can itself use iconv(3)
to convert its own strings from their original encoding to the
encoding required by the terminal.

So, it seems you're writing your program on a Microsoft system, that I
was oblivious of that fact, and that I don't know anything about
programming Microsoft systems.   When I have to use a Microsoft system
(temporarily), I download http://www.cygwin.com/setup.exe, and use
cygwin, which gives me a POSIX environment, and if I have to develop a
program, I install Linux instead.  Perhaps the POSIX API I mention
here doesn't apply to Microsoft programs.  

In any case, coding systems may vary depending on the output device.
If your program writes to a terminal or console, you have to deal with
the coding systems configured in the terminal.  If it displays text
thru a GUI, you have to deal with the coding system expected by the
GUI toolkit you're using.  If it writes a file, you have to deal with
the encoding that must generated in that file.

The C compilers just take bytes and store bytes (if it wasn't
specified in the language, they're just C programs themselves!,
usually), so if you encode your sources in ISO-8859-1, you will have
ISO-8859-1 literal bytes in your program.  If you need to output to a
device that expects another encoding, then your program will have to
find what encoding is expected, and it will have to convert the
strings.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-07-08 15:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-07 16:27 character encoding confusion patrol
2010-07-07 22:37 ` Pascal J. Bourguignon
2010-07-08  0:19   ` patrol
2010-07-08  1:15     ` Barry Margolin
2010-07-08 15:40     ` Pascal J. Bourguignon
2010-07-08  1:38 ` John Bokma
2010-07-08 13:24   ` patrol
2010-07-08  7:32 ` Tim X
2010-07-08 13:30   ` patrol

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).