From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: pjb@informatimago.com (Pascal J. Bourguignon)
Newsgroups: gmane.emacs.help
Subject: Re: character encoding confusion
Date: Thu, 08 Jul 2010 17:40:54 +0200
Organization: Informatimago
Message-ID: <878w5mf0c9.fsf@kuiper.lan.informatimago.com>
References: <e2e5fcc1-80c7-4887-a150-9eae6e78398c@d16g2000yqb.googlegroups.com>
	<87pqyygbpm.fsf@kuiper.lan.informatimago.com>
	<d8471b2d-27e0-4a7e-bc94-8372d2df3f01@r27g2000yqb.googlegroups.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: dough.gmane.org 1291850388 6652 80.91.229.12 (8 Dec 2010 23:19:48 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Wed, 8 Dec 2010 23:19:48 +0000 (UTC)
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Dec 09 00:19:40 2010
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1PQTIL-00082O-DJ
	for geh-help-gnu-emacs@m.gmane.org; Thu, 09 Dec 2010 00:19:37 +0100
Original-Received: from localhost ([127.0.0.1]:50596 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1PQTID-0007at-C4
	for geh-help-gnu-emacs@m.gmane.org; Wed, 08 Dec 2010 18:19:25 -0500
Original-Path: usenet.stanford.edu!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
Original-Newsgroups: gnu.emacs.help
Original-Lines: 153
Original-X-Trace: individual.net yLpIb507Ev0qyPXzjoXEaQNyH8tTZpySEjRM+uySEXur3ahfES
Cancel-Lock: sha1:ZjJjNmJjNGI4MjcxOTNiZDMwNTU3ZWE0N2ZlN2Q3MTMwNGFjMTcxMA==
	sha1:+ybRsdgvHFXIXum6vfe0PZNgxu0=
Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwAQMAAABtzGvEAAAABlBMVEUAAAD///+l2Z/dAAAA
	oElEQVR4nK3OsRHCMAwF0O8YQufUNIQRGIAja9CxSA55AxZgFO4coMgYrEDDQZWPIlNAjwq9
	033pbOBPtbXuB6PKNBn5gZkhGa86Z4x2wE67O+06WxGD/HCOGR0deY3f9Ijwwt7rNGNf6Oac
	l/GuZTF1wFGKiYYHKSFAkjIo1b6sCYS1sVmFhhhahKQssRjRT90ITWUk6vvK3RsPGs+M1RuR
	mV+hO/VvFAAAAABJRU5ErkJggg==
X-Accept-Language: fr, es, en
X-Disabled: X-No-Archive: no
User-Agent: Gnus/5.101 (Gnus v5.10.10) Emacs/23.2 (gnu/linux)
Original-Xref: usenet.stanford.edu gnu.emacs.help:179626
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:76205
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/76205>

patrol <patrol_boat@hotmail.com> writes:

> On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon)
> wrote:
>>
>> Remember that C only deals with integer.  There is no character type in C.
>
> I thought there was a char data type. Well, not exactly sure the
> relevence of that...

Yes, but char is defined as being some integer between MIN_CHAR and MAX_CHAR.

In C, there is no character type.

There are several ways to write literal integers,  such as 0101, 65,
'A', or 0x41, and there are several ways to write vector literals,
such as: {65,66,67,0} or "ABC", but there is no character.  And
therefore no string.

Until you write:

    typedef struct {
       unsigned code;
    }   character;

    typedef struct {
       int allocated;
       int length;
       character* contents;
    }   string;

and the associated functions.


>> So, what happens when you call: printf("%c",176); ?
>
> Well as I said in my post, I get a shaded square. 

When you call  printf("%c",176);, one byte of value 176 is sent to the
output stream (file or terminal).  That's all, as far as C is concerned.


> But printf("%c", 248) yields the degree sign. But like I said, under
> Latin-1 and UTF-8, 176 is the degree sign, not 248.

So what?

(Alternatively, you may try to find in what coding system 248 is a
degree sign:

CL-USER> (do-external-symbols (cs :charset) 
           (ignore-errors 
             (let ((str (ext:convert-string-from-bytes #(248) cs)))
               (when (string-equal  (char-name (character str)) "DEGREE_SIGN")
                 (print (list cs str))))))

(CHARSET:CP866 "°") 
(CHARSET:CP860 "°") 
(CHARSET:CP861 "°") 
(CHARSET:CP862 "°") 
(CHARSET:CP863 "°") 
(CHARSET:CP863-IBM "°") 
(CHARSET:CP869 "°") 
(CHARSET:CP861-IBM "°") 
(CHARSET:CP862-IBM "°") 
(CHARSET:CP437 "°") 
(CHARSET:CP852-IBM "°") 
(CHARSET:CP857 "°") 
(CHARSET:CP850 "°") 
(CHARSET:CP869-IBM "°") 
(CHARSET:CP852 "°") 
(CHARSET:CP775 "°") 
(CHARSET:CP860-IBM "°") 
(CHARSET:CP865-IBM "°") 
(CHARSET:CP737 "°") 
(CHARSET:CP437-IBM "°") 
(CHARSET:CP865 "°") 
NIL

But this only tells us that your terminal is configured to convert the
bytes it receives using some Microsoft-specific coding system.
)


>> Have a look at setlocale, LC_ALL, etc, and libiconv.
>
> I don't have any experience with this, but I did printf("%d", LC_ALL),
> which returned 0. Don't know what that means, but I'm not sure why
> locale settings should matter. Aren't Latin-1 and UTF-8 universal
> encodings? If a file is encoded in Latin-1, wouldn't the degree sign
> map to 176 regardless of locale?

LC_ALL is an environment variable in a POSIX system that informs
libraries and programs what language and character encoding the
current user and terminal expect.  There is a set of associated
variables.

setlocale(3) is a library function that let you indicate the language
and character encoding should be used for the current user and
terminal.

Type:
    man 3 setlocale
and read also all the manual pages listed in the SEE ALSO section.

Type
    man 3 iconv


For example, in my ~/.bash_env, I have:

    LC_CTYPE=en_US.UTF-8
    export LC_CTYPE

This defines an environment variable named LC_CTYPE whose value is
en_US.UTF-8.  This value indicate that I want the messages in USA
English, and encoded in UTF-8.  So programs may call getenv(3) to get
the value of these environment variables, and pass it to setlocale to
inform the libraries what encoding to use, and can itself use iconv(3)
to convert its own strings from their original encoding to the
encoding required by the terminal.


So, it seems you're writing your program on a Microsoft system, that I
was oblivious of that fact, and that I don't know anything about
programming Microsoft systems.   When I have to use a Microsoft system
(temporarily), I download http://www.cygwin.com/setup.exe, and use
cygwin, which gives me a POSIX environment, and if I have to develop a
program, I install Linux instead.  Perhaps the POSIX API I mention
here doesn't apply to Microsoft programs.  

In any case, coding systems may vary depending on the output device.
If your program writes to a terminal or console, you have to deal with
the coding systems configured in the terminal.  If it displays text
thru a GUI, you have to deal with the coding system expected by the
GUI toolkit you're using.  If it writes a file, you have to deal with
the encoding that must generated in that file.

The C compilers just take bytes and store bytes (if it wasn't
specified in the language, they're just C programs themselves!,
usually), so if you encode your sources in ISO-8859-1, you will have
ISO-8859-1 literal bytes in your program.  If you need to output to a
device that expects another encoding, then your program will have to
find what encoding is expected, and it will have to convert the
strings.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/