From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: pjb@informatimago.com (Pascal J. Bourguignon) Newsgroups: gmane.emacs.help Subject: Re: character encoding confusion Date: Thu, 08 Jul 2010 17:40:54 +0200 Organization: Informatimago Message-ID: <878w5mf0c9.fsf@kuiper.lan.informatimago.com> References: <87pqyygbpm.fsf@kuiper.lan.informatimago.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: dough.gmane.org 1291850388 6652 80.91.229.12 (8 Dec 2010 23:19:48 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 8 Dec 2010 23:19:48 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Dec 09 00:19:40 2010 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PQTIL-00082O-DJ for geh-help-gnu-emacs@m.gmane.org; Thu, 09 Dec 2010 00:19:37 +0100 Original-Received: from localhost ([127.0.0.1]:50596 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PQTID-0007at-C4 for geh-help-gnu-emacs@m.gmane.org; Wed, 08 Dec 2010 18:19:25 -0500 Original-Path: usenet.stanford.edu!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 153 Original-X-Trace: individual.net yLpIb507Ev0qyPXzjoXEaQNyH8tTZpySEjRM+uySEXur3ahfES Cancel-Lock: sha1:ZjJjNmJjNGI4MjcxOTNiZDMwNTU3ZWE0N2ZlN2Q3MTMwNGFjMTcxMA== sha1:+ybRsdgvHFXIXum6vfe0PZNgxu0= Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwAQMAAABtzGvEAAAABlBMVEUAAAD///+l2Z/dAAAA oElEQVR4nK3OsRHCMAwF0O8YQufUNIQRGIAja9CxSA55AxZgFO4coMgYrEDDQZWPIlNAjwq9 033pbOBPtbXuB6PKNBn5gZkhGa86Z4x2wE67O+06WxGD/HCOGR0deY3f9Ijwwt7rNGNf6Oac l/GuZTF1wFGKiYYHKSFAkjIo1b6sCYS1sVmFhhhahKQssRjRT90ITWUk6vvK3RsPGs+M1RuR mV+hO/VvFAAAAABJRU5ErkJggg== X-Accept-Language: fr, es, en X-Disabled: X-No-Archive: no User-Agent: Gnus/5.101 (Gnus v5.10.10) Emacs/23.2 (gnu/linux) Original-Xref: usenet.stanford.edu gnu.emacs.help:179626 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:76205 Archived-At: patrol writes: > On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon) > wrote: >> >> Remember that C only deals with integer. There is no character type in C. > > I thought there was a char data type. Well, not exactly sure the > relevence of that... Yes, but char is defined as being some integer between MIN_CHAR and MAX_CHAR. In C, there is no character type. There are several ways to write literal integers, such as 0101, 65, 'A', or 0x41, and there are several ways to write vector literals, such as: {65,66,67,0} or "ABC", but there is no character. And therefore no string. Until you write: typedef struct { unsigned code; } character; typedef struct { int allocated; int length; character* contents; } string; and the associated functions. >> So, what happens when you call: printf("%c",176); ? > > Well as I said in my post, I get a shaded square. When you call printf("%c",176);, one byte of value 176 is sent to the output stream (file or terminal). That's all, as far as C is concerned. > But printf("%c", 248) yields the degree sign. But like I said, under > Latin-1 and UTF-8, 176 is the degree sign, not 248. So what? (Alternatively, you may try to find in what coding system 248 is a degree sign: CL-USER> (do-external-symbols (cs :charset) (ignore-errors (let ((str (ext:convert-string-from-bytes #(248) cs))) (when (string-equal (char-name (character str)) "DEGREE_SIGN") (print (list cs str)))))) (CHARSET:CP866 "°") (CHARSET:CP860 "°") (CHARSET:CP861 "°") (CHARSET:CP862 "°") (CHARSET:CP863 "°") (CHARSET:CP863-IBM "°") (CHARSET:CP869 "°") (CHARSET:CP861-IBM "°") (CHARSET:CP862-IBM "°") (CHARSET:CP437 "°") (CHARSET:CP852-IBM "°") (CHARSET:CP857 "°") (CHARSET:CP850 "°") (CHARSET:CP869-IBM "°") (CHARSET:CP852 "°") (CHARSET:CP775 "°") (CHARSET:CP860-IBM "°") (CHARSET:CP865-IBM "°") (CHARSET:CP737 "°") (CHARSET:CP437-IBM "°") (CHARSET:CP865 "°") NIL But this only tells us that your terminal is configured to convert the bytes it receives using some Microsoft-specific coding system. ) >> Have a look at setlocale, LC_ALL, etc, and libiconv. > > I don't have any experience with this, but I did printf("%d", LC_ALL), > which returned 0. Don't know what that means, but I'm not sure why > locale settings should matter. Aren't Latin-1 and UTF-8 universal > encodings? If a file is encoded in Latin-1, wouldn't the degree sign > map to 176 regardless of locale? LC_ALL is an environment variable in a POSIX system that informs libraries and programs what language and character encoding the current user and terminal expect. There is a set of associated variables. setlocale(3) is a library function that let you indicate the language and character encoding should be used for the current user and terminal. Type: man 3 setlocale and read also all the manual pages listed in the SEE ALSO section. Type man 3 iconv For example, in my ~/.bash_env, I have: LC_CTYPE=en_US.UTF-8 export LC_CTYPE This defines an environment variable named LC_CTYPE whose value is en_US.UTF-8. This value indicate that I want the messages in USA English, and encoded in UTF-8. So programs may call getenv(3) to get the value of these environment variables, and pass it to setlocale to inform the libraries what encoding to use, and can itself use iconv(3) to convert its own strings from their original encoding to the encoding required by the terminal. So, it seems you're writing your program on a Microsoft system, that I was oblivious of that fact, and that I don't know anything about programming Microsoft systems. When I have to use a Microsoft system (temporarily), I download http://www.cygwin.com/setup.exe, and use cygwin, which gives me a POSIX environment, and if I have to develop a program, I install Linux instead. Perhaps the POSIX API I mention here doesn't apply to Microsoft programs. In any case, coding systems may vary depending on the output device. If your program writes to a terminal or console, you have to deal with the coding systems configured in the terminal. If it displays text thru a GUI, you have to deal with the coding system expected by the GUI toolkit you're using. If it writes a file, you have to deal with the encoding that must generated in that file. The C compilers just take bytes and store bytes (if it wasn't specified in the language, they're just C programs themselves!, usually), so if you encode your sources in ISO-8859-1, you will have ISO-8859-1 literal bytes in your program. If you need to output to a device that expects another encoding, then your program will have to find what encoding is expected, and it will have to convert the strings. -- __Pascal Bourguignon__ http://www.informatimago.com/