From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: those funny non-ASCII characters Date: Sat, 2 Jun 2012 04:54:34 -0700 (PDT) Organization: http://groups.google.com Message-ID: <878c6c73-4646-42fa-b5c5-5535803457f1@ri8g2000pbc.googlegroups.com> References: <731567ba-000c-4643-9eff-0237129b90c7@oe8g2000pbb.googlegroups.com> <5cba8baa-c270-4985-aec8-4468aaa9ed05@wp3g2000pbc.googlegroups.com> <202f4594-9462-48dc-954d-8cf9ac6a581e@s6g2000pbi.googlegroups.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1338638121 29755 80.91.229.3 (2 Jun 2012 11:55:21 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 2 Jun 2012 11:55:21 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Jun 02 13:55:17 2012 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SamvJ-0007AD-9N for geh-help-gnu-emacs@m.gmane.org; Sat, 02 Jun 2012 13:55:13 +0200 Original-Received: from localhost ([::1]:53653 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SamvJ-0000TC-10 for geh-help-gnu-emacs@m.gmane.org; Sat, 02 Jun 2012 07:55:13 -0400 Original-Path: usenet.stanford.edu!postnews.google.com!ri8g2000pbc.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 84 Original-NNTP-Posting-Host: 76.126.112.84 Original-X-Trace: posting.google.com 1338638074 22271 127.0.0.1 (2 Jun 2012 11:54:34 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Sat, 2 Jun 2012 11:54:34 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: ri8g2000pbc.googlegroups.com; posting-host=76.126.112.84; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5,gzip(gfe) Original-Xref: usenet.stanford.edu gnu.emacs.help:192676 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:85081 Archived-At: On Jun 1, 8:17=C2=A0pm, rusi wrote: > On Jun 2, 2:06=C2=A0am, Xah Lee wrote: > > > > Xah wrote > > > > > =E3=80=88Unicode BOM Byte Order Mark Hack=E3=80=89http://xahlee.org= /comp/unicode_BOM_byte_orde_mark.html > > > > >http://www.unicode.org/faq/utf_bom.html#bom1 > > > On Jun 1, 9:26=C2=A0am, rusi wrote: > > > > Seehttp://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > > > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8, > > > but may > > > be encountered in contexts where UTF-8 data is converted from other > > > encoding forms..." > > > > More specifically the non-recommendation of bom:http://www.unicode.or= g/faq/utf_bom.html > > > "Note that some recipients of UTF-8 encoded data do not expect a BOM. > > > Where UTF-8 is used transparently in 8-bit environments, the use of a > > > BOM will interfere with any protocol or file format that expects > > > specific ASCII characters at the beginning, such as the use of "#!" o= f > > > at the beginning of Unix shell scripts. " > > > didn't i mention these 2 points exactly in the link i gave?? > > Yeah your own link says this: (as you know I often use and quote your > unicode pages :-) ) > > - In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix) > hack. > - Many Window software add BOM to utf-8 files, e.g. Notepad. > > But you also say > > > If your lang spec says unicode, you have to support BOM mark > > So I am not clear whats ur stand... > > Let me make my own position clear: > The de jure unicode standard is set by the unicode consortium (or > whatever its called) > The de facto standard is set by microsoft and java > The two conflict BOM mark is part of the unicode standard. If a tech declares full support for unicode, support for BOM mark is necessary. BOM mark is a hack, but so is unix shebang mark. BOM mark being a given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is invented by unix fanatic Rob Pike largely to help unix world move forward to unicode. As it is, BOM mark conflict with the spirit of utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM mark byte sequence isn't in ASCII.) i read the link Thien-Thin Nguyen posted =E3=80=94http:// www.utf8everywhere.org/=E3=80=95. At first i find it very informative, but = in the end i wasn't convinced in its opinion that we should all adopt utf-8 instead of utf-16. I think if one switch a attitude, that utf-8 is the hack that introduced all this problems, then many of their argument for utf-8 doesn't stand. side note... about that site, it's Windows oriented. As such, they didn't explain many terms and Windows tech they use, e.g. i have little idea what narrowchar or widechar they mean, nor of the many Windows libraries they mention. also, the site is decidedly western-mind oriented. They forgot that in china, the encoding used is GB 18030, which has the same char set as unicode but different encoding, and is also compatible with ascii. No utf-8 nor utf-anything whatsoever. Chinese web traffic are like half of the world's or something. the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file systems, all utf-16, plus java C# etc. Though, the web (html,xml,css) are all utf-8. Neither are likely to go away. If Java and C# and NTFS disappeared from the face of this earth, then maybe. lol. :D Xah