From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: those funny non-ASCII characters Date: Thu, 24 May 2012 17:56:59 -0700 (PDT) Organization: http://groups.google.com Message-ID: References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1337907616 22527 80.91.229.3 (25 May 2012 01:00:16 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 25 May 2012 01:00:16 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri May 25 03:00:15 2012 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SXit2-0005zj-Un for geh-help-gnu-emacs@m.gmane.org; Fri, 25 May 2012 03:00:13 +0200 Original-Received: from localhost ([::1]:58348 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SXit2-0006O5-N1 for geh-help-gnu-emacs@m.gmane.org; Thu, 24 May 2012 21:00:12 -0400 Original-Path: usenet.stanford.edu!postnews.google.com!nl1g2000pbc.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help,comp.emacs Original-Lines: 49 Original-NNTP-Posting-Host: 76.126.112.84 Original-X-Trace: posting.google.com 1337907507 25655 127.0.0.1 (25 May 2012 00:58:27 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Fri, 25 May 2012 00:58:27 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: nl1g2000pbc.googlegroups.com; posting-host=76.126.112.84; posting-account=bRPKjQoAAACxZsR8_VPXCX27T2YcsyMA User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5,gzip(gfe) Original-Xref: usenet.stanford.edu gnu.emacs.help:192555 comp.emacs:102467 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:84960 Archived-At: On May 24, 4:49=C2=A0pm, "Buchs, Kevin" wrote: > I often paste content from web pages into an emacs org-mode buffer and I > get the odd quote characters or dashes that are not ASCII. I created a > lisp function to remove the unicode ones that are just 8 bits. Lately I > am seeing that there are characters that are not being caught. They show > up in emacs as the expected character. When I kill/yank them into lisp > code, they are not being found. When I save the buffer, I am asked for > coding and chose raw text. When the file is opened again, these > characters are showing up as some sort of special symbol (dashed circle > with flag off the top) followed by doubles/triples of \2xx. For example, > the dash character I just stored was this sequence: circle-flag \200 > \231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245 > 206 340 244 206 210 200 and for the dash mentioned above 342 200 231. > > I am very naive in regard to coding, so please excuse my ignorance. I > would guess these are 16-bit (Unicode16) characters. Can someone > enlighten me as to how I can determine what these characters are (after > pasted into a buffer) and how I can code a function to replace them with > ASCII equivalents? The only thing I could think of was hexl mode, but > that didn't turn out well. Thanks. better to embrace unicode than fight it. what encoding you have when you paste is rather complex. I guess it depends on the sources you copy from, as each web page can be in diff charset and encoding then am not sure your OS do some translation in the pasteboard. maybe this will help. =E3=80=88Emacs File/Character Encoding/Decoding FAQ=E3=80=89 http://xahlee.org/emacs/emacs_encoding_decoding_faq.html =E3=80=88Xah's Unicode Tutorial=E3=80=89 http://xahlee.org/Periodic_dosage_dir/unicode.html to replace non-ascii, you can use the regex [[:nonascii:]]+ =E3=80=88Char Classes - GNU Emacs Lisp Reference Manual=E3=80=89 http://xahlee.org/emacs_manual/elisp/Char-Classes.html =E3=80=88Emacs Lisp: Convert Unicode String to ASCII (Zap Gremlins)=E3=80= =89 http://xahlee.org/emacs/emacs_zap_gremlins.html Xah