From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "B. T. Raven" Newsgroups: gmane.emacs.help Subject: Re: Encoding help Date: Tue, 02 Jun 2009 11:25:41 -0500 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1243960843 6588 80.91.229.12 (2 Jun 2009 16:40:43 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 2 Jun 2009 16:40:43 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Jun 02 18:40:41 2009 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MBX2W-0002Cs-Ei for geh-help-gnu-emacs@m.gmane.org; Tue, 02 Jun 2009 18:40:40 +0200 Original-Received: from localhost ([127.0.0.1]:37619 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MBX2V-00067k-Gm for geh-help-gnu-emacs@m.gmane.org; Tue, 02 Jun 2009 12:40:39 -0400 Original-Path: news.stanford.edu!newsfeed.stanford.edu!postnews.google.com!news1.google.com!border1.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!backlog2.nntp.dca.giganews.com!nntp.sysmatrix.net!news.sysmatrix.net.POSTED!not-for-mail Original-NNTP-Posting-Date: Tue, 02 Jun 2009 11:25:12 -0500 User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) Original-Newsgroups: gnu.emacs.help In-Reply-To: Original-Lines: 62 X-Usenet-Provider: http://www.giganews.com Original-NNTP-Posting-Host: 12.73.132.61 Original-X-Trace: sv3-u3aRlRXqDB/t+SDNz9b4PqHN5ZGXsnPBbomBU4a65DI8spFn43j1gjFNAGP03uWELvHbusTWWQNH1u/!4q9kjbU/Wop4i4VyaLldZs5ep76RQRer2w2XgRYyu/yy+K50oRO3XI0yXk6ZHuoFba22FRPeLF+i!nCmMD4XCBbhP4NsJ9GWkykSfUPXQkw== Original-X-Complaints-To: abuse@sysmatrix.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.39 X-Original-Bytes: 4144 Original-Xref: news.stanford.edu gnu.emacs.help:169674 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:64904 Archived-At: Eli Zaretskii wrote: >> Date: Mon, 01 Jun 2009 11:51:13 -0500 >> From: "B. T. Raven" >> Newsgroups: gnu.emacs.help >> >> I have a file created by saving a pdf as text and I want to convert the >> whole thing to utf-8 encoding. If I force the encoding for save in Emacs >> 23.0 to utf-8 I get the following in a *Warning* buffer: >> >> These default coding systems were tried to encode text >> in the buffer `span.txt': >> (utf-8-dos (122 . 4194285) (165 . 4194257) (204 . 4194285) (253 >> . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418 >> . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285)) >> (iso-latin-1-dos (122 . 4194285) (165 . 4194257) (204 . 4194285) >> (253 . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418 >> . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285)) >> However, each of them encountered characters it couldn't encode: >> >> [Below are many dozens of \xxx octal escape sequences] >> >> utf-8-dos cannot encode these: ... >> iso-latin-1-dos cannot encode these: ... >> >> The original pdf shows many standard diacritics for Romance languages >> along with a few vowels with macrons. > > It sounds like the original text file is already in UTF-8. Does it > help to visit it with "C-x RET c utf-8 RET C-x C-f" instead of just > "C-x C-f"? > > If that doesn't help (i.e. if you don't see diacritics instead of > octal escapes), then can you find out how the files is encoded? > > Going to one of the octal escapes and typing "C-u C-x =" might also > give important hints, so please post the result here. > >> If my only option is to Search and Replace these escape sequences >> with Unicode characters, how can I get a list of all these bad >> characters (they all show in red in Emacs 23 anyway). > > You can try using the functions unencodable-char-position and > find-coding-systems-region to find these characters. Thanks for the heads up on these functions, Eli. I did use the C-x ret c utf-8 ploy but that just repeats my default settings. I see most characters legibly with C-x ret c iso-8859-1 but there are still a few escape sequences sprinkled around. The most common are those pretty quotes that Latex substitutes for ascii single or double quote. What were vowels with macrons in the pdf are bare vowels so they must have been compiled into the pdf as uncomposed (not monolithic composed glyphs). > >> Has any of you written routines to replace things like these using a >> list of dotted pairs or something similar? > > Given the wealth of encodings supported by Emacs, such replacements > should not be necessary. Instead, try to find out how the file is > encoded, and visit it by instructing Emacs to use that encoding, with > "C-x RET c". > >