unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Encoding help
@ 2009-06-01 16:51 B. T. Raven
  2009-06-01 23:05 ` Eli Zaretskii
       [not found] ` <mailman.8314.1243897564.31690.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 6+ messages in thread
From: B. T. Raven @ 2009-06-01 16:51 UTC (permalink / raw)
  To: help-gnu-emacs

I have a file created by saving a pdf as text and I want to convert the 
whole thing to utf-8 encoding. If I force the encoding for save in Emacs 
23.0 to utf-8 I get the following in a *Warning* buffer:

These default coding systems were tried to encode text
in the buffer `span.txt':
   (utf-8-dos (122 . 4194285) (165 . 4194257) (204 . 4194285) (253
   . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
   . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
   (iso-latin-1-dos (122 . 4194285) (165 . 4194257) (204 . 4194285)
   (253 . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
   . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
However, each of them encountered characters it couldn't encode:

[Below are many dozens of \xxx octal escape sequences]

   utf-8-dos cannot encode these:                     ...
   iso-latin-1-dos cannot encode these:                     ...

The original pdf shows many standard diacritics for Romance languages 
along with a few vowels with macrons. There is no option in Adobe Reader 
for saving as encoded text. If my only option is to Search and Replace 
these escape sequences with Unicode characters, how can I get a list of 
all these bad characters (they all show in red in Emacs 23 anyway). Has 
any of you written routines to replace things like these using a list of 
dotted pairs or something similar?


Thanks,

Ed


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encoding help
  2009-06-01 16:51 Encoding help B. T. Raven
@ 2009-06-01 23:05 ` Eli Zaretskii
       [not found] ` <mailman.8314.1243897564.31690.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2009-06-01 23:05 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Mon, 01 Jun 2009 11:51:13 -0500
> From: "B. T. Raven" <nihil@nihilo.net>
> Newsgroups: gnu.emacs.help
> 
> I have a file created by saving a pdf as text and I want to convert the 
> whole thing to utf-8 encoding. If I force the encoding for save in Emacs 
> 23.0 to utf-8 I get the following in a *Warning* buffer:
> 
> These default coding systems were tried to encode text
> in the buffer `span.txt':
>    (utf-8-dos (122 . 4194285) (165 . 4194257) (204 . 4194285) (253
>    . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
>    . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
>    (iso-latin-1-dos (122 . 4194285) (165 . 4194257) (204 . 4194285)
>    (253 . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
>    . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
> However, each of them encountered characters it couldn't encode:
> 
> [Below are many dozens of \xxx octal escape sequences]
> 
>    utf-8-dos cannot encode these:                     ...
>    iso-latin-1-dos cannot encode these:                     ...
> 
> The original pdf shows many standard diacritics for Romance languages 
> along with a few vowels with macrons.

It sounds like the original text file is already in UTF-8.  Does it
help to visit it with "C-x RET c utf-8 RET C-x C-f" instead of just
"C-x C-f"?

If that doesn't help (i.e. if you don't see diacritics instead of
octal escapes), then can you find out how the files is encoded?

Going to one of the octal escapes and typing "C-u C-x =" might also
give important hints, so please post the result here.

> If my only option is to Search and Replace these escape sequences
> with Unicode characters, how can I get a list of all these bad
> characters (they all show in red in Emacs 23 anyway).

You can try using the functions unencodable-char-position and
find-coding-systems-region to find these characters.

> Has any of you written routines to replace things like these using a
> list of dotted pairs or something similar?

Given the wealth of encodings supported by Emacs, such replacements
should not be necessary.  Instead, try to find out how the file is
encoded, and visit it by instructing Emacs to use that encoding, with
"C-x RET c".




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encoding help
       [not found] ` <mailman.8314.1243897564.31690.help-gnu-emacs@gnu.org>
@ 2009-06-02 16:25   ` B. T. Raven
  2009-06-02 22:58     ` Eli Zaretskii
       [not found]     ` <mailman.8392.1243983524.31690.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 6+ messages in thread
From: B. T. Raven @ 2009-06-02 16:25 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:
>> Date: Mon, 01 Jun 2009 11:51:13 -0500
>> From: "B. T. Raven" <nihil@nihilo.net>
>> Newsgroups: gnu.emacs.help
>>
>> I have a file created by saving a pdf as text and I want to convert the 
>> whole thing to utf-8 encoding. If I force the encoding for save in Emacs 
>> 23.0 to utf-8 I get the following in a *Warning* buffer:
>>
>> These default coding systems were tried to encode text
>> in the buffer `span.txt':
>>    (utf-8-dos (122 . 4194285) (165 . 4194257) (204 . 4194285) (253
>>    . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
>>    . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
>>    (iso-latin-1-dos (122 . 4194285) (165 . 4194257) (204 . 4194285)
>>    (253 . 4194257) (292 . 4194285) (372 . 4194289) (410 . 4194285) (418
>>    . 4194285) (653 . 4194217) (689 . 4194285) (731 . 4194285))
>> However, each of them encountered characters it couldn't encode:
>>
>> [Below are many dozens of \xxx octal escape sequences]
>>
>>    utf-8-dos cannot encode these:                     ...
>>    iso-latin-1-dos cannot encode these:                     ...
>>
>> The original pdf shows many standard diacritics for Romance languages 
>> along with a few vowels with macrons.
> 
> It sounds like the original text file is already in UTF-8.  Does it
> help to visit it with "C-x RET c utf-8 RET C-x C-f" instead of just
> "C-x C-f"?
> 
> If that doesn't help (i.e. if you don't see diacritics instead of
> octal escapes), then can you find out how the files is encoded?
> 
> Going to one of the octal escapes and typing "C-u C-x =" might also
> give important hints, so please post the result here.
> 
>> If my only option is to Search and Replace these escape sequences
>> with Unicode characters, how can I get a list of all these bad
>> characters (they all show in red in Emacs 23 anyway).
> 
> You can try using the functions unencodable-char-position and
> find-coding-systems-region to find these characters.

Thanks for the heads up on these functions, Eli. I did use the C-x ret c 
utf-8 ploy but that just repeats my default settings. I see most 
characters legibly  with C-x ret c iso-8859-1 but there are still a few 
escape sequences sprinkled around. The most common are those pretty 
quotes that Latex substitutes for ascii single or double quote. What 
were vowels with macrons in the pdf are bare vowels so they must have 
been compiled into the pdf as uncomposed (not monolithic composed glyphs).

> 
>> Has any of you written routines to replace things like these using a
>> list of dotted pairs or something similar?
> 
> Given the wealth of encodings supported by Emacs, such replacements
> should not be necessary.  Instead, try to find out how the file is
> encoded, and visit it by instructing Emacs to use that encoding, with
> "C-x RET c".
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encoding help
  2009-06-02 16:25   ` B. T. Raven
@ 2009-06-02 22:58     ` Eli Zaretskii
       [not found]     ` <mailman.8392.1243983524.31690.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2009-06-02 22:58 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Tue, 02 Jun 2009 11:25:41 -0500
> From: "B. T. Raven" <nihil@nihilo.net>
> Newsgroups: gnu.emacs.help
> 
> Thanks for the heads up on these functions, Eli. I did use the C-x ret c 
> utf-8 ploy but that just repeats my default settings. I see most 
> characters legibly  with C-x ret c iso-8859-1 but there are still a few 
> escape sequences sprinkled around. The most common are those pretty 
> quotes that Latex substitutes for ascii single or double quote. What 
> were vowels with macrons in the pdf are bare vowels so they must have 
> been compiled into the pdf as uncomposed (not monolithic composed glyphs).

It might be a good idea to submit a bug report with "M-x
report-emacs-bug RET", then.  Please tell there what program(s) and
command line options you used to produce the text file from PDF, and
someone will look into this and see whether Emacs could do any
better.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encoding help
       [not found]     ` <mailman.8392.1243983524.31690.help-gnu-emacs@gnu.org>
@ 2009-06-03 17:35       ` B. T. Raven
  2009-06-03 17:58         ` Peter Dyballa
  0 siblings, 1 reply; 6+ messages in thread
From: B. T. Raven @ 2009-06-03 17:35 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:
>> Date: Tue, 02 Jun 2009 11:25:41 -0500
>> From: "B. T. Raven" <nihil@nihilo.net>
>> Newsgroups: gnu.emacs.help
>>
>> Thanks for the heads up on these functions, Eli. I did use the C-x ret c 
>> utf-8 ploy but that just repeats my default settings. I see most 
>> characters legibly  with C-x ret c iso-8859-1 but there are still a few 
>> escape sequences sprinkled around. The most common are those pretty 
>> quotes that Latex substitutes for ascii single or double quote. What 
>> were vowels with macrons in the pdf are bare vowels so they must have 
>> been compiled into the pdf as uncomposed (not monolithic composed glyphs).
> 
> It might be a good idea to submit a bug report with "M-x
> report-emacs-bug RET", then.  Please tell there what program(s) and
> command line options you used to produce the text file from PDF, and
> someone will look into this and see whether Emacs could do any
> better.
> 
> 

Okay. I'll do that if I'm convinced that Emacs is at fault. Now I'm not 
so sure. Adobe Reader 8.0 offers only (Accessible) .txt as an option for 
Save as text. In the meanwhile I made a similar pdf with auctex and the 
.txt file produced by Adobe Reader is even more fragmented than the 
first one. I guess this is not surprising after the orginal .tex file 
goes through  \usepackage[utf8x]{inputenc} and \usepackage{babel}.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Encoding help
  2009-06-03 17:35       ` B. T. Raven
@ 2009-06-03 17:58         ` Peter Dyballa
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Dyballa @ 2009-06-03 17:58 UTC (permalink / raw)
  To: B. T. Raven; +Cc: help-gnu-emacs


Am 03.06.2009 um 19:35 schrieb B. T. Raven:

> In the meanwhile I made a similar pdf with auctex and the .txt file  
> produced by Adobe Reader is even more fragmented than the first  
> one. I guess this is not surprising after the orginal .tex file  
> goes through  \usepackage[utf8x]{inputenc} and \usepackage{babel}.


As long as you're using pdfTeX you can be sure that the PDF file has  
composed characters (input encoding plays no role, because it's just  
an *input* encoding). With a CMAP (character mapping, see 'texdoc -s  
cmap') and an 8 (or 7) bit font encoding (T1, T2A, T2B, T2C, T5, OT1,  
OT1tt, OT6, LGR, LAE, LFE) the composed characters can be mapped to  
the ready to use (pre-composed) Unicode characters.

The use of XeTeX might be another option (it's xdvipdfmx output  
driver inserts CMAPs into the PDF file). Or another PDF viewer, one  
that automatically reloads the updated PDF output file.

--
Greetings

   Pete

A common mistake that people make when trying to design something  
completely foolproof is to underestimate the ingenuity of complete  
fools.







^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-06-03 17:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-01 16:51 Encoding help B. T. Raven
2009-06-01 23:05 ` Eli Zaretskii
     [not found] ` <mailman.8314.1243897564.31690.help-gnu-emacs@gnu.org>
2009-06-02 16:25   ` B. T. Raven
2009-06-02 22:58     ` Eli Zaretskii
     [not found]     ` <mailman.8392.1243983524.31690.help-gnu-emacs@gnu.org>
2009-06-03 17:35       ` B. T. Raven
2009-06-03 17:58         ` Peter Dyballa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).