all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* I'm is really I'm
@ 2010-07-07  1:21 Lennart Borgman
  2010-07-07  1:38 ` Harald Hanche-Olsen
  2010-07-07  2:34 ` Karl Fogel
  0 siblings, 2 replies; 8+ messages in thread
From: Lennart Borgman @ 2010-07-07  1:21 UTC (permalink / raw)
  To: Emacs-Devel devel

I was just copying some text from a pdf file to store in org-mode in
Emacs. Some of the characters are not readable with my font. For
example when pasting something that looked like

  I'm

I got in Emacs

  I‟m

where the middle char is

        character: ‟ (8223, #o20037, #x201f)
preferred charset: unicode-bmp
                   (Unicode Basic Multilingual Plane (U+0000..U+FFFF))
       code point: 0x201F
           syntax: . 	which means: punctuation
         category: .:Base
      buffer code: #xE2 #x80 #x9F
        file code: #xE2 #x80 #x9F
                   (encoded by coding system utf-8-unix)
          display: no font available

Obviously this character is normally ' (char 39).

Do we have any tool for replacing such characters in Emacs? Or is
there a better way?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  1:21 I'm is really I'm Lennart Borgman
@ 2010-07-07  1:38 ` Harald Hanche-Olsen
  2010-07-07  1:57   ` Lennart Borgman
  2010-07-07  2:34 ` Karl Fogel
  1 sibling, 1 reply; 8+ messages in thread
From: Harald Hanche-Olsen @ 2010-07-07  1:38 UTC (permalink / raw)
  To: lennart.borgman; +Cc: emacs-devel

+ Lennart Borgman <lennart.borgman@gmail.com>:

> I was just copying some text from a pdf file to store in org-mode in
> Emacs. Some of the characters are not readable with my font. For
> example when pasting something that looked like
> 
>   I'm
> 
> I got in Emacs
> 
>   I‟m
> 
> where the middle char is

DOUBLE HIGH-REVERSED-9 QUOTATION MARK
which is really sort of wrong.

I suspect that the PDF file has used a non-unicode font for that
character. And once the PDF file creator resorts to such then ...

> Do we have any tool for replacing such characters in Emacs? Or is
> there a better way?

... all bets for an automatic recovery are off, except for some AI
technique. Otherwise, I am very much afraid that good old search and
replace is the best you can do. Of course, if you have a lot of these
files and they all suffer the same symptoms, you might want to build a
translation table of sorts. Is your question really about how you can
build and apply such tables to text?

- Harald



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  1:38 ` Harald Hanche-Olsen
@ 2010-07-07  1:57   ` Lennart Borgman
  2010-07-07  4:19     ` Jason Rumney
  0 siblings, 1 reply; 8+ messages in thread
From: Lennart Borgman @ 2010-07-07  1:57 UTC (permalink / raw)
  To: Harald Hanche-Olsen; +Cc: emacs-devel

On Wed, Jul 7, 2010 at 3:38 AM, Harald Hanche-Olsen <hanche@math.ntnu.no> wrote:
> + Lennart Borgman <lennart.borgman@gmail.com>:
>
>> I was just copying some text from a pdf file to store in org-mode in
>> Emacs. Some of the characters are not readable with my font. For
>> example when pasting something that looked like
>>
>>   I'm
>>
>> I got in Emacs
>>
>>   I‟m
>>
>> where the middle char is
>
> DOUBLE HIGH-REVERSED-9 QUOTATION MARK
> which is really sort of wrong.
>
> I suspect that the PDF file has used a non-unicode font for that
> character. And once the PDF file creator resorts to such then ...


The PDF file says "PDF Producer: Microsoft Office Word 2007".


>> Do we have any tool for replacing such characters in Emacs? Or is
>> there a better way?
>
> ... all bets for an automatic recovery are off, except for some AI
> technique. Otherwise, I am very much afraid that good old search and
> replace is the best you can do. Of course, if you have a lot of these
> files and they all suffer the same symptoms, you might want to build a
> translation table of sorts. Is your question really about how you can
> build and apply such tables to text?


I hoped there were some easy cases where some characters commonly used
for typographic reasons could be replaced by more "wellknown"
characters.

Otherwise a very simple "AI" technique could perhaps be to just build
a table of things with ("I'm" "I'm").



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  1:21 I'm is really I'm Lennart Borgman
  2010-07-07  1:38 ` Harald Hanche-Olsen
@ 2010-07-07  2:34 ` Karl Fogel
  2010-07-07 13:55   ` Davis Herring
  1 sibling, 1 reply; 8+ messages in thread
From: Karl Fogel @ 2010-07-07  2:34 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Emacs-Devel devel

Lennart Borgman <lennart.borgman@gmail.com> writes:
>Obviously this character is normally ' (char 39).
>
>Do we have any tool for replacing such characters in Emacs? Or is
>there a better way?

I get this problem all the time when pasting from web pages, PDFs, and
other sources of formatted text.

So I've been trying to write either a "filtered paste" or just a
function to clean up a region after pasting it.  But I'm rusty on
character representations in Emacs these days, and am having trouble
coming up with a way to represent (in Elisp source code) the characters
that most often need replacing.

Anyone who wants to play Captain Obvious on the code below, go for it.
It would be nice to give Emacs a standard solution to this common
problem.

  (defun clean-region (start end)
    "Clean up a region of text that comes from a non-plaintext source.
  Formatted sources, such as web pages and PDF documents, often contain
  characters that could be reasonably represented in plain ASCII but are
  not.  For example the characters referenced by &rdquo; and &ldquo; in
  HTML are not the same as ASCII 34 (double quote).  It is sometimes
  desirable to simply convert the formatted text to ASCII."
    (interactive "*r")
    ;; TODO: this is not working yet.  Maybe make chars, not strings,
    ;; and this might work?  Not sure.
    (let ((open-double-quote  (make-string 3 0))
          (close-double-quote (make-string 3 0))
          (funderscore        ? )
          (apostrophe         (make-string 3 0)))
      ;; I don't know any other way to make these strings besides
      ;; just setting each character by hand... but even that doesn't
      ;; seem to result in a working `replace-string' in the end.
      (aset open-double-quote 0 ?â)
      (aset open-double-quote 1 128)
      (aset open-double-quote 2 156)
      (aset close-double-quote 0 ?â)
      (aset close-double-quote 1 128)
      (aset close-double-quote 2 157)
      (aset apostrophe 0 ?â)
      (aset apostrophe 1 128)
      (aset apostrophe 2 153)
      (save-excursion
        (goto-char start)
        (replace-string apostrophe "'"  nil start end))))



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  1:57   ` Lennart Borgman
@ 2010-07-07  4:19     ` Jason Rumney
  2010-07-07  6:44       ` Reiner Steib
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Rumney @ 2010-07-07  4:19 UTC (permalink / raw)
  To: emacs-devel

On 07/07/2010 09:57, Lennart Borgman wrote:
> I hoped there were some easy cases where some characters commonly used
> for typographic reasons could be replaced by more "wellknown"
> characters.
>    

There are some filters in Gnus to handle this type of problem, but the 
problem you saw is different. PDF allows fonts to be embedded in the 
document, and when this happens the mapping from character encoding to 
glyph gets optimised so there is no common standard.  I've seen this 
before when attempting to copy and paste from a Japanese PDF document - 
there was no way of getting useful information out short of using OCR on 
a bitmap of the PDF reader's display.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  4:19     ` Jason Rumney
@ 2010-07-07  6:44       ` Reiner Steib
  0 siblings, 0 replies; 8+ messages in thread
From: Reiner Steib @ 2010-07-07  6:44 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-devel

On Wed, Jul 07 2010, Jason Rumney wrote:

> On 07/07/2010 09:57, Lennart Borgman wrote:
>> I hoped there were some easy cases where some characters commonly used
>> for typographic reasons could be replaced by more "wellknown"
>> characters.
>
> There are some filters in Gnus 

(See "dumbquotes" in gnus-art.el)

> to handle this type of problem, but the problem you saw is
> different.

For the common case, I'd start with `iso-translate-conventions' from
`iso-cvt.el'.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07  2:34 ` Karl Fogel
@ 2010-07-07 13:55   ` Davis Herring
  2010-07-07 15:22     ` Karl Fogel
  0 siblings, 1 reply; 8+ messages in thread
From: Davis Herring @ 2010-07-07 13:55 UTC (permalink / raw)
  To: Karl Fogel; +Cc: Lennart Borgman, Emacs-Devel devel

> Anyone who wants to play Captain Obvious on the code below, go for it.
> It would be nice to give Emacs a standard solution to this common
> problem.
[snip]
>       (aset open-double-quote 0 ?â)
>       (aset open-double-quote 1 128)
>       (aset open-double-quote 2 156)

It looks like you're trying to write the UTF-8 for your character.  Don't
do that; write the actual character (or the hex-escape for it; see
(elisp)General Escape Syntax), since that's what's in the buffer.

Apologies, of course, if I'm misunderstanding.

Davis

-- 
This product is sold by volume, not by mass.  If it appears too dense or
too sparse, it is because mass-energy conversion has occurred during
shipping.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: I'm is really I'm
  2010-07-07 13:55   ` Davis Herring
@ 2010-07-07 15:22     ` Karl Fogel
  0 siblings, 0 replies; 8+ messages in thread
From: Karl Fogel @ 2010-07-07 15:22 UTC (permalink / raw)
  To: herring; +Cc: Lennart Borgman, Emacs-Devel devel

"Davis Herring" <herring@lanl.gov> writes:
>>       (aset open-double-quote 0 ?â)
>>       (aset open-double-quote 1 128)
>>       (aset open-double-quote 2 156)
>
>It looks like you're trying to write the UTF-8 for your character.  Don't
>do that; write the actual character (or the hex-escape for it; see
>(elisp)General Escape Syntax), since that's what's in the buffer.
>
>Apologies, of course, if I'm misunderstanding.

Thank you!  That's the hint I needed.  I'll also check out
`iso-translate-conventions' from `iso-cvt.el' as Reiner Steib mentioned.

-Karl



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-07 15:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-07  1:21 I'm is really I'm Lennart Borgman
2010-07-07  1:38 ` Harald Hanche-Olsen
2010-07-07  1:57   ` Lennart Borgman
2010-07-07  4:19     ` Jason Rumney
2010-07-07  6:44       ` Reiner Steib
2010-07-07  2:34 ` Karl Fogel
2010-07-07 13:55   ` Davis Herring
2010-07-07 15:22     ` Karl Fogel

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.