Re: Smart Quotes Exporting

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Mark Shoulson <mark@kli.org>
To: emacs-orgmode@gnu.org
Subject: Re: Smart Quotes Exporting
Date: Fri, 15 Jun 2012 16:20:43 +0000 (UTC)	[thread overview]
Message-ID: <loom.20120615T171057-967@post.gmane.org> (raw)
In-Reply-To: 874nqgeke6.fsf@gmail.com

Nicolas Goaziou <n.goaziou <at> gmail.com> writes:

> 
> Hello,
> 
> Mark Shoulson <mark <at> kli.org> writes:
> 
> >> ASCII exporter also handle UTF-8. So it's good to have there too.
> >
> > Really?  I would have thought ASCII meant ASCII, as in 7-bit clean
> > text.
> 
> org-e-ascii.el (as old org-ascii.el) handles ASCII, Latin1 and UTF-8
> encodings.

I noticed that after writing my response.  The name just threw me a little.  
Yes, that exporter needs to handle it too.

> > It looked to me like your solution would essentially boil down to "do
> > string handling when there's a string, otherwise recur down and find
> > the strings," which essentially means apply it to all the
> > strings... and there were already functions out there applying things
> > to strings, so this can just ride along with them.  Here, let's look
> > at your suggestion and see if we can find what I missed:
> >
....
> > So, if it's a string, use the regexps (if they can be smart enough to look 
at
> > beginning and end of the string, which they can--though I haven't been 
using the
> > :post-blank property so presumably something is amiss), and if it isn't a
> > string, recur down until you get to a string... Ah, but only if it's in
> > org-element-recursive-objects.
> 
> You're missing an important part: the regexps cannot be smart enough for
> quotes at the beginning or the end of the string. There, you must look
> outside the string. Hence:

Well, wait; regexps can make some pretty darn good guesses at the beginnings 
or ends of strings.  Quotations don't normally end in spaces (in the 
conventions used with ""; French typography is different, but if you're using 
spaces around your quotes you have worse problems (line-breaks) to worry 
about).  So if a string ends in space(s) followed by a quote, it's very likely 
that quote is an open-quote for some stuff that comes after.  Conversely, if a 
string starts with a quote followed by some spaces, it's very likely a close-
quote to what went on before.

This isn't quite it; beginning-of-string followed by quote, then punctuation 
and then spaces is also a close-quote, etc... There is a lot of fine-tuning.  
But even what I currently have was able to handle your 

Caesar said, "/Alea Jacta est./"

example.  Yes, there are edge-cases which this won't catch, and it remains to 
be seen how pervasive and annoying those are.  It may be that repeated 
tweaking of regexps will handle enough of the ordinary cases.  It may be that 
after a few rounds of regexp-hacking someone will finally decide that regexp-
hacking just won't handle enough of the important cases.  But I think even as 
it stands now we'd probably handle 80-90% of the normal situations, which 
really is as much as we reasonably can hope for.

Could I trouble someone to try applying my patch and trying it out for 
yourself and seeing just how bad/good the performance is?  It seems to work 
okay for the cases I've been trying, but maybe my dataset isn't robust 
enough.  Let's give it a test and seen how many actual cases in common usage 
it gets wrong.  Maybe see how much can be fixed by tuning regexps.

> 
> > ]      1. If it has a quote as its first or last position, check for
> > ]         objects before or after the string to guess its status. An
> > ]         object never starts with a white space, but you may have to
> > ]         check :post-blank property in order to know if previous object
> > ]         had white spaces at its end.
> 
> But you can only do that from the element containing the string, not
> from the string itself.

The case where a quote both sits at the edge of a string (i.e. at the border 
of some element, formatting, etc) *and* does not have whitespace next to it, 
with possible punctuation, does not seem to be a normal occurrence to me.  If 
I'm wrong, how common *is* it?

> 
> > So the issue with the current state is that it
> > would wind up applying to too much? (it would hit code and verbatim 
elements,
> > for example, and that would be wrong.)
> 
> No, you are not applying it too much (verbatim elements don't contain
> plain-text objects) but your function hasn't got access to enough
> information to be useful.

The on-screen version, of course, will have to be smarter and check for 
the "face" formatting to make sure it doesn't happen in comments or verbatims; 
I am pretty sure it does not do that yet.

> > wait, called on the top-level parsed tree object, recursively doing
> > its thing before(?) the transcoders of the individual objects get to
> > it.
> 
> That's called a parse tree filter. That should be a possibility
> indeed. The function would be applied on the parse tree and would
> replace strings within elements containing plain text (that is
> paragraph, verse-block and table-row types). parse tree filters are
> applied very early in the export process.
> 
> Another option would be to integrate it into
> `org-element-normalize-contents', but I think the previous way is
> better.

Maybe.  I know it sounds like I'm fixated on the plain-text solution, but I'm 
not convinced the envisioned problems are more than theoretical, or that they 
will cause an unacceptable amount of error (keeping in mind that some error 
*is* acceptable and unavoidable).

> > The on-screen one would still use the plain-string computation, as you 
said,
> > since the full parse isn't available.
> 
> Yes.
> 
> > It would also need to be tweaked not to act on verbatim/comment text,
> > etc.
> 
> Yes. You may want to use `org-element-at-point' and `org-element-type'
> to tell if you're somewhere smart quotes are allowed (in table,
> table-row, paragraph, verse-block elements).

Probably.  I think I saw some other package make these decisions by peeking at 
the formatting and seeing if it is set in comment-face or something, but 
checking the element at point is presumably more sensible.

~mark

next prev parent reply	other threads:[~2012-06-15 16:21 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-22  3:32 "Smart" quotes Mark E. Shoulson
2012-05-23 22:17 ` Nicolas Goaziou
2012-05-24  3:05   ` Mark E. Shoulson
2012-05-25 17:14     ` Nicolas Goaziou
2012-05-25 17:51       ` Jambunathan K
2012-05-25 22:51       ` Mark E. Shoulson
2012-05-26  6:48         ` Nicolas Goaziou
2012-05-29  1:30           ` Mark E. Shoulson
2012-05-29 17:57             ` Nicolas Goaziou
2012-05-30  0:51               ` Mark E. Shoulson
2012-05-31  1:50                 ` (no subject) Mark Shoulson
2012-05-31 13:38                   ` Nicolas Goaziou
2012-05-31 23:26                     ` Smart Quotes Exporting (Was: Re: (no subject)) Mark E. Shoulson
2012-06-01 17:11                       ` Smart Quotes Exporting Nicolas Goaziou
2012-06-01 22:41                         ` Mark E. Shoulson
2012-06-03  3:16                         ` Mark E. Shoulson
2012-06-06  2:14                         ` Mark E. Shoulson
2012-06-07 19:21                           ` Nicolas Goaziou
2012-06-11  1:28                             ` Mark Shoulson
2012-06-12 13:21                               ` Nicolas Goaziou
2012-06-15 16:20                                 ` Mark Shoulson [this message]
2012-06-19  9:26                                   ` Nicolas Goaziou
2012-08-07 23:18                                     ` Bastien

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=loom.20120615T171057-967@post.gmane.org \
    --to=mark@kli.org \
    --cc=emacs-orgmode@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.