unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: p.stephani2@gmail.com
Cc: 23086@debbugs.gnu.org
Subject: bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters
Date: Mon, 17 Jul 2017 18:09:46 +0300	[thread overview]
Message-ID: <83o9sjcd6t.fsf@gnu.org> (raw)
In-Reply-To: <831t725w4k.fsf@gnu.org> (message from Eli Zaretskii on Tue, 22 Mar 2016 18:13:15 +0200)

> Date: Tue, 22 Mar 2016 18:13:15 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 23086@debbugs.gnu.org
> 
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 22 Mar 2016 11:42:46 +0100
> > 
> > Type some characters
> > C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR)
> > Type some more characters
> > M-q
> > 
> > Expected behavior: Emacs treats these characters as line and paragraph
> > separators: they are displayed as line breaks, M-q doesn't remove them,
> > and forward-paragraph etc. treat the paragraph separator as paragraph
> > end.
> > 
> > Actual behavior: These characters are displayed as one-pixel horizontal
> > whitespace and otherwise ignore.
> > 
> > Also discussed in
> > https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
> > https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
> > support for these characters, but I think proper treatment of Unicode
> > separators should be part of Emacs.
> 
> It is not clear to me what exactly is the requested feature.  Can you
> propose a detailed list of requirements?
> 
> I'm asking because these characters come in Unicode with a non-trivial
> baggage, that is a far cry from just breaking the line; see
> 
>   http://unicode.org/reports/tr14/
>   http://unicode.org/reports/tr29/
> 
> There are also implications on the bidirectional display (it is
> sensitive to where the line and the paragraph begin and end).
> 
> If we want to support these two characters, we should think about
> which parts of the relevant functionality we want to see in Emacs,
> because users will expect that.  In addition, there are other
> white-space characters defined by Unicode, and it would make sense to
> treat them all alike.  I'm not sure it makes sense to support just the
> line-breaking and paragraph-separator parts of only these two
> characters.
> 
> Then there are Emacs-specific issues, for example:
> 
>  . do we treat u+2028 and u+2029 as literal characters, or as a form
>    of EOL encoding?
>  . if the former, how do we distinguish them from newlines on display?
>  . should Isearch find these when looking for "\n"? how about regexp
>    search for "$"?
> 
> There are probably more implications, these just the ones that popped
> in my mind in 5 sec.  IOW, I think Someone™ should think this over and
> present a detailed proposal.

So I've dusted off this year-old bug reported and decided to improve
Emacs in this area.  Here's what I propose:

 . u+2028 and u+2029 (and also perhaps u+0085) will be treated a form
   of EOL encoding, which means they will not appear on display, and
   will cause the next character be displayed on the next screen line
 . M-q will remove u+2028, as it removes newlines, and put newlines
   at all EOLs as part of filling
 . M-q will NOT remove u+2029, unless the user wants to refill several
   paragraphs as a single paragraph, and there happens to be a u+2029
   between some of the paragraphs
 . forward-paragraph etc. will treat u+2029 as paragraph end
 . bidi reordering will treat u+2029 as paragraph end

There are some compromises in these decisions, but they make the job
much easier and less intrusive, and I think they will advance the
level of our Unicode support quite a bit.

Comments?

I think we should also make $ match these two characters, in addition
to the newline, but that could be more difficult.  Would someone who
knows their way in regex.c want to work on this part?





      parent reply	other threads:[~2017-07-17 15:09 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-22 10:42 bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters Philipp Stephani
2016-03-22 16:13 ` Eli Zaretskii
2016-03-26 23:49   ` John Wiegley
2017-07-17 15:09   ` Eli Zaretskii [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83o9sjcd6t.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=23086@debbugs.gnu.org \
    --cc=p.stephani2@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).