From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters Date: Mon, 17 Jul 2017 18:09:46 +0300 Message-ID: <83o9sjcd6t.fsf@gnu.org> References: <831t725w4k.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1500304231 23702 195.159.176.226 (17 Jul 2017 15:10:31 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 17 Jul 2017 15:10:31 +0000 (UTC) Cc: 23086@debbugs.gnu.org To: p.stephani2@gmail.com Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Jul 17 17:10:26 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dX7f4-0005Lm-Nn for geb-bug-gnu-emacs@m.gmane.org; Mon, 17 Jul 2017 17:10:14 +0200 Original-Received: from localhost ([::1]:50969 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dX7f8-00088y-GN for geb-bug-gnu-emacs@m.gmane.org; Mon, 17 Jul 2017 11:10:18 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43281) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dX7ew-00085F-55 for bug-gnu-emacs@gnu.org; Mon, 17 Jul 2017 11:10:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dX7es-0002YY-TA for bug-gnu-emacs@gnu.org; Mon, 17 Jul 2017 11:10:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:41966) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dX7es-0002YU-PY for bug-gnu-emacs@gnu.org; Mon, 17 Jul 2017 11:10:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1dX7es-0000Hu-91 for bug-gnu-emacs@gnu.org; Mon, 17 Jul 2017 11:10:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 17 Jul 2017 15:10:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 23086 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 23086-submit@debbugs.gnu.org id=B23086.15003041881083 (code B ref 23086); Mon, 17 Jul 2017 15:10:02 +0000 Original-Received: (at 23086) by debbugs.gnu.org; 17 Jul 2017 15:09:48 +0000 Original-Received: from localhost ([127.0.0.1]:44643 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dX7ed-0000HO-RV for submit@debbugs.gnu.org; Mon, 17 Jul 2017 11:09:48 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:52225) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dX7ec-0000H9-Td for 23086@debbugs.gnu.org; Mon, 17 Jul 2017 11:09:47 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dX7eU-0002SV-BD for 23086@debbugs.gnu.org; Mon, 17 Jul 2017 11:09:41 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:35197) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dX7eU-0002SR-7B; Mon, 17 Jul 2017 11:09:38 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4324 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1dX7eT-0006ko-LZ; Mon, 17 Jul 2017 11:09:38 -0400 In-reply-to: <831t725w4k.fsf@gnu.org> (message from Eli Zaretskii on Tue, 22 Mar 2016 18:13:15 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:134671 Archived-At: > Date: Tue, 22 Mar 2016 18:13:15 +0200 > From: Eli Zaretskii > Cc: 23086@debbugs.gnu.org > > > From: Philipp Stephani > > Date: Tue, 22 Mar 2016 11:42:46 +0100 > > > > Type some characters > > C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR) > > Type some more characters > > M-q > > > > Expected behavior: Emacs treats these characters as line and paragraph > > separators: they are displayed as line breaks, M-q doesn't remove them, > > and forward-paragraph etc. treat the paragraph separator as paragraph > > end. > > > > Actual behavior: These characters are displayed as one-pixel horizontal > > whitespace and otherwise ignore. > > > > Also discussed in > > https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html. > > https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds > > support for these characters, but I think proper treatment of Unicode > > separators should be part of Emacs. > > It is not clear to me what exactly is the requested feature. Can you > propose a detailed list of requirements? > > I'm asking because these characters come in Unicode with a non-trivial > baggage, that is a far cry from just breaking the line; see > > http://unicode.org/reports/tr14/ > http://unicode.org/reports/tr29/ > > There are also implications on the bidirectional display (it is > sensitive to where the line and the paragraph begin and end). > > If we want to support these two characters, we should think about > which parts of the relevant functionality we want to see in Emacs, > because users will expect that. In addition, there are other > white-space characters defined by Unicode, and it would make sense to > treat them all alike. I'm not sure it makes sense to support just the > line-breaking and paragraph-separator parts of only these two > characters. > > Then there are Emacs-specific issues, for example: > > . do we treat u+2028 and u+2029 as literal characters, or as a form > of EOL encoding? > . if the former, how do we distinguish them from newlines on display? > . should Isearch find these when looking for "\n"? how about regexp > search for "$"? > > There are probably more implications, these just the ones that popped > in my mind in 5 sec. IOW, I think Someoneā„¢ should think this over and > present a detailed proposal. So I've dusted off this year-old bug reported and decided to improve Emacs in this area. Here's what I propose: . u+2028 and u+2029 (and also perhaps u+0085) will be treated a form of EOL encoding, which means they will not appear on display, and will cause the next character be displayed on the next screen line . M-q will remove u+2028, as it removes newlines, and put newlines at all EOLs as part of filling . M-q will NOT remove u+2029, unless the user wants to refill several paragraphs as a single paragraph, and there happens to be a u+2029 between some of the paragraphs . forward-paragraph etc. will treat u+2029 as paragraph end . bidi reordering will treat u+2029 as paragraph end There are some compromises in these decisions, but they make the job much easier and less intrusive, and I think they will advance the level of our Unicode support quite a bit. Comments? I think we should also make $ match these two characters, in addition to the newline, but that could be more difficult. Would someone who knows their way in regex.c want to work on this part?