From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: EOL: unix/dos/mac Date: Wed, 27 Mar 2013 03:12:11 +0900 Message-ID: <87wqsuav44.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87ip4fc4xd.fsf@uwakimon.sk.tsukuba.ac.jp> <831ub21xpn.fsf@gnu.org> <87620ed2p1.fsf@uwakimon.sk.tsukuba.ac.jp> <83vc8ezh52.fsf@gnu.org> <871ub2crhm.fsf@uwakimon.sk.tsukuba.ac.jp> <83obe6z4vq.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-Trace: ger.gmane.org 1364321547 14538 80.91.229.3 (26 Mar 2013 18:12:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 26 Mar 2013 18:12:27 +0000 (UTC) Cc: per.starback@gmail.com, monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Mar 26 19:12:49 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UKYMa-00054O-5X for ged-emacs-devel@m.gmane.org; Tue, 26 Mar 2013 19:12:48 +0100 Original-Received: from localhost ([::1]:59720 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UKYMC-0003da-60 for ged-emacs-devel@m.gmane.org; Tue, 26 Mar 2013 14:12:24 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:50639) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UKYM6-0003UU-TP for emacs-devel@gnu.org; Tue, 26 Mar 2013 14:12:21 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UKYM5-0005Bp-0d for emacs-devel@gnu.org; Tue, 26 Mar 2013 14:12:18 -0400 Original-Received: from mgmt2.sk.tsukuba.ac.jp ([130.158.97.224]:52159) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UKYM1-0005Ao-T0; Tue, 26 Mar 2013 14:12:14 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mgmt2.sk.tsukuba.ac.jp (Postfix) with ESMTP id 9AACF97090A; Wed, 27 Mar 2013 03:12:11 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 5DF1C1A3D97; Wed, 27 Mar 2013 03:12:11 +0900 (JST) In-Reply-To: <83obe6z4vq.fsf@gnu.org> X-Mailer: VM undefined under 21.5 (beta32) "habanero" b0d40183ac79 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 130.158.97.224 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:158235 Archived-At: Eli Zaretskii writes: > > From: "Stephen J. Turnbull" > > Currently NLFs *are* displayed, if they don't match the default for > > the buffer. > > No, they are displayed because nothing other than a single LF is > treated like NLF by the Emacs internals. Emacs doesn't get to define NLF; it's a Unicode concept. You'll get in trouble if you get confused about that. Those *are* NLFs, and in the "CR in *-unix buffer" form they *are* displayed as "^M"s, while in the "bare LF in *-doc buffer" form they *do* appear as stair-stepping lines. That does bother some users, including some who understand why it happens. > > Because you have to fix pretty much everything > > I'm probably missing something important, because things I think will > need fixing are nowhere near "pretty much everything". How about > posting a long enough list of things to fix to convince me that > "pretty much everything" is close to the truth? "Everything" is of course an exaggeration. At a minimum, you need to change delete and motion commands to handle the fact that EOL doesn't have a constant width in characters. Should users be able to move *into* a CRLF in -unix buffer? How about a -dos buffer? Should forward-char-command move into or *over* a CRLF? Does it matter what the EOL convention is for that buffer? What are we going to do for the occasional user who wants the less usual behavior for some reason? You need to decide what (insert "\015") means in a -dos buffer, and you can be pretty sure that some users will be confused whichever you choose. Ditto (insert "\012") in a -mac buffer. You may very well want those to mean something different from the commands that self-insert either or both of those characters. Until now, skip-chars-forward and regexps would find EOL if the string defining the target contained "\n". Is that going to continue to be true? How do you propose to find a bare LF -- are we going to make users use octal or hex escapes, or do we define new string syntax? > > Code will be massively uglified with tests for variable-length > > sequences instead of single characters > > The code is already replete with that, ever since Emacs started using > a multi-byte representation for characters in buffers. We have a set > of macros to fetch and examine multi-byte sequences, for that reason. > I see nothing hard or "ugly" here, sorry. Ah, but this is completely a different story. Those there are C macros, and not visible to Lisp programs, which know that a line break is represented by a single character, U+000A. That's no longer true for NLF, which by definition is composed of one or more *characters*, not code units. It's *Lisp* code that has to deal with this. > > Any code handling old-style hidden lines (with CR marking > > "invisible" lines) will have to be changed. > > First, we want to deprecate and remove this feature anyway (there's > already an implemented alternative). And second, we already handle > this today so that we don't display ^M there; the same method can be > used for the other NLFs. Sorry, that breaks immediately. That ^M is now an NLF, and you either treat it that way and not as an invisibility marker, or the meaning of the buffer changes when you switch that mode on and off in a very delicate way. I'm pretty sure it will corrupt the buffer unless you mark preexisting ^Ms as NLFs or convert them to something else. Which is what I'm proposing, of course. So you can fall back on deprecation. Has the feature actually been scheduled for deprecation and eventual removal? If not, you're looking at 5-10 years before it gets removed. > If the problem _is_ significant, we might as well solve it The > Right Way, instead of applying more and more band-aid. Conversion > of NLFs to a single LF is a kludge, Not to mention a close approximation to the right way to handle them according to the Unicode standard under many circumstances. (The truly correct way to handle them is to substitute LINE SEPARATOR, as I mentioned earlier.) > You cannot do such conversion efficiently if you need to discover > the EOL format for every line. Of course you can. You don't need to "discover" the EOL format; you know that an EOL is anything that matches "\r\n\|\r\|\n\|\205" as you move forward through the buffer. It's only a tiny bit more expensive than current conversion for -dos or -mac, and those are hardly prohibitive, especially when compared to I/O itself. > What it adds doesn't seem so frightening to me, certainly less so > than, say, adding bidi support ;-) Agreed, but irrelevant. bidi is a new feature necessary to support some languages currently used by millions of people, and the hairiness is mandated by UAX #9 -- an alternative implementation is not going to make conformance much easier. What we're talking about here are alternative implementations of a much smaller feature, NLF, and which one is going to be more efficient and more natural for Emacs. > The internal representation is still exposed, so nothing's changed in > that department. I know, and taking advantage of that exposure still falls in the class of "Kids, these stunts are performed by trained professionals. Don't try this at home!" Can you deny that? > > I think you're hearing monsters in the closet. > > And I think _you_ are hearing them. Well, yes, I am. But I've worked with implementations of coding systems in both XEmacs and Python, and I know that what I'm talking about will work and be efficient, and buffers and strings will continue to conform to the Emacs model. I know that what you're talking about will break some invariants for character motion and editing at line end, and that worries me. Proof? You're right, I have none. By the same token, you don't either. What worries me is that while I can prove (or perhaps disprove) my point with a small set of unit tests and benchmarks, you will have to hand that version of Emacs to real users for a year or three to find out if anybody really cares that the model broke. > Or maybe you will show me such a large list of things that will > become broken by keeping NLFs that I will change my mind. I can't; I gave you my list already, and I grant that it's not all that long and several of the potential problems can't be confirmed at this point. But if you decide to keep NLFs in the buffer rather than conforming to the tried and true Emacs/Mule model of converting them to a one-character representation, I predict you will find plenty of breakage over years, just as the \201 bug regressed multiple times over something like a decade. It's true that keeping NLFs in the buffer will bring Emacs's internal representation into closer conformance with the Unicode Standard, but both the benefits and the costs of that are unclear to me. Sure, it makes it conceptually straightforward to support Unicode handling of NLF in regexps, but you can already do that by simply avoiding EOL conversion when you need highly accurate Unicode conformance. On the other hand, when you are treating NLFs as NLFs, you will be breaking the 40-year-old Emacs model of a linebreak marked by a single character. I don't know what trouble that will cause, but there's no easy workaround for it that preserves those NLFs.