* Buffer names with R2L characters @ 2011-06-20 16:21 Eli Zaretskii 2011-06-20 18:00 ` Stefan Monnier ` (3 more replies) 0 siblings, 4 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-20 16:21 UTC (permalink / raw) To: emacs-devel I bumped into this annoyance while working on bidi reordering of strings. As some of you know, the mode line is constructed from C and Lisp strings, and bidi.c can now reorder them (for now, only its new and improved version in my local branch). Once I had this half-working, the first thing I tested was visiting files whose names include R2L characters. It works, but there's one problem: the "<N>" tails we attach to buffer names to make them unique. The problem is that the '<' and '>' characters are "other neutral", or "ON", in the UAX#9 parlance, and so their directionality depends on the surrounding characters. As result, a buffer name typed as ABCDEF<2> is displayed in the mode line like this: 2>FEDCBA> I verified this with the Unicode Reference Implementation, and there's no bug in bidi.c: this is the correct reordering according to the Unicode Bidirectional Algorithm. I can fix this in most prominent use cases -- the mode line, the buffer menu, and even in the prompt produced by read-buffer -- by appending a suitable character to the end of the string (after the numeric tail) and making it invisible with text properties. But this sounds kludgey, and of course sooner or later the "2>FEDCBA>" thingy will show somewhere, e.g. if someone coughs up their own mode-line-format and use buffer-name directly, or whatever. However, I don't see a better way out of this, and leaving it as it is would be too ugly, IMO. If someone has better ideas, I'm all ears. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii @ 2011-06-20 18:00 ` Stefan Monnier 2011-06-20 20:52 ` Eli Zaretskii 2011-06-20 20:13 ` James Cloos ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: Stefan Monnier @ 2011-06-20 18:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > Once I had this half-working, the first thing I tested was visiting > files whose names include R2L characters. It works, but there's one > problem: the "<N>" tails we attach to buffer names to make them > unique. The problem is that the '<' and '>' characters are "other > neutral", or "ON", in the UAX#9 parlance, and so their directionality > depends on the surrounding characters. As result, a buffer name typed > as ABCDEF<2> is displayed in the mode line like this: > 2> FEDCBA> Sounds like the same kind of issue as the one brought up a while ago about bidi+XML (or any other mark up). Another important case is when you use uniquify (in which case the above will happen less frequently since file buffers add the directory info to disambiguate the name, but may be replace by similar problems if the separator between the file and directory part (typically /, or \, or |) doesn't have the "right" bidi behavior). > I verified this with the Unicode Reference Implementation, and there's > no bug in bidi.c: this is the correct reordering according to the > Unicode Bidirectional Algorithm. > appending a suitable character to the end of the string (after the > numeric tail) and making it invisible with text properties. But this Hmm, I would have expected that adding something at the end of the string would not work ("too late"), whereas adding it between "ABCDEF" and "<2>" would have felt very natural to me. I guess it just shows how little I know of bidi, Stefan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 18:00 ` Stefan Monnier @ 2011-06-20 20:52 ` Eli Zaretskii 0 siblings, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-20 20:52 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Mon, 20 Jun 2011 14:00:34 -0400 > > > 2>FEDCBA> > > Sounds like the same kind of issue as the one brought up a while ago > about bidi+XML (or any other mark up). It's caused by the bidirectional properties of the '<' and '>' characters, but other than that, this has nothing to do with XML. > Another important case is when you use uniquify (in which case the > above will happen less frequently since file buffers add the directory > info to disambiguate the name, but may be replace by similar problems > if the separator between the file and directory part (typically /, or > \, or |) doesn't have the "right" bidi behavior). You get something like foo/bar/FEDCBA, so there's no problem here, I think. > > I verified this with the Unicode Reference Implementation, and there's > > no bug in bidi.c: this is the correct reordering according to the > > Unicode Bidirectional Algorithm. > > > appending a suitable character to the end of the string (after the > > numeric tail) and making it invisible with text properties. But this > > Hmm, I would have expected that adding something at the end of the string > would not work ("too late"), whereas adding it between "ABCDEF" and > "<2>" would have felt very natural to me. The final "resolved level" of a weak character depends on characters on its both sides, not just on one side. So there's no "too late". When '>' is the last character, the algorithm uses a default value for the absent character after it, and the default depends on the current paragraph direction, which must be L2R both in the mode line and in buffer-menu. Thus, '>' gets the (default) L2R direction in this case. Appending a zero makes '>' be surrounded by two digits, so it gets the R2L direction (because the digits are embedded in a run of R2L characters) and is mirrored into '<'. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii 2011-06-20 18:00 ` Stefan Monnier @ 2011-06-20 20:13 ` James Cloos 2011-06-20 21:08 ` Eli Zaretskii 2011-06-20 21:06 ` Kalle Olavi Niemitalo 2011-06-21 16:52 ` Ehud Karni 3 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2011-06-20 20:13 UTC (permalink / raw) To: emacs-devel; +Cc: Eli Zaretskii The bidi algorithm is just not designed for markup (and the <digits> tag /is/ markup). Ideally there would be a 0-width break before the <digits> or a way to mark the <digits> blob as non-neutral. Whether the result should display as <12>FEDCBA, <21>FEDCBA or FEDCBA<12>, though, I have no idea. (That is, I don't know which of those users would prefer. I presume that a ordering break would result in the third.) I doubt that it can be propperly fixed, though, w/o also fixing unicode's algorithm to better handle markup interspersed in the main text. In general, each blob of markup should be handled as its own document, and the result of that should be treated as a single (perhaps neutral) character from the point of view of the enclosing text. That would fix things for this issue, sgml/xml, TeX, source code of every type (some of the markup there is implicit, but still logically extant) et cetera. For that to work the engine obviously needs to know what markup looks like, which requires additional meta information about each document. The buffer mode helps, but may not be sufficient? Which is probably why unicode, with their plain-text emphasis, ignored it. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 20:13 ` James Cloos @ 2011-06-20 21:08 ` Eli Zaretskii 2011-06-21 4:26 ` Stephen J. Turnbull 2011-06-21 4:33 ` Miles Bader 0 siblings, 2 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-20 21:08 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel > From: James Cloos <cloos@jhcloos.com> > Cc: Eli Zaretskii <eliz@gnu.org> > Date: Mon, 20 Jun 2011 16:13:15 -0400 > > The bidi algorithm is just not designed for markup (and the <digits> tag > /is/ markup). No, it isn't. Not every use of '<' and '>' is markup. They can also be used in context such as "n > N" etc. They are just characters; XML did not appropriate them just because it uses them for markup. > Ideally there would be a 0-width break before the <digits> It's possible (we could use LRM), but it isn't ideal, because text terminals will have trouble displaying it. That's why I think using a character covered by invisible text property is better. > or a way to mark the <digits> blob as non-neutral. This feature is still far away. I thought about it, and concluded that implementing it is not trivial, or at least there were a couple of problems for which I couldn't think of a good solution yet. I don't think we should wait for such a feature and in the meantime display "12>FEDCBA>" as buffer name. I'd like to have a solution, even if an interim one, for Emacs 24.1. > Whether the result should display as <12>FEDCBA, <21>FEDCBA or FEDCBA<12>, > though, I have no idea. We can decide whatever we want (but not the one with "21": the order of digits in a number should be left to right). But any of these needs "help" to get them look like that, because the UBA simply cannot treat these cases gracefully. > I presume that a ordering break would result in the third. What's an "ordering break"? > For that to work the engine obviously needs to know what markup looks like, > which requires additional meta information about each document. The buffer > mode helps, but may not be sufficient? The Emacs reordering engine doesn't know anything about major modes, text properties, overlays, and other meta information. The only exceptions are (1) `display' properties and (2) buffer narrowing limits. This design is intentional, because otherwise the same code could not be used for reordering text independently of redisplay, e.g. for producing visual-order encodings, like if you wanted to print R2L text on a relatively dumb printer or send it to a process that doesn't grok bidi. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 21:08 ` Eli Zaretskii @ 2011-06-21 4:26 ` Stephen J. Turnbull 2011-06-21 6:28 ` Eli Zaretskii 2011-06-21 4:33 ` Miles Bader 1 sibling, 1 reply; 21+ messages in thread From: Stephen J. Turnbull @ 2011-06-21 4:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: James Cloos, emacs-devel Eli Zaretskii writes: > > From: James Cloos <cloos@jhcloos.com> > > Cc: Eli Zaretskii <eliz@gnu.org> > > Date: Mon, 20 Jun 2011 16:13:15 -0400 > > > > The bidi algorithm is just not designed for markup (and the <digits> tag > > /is/ markup). > > No, it isn't. Not every use of '<' and '>' is markup. Please rethink here, Eli. In the sense Jim is talking about, at the conceptual level this use case is indeed markup, ie, metadata encoded in plain text. There may be better ways of solving the display problem than taking that literally. > They can also be used in context such as "n > N" etc. They are > just characters; XML did not appropriate them just because it uses > them for markup. Of course this is true, but it misses his point, which is that in a markup context, there is text data (in SGML, CDATA or PCDATA) and there is metadata. From the point of view of working with these higher-level protocols, it may desirable that each run of text data be treated separately in an editor. There's nothing in the Unicode standard that says that Emacs couldn't have a `non-bidi' syntax class, which when applied as a property to text in a buffer would break that buffer into two or more bidi "streams" to which the algorithm would be applied independently. However, this is an implementation detail. If life is easier if you consider every textual object (buffer or string) as a single stream to which the bidi algorithm should be applied, there's nothing wrong with that, either. > This feature is still far away. I thought about it, and concluded > that implementing it is not trivial, or at least there were a couple > of problems for which I couldn't think of a good solution yet. I'm happy to accept your judgment here, of course. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 4:26 ` Stephen J. Turnbull @ 2011-06-21 6:28 ` Eli Zaretskii 2011-06-21 8:44 ` Stephen J. Turnbull 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 6:28 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: cloos, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: James Cloos <cloos@jhcloos.com>, > emacs-devel@gnu.org > Date: Tue, 21 Jun 2011 13:26:50 +0900 > > Eli Zaretskii writes: > > > From: James Cloos <cloos@jhcloos.com> > > > Cc: Eli Zaretskii <eliz@gnu.org> > > > Date: Mon, 20 Jun 2011 16:13:15 -0400 > > > > > > The bidi algorithm is just not designed for markup (and the <digits> tag > > > /is/ markup). > > > > No, it isn't. Not every use of '<' and '>' is markup. > > Please rethink here, Eli. In the sense Jim is talking about, at the > conceptual level this use case is indeed markup, ie, metadata encoded > in plain text. There may be better ways of solving the display > problem than taking that literally. Well, the fact that I started this discussion is a sign that I'm willing to rethink ;-) However, I'm not sure what you are suggesting to rethink, specifically. What I said is that in foo<1>, the "<1>" part is not a markup, it's just part of a string that is a buffer name. If you disagree, then please explain why this use of <..> should be considered markup. Just the fact that it uses <..> is not enough, IMO. > > They can also be used in context such as "n > N" etc. They are > > just characters; XML did not appropriate them just because it uses > > them for markup. > > Of course this is true, but it misses his point, which is that in a > markup context, there is text data (in SGML, CDATA or PCDATA) and > there is metadata. From the point of view of working with these > higher-level protocols, it may desirable that each run of text data be > treated separately in an editor. And it will be, when Emacs is taught to reorder non-plain text. I never said nor thought that we should reorder markup text as if it were plain text -- that'd be terribly wrong. Similarly, we should reorder only comments and strings while displaying program source files. Both these use cases call for selectively reordering an otherwise strictly L2R text. Emacs should be able to do that, and in doing so it will need to rely heavily on the buffer's major mode. But the necessary infrastructure is not yet in Emacs, and it won't be there in time for Emacs 24.1. Moreover, I'm not sure it would be a good idea to use such an infrastructure for FOO<2> buffer names even if it were available. If you agree with me that <2> is not markup, then using methods designed for markup languages would be as kludgey as appending an invisible character: they both use a trick not intended for this use case. There's no major mode here to help us DTRT with a string that is part of the mode line. > There's nothing in the Unicode standard that says that Emacs couldn't > have a `non-bidi' syntax class, which when applied as a property to > text in a buffer would break that buffer into two or more bidi > "streams" to which the algorithm would be applied independently. Right. I didn't mean to say otherwise. > However, this is an implementation detail. If life is easier if you > consider every textual object (buffer or string) as a single stream to > which the bidi algorithm should be applied, there's nothing wrong with > that, either. Well, that won't solve the problem of displaying markup languages and program sources, so it cannot be that simple ;-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 6:28 ` Eli Zaretskii @ 2011-06-21 8:44 ` Stephen J. Turnbull 2011-06-21 14:28 ` Eli Zaretskii 0 siblings, 1 reply; 21+ messages in thread From: Stephen J. Turnbull @ 2011-06-21 8:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: cloos, emacs-devel Eli Zaretskii writes: > However, I'm not sure what you are suggesting to rethink, > specifically. What I said is that in foo<1>, the "<1>" part is not a > markup, it's just part of a string that is a buffer name. Well, it's not markup in the sense of HTML (ie, display markup), but it is markup in the sense of XML (semantic markup). It's true that the appended string is arbitrary, and that the relationship to the "desired" buffer name is quite arbitrary (you could use alphabetic characters instead of numerals, for example). But it is used to disambiguate what to the user would otherwise be identical names, and the specific form clearly indicates that this is "metadata". Buffers on main.c and main.c~ clearly represent files with different names, while main.c and main.c<2> by convention represent files with the same name. It is this convention, not the use of "<>", that makes the uniquifer "<1>" into markup. [Each run of plain text in a markup buffer should be treated as a separate "stream" of text for bidi display purposes, or something like that.] > And it will be, when Emacs is taught to reorder non-plain text. OK. > I never said nor thought that we should reorder markup text as if > it were plain text -- that'd be terribly wrong. > But the necessary infrastructure is not yet in Emacs, and it won't be > there in time for Emacs 24.1. OK, that's what I wanted to say myself, but it wasn't clear to me that was your reasoning. > There's no major mode here to help us DTRT with a string that is part > of the mode line. Sure there is, "mode line mode". ;-) Mode lines have a syntax, and so do buffer-names (when uniquifying). The fact there there's no major mode we use in buffers that's like that is not really relevant. Of course if you paste a buffer-name into a buffer in some major mode, you may run into problems. But that's always the case when changing syntax models on the same object. > > However, this is an implementation detail. If life is easier if you > > consider every textual object (buffer or string) as a single stream to > > which the bidi algorithm should be applied, there's nothing wrong with > > that, either. > > Well, that won't solve the problem of displaying markup languages and > program sources, so it cannot be that simple ;-) That's exactly the kind of thing where (at least for the next year or so) I'm just gonna have to trust you. :-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 8:44 ` Stephen J. Turnbull @ 2011-06-21 14:28 ` Eli Zaretskii 0 siblings, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 14:28 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: cloos, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: cloos@jhcloos.com, > emacs-devel@gnu.org > Date: Tue, 21 Jun 2011 17:44:05 +0900 > > > There's no major mode here to help us DTRT with a string that is part > > of the mode line. > > Sure there is, "mode line mode". ;-) Mode lines have a syntax, and so > do buffer-names (when uniquifying). I think the difference is significant. Mode line in Emacs is fully programmable, and can be programmed to display anything in any order and form. There are much less rules here than in any major mode, because the latter is constrained by externally imposed rules of the "language" supported by the mode. By contrast, I can program my mode line to break any and all "syntax" that users of the default mode-line-format are used to. Even uniquifying buffer names can be done in several different flavors, out of the box. Anyway, I'm perfectly happy to leave the display of such names as the UBA would have them, and mark this as a temporarily missing feature. At least the MS-Windows file manager displays such names the same (cannot test on GNU/Linux where I'm typing this), so we have nothing to be ashamed of. Btw, the numbered backup files suffer from the same problem, their buffer names are displayed as 1~.RABOOF~ instead of ~1~.RABOOF. So it's not just the duplicate file names that will trigger this. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 21:08 ` Eli Zaretskii 2011-06-21 4:26 ` Stephen J. Turnbull @ 2011-06-21 4:33 ` Miles Bader 2011-06-21 6:30 ` Eli Zaretskii 2011-06-21 7:26 ` David Kastrup 1 sibling, 2 replies; 21+ messages in thread From: Miles Bader @ 2011-06-21 4:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: James Cloos, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > I don't think we should wait for such a feature and in the meantime > display "12>FEDCBA>" as buffer name. I'd like to have a solution, > even if an interim one, for Emacs 24.1. Do you think this case will be common enough in practice to be worth worrying about in the short term? In my experience redundant buffer/file names are not very common for _arbitrary_ files. Rather, they tend to occur with very specific names where the name is not chosen by the user, but rather is due to some external standard -- e.g. "Makefile", "*shell*", "README". But I think such "standard" names are also more likely to be written just using ASCII, and so won't tickle this problem. Do you observe otherwise in your environment? [hmm, what's Hebrew for "README"... :] -Miles -- We are all lying in the gutter, but some of us are looking at the stars. -Oscar Wilde ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 4:33 ` Miles Bader @ 2011-06-21 6:30 ` Eli Zaretskii 2011-06-21 7:26 ` David Kastrup 1 sibling, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 6:30 UTC (permalink / raw) To: Miles Bader; +Cc: cloos, emacs-devel > From: Miles Bader <miles@gnu.org> > Cc: James Cloos <cloos@jhcloos.com>, emacs-devel@gnu.org > Date: Tue, 21 Jun 2011 13:33:13 +0900 > > Do you think this case will be common enough in practice to be worth > worrying about in the short term? I have no idea. But if it's not common, then the kludge I suggested won't bother anyone, either, right? > Do you observe otherwise in your environment? [hmm, what's Hebrew for > "README"... :] I try very hard to stick to English ;-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 4:33 ` Miles Bader 2011-06-21 6:30 ` Eli Zaretskii @ 2011-06-21 7:26 ` David Kastrup 1 sibling, 0 replies; 21+ messages in thread From: David Kastrup @ 2011-06-21 7:26 UTC (permalink / raw) To: emacs-devel Miles Bader <miles@gnu.org> writes: > Eli Zaretskii <eliz@gnu.org> writes: >> I don't think we should wait for such a feature and in the meantime >> display "12>FEDCBA>" as buffer name. I'd like to have a solution, >> even if an interim one, for Emacs 24.1. > > Do you think this case will be common enough in practice to be worth > worrying about in the short term? > > In my experience redundant buffer/file names are not very common for > _arbitrary_ files. Rather, they tend to occur with very specific > names where the name is not chosen by the user, but rather is due to > some external standard -- e.g. "Makefile", "*shell*", "README". I have them quite often when comparing/merging versions of files. -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii 2011-06-20 18:00 ` Stefan Monnier 2011-06-20 20:13 ` James Cloos @ 2011-06-20 21:06 ` Kalle Olavi Niemitalo 2011-06-21 2:51 ` Eli Zaretskii 2011-06-21 16:52 ` Ehud Karni 3 siblings, 1 reply; 21+ messages in thread From: Kalle Olavi Niemitalo @ 2011-06-20 21:06 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > But this sounds kludgey, and of course sooner or later the > "2>FEDCBA>" thingy will show somewhere, e.g. if someone coughs > up their own mode-line-format and use buffer-name directly, or > whatever. Perhaps you could make buffer-name return a string with text properties that let it display in the preferable way. I don't know whether the bidi code supports such properties. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 21:06 ` Kalle Olavi Niemitalo @ 2011-06-21 2:51 ` Eli Zaretskii 0 siblings, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 2:51 UTC (permalink / raw) To: Kalle Olavi Niemitalo; +Cc: emacs-devel > From: Kalle Olavi Niemitalo <kon@iki.fi> > Date: Tue, 21 Jun 2011 00:06:29 +0300 > > Eli Zaretskii <eliz@gnu.org> writes: > > > But this sounds kludgey, and of course sooner or later the > > "2>FEDCBA>" thingy will show somewhere, e.g. if someone coughs > > up their own mode-line-format and use buffer-name directly, or > > whatever. > > Perhaps you could make buffer-name return a string with text > properties that let it display in the preferable way. I don't > know whether the bidi code supports such properties. That's what James was talking about. As I replied, there are no such text properties in Emacs yet, and implementing them will not be trivial, so it won't be in Emacs 24.1. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii ` (2 preceding siblings ...) 2011-06-20 21:06 ` Kalle Olavi Niemitalo @ 2011-06-21 16:52 ` Ehud Karni 2011-06-21 17:24 ` Eli Zaretskii 3 siblings, 1 reply; 21+ messages in thread From: Ehud Karni @ 2011-06-21 16:52 UTC (permalink / raw) To: eliz; +Cc: stephen, miles, monnier, cloos, emacs-devel On Mon, 20 Jun 2011 19:21:00 +03:00 Eli Zaretskii wrote: > A buffer name typed as ABCDEF<2> is displayed in the mode line > like this: > > 2>FEDCBA> > > [snip] > > I can fix this .... by appending a suitable character to the end of > the string (after the numeric tail) and making it invisible with text > properties. The "right" way to fix this is to have the buffer name between 2 zero width LRM characters. On Mon, 20 Jun 2011 14:01:00 -04:00 Stefan Monnier wrote: SM> Another important case is when you use uniquify ... file buffers SM> add the directory info to disambiguate the name ... On Mon, 20 Jun 2011 23:52:44 +03:00 Eli Zaretskii answered: EZ> You get something like foo/bar/FEDCBA, so there's no problem here, EZ> I think. I think Eli is wrong here. An example will help, a file with the (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear as: "def ONML|KIHG qrst" which is clearly wrong. My way to solve it is as above, i.e. add zero width LRM on both sides of the separator (/ or |) in addition to the enclosing LRMs. The problem is even greater in `dired' with files that have ALL Hebrew names. If you have a Hebrew locale, the date has Hebrew in it (the month name) then it has some digits and ":" (all neutrals and weak L2R ) and then the file name. The bidi algorithm actually exchange the month and file name. File names are trouble - here is a paragraph from UAX#9: However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting codes is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and ensures that plain text used for simple items like filenames or labels can ========= always be correctly ordered for display. On Tue, 21 Jun 2011 13:33:13 +0900 Miles Bader wrote: > Do you think this case will be common enough in practice to be worth > worrying about in the short term? From work for an Israeli company, I can assure you it is common. For example there are 3 different directories named "SHIRIM" (songs), and at least 5 different directories with the Hebrew name of one of the client on public disk. Ehud. -- Ehud Karni Tel: +972-3-7966-561 /"\ Mivtach - Simon Fax: +972-3-7976-561 \ / ASCII Ribbon Campaign Insurance agencies (USA) voice mail and X Against HTML Mail http://www.mvs.co.il FAX: 1-815-5509341 / \ GnuPG: 98EA398D <http://www.keyserver.net/> Better Safe Than Sorry ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 16:52 ` Ehud Karni @ 2011-06-21 17:24 ` Eli Zaretskii 2011-06-21 17:59 ` Ehud Karni 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 17:24 UTC (permalink / raw) To: ehud; +Cc: stephen, miles, monnier, cloos, emacs-devel > Date: Tue, 21 Jun 2011 19:52:18 +0300 > From: "Ehud Karni" <ehud@unix.mvs.co.il> > Cc: emacs-devel@gnu.org, monnier@iro.umontreal.ca, cloos@jhcloos.com, > stephen@xemacs.org, miles@gnu.org > > > 2>FEDCBA> > > > > [snip] > > > > I can fix this .... by appending a suitable character to the end of > > the string (after the numeric tail) and making it invisible with text > > properties. > > The "right" way to fix this is to have the buffer name between 2 zero > width LRM characters. First, why do we need the LRM at the beginning? The mode line is formatted with L2R "paragraph direction", so the leading LRM is unneeded (though won't do any harm). The "*Buffer List*" buffer is forced to use L2R paragraph direction as well, so the leading LRM is not needed there as well. And second, what do you mean by "zero width"? The current facilities let me change the LRM display only globally, so I cannot make these LRM characters zero-width only in the mode line -- they will be displayed as such in all the buffers and strings. Moreover, I'm not sure TTYs support zero-width. Instead, I propose to make the LRM invisible. This is supported on all display types. > I think Eli is wrong here. An example will help, a file with the > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear > as: "def ONML|KIHG qrst" which is clearly wrong. > > My way to solve it is as above, i.e. add zero width LRM on both sides > of the separator (/ or |) in addition to the enclosing LRMs. I think this is beginning to become gross. > The problem is even greater in `dired' with files that have ALL Hebrew > names. If you have a Hebrew locale, the date has Hebrew in it (the > month name) then it has some digits and ":" (all neutrals and weak > L2R ) and then the file name. The bidi algorithm actually exchange > the month and file name. Yes, Dired (and other similar modes) will "need work" (TM) to give a plausible display with bidi. Patches are welcome. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 17:24 ` Eli Zaretskii @ 2011-06-21 17:59 ` Ehud Karni 2011-06-21 18:10 ` Eli Zaretskii 2011-06-22 22:27 ` Stefan Monnier 0 siblings, 2 replies; 21+ messages in thread From: Ehud Karni @ 2011-06-21 17:59 UTC (permalink / raw) To: eliz; +Cc: stephen, miles, monnier, cloos, emacs-devel On Tue, 21 Jun 2011 20:24:04 +0300, Eli Zaretskii <eliz@gnu.org> wrote: > > First, why do we need the LRM at the beginning? The mode line is > formatted with L2R "paragraph direction", so the leading LRM is > unneeded (though won't do any harm). The "*Buffer List*" buffer is > forced to use L2R paragraph direction as well, so the leading LRM is > not needed there as well. The 1st LRM may be unneeded but I will add it any way for the general implementation - any substring that contain R2L character and indented to be used in L2R paragraph will be enclosed on both sides by LRM, so it can be inserted without introducing new problems. I think that we need a new functions, something like R2L-quote and L2R-quote that will produce strings that will not cause problem when used in R2L (or L2R) reading direction. > And second, what do you mean by "zero width"? The current facilities > let me change the LRM display only globally, so I cannot make these > LRM characters zero-width only in the mode line -- they will be > displayed as such in all the buffers and strings. Moreover, I'm not > sure TTYs support zero-width. > > Instead, I propose to make the LRM invisible. This is supported on > all display types. May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part of the user text, and a "virtual" one, which is always invisible, ignored by search, and never saved. This will solve many problems, but will create others. May be use the "virtual" LRM/RLM only on non saved text (like the mode-line, dired buffer and so on). > > I think Eli is wrong here. An example will help, a file with the > > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear > > as: "def ONML|KIHG qrst" which is clearly wrong. > > > > My way to solve it is as above, i.e. add zero width LRM on both sides > > of the separator (/ or |) in addition to the enclosing LRMs. > > I think this is beginning to become gross. But it is a general solution that is easily implemented. > > The problem is even greater in `dired' with files that have ALL Hebrew > > names. If you have a Hebrew locale, the date has Hebrew in it (the > > month name) then it has some digits and ":" (all neutrals and weak > > L2R ) and then the file name. The bidi algorithm actually exchange > > the month and file name. > > Yes, Dired (and other similar modes) will "need work" (TM) to give a > plausible display with bidi. Patches are welcome. -- Ehud Karni Tel: +972-3-7966-561 /"\ Mivtach - Simon Fax: +972-3-7976-561 \ / ASCII Ribbon Campaign Insurance agencies (USA) voice mail and X Against HTML Mail http://www.mvs.co.il FAX: 1-815-5509341 / \ GnuPG: 98EA398D <http://www.keyserver.net/> Better Safe Than Sorry ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 17:59 ` Ehud Karni @ 2011-06-21 18:10 ` Eli Zaretskii 2011-06-22 22:27 ` Stefan Monnier 1 sibling, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2011-06-21 18:10 UTC (permalink / raw) To: ehud; +Cc: stephen, miles, monnier, cloos, emacs-devel > Date: Tue, 21 Jun 2011 20:59:51 +0300 > From: "Ehud Karni" <ehud@unix.mvs.co.il> > Cc: emacs-devel@gnu.org, monnier@iro.umontreal.ca, cloos@jhcloos.com, > stephen@xemacs.org, miles@gnu.org > > I think that we need a new functions, something like R2L-quote and > L2R-quote that will produce strings that will not cause problem when > used in R2L (or L2R) reading direction. Patches are welcome. > > Instead, I propose to make the LRM invisible. This is supported on > > all display types. > > May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part > of the user text, and a "virtual" one, which is always invisible, ignored > by search, and never saved. The "never saved" part might need some new infrastructure, because the only one we have -- overlays -- does not affect reordering. Another idea would be to cover the real string with an overlay with display property that is a string computed from the covered text, and put the directional formatting codes in that display string. > May be use the "virtual" LRM/RLM only on non saved text (like the > mode-line, dired buffer and so on). Dired buffer can be saved, e.g. by write-region etc. Note that these decorations will have to be used in various additional places, like the prompt displayed by read-buffer ("(defalt FOOBAR)"), for example. > > > I think Eli is wrong here. An example will help, a file with the > > > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear > > > as: "def ONML|KIHG qrst" which is clearly wrong. > > > > > > My way to solve it is as above, i.e. add zero width LRM on both sides > > > of the separator (/ or |) in addition to the enclosing LRMs. > > > > I think this is beginning to become gross. > > But it is a general solution that is easily implemented. I think it's gross, but I won't object to patches to that effect. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-21 17:59 ` Ehud Karni 2011-06-21 18:10 ` Eli Zaretskii @ 2011-06-22 22:27 ` Stefan Monnier 2011-06-23 9:16 ` Eli Zaretskii 1 sibling, 1 reply; 21+ messages in thread From: Stefan Monnier @ 2011-06-22 22:27 UTC (permalink / raw) To: ehud; +Cc: eliz, miles, stephen, cloos, emacs-devel > I think that we need a new functions, something like R2L-quote and > L2R-quote that will produce strings that will not cause problem when > used in R2L (or L2R) reading direction. That might be a good idea. At least it would let us encapsulate the solution to the problem, so we can change it later on. >> And second, what do you mean by "zero width"? The current facilities >> let me change the LRM display only globally, so I cannot make these >> LRM characters zero-width only in the mode line -- they will be >> displayed as such in all the buffers and strings. Moreover, I'm not >> sure TTYs support zero-width. >> Instead, I propose to make the LRM invisible. This is supported on >> all display types. > May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part > of the user text, and a "virtual" one, which is always invisible, ignored > by search, and never saved. This will solve many problems, but will create > others. May be use the "virtual" LRM/RLM only on non saved text (like the > mode-line, dired buffer and so on). Maybe another way to attack the problem is to say that the < and the > in that string are not neutral but "weak L2R" or something like that. Maybe this would also work for XML markup. We could specify such a thing via some char-table overriding the default bidi properties of specified chars. We would either need to be able to set this as a text-property over the "<N>", or to have one for the mode-line. >> > I think Eli is wrong here. An example will help, a file with the >> > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear >> > as: "def ONML|KIHG qrst" which is clearly wrong. >> > My way to solve it is as above, i.e. add zero width LRM on both sides >> > of the separator (/ or |) in addition to the enclosing LRMs. >> I think this is beginning to become gross. > But it is a general solution that is easily implemented. Indeed, for the buffer names it seems perfectly acceptable since we generate them ourselves and they don't go very far. I'm not sure why Eli doesn't like this solution. Stefan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-22 22:27 ` Stefan Monnier @ 2011-06-23 9:16 ` Eli Zaretskii 2011-06-25 13:25 ` Stefan Monnier 0 siblings, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2011-06-23 9:16 UTC (permalink / raw) To: Stefan Monnier; +Cc: ehud, miles, stephen, cloos, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: eliz@gnu.org, emacs-devel@gnu.org, cloos@jhcloos.com, stephen@xemacs.org, miles@gnu.org > Date: Wed, 22 Jun 2011 18:27:14 -0400 > > Maybe another way to attack the problem is to say that the < and the > > in that string are not neutral but "weak L2R" or something like that. There's no "weak L2R" bidi type or category in UAX#9. Weak types include numbers (i.e. digits) and "number separators" (plus and minus). Changing the type of '<' and '>' to number separator will not gain us anything, because these separators are treated the same as neutrals, except when they are between two numbers. Changing the type to numbers could probably solve the problem, but runs the risk of getting us in more trouble, since the treatment of numbers makes sense only for numbers. > Maybe this would also work for XML markup. It won't. In fact, it could make things worse. To see it, take the first example in this article: http://www.sw.it.aoyama.ac.jp/2005/pub/IUC28-bidi/IUC28.html the one that uses Arabic, copy/paste it into *scratch* in Emacs 24 with bidi-display-reordering turned on, and replace every '<' and '>' there with either '-' (a number separator) or a digit. The result is still unreadable gibberish, and in the case of digits it's even less readable. > We could specify such a thing via some char-table overriding the > default bidi properties of specified chars. We would either need to be > able to set this as a text-property over the "<N>", or to have one for > the mode-line. First, there's no need to invent another char-table. The bidi types used by bidi.c are already specified in a char-table, so all you'd need to do is to modify it (probably its copy). Assuming we indeed want to modify the properties of '<' and '>', that is -- which I think is not a good idea. (Btw, these two characters are not the only ones that cause trouble in display of buffer names. '~' is another one, and in fact all the punctuation characters behave in the same way. Are we going to modify the properties of all of them?) And second, using text properties for overriding bidi properties is not a good idea at all, because bidi.c works below the level that pays attention to text properties. Making it aware of text properties will slow it down considerably, or require a complete redesign of how the bidi display works in general, i.e. give up the total separation between the reordering and the rest of the display engine. I don't think we want that on behalf of this relatively minor issue. Bottom line, using the directional control characters is the best way of adapting the visual appearance to user expectations when displaying plain text. XML and other non-pain text buffers are a different kind of problem. There, we would like to display correctly not just text around '>', but also comments and strings. The problems there are with all the punctuation characters near the end of the comments and strings (they display at the wrong end of the last sentence) and with L2R text embedded in the otherwise R2L text. IOW, we would like to have a way to display such comments and strings as if they were in an R2L paragraph. I don't yet know what would be a good solution to that. In fact, I don't think we have an exhaustive list of situations where the default reordering causes trouble and must be augmented by something else. > >> > I think Eli is wrong here. An example will help, a file with the > >> > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear > >> > as: "def ONML|KIHG qrst" which is clearly wrong. > >> > My way to solve it is as above, i.e. add zero width LRM on both sides > >> > of the separator (/ or |) in addition to the enclosing LRMs. > >> I think this is beginning to become gross. > > But it is a general solution that is easily implemented. > > Indeed, for the buffer names it seems perfectly acceptable since we > generate them ourselves and they don't go very far. I'm not sure why > Eli doesn't like this solution. I don't like the proliferation of directional marks that this will bring. I hoped that we will need these directional control characters only very rarely. These have problems on TTYs, and even in GUI sessions they are visible by default (as thin spaces), so they will disrupt the visual appearance and cursor motion. We will need to have them everywhere, e.g. in the prompt displayed by read-buffer and in other places, if we want buffer names to look the same in all contexts. But since this is the best available solution, I'm willing to try; maybe I'm wrong and the results will not be that bad after all. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters 2011-06-23 9:16 ` Eli Zaretskii @ 2011-06-25 13:25 ` Stefan Monnier 0 siblings, 0 replies; 21+ messages in thread From: Stefan Monnier @ 2011-06-25 13:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: ehud, miles, stephen, cloos, emacs-devel > the one that uses Arabic, copy/paste it into *scratch* in Emacs 24 > with bidi-display-reordering turned on, and replace every '<' and '>' > there with either '-' (a number separator) or a digit. The result is > still unreadable gibberish, and in the case of digits it's even less > readable. Oh, yes, yuck. Thanks for the explanation. > And second, using text properties for overriding bidi properties is > not a good idea at all, because bidi.c works below the level that pays > attention to text properties. I know. > Bottom line, using the directional control characters is the best way > of adapting the visual appearance to user expectations when displaying > plain text. OK. >> Indeed, for the buffer names it seems perfectly acceptable since we >> generate them ourselves and they don't go very far. I'm not sure why >> Eli doesn't like this solution. > I don't like the proliferation of directional marks that this will > bring. I hoped that we will need these directional control characters > only very rarely. These have problems on TTYs, and even in GUI > sessions they are visible by default (as thin spaces), so they will > disrupt the visual appearance and cursor motion. We will need to have > them everywhere, e.g. in the prompt displayed by read-buffer and in > other places, if we want buffer names to look the same in all > contexts. But since this is the best available solution, I'm willing > to try; maybe I'm wrong and the results will not be that bad after > all. We should make them display as nothing at all (but obey the display-table, of course, so they can be made visible when needed). Stefan ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-06-25 13:25 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii 2011-06-20 18:00 ` Stefan Monnier 2011-06-20 20:52 ` Eli Zaretskii 2011-06-20 20:13 ` James Cloos 2011-06-20 21:08 ` Eli Zaretskii 2011-06-21 4:26 ` Stephen J. Turnbull 2011-06-21 6:28 ` Eli Zaretskii 2011-06-21 8:44 ` Stephen J. Turnbull 2011-06-21 14:28 ` Eli Zaretskii 2011-06-21 4:33 ` Miles Bader 2011-06-21 6:30 ` Eli Zaretskii 2011-06-21 7:26 ` David Kastrup 2011-06-20 21:06 ` Kalle Olavi Niemitalo 2011-06-21 2:51 ` Eli Zaretskii 2011-06-21 16:52 ` Ehud Karni 2011-06-21 17:24 ` Eli Zaretskii 2011-06-21 17:59 ` Ehud Karni 2011-06-21 18:10 ` Eli Zaretskii 2011-06-22 22:27 ` Stefan Monnier 2011-06-23 9:16 ` Eli Zaretskii 2011-06-25 13:25 ` Stefan Monnier
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.