* Buffer names with R2L characters
@ 2011-06-20 16:21 Eli Zaretskii
2011-06-20 18:00 ` Stefan Monnier
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-20 16:21 UTC (permalink / raw)
To: emacs-devel
I bumped into this annoyance while working on bidi reordering of
strings. As some of you know, the mode line is constructed from C and
Lisp strings, and bidi.c can now reorder them (for now, only its
new and improved version in my local branch).
Once I had this half-working, the first thing I tested was visiting
files whose names include R2L characters. It works, but there's one
problem: the "<N>" tails we attach to buffer names to make them
unique. The problem is that the '<' and '>' characters are "other
neutral", or "ON", in the UAX#9 parlance, and so their directionality
depends on the surrounding characters. As result, a buffer name typed
as ABCDEF<2> is displayed in the mode line like this:
2>FEDCBA>
I verified this with the Unicode Reference Implementation, and there's
no bug in bidi.c: this is the correct reordering according to the
Unicode Bidirectional Algorithm.
I can fix this in most prominent use cases -- the mode line, the
buffer menu, and even in the prompt produced by read-buffer -- by
appending a suitable character to the end of the string (after the
numeric tail) and making it invisible with text properties. But this
sounds kludgey, and of course sooner or later the "2>FEDCBA>" thingy
will show somewhere, e.g. if someone coughs up their own
mode-line-format and use buffer-name directly, or whatever. However,
I don't see a better way out of this, and leaving it as it is would be
too ugly, IMO. If someone has better ideas, I'm all ears.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii
@ 2011-06-20 18:00 ` Stefan Monnier
2011-06-20 20:52 ` Eli Zaretskii
2011-06-20 20:13 ` James Cloos
` (2 subsequent siblings)
3 siblings, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2011-06-20 18:00 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel
> Once I had this half-working, the first thing I tested was visiting
> files whose names include R2L characters. It works, but there's one
> problem: the "<N>" tails we attach to buffer names to make them
> unique. The problem is that the '<' and '>' characters are "other
> neutral", or "ON", in the UAX#9 parlance, and so their directionality
> depends on the surrounding characters. As result, a buffer name typed
> as ABCDEF<2> is displayed in the mode line like this:
> 2> FEDCBA>
Sounds like the same kind of issue as the one brought up a while ago
about bidi+XML (or any other mark up).
Another important case is when you use uniquify (in which case the
above will happen less frequently since file buffers add the directory
info to disambiguate the name, but may be replace by similar problems
if the separator between the file and directory part (typically /, or
\, or |) doesn't have the "right" bidi behavior).
> I verified this with the Unicode Reference Implementation, and there's
> no bug in bidi.c: this is the correct reordering according to the
> Unicode Bidirectional Algorithm.
> appending a suitable character to the end of the string (after the
> numeric tail) and making it invisible with text properties. But this
Hmm, I would have expected that adding something at the end of the string
would not work ("too late"), whereas adding it between "ABCDEF" and
"<2>" would have felt very natural to me.
I guess it just shows how little I know of bidi,
Stefan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii
2011-06-20 18:00 ` Stefan Monnier
@ 2011-06-20 20:13 ` James Cloos
2011-06-20 21:08 ` Eli Zaretskii
2011-06-20 21:06 ` Kalle Olavi Niemitalo
2011-06-21 16:52 ` Ehud Karni
3 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2011-06-20 20:13 UTC (permalink / raw)
To: emacs-devel; +Cc: Eli Zaretskii
The bidi algorithm is just not designed for markup (and the <digits> tag
/is/ markup). Ideally there would be a 0-width break before the <digits>
or a way to mark the <digits> blob as non-neutral.
Whether the result should display as <12>FEDCBA, <21>FEDCBA or FEDCBA<12>,
though, I have no idea. (That is, I don't know which of those users would
prefer. I presume that a ordering break would result in the third.)
I doubt that it can be propperly fixed, though, w/o also fixing unicode's
algorithm to better handle markup interspersed in the main text.
In general, each blob of markup should be handled as its own document,
and the result of that should be treated as a single (perhaps neutral)
character from the point of view of the enclosing text.
That would fix things for this issue, sgml/xml, TeX, source code of
every type (some of the markup there is implicit, but still logically
extant) et cetera.
For that to work the engine obviously needs to know what markup looks like,
which requires additional meta information about each document. The buffer
mode helps, but may not be sufficient?
Which is probably why unicode, with their plain-text emphasis, ignored it.
-JimC
--
James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 18:00 ` Stefan Monnier
@ 2011-06-20 20:52 ` Eli Zaretskii
0 siblings, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-20 20:52 UTC (permalink / raw)
To: Stefan Monnier; +Cc: emacs-devel
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Mon, 20 Jun 2011 14:00:34 -0400
>
> > 2>FEDCBA>
>
> Sounds like the same kind of issue as the one brought up a while ago
> about bidi+XML (or any other mark up).
It's caused by the bidirectional properties of the '<' and '>'
characters, but other than that, this has nothing to do with XML.
> Another important case is when you use uniquify (in which case the
> above will happen less frequently since file buffers add the directory
> info to disambiguate the name, but may be replace by similar problems
> if the separator between the file and directory part (typically /, or
> \, or |) doesn't have the "right" bidi behavior).
You get something like foo/bar/FEDCBA, so there's no problem here, I
think.
> > I verified this with the Unicode Reference Implementation, and there's
> > no bug in bidi.c: this is the correct reordering according to the
> > Unicode Bidirectional Algorithm.
>
> > appending a suitable character to the end of the string (after the
> > numeric tail) and making it invisible with text properties. But this
>
> Hmm, I would have expected that adding something at the end of the string
> would not work ("too late"), whereas adding it between "ABCDEF" and
> "<2>" would have felt very natural to me.
The final "resolved level" of a weak character depends on characters
on its both sides, not just on one side. So there's no "too late".
When '>' is the last character, the algorithm uses a default value for
the absent character after it, and the default depends on the current
paragraph direction, which must be L2R both in the mode line and in
buffer-menu. Thus, '>' gets the (default) L2R direction in this case.
Appending a zero makes '>' be surrounded by two digits, so it gets the
R2L direction (because the digits are embedded in a run of R2L
characters) and is mirrored into '<'.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii
2011-06-20 18:00 ` Stefan Monnier
2011-06-20 20:13 ` James Cloos
@ 2011-06-20 21:06 ` Kalle Olavi Niemitalo
2011-06-21 2:51 ` Eli Zaretskii
2011-06-21 16:52 ` Ehud Karni
3 siblings, 1 reply; 21+ messages in thread
From: Kalle Olavi Niemitalo @ 2011-06-20 21:06 UTC (permalink / raw)
To: emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
> But this sounds kludgey, and of course sooner or later the
> "2>FEDCBA>" thingy will show somewhere, e.g. if someone coughs
> up their own mode-line-format and use buffer-name directly, or
> whatever.
Perhaps you could make buffer-name return a string with text
properties that let it display in the preferable way. I don't
know whether the bidi code supports such properties.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 20:13 ` James Cloos
@ 2011-06-20 21:08 ` Eli Zaretskii
2011-06-21 4:26 ` Stephen J. Turnbull
2011-06-21 4:33 ` Miles Bader
0 siblings, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-20 21:08 UTC (permalink / raw)
To: James Cloos; +Cc: emacs-devel
> From: James Cloos <cloos@jhcloos.com>
> Cc: Eli Zaretskii <eliz@gnu.org>
> Date: Mon, 20 Jun 2011 16:13:15 -0400
>
> The bidi algorithm is just not designed for markup (and the <digits> tag
> /is/ markup).
No, it isn't. Not every use of '<' and '>' is markup. They can also
be used in context such as "n > N" etc. They are just characters; XML
did not appropriate them just because it uses them for markup.
> Ideally there would be a 0-width break before the <digits>
It's possible (we could use LRM), but it isn't ideal, because text
terminals will have trouble displaying it. That's why I think using a
character covered by invisible text property is better.
> or a way to mark the <digits> blob as non-neutral.
This feature is still far away. I thought about it, and concluded
that implementing it is not trivial, or at least there were a couple
of problems for which I couldn't think of a good solution yet.
I don't think we should wait for such a feature and in the meantime
display "12>FEDCBA>" as buffer name. I'd like to have a solution,
even if an interim one, for Emacs 24.1.
> Whether the result should display as <12>FEDCBA, <21>FEDCBA or FEDCBA<12>,
> though, I have no idea.
We can decide whatever we want (but not the one with "21": the order
of digits in a number should be left to right). But any of these
needs "help" to get them look like that, because the UBA simply cannot
treat these cases gracefully.
> I presume that a ordering break would result in the third.
What's an "ordering break"?
> For that to work the engine obviously needs to know what markup looks like,
> which requires additional meta information about each document. The buffer
> mode helps, but may not be sufficient?
The Emacs reordering engine doesn't know anything about major modes,
text properties, overlays, and other meta information. The only
exceptions are (1) `display' properties and (2) buffer narrowing
limits. This design is intentional, because otherwise the same code
could not be used for reordering text independently of redisplay,
e.g. for producing visual-order encodings, like if you wanted to print
R2L text on a relatively dumb printer or send it to a process that
doesn't grok bidi.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 21:06 ` Kalle Olavi Niemitalo
@ 2011-06-21 2:51 ` Eli Zaretskii
0 siblings, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 2:51 UTC (permalink / raw)
To: Kalle Olavi Niemitalo; +Cc: emacs-devel
> From: Kalle Olavi Niemitalo <kon@iki.fi>
> Date: Tue, 21 Jun 2011 00:06:29 +0300
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > But this sounds kludgey, and of course sooner or later the
> > "2>FEDCBA>" thingy will show somewhere, e.g. if someone coughs
> > up their own mode-line-format and use buffer-name directly, or
> > whatever.
>
> Perhaps you could make buffer-name return a string with text
> properties that let it display in the preferable way. I don't
> know whether the bidi code supports such properties.
That's what James was talking about. As I replied, there are no such
text properties in Emacs yet, and implementing them will not be
trivial, so it won't be in Emacs 24.1.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 21:08 ` Eli Zaretskii
@ 2011-06-21 4:26 ` Stephen J. Turnbull
2011-06-21 6:28 ` Eli Zaretskii
2011-06-21 4:33 ` Miles Bader
1 sibling, 1 reply; 21+ messages in thread
From: Stephen J. Turnbull @ 2011-06-21 4:26 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: James Cloos, emacs-devel
Eli Zaretskii writes:
> > From: James Cloos <cloos@jhcloos.com>
> > Cc: Eli Zaretskii <eliz@gnu.org>
> > Date: Mon, 20 Jun 2011 16:13:15 -0400
> >
> > The bidi algorithm is just not designed for markup (and the <digits> tag
> > /is/ markup).
>
> No, it isn't. Not every use of '<' and '>' is markup.
Please rethink here, Eli. In the sense Jim is talking about, at the
conceptual level this use case is indeed markup, ie, metadata encoded
in plain text. There may be better ways of solving the display
problem than taking that literally.
> They can also be used in context such as "n > N" etc. They are
> just characters; XML did not appropriate them just because it uses
> them for markup.
Of course this is true, but it misses his point, which is that in a
markup context, there is text data (in SGML, CDATA or PCDATA) and
there is metadata. From the point of view of working with these
higher-level protocols, it may desirable that each run of text data be
treated separately in an editor.
There's nothing in the Unicode standard that says that Emacs couldn't
have a `non-bidi' syntax class, which when applied as a property to
text in a buffer would break that buffer into two or more bidi
"streams" to which the algorithm would be applied independently.
However, this is an implementation detail. If life is easier if you
consider every textual object (buffer or string) as a single stream to
which the bidi algorithm should be applied, there's nothing wrong with
that, either.
> This feature is still far away. I thought about it, and concluded
> that implementing it is not trivial, or at least there were a couple
> of problems for which I couldn't think of a good solution yet.
I'm happy to accept your judgment here, of course.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 21:08 ` Eli Zaretskii
2011-06-21 4:26 ` Stephen J. Turnbull
@ 2011-06-21 4:33 ` Miles Bader
2011-06-21 6:30 ` Eli Zaretskii
2011-06-21 7:26 ` David Kastrup
1 sibling, 2 replies; 21+ messages in thread
From: Miles Bader @ 2011-06-21 4:33 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: James Cloos, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
> I don't think we should wait for such a feature and in the meantime
> display "12>FEDCBA>" as buffer name. I'd like to have a solution,
> even if an interim one, for Emacs 24.1.
Do you think this case will be common enough in practice to be worth
worrying about in the short term?
In my experience redundant buffer/file names are not very common for
_arbitrary_ files. Rather, they tend to occur with very specific names
where the name is not chosen by the user, but rather is due to some
external standard -- e.g. "Makefile", "*shell*", "README". But I think
such "standard" names are also more likely to be written just using
ASCII, and so won't tickle this problem.
Do you observe otherwise in your environment? [hmm, what's Hebrew for
"README"... :]
-Miles
--
We are all lying in the gutter, but some of us are looking at the stars.
-Oscar Wilde
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 4:26 ` Stephen J. Turnbull
@ 2011-06-21 6:28 ` Eli Zaretskii
2011-06-21 8:44 ` Stephen J. Turnbull
0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 6:28 UTC (permalink / raw)
To: Stephen J. Turnbull; +Cc: cloos, emacs-devel
> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: James Cloos <cloos@jhcloos.com>,
> emacs-devel@gnu.org
> Date: Tue, 21 Jun 2011 13:26:50 +0900
>
> Eli Zaretskii writes:
> > > From: James Cloos <cloos@jhcloos.com>
> > > Cc: Eli Zaretskii <eliz@gnu.org>
> > > Date: Mon, 20 Jun 2011 16:13:15 -0400
> > >
> > > The bidi algorithm is just not designed for markup (and the <digits> tag
> > > /is/ markup).
> >
> > No, it isn't. Not every use of '<' and '>' is markup.
>
> Please rethink here, Eli. In the sense Jim is talking about, at the
> conceptual level this use case is indeed markup, ie, metadata encoded
> in plain text. There may be better ways of solving the display
> problem than taking that literally.
Well, the fact that I started this discussion is a sign that I'm
willing to rethink ;-)
However, I'm not sure what you are suggesting to rethink,
specifically. What I said is that in foo<1>, the "<1>" part is not a
markup, it's just part of a string that is a buffer name. If you
disagree, then please explain why this use of <..> should be
considered markup. Just the fact that it uses <..> is not enough,
IMO.
> > They can also be used in context such as "n > N" etc. They are
> > just characters; XML did not appropriate them just because it uses
> > them for markup.
>
> Of course this is true, but it misses his point, which is that in a
> markup context, there is text data (in SGML, CDATA or PCDATA) and
> there is metadata. From the point of view of working with these
> higher-level protocols, it may desirable that each run of text data be
> treated separately in an editor.
And it will be, when Emacs is taught to reorder non-plain text. I
never said nor thought that we should reorder markup text as if it
were plain text -- that'd be terribly wrong. Similarly, we should
reorder only comments and strings while displaying program source
files. Both these use cases call for selectively reordering an
otherwise strictly L2R text. Emacs should be able to do that, and in
doing so it will need to rely heavily on the buffer's major mode.
But the necessary infrastructure is not yet in Emacs, and it won't be
there in time for Emacs 24.1.
Moreover, I'm not sure it would be a good idea to use such an
infrastructure for FOO<2> buffer names even if it were available. If
you agree with me that <2> is not markup, then using methods designed
for markup languages would be as kludgey as appending an invisible
character: they both use a trick not intended for this use case.
There's no major mode here to help us DTRT with a string that is part
of the mode line.
> There's nothing in the Unicode standard that says that Emacs couldn't
> have a `non-bidi' syntax class, which when applied as a property to
> text in a buffer would break that buffer into two or more bidi
> "streams" to which the algorithm would be applied independently.
Right. I didn't mean to say otherwise.
> However, this is an implementation detail. If life is easier if you
> consider every textual object (buffer or string) as a single stream to
> which the bidi algorithm should be applied, there's nothing wrong with
> that, either.
Well, that won't solve the problem of displaying markup languages and
program sources, so it cannot be that simple ;-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 4:33 ` Miles Bader
@ 2011-06-21 6:30 ` Eli Zaretskii
2011-06-21 7:26 ` David Kastrup
1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 6:30 UTC (permalink / raw)
To: Miles Bader; +Cc: cloos, emacs-devel
> From: Miles Bader <miles@gnu.org>
> Cc: James Cloos <cloos@jhcloos.com>, emacs-devel@gnu.org
> Date: Tue, 21 Jun 2011 13:33:13 +0900
>
> Do you think this case will be common enough in practice to be worth
> worrying about in the short term?
I have no idea. But if it's not common, then the kludge I suggested
won't bother anyone, either, right?
> Do you observe otherwise in your environment? [hmm, what's Hebrew for
> "README"... :]
I try very hard to stick to English ;-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 4:33 ` Miles Bader
2011-06-21 6:30 ` Eli Zaretskii
@ 2011-06-21 7:26 ` David Kastrup
1 sibling, 0 replies; 21+ messages in thread
From: David Kastrup @ 2011-06-21 7:26 UTC (permalink / raw)
To: emacs-devel
Miles Bader <miles@gnu.org> writes:
> Eli Zaretskii <eliz@gnu.org> writes:
>> I don't think we should wait for such a feature and in the meantime
>> display "12>FEDCBA>" as buffer name. I'd like to have a solution,
>> even if an interim one, for Emacs 24.1.
>
> Do you think this case will be common enough in practice to be worth
> worrying about in the short term?
>
> In my experience redundant buffer/file names are not very common for
> _arbitrary_ files. Rather, they tend to occur with very specific
> names where the name is not chosen by the user, but rather is due to
> some external standard -- e.g. "Makefile", "*shell*", "README".
I have them quite often when comparing/merging versions of files.
--
David Kastrup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 6:28 ` Eli Zaretskii
@ 2011-06-21 8:44 ` Stephen J. Turnbull
2011-06-21 14:28 ` Eli Zaretskii
0 siblings, 1 reply; 21+ messages in thread
From: Stephen J. Turnbull @ 2011-06-21 8:44 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: cloos, emacs-devel
Eli Zaretskii writes:
> However, I'm not sure what you are suggesting to rethink,
> specifically. What I said is that in foo<1>, the "<1>" part is not a
> markup, it's just part of a string that is a buffer name.
Well, it's not markup in the sense of HTML (ie, display markup), but
it is markup in the sense of XML (semantic markup). It's true that
the appended string is arbitrary, and that the relationship to the
"desired" buffer name is quite arbitrary (you could use alphabetic
characters instead of numerals, for example). But it is used to
disambiguate what to the user would otherwise be identical names, and
the specific form clearly indicates that this is "metadata". Buffers
on main.c and main.c~ clearly represent files with different names,
while main.c and main.c<2> by convention represent files with the same
name. It is this convention, not the use of "<>", that makes the
uniquifer "<1>" into markup.
[Each run of plain text in a markup buffer should be treated as a
separate "stream" of text for bidi display purposes, or something like
that.]
> And it will be, when Emacs is taught to reorder non-plain text.
OK.
> I never said nor thought that we should reorder markup text as if
> it were plain text -- that'd be terribly wrong.
> But the necessary infrastructure is not yet in Emacs, and it won't be
> there in time for Emacs 24.1.
OK, that's what I wanted to say myself, but it wasn't clear to me that
was your reasoning.
> There's no major mode here to help us DTRT with a string that is part
> of the mode line.
Sure there is, "mode line mode". ;-) Mode lines have a syntax, and so
do buffer-names (when uniquifying). The fact there there's no major
mode we use in buffers that's like that is not really relevant.
Of course if you paste a buffer-name into a buffer in some major mode,
you may run into problems. But that's always the case when changing
syntax models on the same object.
> > However, this is an implementation detail. If life is easier if you
> > consider every textual object (buffer or string) as a single stream to
> > which the bidi algorithm should be applied, there's nothing wrong with
> > that, either.
>
> Well, that won't solve the problem of displaying markup languages and
> program sources, so it cannot be that simple ;-)
That's exactly the kind of thing where (at least for the next year or
so) I'm just gonna have to trust you. :-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 8:44 ` Stephen J. Turnbull
@ 2011-06-21 14:28 ` Eli Zaretskii
0 siblings, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 14:28 UTC (permalink / raw)
To: Stephen J. Turnbull; +Cc: cloos, emacs-devel
> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: cloos@jhcloos.com,
> emacs-devel@gnu.org
> Date: Tue, 21 Jun 2011 17:44:05 +0900
>
> > There's no major mode here to help us DTRT with a string that is part
> > of the mode line.
>
> Sure there is, "mode line mode". ;-) Mode lines have a syntax, and so
> do buffer-names (when uniquifying).
I think the difference is significant. Mode line in Emacs is fully
programmable, and can be programmed to display anything in any order
and form. There are much less rules here than in any major mode,
because the latter is constrained by externally imposed rules of the
"language" supported by the mode. By contrast, I can program my mode
line to break any and all "syntax" that users of the default
mode-line-format are used to. Even uniquifying buffer names can be
done in several different flavors, out of the box.
Anyway, I'm perfectly happy to leave the display of such names as the
UBA would have them, and mark this as a temporarily missing feature.
At least the MS-Windows file manager displays such names the same
(cannot test on GNU/Linux where I'm typing this), so we have nothing
to be ashamed of.
Btw, the numbered backup files suffer from the same problem, their
buffer names are displayed as 1~.RABOOF~ instead of ~1~.RABOOF. So
it's not just the duplicate file names that will trigger this.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii
` (2 preceding siblings ...)
2011-06-20 21:06 ` Kalle Olavi Niemitalo
@ 2011-06-21 16:52 ` Ehud Karni
2011-06-21 17:24 ` Eli Zaretskii
3 siblings, 1 reply; 21+ messages in thread
From: Ehud Karni @ 2011-06-21 16:52 UTC (permalink / raw)
To: eliz; +Cc: stephen, miles, monnier, cloos, emacs-devel
On Mon, 20 Jun 2011 19:21:00 +03:00 Eli Zaretskii wrote:
> A buffer name typed as ABCDEF<2> is displayed in the mode line
> like this:
>
> 2>FEDCBA>
>
> [snip]
>
> I can fix this .... by appending a suitable character to the end of
> the string (after the numeric tail) and making it invisible with text
> properties.
The "right" way to fix this is to have the buffer name between 2 zero
width LRM characters.
On Mon, 20 Jun 2011 14:01:00 -04:00 Stefan Monnier wrote:
SM> Another important case is when you use uniquify ... file buffers
SM> add the directory info to disambiguate the name ...
On Mon, 20 Jun 2011 23:52:44 +03:00 Eli Zaretskii answered:
EZ> You get something like foo/bar/FEDCBA, so there's no problem here,
EZ> I think.
I think Eli is wrong here. An example will help, a file with the
(logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
as: "def ONML|KIHG qrst" which is clearly wrong.
My way to solve it is as above, i.e. add zero width LRM on both sides
of the separator (/ or |) in addition to the enclosing LRMs.
The problem is even greater in `dired' with files that have ALL Hebrew
names. If you have a Hebrew locale, the date has Hebrew in it (the
month name) then it has some digits and ":" (all neutrals and weak
L2R ) and then the file name. The bidi algorithm actually exchange
the month and file name.
File names are trouble - here is a paragraph from UAX#9:
However, in the case of bidirectional text, there are circumstances
where an implicit bidirectional ordering is not sufficient to
produce comprehensible text. To deal with these cases, a minimal
set of directional formatting codes is defined to control the
ordering of characters when rendered. This allows exact control
of the display ordering for legible interchange and ensures that
plain text used for simple items like filenames or labels can
=========
always be correctly ordered for display.
On Tue, 21 Jun 2011 13:33:13 +0900 Miles Bader wrote:
> Do you think this case will be common enough in practice to be worth
> worrying about in the short term?
From work for an Israeli company, I can assure you it is common.
For example there are 3 different directories named "SHIRIM" (songs),
and at least 5 different directories with the Hebrew name of one of
the client on public disk.
Ehud.
--
Ehud Karni Tel: +972-3-7966-561 /"\
Mivtach - Simon Fax: +972-3-7976-561 \ / ASCII Ribbon Campaign
Insurance agencies (USA) voice mail and X Against HTML Mail
http://www.mvs.co.il FAX: 1-815-5509341 / \
GnuPG: 98EA398D <http://www.keyserver.net/> Better Safe Than Sorry
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 16:52 ` Ehud Karni
@ 2011-06-21 17:24 ` Eli Zaretskii
2011-06-21 17:59 ` Ehud Karni
0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 17:24 UTC (permalink / raw)
To: ehud; +Cc: stephen, miles, monnier, cloos, emacs-devel
> Date: Tue, 21 Jun 2011 19:52:18 +0300
> From: "Ehud Karni" <ehud@unix.mvs.co.il>
> Cc: emacs-devel@gnu.org, monnier@iro.umontreal.ca, cloos@jhcloos.com,
> stephen@xemacs.org, miles@gnu.org
>
> > 2>FEDCBA>
> >
> > [snip]
> >
> > I can fix this .... by appending a suitable character to the end of
> > the string (after the numeric tail) and making it invisible with text
> > properties.
>
> The "right" way to fix this is to have the buffer name between 2 zero
> width LRM characters.
First, why do we need the LRM at the beginning? The mode line is
formatted with L2R "paragraph direction", so the leading LRM is
unneeded (though won't do any harm). The "*Buffer List*" buffer is
forced to use L2R paragraph direction as well, so the leading LRM is
not needed there as well.
And second, what do you mean by "zero width"? The current facilities
let me change the LRM display only globally, so I cannot make these
LRM characters zero-width only in the mode line -- they will be
displayed as such in all the buffers and strings. Moreover, I'm not
sure TTYs support zero-width.
Instead, I propose to make the LRM invisible. This is supported on
all display types.
> I think Eli is wrong here. An example will help, a file with the
> (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
> as: "def ONML|KIHG qrst" which is clearly wrong.
>
> My way to solve it is as above, i.e. add zero width LRM on both sides
> of the separator (/ or |) in addition to the enclosing LRMs.
I think this is beginning to become gross.
> The problem is even greater in `dired' with files that have ALL Hebrew
> names. If you have a Hebrew locale, the date has Hebrew in it (the
> month name) then it has some digits and ":" (all neutrals and weak
> L2R ) and then the file name. The bidi algorithm actually exchange
> the month and file name.
Yes, Dired (and other similar modes) will "need work" (TM) to give a
plausible display with bidi. Patches are welcome.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 17:24 ` Eli Zaretskii
@ 2011-06-21 17:59 ` Ehud Karni
2011-06-21 18:10 ` Eli Zaretskii
2011-06-22 22:27 ` Stefan Monnier
0 siblings, 2 replies; 21+ messages in thread
From: Ehud Karni @ 2011-06-21 17:59 UTC (permalink / raw)
To: eliz; +Cc: stephen, miles, monnier, cloos, emacs-devel
On Tue, 21 Jun 2011 20:24:04 +0300, Eli Zaretskii <eliz@gnu.org> wrote:
>
> First, why do we need the LRM at the beginning? The mode line is
> formatted with L2R "paragraph direction", so the leading LRM is
> unneeded (though won't do any harm). The "*Buffer List*" buffer is
> forced to use L2R paragraph direction as well, so the leading LRM is
> not needed there as well.
The 1st LRM may be unneeded but I will add it any way for the general
implementation - any substring that contain R2L character and indented
to be used in L2R paragraph will be enclosed on both sides by LRM, so
it can be inserted without introducing new problems.
I think that we need a new functions, something like R2L-quote and
L2R-quote that will produce strings that will not cause problem when
used in R2L (or L2R) reading direction.
> And second, what do you mean by "zero width"? The current facilities
> let me change the LRM display only globally, so I cannot make these
> LRM characters zero-width only in the mode line -- they will be
> displayed as such in all the buffers and strings. Moreover, I'm not
> sure TTYs support zero-width.
>
> Instead, I propose to make the LRM invisible. This is supported on
> all display types.
May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part
of the user text, and a "virtual" one, which is always invisible, ignored
by search, and never saved. This will solve many problems, but will create
others. May be use the "virtual" LRM/RLM only on non saved text (like the
mode-line, dired buffer and so on).
> > I think Eli is wrong here. An example will help, a file with the
> > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
> > as: "def ONML|KIHG qrst" which is clearly wrong.
> >
> > My way to solve it is as above, i.e. add zero width LRM on both sides
> > of the separator (/ or |) in addition to the enclosing LRMs.
>
> I think this is beginning to become gross.
But it is a general solution that is easily implemented.
> > The problem is even greater in `dired' with files that have ALL Hebrew
> > names. If you have a Hebrew locale, the date has Hebrew in it (the
> > month name) then it has some digits and ":" (all neutrals and weak
> > L2R ) and then the file name. The bidi algorithm actually exchange
> > the month and file name.
>
> Yes, Dired (and other similar modes) will "need work" (TM) to give a
> plausible display with bidi. Patches are welcome.
--
Ehud Karni Tel: +972-3-7966-561 /"\
Mivtach - Simon Fax: +972-3-7976-561 \ / ASCII Ribbon Campaign
Insurance agencies (USA) voice mail and X Against HTML Mail
http://www.mvs.co.il FAX: 1-815-5509341 / \
GnuPG: 98EA398D <http://www.keyserver.net/> Better Safe Than Sorry
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 17:59 ` Ehud Karni
@ 2011-06-21 18:10 ` Eli Zaretskii
2011-06-22 22:27 ` Stefan Monnier
1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-21 18:10 UTC (permalink / raw)
To: ehud; +Cc: stephen, miles, monnier, cloos, emacs-devel
> Date: Tue, 21 Jun 2011 20:59:51 +0300
> From: "Ehud Karni" <ehud@unix.mvs.co.il>
> Cc: emacs-devel@gnu.org, monnier@iro.umontreal.ca, cloos@jhcloos.com,
> stephen@xemacs.org, miles@gnu.org
>
> I think that we need a new functions, something like R2L-quote and
> L2R-quote that will produce strings that will not cause problem when
> used in R2L (or L2R) reading direction.
Patches are welcome.
> > Instead, I propose to make the LRM invisible. This is supported on
> > all display types.
>
> May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part
> of the user text, and a "virtual" one, which is always invisible, ignored
> by search, and never saved.
The "never saved" part might need some new infrastructure, because the
only one we have -- overlays -- does not affect reordering.
Another idea would be to cover the real string with an overlay with
display property that is a string computed from the covered text, and
put the directional formatting codes in that display string.
> May be use the "virtual" LRM/RLM only on non saved text (like the
> mode-line, dired buffer and so on).
Dired buffer can be saved, e.g. by write-region etc.
Note that these decorations will have to be used in various additional
places, like the prompt displayed by read-buffer ("(defalt FOOBAR)"),
for example.
> > > I think Eli is wrong here. An example will help, a file with the
> > > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
> > > as: "def ONML|KIHG qrst" which is clearly wrong.
> > >
> > > My way to solve it is as above, i.e. add zero width LRM on both sides
> > > of the separator (/ or |) in addition to the enclosing LRMs.
> >
> > I think this is beginning to become gross.
>
> But it is a general solution that is easily implemented.
I think it's gross, but I won't object to patches to that effect.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-21 17:59 ` Ehud Karni
2011-06-21 18:10 ` Eli Zaretskii
@ 2011-06-22 22:27 ` Stefan Monnier
2011-06-23 9:16 ` Eli Zaretskii
1 sibling, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2011-06-22 22:27 UTC (permalink / raw)
To: ehud; +Cc: eliz, miles, stephen, cloos, emacs-devel
> I think that we need a new functions, something like R2L-quote and
> L2R-quote that will produce strings that will not cause problem when
> used in R2L (or L2R) reading direction.
That might be a good idea. At least it would let us encapsulate the
solution to the problem, so we can change it later on.
>> And second, what do you mean by "zero width"? The current facilities
>> let me change the LRM display only globally, so I cannot make these
>> LRM characters zero-width only in the mode line -- they will be
>> displayed as such in all the buffers and strings. Moreover, I'm not
>> sure TTYs support zero-width.
>> Instead, I propose to make the LRM invisible. This is supported on
>> all display types.
> May be we need 2 LRMs (and 2 RLMs), the normal "real" one, which is part
> of the user text, and a "virtual" one, which is always invisible, ignored
> by search, and never saved. This will solve many problems, but will create
> others. May be use the "virtual" LRM/RLM only on non saved text (like the
> mode-line, dired buffer and so on).
Maybe another way to attack the problem is to say that the < and the >
in that string are not neutral but "weak L2R" or something like that.
Maybe this would also work for XML markup.
We could specify such a thing via some char-table overriding the
default bidi properties of specified chars. We would either need to be
able to set this as a text-property over the "<N>", or to have one for
the mode-line.
>> > I think Eli is wrong here. An example will help, a file with the
>> > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
>> > as: "def ONML|KIHG qrst" which is clearly wrong.
>> > My way to solve it is as above, i.e. add zero width LRM on both sides
>> > of the separator (/ or |) in addition to the enclosing LRMs.
>> I think this is beginning to become gross.
> But it is a general solution that is easily implemented.
Indeed, for the buffer names it seems perfectly acceptable since we
generate them ourselves and they don't go very far. I'm not sure why
Eli doesn't like this solution.
Stefan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-22 22:27 ` Stefan Monnier
@ 2011-06-23 9:16 ` Eli Zaretskii
2011-06-25 13:25 ` Stefan Monnier
0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2011-06-23 9:16 UTC (permalink / raw)
To: Stefan Monnier; +Cc: ehud, miles, stephen, cloos, emacs-devel
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: eliz@gnu.org, emacs-devel@gnu.org, cloos@jhcloos.com, stephen@xemacs.org, miles@gnu.org
> Date: Wed, 22 Jun 2011 18:27:14 -0400
>
> Maybe another way to attack the problem is to say that the < and the >
> in that string are not neutral but "weak L2R" or something like that.
There's no "weak L2R" bidi type or category in UAX#9. Weak types
include numbers (i.e. digits) and "number separators" (plus and
minus). Changing the type of '<' and '>' to number separator will not
gain us anything, because these separators are treated the same as
neutrals, except when they are between two numbers. Changing the type
to numbers could probably solve the problem, but runs the risk of
getting us in more trouble, since the treatment of numbers makes sense
only for numbers.
> Maybe this would also work for XML markup.
It won't. In fact, it could make things worse. To see it, take the
first example in this article:
http://www.sw.it.aoyama.ac.jp/2005/pub/IUC28-bidi/IUC28.html
the one that uses Arabic, copy/paste it into *scratch* in Emacs 24
with bidi-display-reordering turned on, and replace every '<' and '>'
there with either '-' (a number separator) or a digit. The result is
still unreadable gibberish, and in the case of digits it's even less
readable.
> We could specify such a thing via some char-table overriding the
> default bidi properties of specified chars. We would either need to be
> able to set this as a text-property over the "<N>", or to have one for
> the mode-line.
First, there's no need to invent another char-table. The bidi types
used by bidi.c are already specified in a char-table, so all you'd
need to do is to modify it (probably its copy). Assuming we indeed
want to modify the properties of '<' and '>', that is -- which I think
is not a good idea. (Btw, these two characters are not the only ones
that cause trouble in display of buffer names. '~' is another one,
and in fact all the punctuation characters behave in the same way.
Are we going to modify the properties of all of them?)
And second, using text properties for overriding bidi properties is
not a good idea at all, because bidi.c works below the level that pays
attention to text properties. Making it aware of text properties will
slow it down considerably, or require a complete redesign of how the
bidi display works in general, i.e. give up the total separation
between the reordering and the rest of the display engine. I don't
think we want that on behalf of this relatively minor issue.
Bottom line, using the directional control characters is the best way
of adapting the visual appearance to user expectations when displaying
plain text.
XML and other non-pain text buffers are a different kind of problem.
There, we would like to display correctly not just text around '>',
but also comments and strings. The problems there are with all the
punctuation characters near the end of the comments and strings (they
display at the wrong end of the last sentence) and with L2R text
embedded in the otherwise R2L text. IOW, we would like to have a way
to display such comments and strings as if they were in an R2L
paragraph. I don't yet know what would be a good solution to that.
In fact, I don't think we have an exhaustive list of situations where
the default reordering causes trouble and must be augmented by
something else.
> >> > I think Eli is wrong here. An example will help, a file with the
> >> > (logical) name "/abc/def GHIK/LMNO qrst" when uniquified will appear
> >> > as: "def ONML|KIHG qrst" which is clearly wrong.
> >> > My way to solve it is as above, i.e. add zero width LRM on both sides
> >> > of the separator (/ or |) in addition to the enclosing LRMs.
> >> I think this is beginning to become gross.
> > But it is a general solution that is easily implemented.
>
> Indeed, for the buffer names it seems perfectly acceptable since we
> generate them ourselves and they don't go very far. I'm not sure why
> Eli doesn't like this solution.
I don't like the proliferation of directional marks that this will
bring. I hoped that we will need these directional control characters
only very rarely. These have problems on TTYs, and even in GUI
sessions they are visible by default (as thin spaces), so they will
disrupt the visual appearance and cursor motion. We will need to have
them everywhere, e.g. in the prompt displayed by read-buffer and in
other places, if we want buffer names to look the same in all
contexts. But since this is the best available solution, I'm willing
to try; maybe I'm wrong and the results will not be that bad after
all.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Buffer names with R2L characters
2011-06-23 9:16 ` Eli Zaretskii
@ 2011-06-25 13:25 ` Stefan Monnier
0 siblings, 0 replies; 21+ messages in thread
From: Stefan Monnier @ 2011-06-25 13:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: ehud, miles, stephen, cloos, emacs-devel
> the one that uses Arabic, copy/paste it into *scratch* in Emacs 24
> with bidi-display-reordering turned on, and replace every '<' and '>'
> there with either '-' (a number separator) or a digit. The result is
> still unreadable gibberish, and in the case of digits it's even less
> readable.
Oh, yes, yuck. Thanks for the explanation.
> And second, using text properties for overriding bidi properties is
> not a good idea at all, because bidi.c works below the level that pays
> attention to text properties.
I know.
> Bottom line, using the directional control characters is the best way
> of adapting the visual appearance to user expectations when displaying
> plain text.
OK.
>> Indeed, for the buffer names it seems perfectly acceptable since we
>> generate them ourselves and they don't go very far. I'm not sure why
>> Eli doesn't like this solution.
> I don't like the proliferation of directional marks that this will
> bring. I hoped that we will need these directional control characters
> only very rarely. These have problems on TTYs, and even in GUI
> sessions they are visible by default (as thin spaces), so they will
> disrupt the visual appearance and cursor motion. We will need to have
> them everywhere, e.g. in the prompt displayed by read-buffer and in
> other places, if we want buffer names to look the same in all
> contexts. But since this is the best available solution, I'm willing
> to try; maybe I'm wrong and the results will not be that bad after
> all.
We should make them display as nothing at all (but obey the
display-table, of course, so they can be made visible when needed).
Stefan
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-06-25 13:25 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-20 16:21 Buffer names with R2L characters Eli Zaretskii
2011-06-20 18:00 ` Stefan Monnier
2011-06-20 20:52 ` Eli Zaretskii
2011-06-20 20:13 ` James Cloos
2011-06-20 21:08 ` Eli Zaretskii
2011-06-21 4:26 ` Stephen J. Turnbull
2011-06-21 6:28 ` Eli Zaretskii
2011-06-21 8:44 ` Stephen J. Turnbull
2011-06-21 14:28 ` Eli Zaretskii
2011-06-21 4:33 ` Miles Bader
2011-06-21 6:30 ` Eli Zaretskii
2011-06-21 7:26 ` David Kastrup
2011-06-20 21:06 ` Kalle Olavi Niemitalo
2011-06-21 2:51 ` Eli Zaretskii
2011-06-21 16:52 ` Ehud Karni
2011-06-21 17:24 ` Eli Zaretskii
2011-06-21 17:59 ` Ehud Karni
2011-06-21 18:10 ` Eli Zaretskii
2011-06-22 22:27 ` Stefan Monnier
2011-06-23 9:16 ` Eli Zaretskii
2011-06-25 13:25 ` Stefan Monnier
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).