unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* bidi properties from uniprop tables
@ 2011-08-18 19:06 Eli Zaretskii
  2011-08-19  4:44 ` Stephen J. Turnbull
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2011-08-18 19:06 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

If a character code is missing from UnicodeData.txt, the uniprop_table
API in C returns zero as its bidi class, which should never happen
(valid classes start at 1).  This causes crashes in redisplay, because
bidi.c is unable to handle a character that has no valid properties.

The get-char-code-property function returns nil for such characters.
Here's an example:

  (get-char-code-property #x378 'bidi-class) => nil

You will not find 0x0378 in UnicodeData.txt.

Such undefined characters should not normally appear in any text, but
`describe-categories' produces such codes, and Emacs crashes when
browsing the buffer created by that command.

I made the code in bidi.c defensive about what it gets from the
uniprop table, but the question is, should we do something to never
have nil in Lisp or zero in C return from these APIs?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* bidi properties from uniprop tables
  2011-08-18 19:06 bidi properties from uniprop tables Eli Zaretskii
@ 2011-08-19  4:44 ` Stephen J. Turnbull
  2011-08-19  6:43   ` Eli Zaretskii
  2011-08-23 12:51   ` Kenichi Handa
  0 siblings, 2 replies; 11+ messages in thread
From: Stephen J. Turnbull @ 2011-08-19  4:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa

Eli Zaretskii writes:

 > Such undefined characters should not normally appear in any text,

Somebody misread the standard, I think.  In Unicode, everything in the
space 0 -- (2^20-1) is a character, with a few well-defined exceptions
called non-characters.  Most, however do not (yet) appear in the UCD,
but are reserved for future standardization.  Since the standard
regularly adds new characters, older versions of Emacs are very likely
to encounter reserved characters "normally" in "some" texts.

The UCD makes provision for this by providing definitions of some
properties for *all code points* (not merely all defined characters).
Unassigned (including noncharacter) code points automatically get the
General_Category property with value Cn (Unicode 6.0, section 4.5,
Table 4.9).  They also automatically get the Name property, with value
"" (the null string, Unicode 6.0, section 4.8, "Formal Definition of
the Name Property", p. 132.

 > I made the code in bidi.c defensive about what it gets from the

Maybe that should be an assert, since a null return is an Emacs bug.

 > uniprop table, but the question is, should we do something to never
 > have nil in Lisp or zero in C return from these APIs?

Yes, a non-nil property list is required by the standard for all code
points (not merely "all characters"), and it is obvious that in this
case conforming to the standard is useful.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19  4:44 ` Stephen J. Turnbull
@ 2011-08-19  6:43   ` Eli Zaretskii
  2011-08-19  9:15     ` Stephen J. Turnbull
  2011-08-23 12:51   ` Kenichi Handa
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2011-08-19  6:43 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel, handa

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Kenichi Handa <handa@m17n.org>,
>     emacs-devel@gnu.org
> Date: Fri, 19 Aug 2011 13:44:48 +0900
> 
>  > I made the code in bidi.c defensive about what it gets from the
> 
> Maybe that should be an assert, since a null return is an Emacs bug.

There's already something that catches such problems, albeit
indirectly, and aborts -- that's how I found this in the first place.
However, it doesn't make sense to have an assert where the bidi
property of a character is looked up as long as we don't make sure
this doesn't happen "normally", because having such an assert now will
cause a predictable crash when moving in a buffer created by
describe-categories.  People use the development version for their
day-to-day work, you know...

>  > uniprop table, but the question is, should we do something to never
>  > have nil in Lisp or zero in C return from these APIs?
> 
> Yes, a non-nil property list is required by the standard for all code
> points

It's not a property list, it's a single property whose value is a
symbol that shouldn't be nil.  See get-char-code-property.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19  6:43   ` Eli Zaretskii
@ 2011-08-19  9:15     ` Stephen J. Turnbull
  2011-08-19 10:36       ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen J. Turnbull @ 2011-08-19  9:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, emacs-devel

Eli Zaretskii writes:

 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: Kenichi Handa <handa@m17n.org>,
 > >     emacs-devel@gnu.org
 > > Date: Fri, 19 Aug 2011 13:44:48 +0900
 > > 
 > >  > I made the code in bidi.c defensive about what it gets from the
 > > 
 > > Maybe that should be an assert, since a null return is an Emacs bug.
 > 
 > There's already something that catches such problems, albeit
 > indirectly, and aborts -- that's how I found this in the first place.
 > However, it doesn't make sense to have an assert where the bidi
 > property of a character is looked up as long as we don't make sure
 > this doesn't happen "normally",

The point was to make sure that this doesn't happen normally, as my
understanding was that this is an Emacs bug.

 > because having such an assert now will cause a predictable crash
 > when moving in a buffer created by describe-categories.  People use
 > the development version for their day-to-day work, you know...

Of course I know that; I run with asserts enabled in several of my
mission-critical applications (but save early and often, and have a
stable version ready to substitute).  Doesn't everybody?  So obviously
you fix the bug first, then add the assert.

 > >  > uniprop table, but the question is, should we do something to never
 > >  > have nil in Lisp or zero in C return from these APIs?
 > > 
 > > Yes, a non-nil property list is required by the standard for all code
 > > points
 > 
 > It's not a property list, it's a single property whose value is a
 > symbol that shouldn't be nil.  See get-char-code-property.

Then no, you shouldn't do that for all properties, because not all
properties are defined for all characters.  Given that this is Lisp,
it should be possible to discover that from the value returned.

Some properties, however, are defined for all code points.  If the
properties in question are defined for all code points or all
characters (presumably Emacs should never allow a non-character code
point in a string or buffer?), then it's an Emacs bug in
get-char-code-property (or in the underlying table) if it returns nil.

If they aren't, then IMO it's a bug in the calling code that it's not
prepared for a null return, and your "defensive code" in bidi.c
is a correct bug fix.

With respect to the Bidi_Class property, UAX#9 says:

    Unassigned characters are given strong types in the
    algorithm. This is an explicit exception to the general Unicode
    conformance requirements with respect to unassigned characters. As
    characters become assigned in the future, these bidirectional
    types may change. For assignments to character types, see
    DerivedBidiClass.txt [DerivedBIDI] in the [UCD].

Since Bidi_Class is only used in this algorithm (and explicit property
lookups) AFAIK, it seems reasonable to me that get-char-code-property
et amis should return the "strong type" specified by DerivedBIDI
(which is LTR it seems, but you should check that).



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19  9:15     ` Stephen J. Turnbull
@ 2011-08-19 10:36       ` Eli Zaretskii
  2011-08-19 12:10         ` Stephen J. Turnbull
  2011-08-20 12:42         ` Kenichi Handa
  0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2011-08-19 10:36 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: handa, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-devel@gnu.org,
>     handa@m17n.org
> Date: Fri, 19 Aug 2011 18:15:58 +0900
> 
>     Unassigned characters are given strong types in the
>     algorithm. This is an explicit exception to the general Unicode
>     conformance requirements with respect to unassigned characters. As
>     characters become assigned in the future, these bidirectional
>     types may change. For assignments to character types, see
>     DerivedBidiClass.txt [DerivedBIDI] in the [UCD].

Thanks, I've managed to miss that addition to the UBA.

> Since Bidi_Class is only used in this algorithm (and explicit property
> lookups) AFAIK

That's not true, it is also used in regexp search by category.  So we
should decide whether to assign these types in the uniprop table, or
have a fallback for them in bidi.c.  Any opinions?  Handa-san?

> it seems reasonable to me that get-char-code-property
> et amis should return the "strong type" specified by DerivedBIDI
> (which is LTR it seems, but you should check that).

No, the type depends on the block:

  # Unlike other properties, unassigned code points in blocks
  # reserved for right-to-left scripts are given either types R or AL.
  #
  # The unassigned code points that default to AL are in the ranges:
  #     [\u0600-\u07BF \uFB50-\uFDFF \uFE70-\uFEFF]
  #
  #     Arabic:            U+0600  -  U+06FF
  #     Syriac:            U+0700  -  U+074F
  #     Arabic_Supplement: U+0750  -  U+077F
  #     Thaana:            U+0780  -  U+07BF
  #     Arabic_Presentation_Forms_A:
  #                        U+FB50  -  U+FDFF
  #     Arabic_Presentation_Forms_B:
  #                        U+FE70  -  U+FEFF
  #           minus noncharacter code points.
  #
  # The unassigned code points that default to R are in the ranges:
  #     [\u0590-\u05FF \u07C0-\u08FF \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EFFF]
  #
  #     Hebrew:            U+0590  -  U+05FF
  #     NKo:               U+07C0  -  U+07FF
  #     Cypriot_Syllabary: U+10800 - U+1083F
  #     Phoenician:        U+10900 - U+1091F
  #     Lydian:            U+10920 - U+1093F
  #     Kharoshthi:        U+10A00 - U+10A5F
  #     and any others in the ranges:
  #                        U+0800  -  U+08FF,
  #                        U+FB1D  -  U+FB4F,
  #                        U+10840 - U+10FFF,
  #                        U+1E800 - U+1EFFF
  #
  # For all other cases:

  #  All code points not explicitly listed for Bidi_Class
  #  have the value Left_To_Right (L).



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19 10:36       ` Eli Zaretskii
@ 2011-08-19 12:10         ` Stephen J. Turnbull
  2011-08-20 12:42         ` Kenichi Handa
  1 sibling, 0 replies; 11+ messages in thread
From: Stephen J. Turnbull @ 2011-08-19 12:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, handa

Eli Zaretskii writes:

 > > Since Bidi_Class is only used in this algorithm (and explicit property
 > > lookups) AFAIK
 > 
 > That's not true, it is also used in regexp search by category.  So we
 > should decide whether to assign these types in the uniprop table, or
 > have a fallback for them in bidi.c.  Any opinions?  Handa-san?

Since these actually appear in the DerivedBIDI file, I'd say that they
should be in the uniprop table and available for regexp search and
others.  IIUC, the deviation from the usual Unicode conformance is
that normally you are not supposed to make any assumptions about
character code points that are not assign, but here assumptions are
made based on code point blocks.  So these defaults are not normative
in the way that the Bidi_Class property is when explicitly assigned to
a character, but they "really are" properties of those code points.

IMHO YMMV.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19 10:36       ` Eli Zaretskii
  2011-08-19 12:10         ` Stephen J. Turnbull
@ 2011-08-20 12:42         ` Kenichi Handa
  2011-08-20 13:00           ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Kenichi Handa @ 2011-08-20 12:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, emacs-devel

In article <8339gxsn6h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Since Bidi_Class is only used in this algorithm (and explicit property
> > lookups) AFAIK

> That's not true, it is also used in regexp search by category.  So we
> should decide whether to assign these types in the uniprop table, or
> have a fallback for them in bidi.c.  Any opinions?  Handa-san?

As I'm on vacation now, I can't access the source of Emacs,
but I remember that there's a place in an element of
unidata-SOMETHING-alist (I don't remember what SOMETHING is)
to specify the default property value.  So, it should be
easy to fix the default value if it is a simple one.

But, the current code doesn't handle the non-simple default
value as below.

> > it seems reasonable to me that get-char-code-property
> > et amis should return the "strong type" specified by DerivedBIDI
> > (which is LTR it seems, but you should check that).

> No, the type depends on the block:

>   # Unlike other properties, unassigned code points in blocks
>   # reserved for right-to-left scripts are given either types R or AL.
>   #
>   # The unassigned code points that default to AL are in the ranges:
>   #     [\u0600-\u07BF \uFB50-\uFDFF \uFE70-\uFEFF]
>   #
>   #     Arabic:            U+0600  -  U+06FF
>   #     Syriac:            U+0700  -  U+074F
>   #     Arabic_Supplement: U+0750  -  U+077F
>   #     Thaana:            U+0780  -  U+07BF
>   #     Arabic_Presentation_Forms_A:
>   #                        U+FB50  -  U+FDFF
>   #     Arabic_Presentation_Forms_B:
>   #                        U+FE70  -  U+FEFF
>   #           minus noncharacter code points.
>   #
>   # The unassigned code points that default to R are in the ranges:
>   #     [\u0590-\u05FF \u07C0-\u08FF \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EFFF]
>   #
>   #     Hebrew:            U+0590  -  U+05FF
>   #     NKo:               U+07C0  -  U+07FF
>   #     Cypriot_Syllabary: U+10800 - U+1083F
>   #     Phoenician:        U+10900 - U+1091F
>   #     Lydian:            U+10920 - U+1093F
>   #     Kharoshthi:        U+10A00 - U+10A5F
>   #     and any others in the ranges:
>   #                        U+0800  -  U+08FF,
>   #                        U+FB1D  -  U+FB4F,
>   #                        U+10840 - U+10FFF,
>   #                        U+1E800 - U+1EFFF
>   #
>   # For all other cases:

>   #  All code points not explicitly listed for Bidi_Class
>   #  have the value Left_To_Right (L).

I'll fix the code to handle it when I'm back to work on next
Monday.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-20 12:42         ` Kenichi Handa
@ 2011-08-20 13:00           ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2011-08-20 13:00 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: stephen, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Sat, 20 Aug 2011 21:42:20 +0900
> Cc: stephen@xemacs.org, emacs-devel@gnu.org
> 
> In article <8339gxsn6h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > Since Bidi_Class is only used in this algorithm (and explicit property
> > > lookups) AFAIK
> 
> > That's not true, it is also used in regexp search by category.  So we
> > should decide whether to assign these types in the uniprop table, or
> > have a fallback for them in bidi.c.  Any opinions?  Handa-san?
> 
> As I'm on vacation now, I can't access the source of Emacs,
> but I remember that there's a place in an element of
> unidata-SOMETHING-alist (I don't remember what SOMETHING is)
> to specify the default property value.  So, it should be
> easy to fix the default value if it is a simple one.

I guess you mean unidata-prop-alist.  If so, it already states that
the default is L:

      (bidi-class
       4 unidata-gen-table-symbol "uni-bidi.el"
       "Unicode bidi class.
  Property value is one of the following symbols:
    L, LRE, LRO, R, AL, RLE, RLO, PDF, EN, ES, ET,
    AN, CS, NSM, BN, B, S, WS, ON"
       unidata-describe-bidi-class
       L  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
       ;; The order of elements must be in sync with bidi_type_t in
       ;; src/dispextern.h.
       (L R EN AN BN B AL LRE LRO RLE RLO PDF ES ET CS NSM S WS ON))

I think the problem is deeper.  The characters in question do not
appear at all in UnicodeData.txt.  I think unidata-gen.el only handles
character codes it finds in UnicodeData.txt, but does nothing for
those it didn't find.

> I'll fix the code to handle it when I'm back to work on next
> Monday.

Thanks you.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-19  4:44 ` Stephen J. Turnbull
  2011-08-19  6:43   ` Eli Zaretskii
@ 2011-08-23 12:51   ` Kenichi Handa
  2011-08-23 14:49     ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Kenichi Handa @ 2011-08-23 12:51 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: eliz, emacs-devel

In article <87k4aako1r.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Somebody misread the standard, I think.

It's me.  Actually, as far as I remember, the early version
of UCD was not clear about the property values of characters
not listed in UCD, and the early version of unidata-gen.el
was created at that time .  After that, I have not checked
the precise definitions of updated UCDs.

> Yes, a non-nil property list is required by the standard for all code
> points (not merely "all characters"), and it is obvious that in this
> case conforming to the standard is useful.

I've just installed fixes.  In the latest code,
get-char-code-property never return nil for these
properties:
  name, general-category, canonical-combining-class,
  bidi-class, decomposition, mirrored, old-name,
  iso-10646-comment.

But, it still returns nil for these properties ("string
property" in UCD terminology):
  decimal-digit-value, digit-value, numeric-value,
  uppercase, lowercase, titlecase, mirroring
UCD says that the default value is a character itself for
them, but to implement it, we have to fill all char-table
elements by corresponding characters (which makes the table
very big), or have to implement a special mechanism to
return a character ifself if the value is nil (which I think
is not adequate at the current timing of feature freeze).
So, I just added this kind of statement in the docstring.

   The value nil means that the actual property value of a
   character is the character itself.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-23 12:51   ` Kenichi Handa
@ 2011-08-23 14:49     ` Eli Zaretskii
  2011-08-23 23:36       ` Kenichi Handa
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2011-08-23 14:49 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: stephen, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: eliz@gnu.org, emacs-devel@gnu.org
> Date: Tue, 23 Aug 2011 21:51:03 +0900
> 
> I've just installed fixes.

Thanks.  I committed a followup changeset, to make the default
bidi-class properties consistent with the latest UCD, to adapt bidi.c
to the changes, and to document in the ELisp manual the default values
for the unassigned codepoints.

> But, it still returns nil for these properties ("string
> property" in UCD terminology):
>   decimal-digit-value, digit-value, numeric-value,
>   uppercase, lowercase, titlecase, mirroring
> UCD says that the default value is a character itself for
> them, but to implement it, we have to fill all char-table
> elements by corresponding characters (which makes the table
> very big), or have to implement a special mechanism to
> return a character ifself if the value is nil (which I think
> is not adequate at the current timing of feature freeze).
> So, I just added this kind of statement in the docstring.
> 
>    The value nil means that the actual property value of a
>    character is the character itself.

I agree with this implementation, and documented these properties
accordingly in the ELisp manual.

Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bidi properties from uniprop tables
  2011-08-23 14:49     ` Eli Zaretskii
@ 2011-08-23 23:36       ` Kenichi Handa
  0 siblings, 0 replies; 11+ messages in thread
From: Kenichi Handa @ 2011-08-23 23:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, emacs-devel

In article <838vqki3mx.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > I've just installed fixes.

> Thanks.  I committed a followup changeset, to make the default
> bidi-class properties consistent with the latest UCD, to adapt bidi.c
> to the changes, and to document in the ELisp manual the default values
> for the unassigned codepoints.

Thank you for them.

> > But, it still returns nil for these properties ("string
> > property" in UCD terminology):
> >   decimal-digit-value, digit-value, numeric-value,
> >   uppercase, lowercase, titlecase, mirroring
> > UCD says that the default value is a character itself for
> > them, but to implement it, we have to fill all char-table
> > elements by corresponding characters (which makes the table
> > very big), or have to implement a special mechanism to
> > return a character ifself if the value is nil (which I think
> > is not adequate at the current timing of feature freeze).
> > So, I just added this kind of statement in the docstring.
> > 
> >    The value nil means that the actual property value of a
> >    character is the character itself.

> I agree with this implementation, and documented these properties
> accordingly in the ELisp manual.

Thank you for that too.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-08-23 23:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-18 19:06 bidi properties from uniprop tables Eli Zaretskii
2011-08-19  4:44 ` Stephen J. Turnbull
2011-08-19  6:43   ` Eli Zaretskii
2011-08-19  9:15     ` Stephen J. Turnbull
2011-08-19 10:36       ` Eli Zaretskii
2011-08-19 12:10         ` Stephen J. Turnbull
2011-08-20 12:42         ` Kenichi Handa
2011-08-20 13:00           ` Eli Zaretskii
2011-08-23 12:51   ` Kenichi Handa
2011-08-23 14:49     ` Eli Zaretskii
2011-08-23 23:36       ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).