* bidi properties from uniprop tables @ 2011-08-18 19:06 Eli Zaretskii 2011-08-19 4:44 ` Stephen J. Turnbull 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2011-08-18 19:06 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel If a character code is missing from UnicodeData.txt, the uniprop_table API in C returns zero as its bidi class, which should never happen (valid classes start at 1). This causes crashes in redisplay, because bidi.c is unable to handle a character that has no valid properties. The get-char-code-property function returns nil for such characters. Here's an example: (get-char-code-property #x378 'bidi-class) => nil You will not find 0x0378 in UnicodeData.txt. Such undefined characters should not normally appear in any text, but `describe-categories' produces such codes, and Emacs crashes when browsing the buffer created by that command. I made the code in bidi.c defensive about what it gets from the uniprop table, but the question is, should we do something to never have nil in Lisp or zero in C return from these APIs? ^ permalink raw reply [flat|nested] 11+ messages in thread
* bidi properties from uniprop tables 2011-08-18 19:06 bidi properties from uniprop tables Eli Zaretskii @ 2011-08-19 4:44 ` Stephen J. Turnbull 2011-08-19 6:43 ` Eli Zaretskii 2011-08-23 12:51 ` Kenichi Handa 0 siblings, 2 replies; 11+ messages in thread From: Stephen J. Turnbull @ 2011-08-19 4:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa Eli Zaretskii writes: > Such undefined characters should not normally appear in any text, Somebody misread the standard, I think. In Unicode, everything in the space 0 -- (2^20-1) is a character, with a few well-defined exceptions called non-characters. Most, however do not (yet) appear in the UCD, but are reserved for future standardization. Since the standard regularly adds new characters, older versions of Emacs are very likely to encounter reserved characters "normally" in "some" texts. The UCD makes provision for this by providing definitions of some properties for *all code points* (not merely all defined characters). Unassigned (including noncharacter) code points automatically get the General_Category property with value Cn (Unicode 6.0, section 4.5, Table 4.9). They also automatically get the Name property, with value "" (the null string, Unicode 6.0, section 4.8, "Formal Definition of the Name Property", p. 132. > I made the code in bidi.c defensive about what it gets from the Maybe that should be an assert, since a null return is an Emacs bug. > uniprop table, but the question is, should we do something to never > have nil in Lisp or zero in C return from these APIs? Yes, a non-nil property list is required by the standard for all code points (not merely "all characters"), and it is obvious that in this case conforming to the standard is useful. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 4:44 ` Stephen J. Turnbull @ 2011-08-19 6:43 ` Eli Zaretskii 2011-08-19 9:15 ` Stephen J. Turnbull 2011-08-23 12:51 ` Kenichi Handa 1 sibling, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2011-08-19 6:43 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: emacs-devel, handa > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: Kenichi Handa <handa@m17n.org>, > emacs-devel@gnu.org > Date: Fri, 19 Aug 2011 13:44:48 +0900 > > > I made the code in bidi.c defensive about what it gets from the > > Maybe that should be an assert, since a null return is an Emacs bug. There's already something that catches such problems, albeit indirectly, and aborts -- that's how I found this in the first place. However, it doesn't make sense to have an assert where the bidi property of a character is looked up as long as we don't make sure this doesn't happen "normally", because having such an assert now will cause a predictable crash when moving in a buffer created by describe-categories. People use the development version for their day-to-day work, you know... > > uniprop table, but the question is, should we do something to never > > have nil in Lisp or zero in C return from these APIs? > > Yes, a non-nil property list is required by the standard for all code > points It's not a property list, it's a single property whose value is a symbol that shouldn't be nil. See get-char-code-property. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 6:43 ` Eli Zaretskii @ 2011-08-19 9:15 ` Stephen J. Turnbull 2011-08-19 10:36 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Stephen J. Turnbull @ 2011-08-19 9:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: handa, emacs-devel Eli Zaretskii writes: > > From: "Stephen J. Turnbull" <stephen@xemacs.org> > > Cc: Kenichi Handa <handa@m17n.org>, > > emacs-devel@gnu.org > > Date: Fri, 19 Aug 2011 13:44:48 +0900 > > > > > I made the code in bidi.c defensive about what it gets from the > > > > Maybe that should be an assert, since a null return is an Emacs bug. > > There's already something that catches such problems, albeit > indirectly, and aborts -- that's how I found this in the first place. > However, it doesn't make sense to have an assert where the bidi > property of a character is looked up as long as we don't make sure > this doesn't happen "normally", The point was to make sure that this doesn't happen normally, as my understanding was that this is an Emacs bug. > because having such an assert now will cause a predictable crash > when moving in a buffer created by describe-categories. People use > the development version for their day-to-day work, you know... Of course I know that; I run with asserts enabled in several of my mission-critical applications (but save early and often, and have a stable version ready to substitute). Doesn't everybody? So obviously you fix the bug first, then add the assert. > > > uniprop table, but the question is, should we do something to never > > > have nil in Lisp or zero in C return from these APIs? > > > > Yes, a non-nil property list is required by the standard for all code > > points > > It's not a property list, it's a single property whose value is a > symbol that shouldn't be nil. See get-char-code-property. Then no, you shouldn't do that for all properties, because not all properties are defined for all characters. Given that this is Lisp, it should be possible to discover that from the value returned. Some properties, however, are defined for all code points. If the properties in question are defined for all code points or all characters (presumably Emacs should never allow a non-character code point in a string or buffer?), then it's an Emacs bug in get-char-code-property (or in the underlying table) if it returns nil. If they aren't, then IMO it's a bug in the calling code that it's not prepared for a null return, and your "defensive code" in bidi.c is a correct bug fix. With respect to the Bidi_Class property, UAX#9 says: Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. Since Bidi_Class is only used in this algorithm (and explicit property lookups) AFAIK, it seems reasonable to me that get-char-code-property et amis should return the "strong type" specified by DerivedBIDI (which is LTR it seems, but you should check that). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 9:15 ` Stephen J. Turnbull @ 2011-08-19 10:36 ` Eli Zaretskii 2011-08-19 12:10 ` Stephen J. Turnbull 2011-08-20 12:42 ` Kenichi Handa 0 siblings, 2 replies; 11+ messages in thread From: Eli Zaretskii @ 2011-08-19 10:36 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: handa, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: emacs-devel@gnu.org, > handa@m17n.org > Date: Fri, 19 Aug 2011 18:15:58 +0900 > > Unassigned characters are given strong types in the > algorithm. This is an explicit exception to the general Unicode > conformance requirements with respect to unassigned characters. As > characters become assigned in the future, these bidirectional > types may change. For assignments to character types, see > DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. Thanks, I've managed to miss that addition to the UBA. > Since Bidi_Class is only used in this algorithm (and explicit property > lookups) AFAIK That's not true, it is also used in regexp search by category. So we should decide whether to assign these types in the uniprop table, or have a fallback for them in bidi.c. Any opinions? Handa-san? > it seems reasonable to me that get-char-code-property > et amis should return the "strong type" specified by DerivedBIDI > (which is LTR it seems, but you should check that). No, the type depends on the block: # Unlike other properties, unassigned code points in blocks # reserved for right-to-left scripts are given either types R or AL. # # The unassigned code points that default to AL are in the ranges: # [\u0600-\u07BF \uFB50-\uFDFF \uFE70-\uFEFF] # # Arabic: U+0600 - U+06FF # Syriac: U+0700 - U+074F # Arabic_Supplement: U+0750 - U+077F # Thaana: U+0780 - U+07BF # Arabic_Presentation_Forms_A: # U+FB50 - U+FDFF # Arabic_Presentation_Forms_B: # U+FE70 - U+FEFF # minus noncharacter code points. # # The unassigned code points that default to R are in the ranges: # [\u0590-\u05FF \u07C0-\u08FF \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EFFF] # # Hebrew: U+0590 - U+05FF # NKo: U+07C0 - U+07FF # Cypriot_Syllabary: U+10800 - U+1083F # Phoenician: U+10900 - U+1091F # Lydian: U+10920 - U+1093F # Kharoshthi: U+10A00 - U+10A5F # and any others in the ranges: # U+0800 - U+08FF, # U+FB1D - U+FB4F, # U+10840 - U+10FFF, # U+1E800 - U+1EFFF # # For all other cases: # All code points not explicitly listed for Bidi_Class # have the value Left_To_Right (L). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 10:36 ` Eli Zaretskii @ 2011-08-19 12:10 ` Stephen J. Turnbull 2011-08-20 12:42 ` Kenichi Handa 1 sibling, 0 replies; 11+ messages in thread From: Stephen J. Turnbull @ 2011-08-19 12:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, handa Eli Zaretskii writes: > > Since Bidi_Class is only used in this algorithm (and explicit property > > lookups) AFAIK > > That's not true, it is also used in regexp search by category. So we > should decide whether to assign these types in the uniprop table, or > have a fallback for them in bidi.c. Any opinions? Handa-san? Since these actually appear in the DerivedBIDI file, I'd say that they should be in the uniprop table and available for regexp search and others. IIUC, the deviation from the usual Unicode conformance is that normally you are not supposed to make any assumptions about character code points that are not assign, but here assumptions are made based on code point blocks. So these defaults are not normative in the way that the Bidi_Class property is when explicitly assigned to a character, but they "really are" properties of those code points. IMHO YMMV. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 10:36 ` Eli Zaretskii 2011-08-19 12:10 ` Stephen J. Turnbull @ 2011-08-20 12:42 ` Kenichi Handa 2011-08-20 13:00 ` Eli Zaretskii 1 sibling, 1 reply; 11+ messages in thread From: Kenichi Handa @ 2011-08-20 12:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, emacs-devel In article <8339gxsn6h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: > > Since Bidi_Class is only used in this algorithm (and explicit property > > lookups) AFAIK > That's not true, it is also used in regexp search by category. So we > should decide whether to assign these types in the uniprop table, or > have a fallback for them in bidi.c. Any opinions? Handa-san? As I'm on vacation now, I can't access the source of Emacs, but I remember that there's a place in an element of unidata-SOMETHING-alist (I don't remember what SOMETHING is) to specify the default property value. So, it should be easy to fix the default value if it is a simple one. But, the current code doesn't handle the non-simple default value as below. > > it seems reasonable to me that get-char-code-property > > et amis should return the "strong type" specified by DerivedBIDI > > (which is LTR it seems, but you should check that). > No, the type depends on the block: > # Unlike other properties, unassigned code points in blocks > # reserved for right-to-left scripts are given either types R or AL. > # > # The unassigned code points that default to AL are in the ranges: > # [\u0600-\u07BF \uFB50-\uFDFF \uFE70-\uFEFF] > # > # Arabic: U+0600 - U+06FF > # Syriac: U+0700 - U+074F > # Arabic_Supplement: U+0750 - U+077F > # Thaana: U+0780 - U+07BF > # Arabic_Presentation_Forms_A: > # U+FB50 - U+FDFF > # Arabic_Presentation_Forms_B: > # U+FE70 - U+FEFF > # minus noncharacter code points. > # > # The unassigned code points that default to R are in the ranges: > # [\u0590-\u05FF \u07C0-\u08FF \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EFFF] > # > # Hebrew: U+0590 - U+05FF > # NKo: U+07C0 - U+07FF > # Cypriot_Syllabary: U+10800 - U+1083F > # Phoenician: U+10900 - U+1091F > # Lydian: U+10920 - U+1093F > # Kharoshthi: U+10A00 - U+10A5F > # and any others in the ranges: > # U+0800 - U+08FF, > # U+FB1D - U+FB4F, > # U+10840 - U+10FFF, > # U+1E800 - U+1EFFF > # > # For all other cases: > # All code points not explicitly listed for Bidi_Class > # have the value Left_To_Right (L). I'll fix the code to handle it when I'm back to work on next Monday. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-20 12:42 ` Kenichi Handa @ 2011-08-20 13:00 ` Eli Zaretskii 0 siblings, 0 replies; 11+ messages in thread From: Eli Zaretskii @ 2011-08-20 13:00 UTC (permalink / raw) To: Kenichi Handa; +Cc: stephen, emacs-devel > From: Kenichi Handa <handa@m17n.org> > Date: Sat, 20 Aug 2011 21:42:20 +0900 > Cc: stephen@xemacs.org, emacs-devel@gnu.org > > In article <8339gxsn6h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: > > > > Since Bidi_Class is only used in this algorithm (and explicit property > > > lookups) AFAIK > > > That's not true, it is also used in regexp search by category. So we > > should decide whether to assign these types in the uniprop table, or > > have a fallback for them in bidi.c. Any opinions? Handa-san? > > As I'm on vacation now, I can't access the source of Emacs, > but I remember that there's a place in an element of > unidata-SOMETHING-alist (I don't remember what SOMETHING is) > to specify the default property value. So, it should be > easy to fix the default value if it is a simple one. I guess you mean unidata-prop-alist. If so, it already states that the default is L: (bidi-class 4 unidata-gen-table-symbol "uni-bidi.el" "Unicode bidi class. Property value is one of the following symbols: L, LRE, LRO, R, AL, RLE, RLO, PDF, EN, ES, ET, AN, CS, NSM, BN, B, S, WS, ON" unidata-describe-bidi-class L <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ;; The order of elements must be in sync with bidi_type_t in ;; src/dispextern.h. (L R EN AN BN B AL LRE LRO RLE RLO PDF ES ET CS NSM S WS ON)) I think the problem is deeper. The characters in question do not appear at all in UnicodeData.txt. I think unidata-gen.el only handles character codes it finds in UnicodeData.txt, but does nothing for those it didn't find. > I'll fix the code to handle it when I'm back to work on next > Monday. Thanks you. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-19 4:44 ` Stephen J. Turnbull 2011-08-19 6:43 ` Eli Zaretskii @ 2011-08-23 12:51 ` Kenichi Handa 2011-08-23 14:49 ` Eli Zaretskii 1 sibling, 1 reply; 11+ messages in thread From: Kenichi Handa @ 2011-08-23 12:51 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: eliz, emacs-devel In article <87k4aako1r.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Somebody misread the standard, I think. It's me. Actually, as far as I remember, the early version of UCD was not clear about the property values of characters not listed in UCD, and the early version of unidata-gen.el was created at that time . After that, I have not checked the precise definitions of updated UCDs. > Yes, a non-nil property list is required by the standard for all code > points (not merely "all characters"), and it is obvious that in this > case conforming to the standard is useful. I've just installed fixes. In the latest code, get-char-code-property never return nil for these properties: name, general-category, canonical-combining-class, bidi-class, decomposition, mirrored, old-name, iso-10646-comment. But, it still returns nil for these properties ("string property" in UCD terminology): decimal-digit-value, digit-value, numeric-value, uppercase, lowercase, titlecase, mirroring UCD says that the default value is a character itself for them, but to implement it, we have to fill all char-table elements by corresponding characters (which makes the table very big), or have to implement a special mechanism to return a character ifself if the value is nil (which I think is not adequate at the current timing of feature freeze). So, I just added this kind of statement in the docstring. The value nil means that the actual property value of a character is the character itself. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-23 12:51 ` Kenichi Handa @ 2011-08-23 14:49 ` Eli Zaretskii 2011-08-23 23:36 ` Kenichi Handa 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2011-08-23 14:49 UTC (permalink / raw) To: Kenichi Handa; +Cc: stephen, emacs-devel > From: Kenichi Handa <handa@m17n.org> > Cc: eliz@gnu.org, emacs-devel@gnu.org > Date: Tue, 23 Aug 2011 21:51:03 +0900 > > I've just installed fixes. Thanks. I committed a followup changeset, to make the default bidi-class properties consistent with the latest UCD, to adapt bidi.c to the changes, and to document in the ELisp manual the default values for the unassigned codepoints. > But, it still returns nil for these properties ("string > property" in UCD terminology): > decimal-digit-value, digit-value, numeric-value, > uppercase, lowercase, titlecase, mirroring > UCD says that the default value is a character itself for > them, but to implement it, we have to fill all char-table > elements by corresponding characters (which makes the table > very big), or have to implement a special mechanism to > return a character ifself if the value is nil (which I think > is not adequate at the current timing of feature freeze). > So, I just added this kind of statement in the docstring. > > The value nil means that the actual property value of a > character is the character itself. I agree with this implementation, and documented these properties accordingly in the ELisp manual. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bidi properties from uniprop tables 2011-08-23 14:49 ` Eli Zaretskii @ 2011-08-23 23:36 ` Kenichi Handa 0 siblings, 0 replies; 11+ messages in thread From: Kenichi Handa @ 2011-08-23 23:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, emacs-devel In article <838vqki3mx.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: > > I've just installed fixes. > Thanks. I committed a followup changeset, to make the default > bidi-class properties consistent with the latest UCD, to adapt bidi.c > to the changes, and to document in the ELisp manual the default values > for the unassigned codepoints. Thank you for them. > > But, it still returns nil for these properties ("string > > property" in UCD terminology): > > decimal-digit-value, digit-value, numeric-value, > > uppercase, lowercase, titlecase, mirroring > > UCD says that the default value is a character itself for > > them, but to implement it, we have to fill all char-table > > elements by corresponding characters (which makes the table > > very big), or have to implement a special mechanism to > > return a character ifself if the value is nil (which I think > > is not adequate at the current timing of feature freeze). > > So, I just added this kind of statement in the docstring. > > > > The value nil means that the actual property value of a > > character is the character itself. > I agree with this implementation, and documented these properties > accordingly in the ELisp manual. Thank you for that too. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-08-23 23:36 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-18 19:06 bidi properties from uniprop tables Eli Zaretskii 2011-08-19 4:44 ` Stephen J. Turnbull 2011-08-19 6:43 ` Eli Zaretskii 2011-08-19 9:15 ` Stephen J. Turnbull 2011-08-19 10:36 ` Eli Zaretskii 2011-08-19 12:10 ` Stephen J. Turnbull 2011-08-20 12:42 ` Kenichi Handa 2011-08-20 13:00 ` Eli Zaretskii 2011-08-23 12:51 ` Kenichi Handa 2011-08-23 14:49 ` Eli Zaretskii 2011-08-23 23:36 ` Kenichi Handa
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.