* [PATCH] Unicode general categories
@ 2009-12-24 5:46 Julian Graham
2009-12-24 7:08 ` Mike Gran
0 siblings, 1 reply; 4+ messages in thread
From: Julian Graham @ 2009-12-24 5:46 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1: Type: text/plain, Size: 433 bytes --]
Hi all,
Find attached a patch that adds support for finding out the Unicode
general category [0] for a character, including documentation and unit
tests. The API is pretty much the same as the one described in R6RS
Standard Libraries 1.1 [1]. I'll push if no one objects.
Regards,
Julian
[0] - http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf
[1] - http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_sec_1.1
[-- Attachment #2: 0001-Support-for-Unicode-general-categories.patch --]
[-- Type: text/x-diff, Size: 4804 bytes --]
From f8fef903d535fa9ceb2677ab0c7dacc7692ea0f3 Mon Sep 17 00:00:00 2001
From: Julian Graham <julian.graham@aya.yale.edu>
Date: Thu, 24 Dec 2009 00:25:19 -0500
Subject: [PATCH] Support for Unicode general categories
* libguile/chars.c, libguile/chars.h (scm_char_general_category): New function.
* test-suite/tests/chars.test: Unit tests for `char-general-category'.
* doc/ref/api-data.texi (Characters): Documentation for
`char-general-category'.
---
doc/ref/api-data.texi | 91 +++++++++++++++++++++++++++++++++++++++++++
libguile/chars.c | 20 +++++++++
libguile/chars.h | 1 +
test-suite/tests/chars.test | 7 +++-
4 files changed, 118 insertions(+), 1 deletions(-)
diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index 6721b12..df5db48 100755
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -1875,6 +1875,97 @@ Return @code{#t} iff @var{chr} is either uppercase or lowercase, else
@code{#f}.
@end deffn
+@deffn {Scheme Procedure} char-general-category chr
+@deffnx {C Function} scm_char_general_category (chr)
+Return a symbol giving the one- or two-letter name of the Unicode
+general category assigned to @var{chr} or @code{#f} if no named category
+is assigned. The following table provides a list of category names
+along with their meanings.
+
+@multitable @columnfractions .1 .4 .1 .4
+@item L
+ @tab Letter
+ @tab Pf
+ @tab Final quote punctuation
+@item Lu
+ @tab Uppercase letter
+ @tab Po
+ @tab Other punctuation
+@item Ll
+ @tab Lowercase letter
+ @tab S
+ @tab Symbol
+@item Lt
+ @tab Titlecase letter
+ @tab Sm
+ @tab Math symbol
+@item Lm
+ @tab Modifier letter
+ @tab Sc
+ @tab Currency symbol
+@item Lo
+ @tab Other letter
+ @tab Sk
+ @tab Modifier symbol
+@item M
+ @tab Mark
+ @tab So
+ @tab Other synbol
+@item Mn
+ @tab Non-spacing mark
+ @tab Z
+ @tab Separator
+@item Mc
+ @tab Combining spacing mark
+ @tab Zs
+ @tab Space separator
+@item Me
+ @tab Enclosing mark
+ @tab Zl
+ @tab Line separator
+@item N
+ @tab Number
+ @tab Zp
+ @tab Paragraph separator
+@item Nd
+ @tab Decimal digit number
+ @tab C
+ @tab Other
+@item Nl
+ @tab Letter number
+ @tab Cc
+ @tab Control
+@item No
+ @tab Other number
+ @tab Cf
+ @tab Format
+@item P
+ @tab Punctuation
+ @tab Cs
+ @tab Surrogate
+@item Pc
+ @tab Connector punctuation
+ @tab Co
+ @tab Private use
+@item Pd
+ @tab Dash punctuation
+ @tab Cn
+ @tab Unassigned
+@item Ps
+ @tab Open punctuation
+ @tab
+ @tab
+@item Pe
+ @tab Close punctuation
+ @tab
+ @tab
+@item Pi
+ @tab Initial quote punctuation
+ @tab
+ @tab
+@end multitable
+@end deffn
+
@rnindex char->integer
@deffn {Scheme Procedure} char->integer chr
@deffnx {C Function} scm_char_to_integer (chr)
diff --git a/libguile/chars.c b/libguile/chars.c
index 1c4d106..36cb08d 100644
--- a/libguile/chars.c
+++ b/libguile/chars.c
@@ -25,6 +25,7 @@
#include <ctype.h>
#include <limits.h>
#include <unicase.h>
+#include <unictype.h>
#include "libguile/_scm.h"
#include "libguile/validate.h"
@@ -467,6 +468,25 @@ SCM_DEFINE (scm_char_titlecase, "char-titlecase", 1, 0, 0,
}
#undef FUNC_NAME
+SCM_DEFINE (scm_char_general_category, "char-general-category", 1, 0, 0,
+ (SCM chr),
+ "Return a symbol representing the Unicode general category of "
+ "@var{chr} or @code{#f} if a named category cannot be found.")
+#define FUNC_NAME s_scm_char_general_category
+{
+ char *sym = NULL;
+ uc_general_category_t cat;
+
+ SCM_VALIDATE_CHAR (1, chr);
+ cat = uc_general_category ((int) SCM_CHAR (chr));
+ sym = uc_general_category_name (cat);
+
+ if (sym != NULL)
+ return scm_from_locale_symbol (sym);
+ return SCM_BOOL_F;
+}
+#undef FUNC_NAME
+
\f
diff --git a/libguile/chars.h b/libguile/chars.h
index 2b00645..488dd25 100644
--- a/libguile/chars.h
+++ b/libguile/chars.h
@@ -81,6 +81,7 @@ SCM_API SCM scm_integer_to_char (SCM n);
SCM_API SCM scm_char_upcase (SCM chr);
SCM_API SCM scm_char_downcase (SCM chr);
SCM_API SCM scm_char_titlecase (SCM chr);
+SCM_API SCM scm_char_general_category (SCM chr);
SCM_API scm_t_wchar scm_c_upcase (scm_t_wchar c);
SCM_API scm_t_wchar scm_c_downcase (scm_t_wchar c);
SCM_API scm_t_wchar scm_c_titlecase (scm_t_wchar c);
diff --git a/test-suite/tests/chars.test b/test-suite/tests/chars.test
index 72805d1..cd1572f 100644
--- a/test-suite/tests/chars.test
+++ b/test-suite/tests/chars.test
@@ -210,7 +210,12 @@
(not (char-is-both? #\newline))
(char-is-both? #\a)
(char-is-both? #\Z)
- (not (char-is-both? #\1)))))
+ (not (char-is-both? #\1))))
+
+ (pass-if "char-general-category"
+ (and (eq? (char-general-category #\a) 'Ll)
+ (eq? (char-general-category #\A) 'Lu)
+ (eq? (char-general-category #\762) 'Lt))))
(with-test-prefix "integer"
--
1.6.3.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] Unicode general categories
2009-12-24 5:46 [PATCH] Unicode general categories Julian Graham
@ 2009-12-24 7:08 ` Mike Gran
2009-12-24 17:10 ` Julian Graham
0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-12-24 7:08 UTC (permalink / raw)
To: Julian Graham, guile-devel
> Hi all,
>
> Find attached a patch that adds support for finding out the Unicode
> general category [0] for a character, including documentation and unit
> tests. The API is pretty much the same as the one described in R6RS
> Standard Libraries 1.1 [1]. I'll push if no one objects.
Hi Julian-
Cool. I have two very minor and pedantic suggestions. You say that
it will return a "one- or two-letter name". I'm pretty sure that
this code will always return a two-letter name and not the one-letter
general category.
Also, the output of SCM_CHAR is effectively a 32-bit signed int and the
uc_general_category takes effectively a 32-bit unsigned int, so perhaps the
cast to (int) should be left out or be libunistring's (ucs4_t) instead.
But, of course, the code works fine as it is.
Thanks,
Mike Gran
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Unicode general categories
2009-12-24 7:08 ` Mike Gran
@ 2009-12-24 17:10 ` Julian Graham
2009-12-28 10:31 ` Andy Wingo
0 siblings, 1 reply; 4+ messages in thread
From: Julian Graham @ 2009-12-24 17:10 UTC (permalink / raw)
To: Mike Gran; +Cc: guile-devel
Hi Mike,
> Cool. I have two very minor and pedantic suggestions. You say that
> it will return a "one- or two-letter name". I'm pretty sure that
> this code will always return a two-letter name and not the one-letter
> general category.
Yes, of course you're right -- uc_general_category_name operates in
terms of bits, not uc_general_category_ts returned from
uc_general_category. The requirements for the corresponding R6RS
function confirm this as well. I've updated the docs.
> Also, the output of SCM_CHAR is effectively a 32-bit signed int and the
> uc_general_category takes effectively a 32-bit unsigned int, so perhaps the
> cast to (int) should be left out or be libunistring's (ucs4_t) instead.
> But, of course, the code works fine as it is.
Also true -- I've taken that out (and fixed a missing `const'
specifier). Did we turn off warnings being errors in master? I'm
used to having my builds fail when I'm sloppy.
(Pushed.)
Thanks,
Julian
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-12-28 10:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-24 5:46 [PATCH] Unicode general categories Julian Graham
2009-12-24 7:08 ` Mike Gran
2009-12-24 17:10 ` Julian Graham
2009-12-28 10:31 ` Andy Wingo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).