unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* [PATCH] Unicode general categories
@ 2009-12-24  5:46 Julian Graham
  2009-12-24  7:08 ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Julian Graham @ 2009-12-24  5:46 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 433 bytes --]

Hi all,

Find attached a patch that adds support for finding out the Unicode
general category [0] for a character, including documentation and unit
tests.  The API is pretty much the same as the one described in R6RS
Standard Libraries 1.1 [1].  I'll push if no one objects.


Regards,
Julian

[0] - http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf
[1] - http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_sec_1.1

[-- Attachment #2: 0001-Support-for-Unicode-general-categories.patch --]
[-- Type: text/x-diff, Size: 4804 bytes --]

From f8fef903d535fa9ceb2677ab0c7dacc7692ea0f3 Mon Sep 17 00:00:00 2001
From: Julian Graham <julian.graham@aya.yale.edu>
Date: Thu, 24 Dec 2009 00:25:19 -0500
Subject: [PATCH] Support for Unicode general categories

* libguile/chars.c, libguile/chars.h (scm_char_general_category): New function.
* test-suite/tests/chars.test: Unit tests for `char-general-category'.
* doc/ref/api-data.texi (Characters): Documentation for
  `char-general-category'.
---
 doc/ref/api-data.texi       |   91 +++++++++++++++++++++++++++++++++++++++++++
 libguile/chars.c            |   20 +++++++++
 libguile/chars.h            |    1 +
 test-suite/tests/chars.test |    7 +++-
 4 files changed, 118 insertions(+), 1 deletions(-)

diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index 6721b12..df5db48 100755
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -1875,6 +1875,97 @@ Return @code{#t} iff @var{chr} is either uppercase or lowercase, else
 @code{#f}.
 @end deffn
 
+@deffn {Scheme Procedure} char-general-category chr
+@deffnx {C Function} scm_char_general_category (chr)
+Return a symbol giving the one- or two-letter name of the Unicode
+general category assigned to @var{chr} or @code{#f} if no named category
+is assigned.  The following table provides a list of category names
+along with their meanings.
+
+@multitable @columnfractions .1 .4 .1 .4
+@item L
+ @tab Letter
+ @tab Pf
+ @tab Final quote punctuation
+@item Lu
+ @tab Uppercase letter
+ @tab Po
+ @tab Other punctuation
+@item Ll
+ @tab Lowercase letter
+ @tab S
+ @tab Symbol
+@item Lt
+ @tab Titlecase letter
+ @tab Sm
+ @tab Math symbol
+@item Lm
+ @tab Modifier letter
+ @tab Sc
+ @tab Currency symbol
+@item Lo
+ @tab Other letter
+ @tab Sk
+ @tab Modifier symbol
+@item M
+ @tab Mark
+ @tab So
+ @tab Other synbol
+@item Mn
+ @tab Non-spacing mark
+ @tab Z
+ @tab Separator
+@item Mc
+ @tab Combining spacing mark
+ @tab Zs
+ @tab Space separator
+@item Me
+ @tab Enclosing mark
+ @tab Zl
+ @tab Line separator
+@item N
+ @tab Number
+ @tab Zp
+ @tab Paragraph separator
+@item Nd
+ @tab Decimal digit number
+ @tab C
+ @tab Other
+@item Nl
+ @tab Letter number
+ @tab Cc
+ @tab Control
+@item No
+ @tab Other number
+ @tab Cf
+ @tab Format
+@item P
+ @tab Punctuation
+ @tab Cs
+ @tab Surrogate
+@item Pc
+ @tab Connector punctuation
+ @tab Co
+ @tab Private use
+@item Pd
+ @tab Dash punctuation
+ @tab Cn
+ @tab Unassigned
+@item Ps
+ @tab Open punctuation
+ @tab
+ @tab
+@item Pe
+ @tab Close punctuation
+ @tab
+ @tab
+@item Pi
+ @tab Initial quote punctuation
+ @tab
+ @tab
+@end multitable
+@end deffn
+
 @rnindex char->integer
 @deffn {Scheme Procedure} char->integer chr
 @deffnx {C Function} scm_char_to_integer (chr)
diff --git a/libguile/chars.c b/libguile/chars.c
index 1c4d106..36cb08d 100644
--- a/libguile/chars.c
+++ b/libguile/chars.c
@@ -25,6 +25,7 @@
 #include <ctype.h>
 #include <limits.h>
 #include <unicase.h>
+#include <unictype.h>
 
 #include "libguile/_scm.h"
 #include "libguile/validate.h"
@@ -467,6 +468,25 @@ SCM_DEFINE (scm_char_titlecase, "char-titlecase", 1, 0, 0,
 }
 #undef FUNC_NAME
 
+SCM_DEFINE (scm_char_general_category, "char-general-category", 1, 0, 0,
+           (SCM chr),
+            "Return a symbol representing the Unicode general category of "
+            "@var{chr} or @code{#f} if a named category cannot be found.")
+#define FUNC_NAME s_scm_char_general_category
+{
+  char *sym = NULL;
+  uc_general_category_t cat;
+
+  SCM_VALIDATE_CHAR (1, chr);
+  cat = uc_general_category ((int) SCM_CHAR (chr));
+  sym = uc_general_category_name (cat);
+
+  if (sym != NULL)
+    return scm_from_locale_symbol (sym);
+  return SCM_BOOL_F;
+}
+#undef FUNC_NAME
+
 \f
 
 
diff --git a/libguile/chars.h b/libguile/chars.h
index 2b00645..488dd25 100644
--- a/libguile/chars.h
+++ b/libguile/chars.h
@@ -81,6 +81,7 @@ SCM_API SCM scm_integer_to_char (SCM n);
 SCM_API SCM scm_char_upcase (SCM chr);
 SCM_API SCM scm_char_downcase (SCM chr);
 SCM_API SCM scm_char_titlecase (SCM chr);
+SCM_API SCM scm_char_general_category (SCM chr);
 SCM_API scm_t_wchar scm_c_upcase (scm_t_wchar c);
 SCM_API scm_t_wchar scm_c_downcase (scm_t_wchar c);
 SCM_API scm_t_wchar scm_c_titlecase (scm_t_wchar c);
diff --git a/test-suite/tests/chars.test b/test-suite/tests/chars.test
index 72805d1..cd1572f 100644
--- a/test-suite/tests/chars.test
+++ b/test-suite/tests/chars.test
@@ -210,7 +210,12 @@
        (not (char-is-both? #\newline))
        (char-is-both? #\a)
        (char-is-both? #\Z)
-       (not (char-is-both? #\1)))))
+       (not (char-is-both? #\1))))
+
+    (pass-if "char-general-category"
+      (and (eq? (char-general-category #\a) 'Ll)
+	   (eq? (char-general-category #\A) 'Lu)
+	   (eq? (char-general-category #\762) 'Lt))))
 
   (with-test-prefix "integer"
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Unicode general categories
  2009-12-24  5:46 [PATCH] Unicode general categories Julian Graham
@ 2009-12-24  7:08 ` Mike Gran
  2009-12-24 17:10   ` Julian Graham
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-12-24  7:08 UTC (permalink / raw)
  To: Julian Graham, guile-devel

> Hi all,
> 
> Find attached a patch that adds support for finding out the Unicode
> general category [0] for a character, including documentation and unit
> tests.  The API is pretty much the same as the one described in R6RS
> Standard Libraries 1.1 [1].  I'll push if no one objects.

Hi Julian-

Cool.  I have two very minor and pedantic suggestions.  You say that 
it will return a "one- or two-letter name".  I'm pretty sure that
this code will always return a two-letter name and not the one-letter
general category.  

Also, the output of SCM_CHAR is effectively a 32-bit signed int and the 
uc_general_category takes effectively a 32-bit unsigned int, so perhaps the
cast to (int) should be left out or be libunistring's (ucs4_t) instead.
But, of course, the code works fine as it is.

Thanks,

Mike Gran




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Unicode general categories
  2009-12-24  7:08 ` Mike Gran
@ 2009-12-24 17:10   ` Julian Graham
  2009-12-28 10:31     ` Andy Wingo
  0 siblings, 1 reply; 4+ messages in thread
From: Julian Graham @ 2009-12-24 17:10 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

Hi Mike,

> Cool.  I have two very minor and pedantic suggestions.  You say that
> it will return a "one- or two-letter name".  I'm pretty sure that
> this code will always return a two-letter name and not the one-letter
> general category.

Yes, of course you're right -- uc_general_category_name operates in
terms of bits, not uc_general_category_ts returned from
uc_general_category.  The requirements for the corresponding R6RS
function confirm this as well.  I've updated the docs.


> Also, the output of SCM_CHAR is effectively a 32-bit signed int and the
> uc_general_category takes effectively a 32-bit unsigned int, so perhaps the
> cast to (int) should be left out or be libunistring's (ucs4_t) instead.
> But, of course, the code works fine as it is.

Also true -- I've taken that out (and fixed a missing `const'
specifier).  Did we turn off warnings being errors in master?  I'm
used to having my builds fail when I'm sloppy.

(Pushed.)


Thanks,
Julian




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Unicode general categories
  2009-12-24 17:10   ` Julian Graham
@ 2009-12-28 10:31     ` Andy Wingo
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Wingo @ 2009-12-28 10:31 UTC (permalink / raw)
  To: Julian Graham; +Cc: guile-devel

On Thu 24 Dec 2009 18:10, Julian Graham <joolean@gmail.com> writes:

> Did we turn off warnings being errors in master? I'm used to having my
> builds fail when I'm sloppy.

Yes, because we want releases to not have -Werror. But we should (IMO)
re-enable -Werror for non-release builds, by default.

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-12-28 10:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-24  5:46 [PATCH] Unicode general categories Julian Graham
2009-12-24  7:08 ` Mike Gran
2009-12-24 17:10   ` Julian Graham
2009-12-28 10:31     ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).