unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Program received signal SIGSEGV, Segmentation fault.
@ 2012-11-16 18:00 Bruce Korb
  2012-11-16 19:19 ` Mark H Weaver
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Korb @ 2012-11-16 18:00 UTC (permalink / raw)
  To: guile-devel Development

This is a clumsy way of saying you don't like the '©' character in strings.
I truly do dislike the fact that that you changed the behavior of strings.
Yes, I know I can figure out how to change my code to use byte arrays
somehow or another, but it is a lot of work.  More than just "sed 's/string/bytes/'"

Anyway, seg faulting is not a good response.

(gdb) printf "%s\n", pDE->de_val.dvu_text
This program reads or accepts a list of files and prints the names of the
files that are not plain text.  @i{plain text} characters are characters
in the range of 0x20 thru 0x7E (' ' thru '~'), plus backspace,
whitespace characters and the character 0xA9 (© - the circled-C copyright
character).
(gdb) s
Backtrace:
In ice-9/boot-9.scm:
 149: 5 [catch #t #<catch-closure 9b5860> ...]
 157: 4 [#<procedure 9510f0 ()>]
In unknown file:
   ?: 3 [catch-closure]
In ice-9/eval.scm:
 368: 2 [eval # ()]
 368: 1 [eval # ()]
In unknown file:
   ?: 0 [stack "explain"]

ERROR: In procedure stack:
ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" "input locale conversion error" 84 \
#vu8(84 104 105 115 32 112 114 111 103 114 97 109 32 114 101 97 100 115 32 111 114 32 97 99 99 101 112 \
116 115 32 97 32 108 105 115 116 32 111 102 32 102 105 108 101 115 32 97 110 100 32 112 114 105 110 \
116 115 32 116 104 101 32 110 97 109 101 115 32 111 102 32 116 104 101 10 102 105 108 101 115 32 116 \
104 97 116 32 97 114 101 32 110 111 116 32 112 108 97 105 110 32 116 101 120 116 46 32 32 64 105 123 \
112 108 97 105 110 32 116 101 120 116 125 32 99 104 97 114 97 99 116 101 114 115 32 97 114 101 32 99 \
104 97 114 97 99 116 101 114 115 10 105 110 32 116 104 101 32 114 97 110 103 101 32 111 102 32 48 \
120 50 48 32 116 104 114 117 32 48 120 55 69 32 40 39 32 39 32 116 104 114 117 32 39 126 39 41 44 \
32 112 108 117 115 32 98 97 99 107 115 112 97 99 101 44 10 119 104 105 116 101 115 112 97 99 101 32 \
99 104 97 114 97 99 116 101 114 115 32 97 110 100 32 116 104 101 32 99 104 97 114 97 99 116 101 114 \
32 48 120 65 57 32 40 169 32 45 32 116 104 101 32 99 105 114 99 108 101 100 45 67 32 99 111 112 121 \
114 105 103 104 116 10 99 104 97 114 97 99 116 101 114 41 46))'.

Program received signal SIGSEGV, Segmentation fault.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 18:00 Program received signal SIGSEGV, Segmentation fault Bruce Korb
@ 2012-11-16 19:19 ` Mark H Weaver
  2012-11-16 19:50   ` Bruce Korb
  0 siblings, 1 reply; 11+ messages in thread
From: Mark H Weaver @ 2012-11-16 19:19 UTC (permalink / raw)
  To: Bruce Korb; +Cc: guile-devel Development

Hi Bruce,

Bruce Korb <bruce.korb@gmail.com> writes:
> This is a clumsy way of saying you don't like the '©' character in
> strings.

I'm sorry, but you haven't provided nearly enough information for me to
figure out what caused the SIGSEGV.  I don't even know what function you
called, or what arguments you passed to it.  I guess you called one of
the scm_from_*_string functions with a string, without telling Guile
what string encoding was used.

> I truly do dislike the fact that that you changed the behavior of
> strings.

The old (pre-2.0) way was completely broken for multibyte encodings.
Given that UTF-8 is fairly standard now, we simply cannot not afford to
keep the old behavior.

If you *really* want something like the old behavior, then you can just
do (setlocale LC_ALL "ISO-8859-1"), but I don't recommend it.

> Yes, I know I can figure out how to change my code to use byte arrays
> somehow or another, but it is a lot of work.  More than just "sed
> 's/string/bytes/'"

Sorry, but the world has gotten more complex with multibyte character
encodings, locales, and the potential for text to/from the user to be in
a different encoding that string literals in your C source code.  If you
want your program to be multilingual, you're going to have to start
thinking about these issues.

In particular, if you use non-ASCII text in your program, then you're
going to have to keep track of which C strings came from the user (and
are thus in the encoding specified by the locale environment variables),
and which C strings came from your program (and are thus in some other
encoding that has nothing to do with the locale).

There is no quick fix, and Guile is not the source of these problems.

> Anyway, seg faulting is not a good response.

If you give me more information, perhaps I can help figure out what
caused the SIGSEGV.

     Mark


> (gdb) printf "%s\n", pDE->de_val.dvu_text
> This program reads or accepts a list of files and prints the names of the
> files that are not plain text.  @i{plain text} characters are characters
> in the range of 0x20 thru 0x7E (' ' thru '~'), plus backspace,
> whitespace characters and the character 0xA9 (© - the circled-C copyright
> character).
> (gdb) s
> Backtrace:
> In ice-9/boot-9.scm:
>  149: 5 [catch #t #<catch-closure 9b5860> ...]
>  157: 4 [#<procedure 9510f0 ()>]
> In unknown file:
>    ?: 3 [catch-closure]
> In ice-9/eval.scm:
>  368: 2 [eval # ()]
>  368: 1 [eval # ()]
> In unknown file:
>    ?: 0 [stack "explain"]
>
> ERROR: In procedure stack:
> ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" "input locale conversion error" 84 \
> #vu8(84 104 105 115 32 112 114 111 103 114 97 109 32 114 101 97 100 115 32 111 114 32 97 99 99 101 112 \
> 116 115 32 97 32 108 105 115 116 32 111 102 32 102 105 108 101 115 32 97 110 100 32 112 114 105 110 \
> 116 115 32 116 104 101 32 110 97 109 101 115 32 111 102 32 116 104 101 10 102 105 108 101 115 32 116 \
> 104 97 116 32 97 114 101 32 110 111 116 32 112 108 97 105 110 32 116 101 120 116 46 32 32 64 105 123 \
> 112 108 97 105 110 32 116 101 120 116 125 32 99 104 97 114 97 99 116 101 114 115 32 97 114 101 32 99 \
> 104 97 114 97 99 116 101 114 115 10 105 110 32 116 104 101 32 114 97 110 103 101 32 111 102 32 48 \
> 120 50 48 32 116 104 114 117 32 48 120 55 69 32 40 39 32 39 32 116 104 114 117 32 39 126 39 41 44 \
> 32 112 108 117 115 32 98 97 99 107 115 112 97 99 101 44 10 119 104 105 116 101 115 112 97 99 101 32 \
> 99 104 97 114 97 99 116 101 114 115 32 97 110 100 32 116 104 101 32 99 104 97 114 97 99 116 101 114 \
> 32 48 120 65 57 32 40 169 32 45 32 116 104 101 32 99 105 114 99 108 101 100 45 67 32 99 111 112 121 \
> 114 105 103 104 116 10 99 104 97 114 97 99 116 101 114 41 46))'.
>
> Program received signal SIGSEGV, Segmentation fault.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 19:19 ` Mark H Weaver
@ 2012-11-16 19:50   ` Bruce Korb
  2012-11-16 20:20     ` Bruce Korb
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Korb @ 2012-11-16 19:50 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Bruce Korb, guile-devel Development

On 11/16/12 11:19, Mark H Weaver wrote:
> Hi Bruce,
> 
> Bruce Korb <bruce.korb@gmail.com> writes:
>> This is a clumsy way of saying you don't like the '©' character in
>> strings.
> 
> I'm sorry, but you haven't provided nearly enough information for me to

I'm sorry, I did leave off the backtrace.  Here's another GDB session:

460         return AG_SCM_STR02SCM(pE->de_val.dvu_text);
(gdb) printf "%s\n", pE->de_val.dvu_text
This program reads or accepts a list of files and prints the names of the
files that are not plain text.  @i{plain text} characters are characters
in the range of 0x20 thru 0x7E (' ' thru '~'), plus backspace,
whitespace characters and the character 0xA9 (© - the copyright character).
(gdb) s
Backtrace:
In ice-9/boot-9.scm:
 149: 8 [catch #t #<catch-closure 9b5860> ...]
 157: 7 [#<procedure 9510f0 ()>]
In unknown file:
   ?: 6 [catch-closure]
In ice-9/eval.scm:
 407: 5 [eval # ()]
 442: 4 [eval # #]
 368: 3 [eval # #]
 368: 2 [eval # #]
 368: 1 [eval # #]
In unknown file:
   ?: 0 [get "explain" #<undefined>]

ERROR: In procedure get:
ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" "input locale conversion error" \
84 #vu8(84 104 105 115 32 112 114 111 103 114 97 109 32 114 101 97 100 115 32 111 114 32 97 99 99 \
 101 112 116 115 32 97 32 108 105 115 116 32 111 102 32 102 105 108 101 115 32 97 110 100 32 112 \
 114 105 110 116 115 32 116 104 101 32 110 97 109 101 115 32 111 102 32 116 104 101 10 102 105 108 \
 101 115 32 116 104 97 116 32 97 114 101 32 110 111 116 32 112 108 97 105 110 32 116 101 120 116 46 \
 32 32 64 105 123 112 108 97 105 110 32 116 101 120 116 125 32 99 104 97 114 97 99 116 101 114 115 \
 32 97 114 101 32 99 104 97 114 97 99 116 101 114 115 10 105 110 32 116 104 101 32 114 97 110 103 \
 101 32 111 102 32 48 120 50 48 32 116 104 114 117 32 48 120 55 69 32 40 39 32 39 32 116 104 114 \
 117 32 39 126 39 41 44 32 112 108 117 115 32 98 97 99 107 115 112 97 99 101 44 10 119 104 105 116 \
 101 115 112 97 99 101 32 99 104 97 114 97 99 116 101 114 115 32 97 110 100 32 116 104 101 32 99 \
 104 97 114 97 99 116 101 114 32 48 120 65 57 32 40 169 32 45 32 116 104 101 32 99 111 112 121 114 \
 105 103 104 116 32 99 104 97 114 97 99 116 101 114 41 46))'.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78a389e in scm_frame_procedure () from /usr/lib64/libguile-2.0.so.22

"AG_SCM_STR02SCM()" is a Guile-version dependent macro:

$ fgrep GUILE_VERSION ../config.h
#define GUILE_VERSION 200005

an edited extract from my version dependent wrapper header:

#if (GUILE_VERSION >= 200000) && (GUILE_VERSION <= 200003)
# error AutoGen does not work with this version of Guile
  choke me.

#elif (GUILE_VERSION <= 106000)
# error AutoGen does not work with this version of Guile
  choke me.

#elif GUILE_VERSION < 107000
# define AG_SCM_STR02SCM(_s)          scm_makfrom0str(_s)

#elif GUILE_VERSION < 200000
# define AG_SCM_STR02SCM(_s)          scm_from_locale_string(_s)

#elif GUILE_VERSION < 201000
# define AG_SCM_STR02SCM(_s)          scm_from_utf8_string(_s)

[...]
> figure out what caused the SIGSEGV.  I don't even know what function you

scm_from_locale_string().  I had a stack trace that disappeared
from the email.  (Typo of some sort.  Sorry.)

> called, or what arguments you passed to it.  I guess you called one of
> the scm_from_*_string functions with a string, without telling Guile
> what string encoding was used.

Telling it?  Aside from LC_ALL=C ??  Nope.

>> Anyway, seg faulting is not a good response.
> 
> If you give me more information, perhaps I can help figure out what
> caused the SIGSEGV.

I am certain it was the \251 character, because when I remove it, it doesn't happen.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 19:50   ` Bruce Korb
@ 2012-11-16 20:20     ` Bruce Korb
  2012-11-16 21:23       ` Mark H Weaver
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Korb @ 2012-11-16 20:20 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Bruce Korb, guile-devel Development


> "AG_SCM_STR02SCM()" is a Guile-version dependent macro:
> 
> $ fgrep GUILE_VERSION ../config.h
> #define GUILE_VERSION 200005
> 
> an edited extract from my version dependent wrapper header:
> 

> #elif GUILE_VERSION < 200000
> # define AG_SCM_STR02SCM(_s)          scm_from_locale_string(_s)
> 
> #elif GUILE_VERSION < 201000
> # define AG_SCM_STR02SCM(_s)          scm_from_utf8_string(_s)
> 
> [...]
>> figure out what caused the SIGSEGV.  I don't even know what function you
> 
> scm_from_locale_string().  I had a stack trace that disappeared
> from the email.  (Typo of some sort.  Sorry.)

Actually, it was scm_from_utf8_string, since GUILE_VERSION was 200005

 -- C Function: SCM scm_from_latin1_string (const char *str)
 -- C Function: SCM scm_from_utf8_string (const char *str)
 -- C Function: SCM scm_from_utf32_string (const scm_t_wchar *str)
     Return a scheme string from the null-terminated C string STR,
     which is ISO-8859-1-, UTF-8-, or UTF-32-encoded.  These functions
     should be used to convert hard-coded C string constants into
     Scheme strings.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 20:20     ` Bruce Korb
@ 2012-11-16 21:23       ` Mark H Weaver
  2012-11-16 23:32         ` Bruce Korb
  0 siblings, 1 reply; 11+ messages in thread
From: Mark H Weaver @ 2012-11-16 21:23 UTC (permalink / raw)
  To: Bruce Korb; +Cc: guile-devel Development

Bruce Korb <bkorb@gnu.org> writes:
>>> figure out what caused the SIGSEGV.  I don't even know what function you
>> 
>> scm_from_locale_string().  I had a stack trace that disappeared
>> from the email.  (Typo of some sort.  Sorry.)
>
> Actually, it was scm_from_utf8_string, since GUILE_VERSION was 200005

Okay, that's the problem.  You told Guile that the C string was encoded
in UTF-8, but actually it was encoded in Latin-1:

> ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" "input locale conversion error" \
> 84 #vu8(84 104 105 115 32 112 114 111 103 114 97 109 32 114 101 97 100 115 32 111 114 32 97 99 99 \
>  101 112 116 115 32 97 32 108 105 115 116 32 111 102 32 102 105 108 101 115 32 97 110 100 32 112 \
>  114 105 110 116 115 32 116 104 101 32 110 97 109 101 115 32 111 102 32 116 104 101 10 102 105 108 \
>  101 115 32 116 104 97 116 32 97 114 101 32 110 111 116 32 112 108 97 105 110 32 116 101 120 116 46 \
>  32 32 64 105 123 112 108 97 105 110 32 116 101 120 116 125 32 99 104 97 114 97 99 116 101 114 115 \
>  32 97 114 101 32 99 104 97 114 97 99 116 101 114 115 10 105 110 32 116 104 101 32 114 97 110 103 \
>  101 32 111 102 32 48 120 50 48 32 116 104 114 117 32 48 120 55 69 32 40 39 32 39 32 116 104 114 \
>  117 32 39 126 39 41 44 32 112 108 117 115 32 98 97 99 107 115 112 97 99 101 44 10 119 104 105 116 \
>  101 115 112 97 99 101 32 99 104 97 114 97 99 116 101 114 115 32 97 110 100 32 116 104 101 32 99 \
>  104 97 114 97 99 116 101 114 32 48 120 65 57 32 40 169 32 45 32 116 104 101 32 99 111 112 121 114 \
>  105 103 104 116 32 99 104 97 114 97 99 116 101 114 41 46))'.

That 169 on the second-to-the-last line is 0xA9, which is the Latin-1
encoding for the copyright symbol.  It is not legal as a bare character
in UTF-8.  The correct UTF-8 encoding for that symbol is 0xC2 0xA9.

As for the SIGSEGV, that's probably a bug in the backtrace printer for
Guile.  Sorry about that.  Thanks for the info.

     Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 21:23       ` Mark H Weaver
@ 2012-11-16 23:32         ` Bruce Korb
  2012-11-17  2:19           ` Noah Lavine
  2012-11-17  4:22           ` Mark H Weaver
  0 siblings, 2 replies; 11+ messages in thread
From: Bruce Korb @ 2012-11-16 23:32 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel Development

On 11/16/12 13:23, Mark H Weaver wrote:
>> Actually, it was scm_from_utf8_string, since GUILE_VERSION was 200005
> 
> Okay, that's the problem.  You told Guile that the C string was encoded
> in UTF-8, but actually it was encoded in Latin-1:

OK, so I tried latin1, too.  (replacing scm_from_utf3_string with
scm_from_latin1_string).  That also does not work.  It replaced the
0xA9 character with '?'.  What it all boils down to is that
I am looking for string handling functions that will handle the
NUL terminated list of bytes and keep its nose out of the contents
of the string.  Period.  Full stop.

So what is left?

I do not want to write my own ag_scm_from_zbytes_string(n) functions.
Such a function would need to remain portable to any and all internal
wiggling by Guile.  And it would need to work with the string scanning
functions, or I'd have to rewrite all of those too.  Or are strings
tagged with the content type, thus making it possible to call the
correct string scanning functions?  Of course, that would mean adding
the "zbytes" functions anyway....

> As for the SIGSEGV, that's probably a bug in the backtrace printer for
> Guile.  Sorry about that.  Thanks for the info.

You're welcome.  I did figure it was something like that.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 23:32         ` Bruce Korb
@ 2012-11-17  2:19           ` Noah Lavine
  2012-11-17 20:22             ` Bruce Korb
  2012-11-17  4:22           ` Mark H Weaver
  1 sibling, 1 reply; 11+ messages in thread
From: Noah Lavine @ 2012-11-17  2:19 UTC (permalink / raw)
  To: Bruce Korb; +Cc: Mark H Weaver, guile-devel Development

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

Hello,

On Fri, Nov 16, 2012 at 6:32 PM, Bruce Korb <bkorb@gnu.org> wrote:

> On 11/16/12 13:23, Mark H Weaver wrote:
> >> Actually, it was scm_from_utf8_string, since GUILE_VERSION was 200005
> >
> > Okay, that's the problem.  You told Guile that the C string was encoded
> > in UTF-8, but actually it was encoded in Latin-1:
>
> OK, so I tried latin1, too.  (replacing scm_from_utf3_string with
> scm_from_latin1_string).  That also does not work.  It replaced the
> 0xA9 character with '?'.


I am no expert on character encodings, but we've seen errors like this
before where it turned out that Guile was attempting to display the
character on a terminal which didn't support it, and then the terminal
converted it into '?'. Could there have been some change in how Guile
displays strings that caused this error? Did it used to show a \-escape
sequence?


> What it all boils down to is that
> I am looking for string handling functions that will handle the
> NUL terminated list of bytes and keep its nose out of the contents
> of the string.  Period.  Full stop.
>

Could you explain what you're trying to do a little more? If you're calling
a function that looks at characters on a string object that doesn't contain
valid characters, then it will fail. If you have a NUL-terminated list of
bytes that contains only characters valid in some encoding, then the
scm_from_*_string functions are supposed to wrap it. So do you intend to
make a string object and then never look inside? Or are you going to roll
your own string-handling starting from byte sequences? The rest of your
email suggests not.

Thanks,
Noah

[-- Attachment #2: Type: text/html, Size: 2214 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-16 23:32         ` Bruce Korb
  2012-11-17  2:19           ` Noah Lavine
@ 2012-11-17  4:22           ` Mark H Weaver
  2012-11-17 18:12             ` Bruce Korb
  1 sibling, 1 reply; 11+ messages in thread
From: Mark H Weaver @ 2012-11-17  4:22 UTC (permalink / raw)
  To: Bruce Korb; +Cc: guile-devel Development

Bruce Korb <bkorb@gnu.org> writes:
> On 11/16/12 13:23, Mark H Weaver wrote:
>>> Actually, it was scm_from_utf8_string, since GUILE_VERSION was 200005
>> 
>> Okay, that's the problem.  You told Guile that the C string was encoded
>> in UTF-8, but actually it was encoded in Latin-1:
>
> OK, so I tried latin1, too.  (replacing scm_from_utf3_string with
> scm_from_latin1_string).  That also does not work.  It replaced the
> 0xA9 character with '?'.

I think what happened is that the scm_from_latin_string *did* work, but
printing did not work.  That's because when the locale is set to "C",
that means ASCII-only, so it was unable to print the copyright symbol.

> What it all boils down to is that I am looking for string handling
> functions that will handle the NUL terminated list of bytes and keep
> its nose out of the contents of the string.  Period.  Full stop.

Bruce, if you refuse to fix these problems properly, you will end up
with a broken program.  Period.  Full stop.  Most modern distributions
use UTF-8 by default, and if you simply write a bare 0xA9 to the
terminal or output file, that's not going to look right on their
terminal or editor.

But if that's really what you want, fine, here's how you do it:

  (fluid-set! %default-port-encoding "ISO-8859-1")
  (set-port-encoding! (current-output-port) "ISO-8859-1")
  (set-port-encoding! (current-input-port) "ISO-8859-1")
  (set-port-encoding! (current-error-port) "ISO-8859-1")

and make sure to *not* set the locale.

     Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-17  4:22           ` Mark H Weaver
@ 2012-11-17 18:12             ` Bruce Korb
  2012-11-17 19:56               ` Mark H Weaver
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Korb @ 2012-11-17 18:12 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel Development

On 11/16/12 20:22, Mark H Weaver wrote:
> Bruce, if you refuse to fix these problems properly, you will end up

Hi Mark,

My program's intent is to read text from two inputs and weave them
together.  It has no need to know or understand the encoding in any way,
other than to communicate exception messages to the user.
Read the inputs, transform the text, write the output.
ISO-8859-1 or UTF8 or unsimplified Chinese input, doesn't matter.
It only matters that it be NUL byte delimitable, and all encodings
support that.  User messages are not involved here.

> with a broken program.  Period.  Full stop.  Most modern distributions
> use UTF-8 by default, and if you simply write a bare 0xA9 to the
> terminal or output file, that's not going to look right on their
> terminal or editor.

It was a data character, not part of my program.
I am trying to get that character to flow through my program
from my inputs to the output file, but I am having trouble with
the Guile functions that transform the data.  I want to hand the
Guile library a string, a la
   (define my-val (get "val-string"))
where "get" is a function that pulls bytes from the input.
Then, later on,  insert it into the output with:
    (emit my-val)
and have *exactly* what was gotten from the input.
I do not care in the slightest what the current locale is.
I pulled in some arbitrary data with ``(get "val-string")''
and now I want those exact bytes to be emitted where I
have the ``(emit my-val)'' invocation.

> But if that's really what you want, fine, here's how you do it:
> 
>   (fluid-set! %default-port-encoding "ISO-8859-1")
>   (set-port-encoding! (current-output-port) "ISO-8859-1")
>   (set-port-encoding! (current-input-port) "ISO-8859-1")
>   (set-port-encoding! (current-error-port) "ISO-8859-1")
> 
> and make sure to *not* set the locale.

Every time I have a fragment of scheme code, I have a new port.
Doing it this way would require concatenating that text with
the text to invoke.  That adds an allocate, two string copies
and a free to every scheme invocation.  I'll poke around, but
I am guessing there would have to be some more of this set up
for each scheme sequence, yes?
        {
            SCM ln = AG_SCM_INT2SCM(line);
            scm_set_port_filename_x(port, file);
            scm_set_port_line_x(port, ln);
            scm_set_port_column_x(port, SCM_INUM0);
        }



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-17 18:12             ` Bruce Korb
@ 2012-11-17 19:56               ` Mark H Weaver
  0 siblings, 0 replies; 11+ messages in thread
From: Mark H Weaver @ 2012-11-17 19:56 UTC (permalink / raw)
  To: Bruce Korb; +Cc: guile-devel Development

Bruce Korb <bkorb@gnu.org> writes:
> On 11/16/12 20:22, Mark H Weaver wrote:
>> Bruce, if you refuse to fix these problems properly, you will end up
>
> Hi Mark,
>
> My program's intent is to read text from two inputs and weave them
> together.  It has no need to know or understand the encoding in any way,

To weave them together, you need to interpret the input characters to
recognize the start and end points of each segment that you will copy to
the output.  Therefore, you need to know the correct character encoding.

Let me give you an example of what can happen if you blindly interpret
all inputs as a series of bytes, and interpret all of the bytes that
fall within the ASCII range as delimiters, macro invocations,
expressions, or whatever.

Consider the following quoted string containing a single chinese
character, and stored in a file using the GBK character encoding:

   "甛"

The bytes in the file corresponding to those three characters are:

   22 AE 5C 22  (hex)

These same bytes, interpreted as ISO-8859-1 (Latin-1), correspond to the
following four characters:

   "®\"

So if autogen reads this file as a sequence of bytes (or coaxes Guile
into doing so) it will see a backslash before the closing quote, and
thus treat it as an escape and keep reading the string.  At which point
your Chinese user is scratching his head and wondering what went wrong,
because he sees no backslash; he sees only a single chinese character
between the quotes.

> I want to hand the Guile library a string, a la
>    (define my-val (get "val-string"))
> where "get" is a function that pulls bytes from the input.

The example above demonstrates that you expect Guile to parse a string
literal, and thus it needs to know how to interpret the bytes as
characters.  For example, it needs to know whether 5C is really a
backslash, or the second byte of a two-byte character sequence for some
chinese character.

>> But if that's really what you want, fine, here's how you do it:
>> 
>>   (fluid-set! %default-port-encoding "ISO-8859-1")
>>   (set-port-encoding! (current-output-port) "ISO-8859-1")
>>   (set-port-encoding! (current-input-port) "ISO-8859-1")
>>   (set-port-encoding! (current-error-port) "ISO-8859-1")
>> 
>> and make sure to *not* set the locale.
>
> Every time I have a fragment of scheme code, I have a new port.

The (fluid-set! %default-port-encoding "ISO-8859-1") should cause all
ports opened in the future to use the ISO-8859-1 (Latin-1) character
encoding, as long as you haven't called 'setlocale'.  The only reason we
need to call 'set-port-encoding!' on the other ports is because they've
already been opened.

> Doing it this way would require concatenating that text with
> the text to invoke.  That adds an allocate, two string copies
> and a free to every scheme invocation.

I don't understand what you mean here.

> I'll poke around, but
> I am guessing there would have to be some more of this set up
> for each scheme sequence, yes?
>         {
>             SCM ln = AG_SCM_INT2SCM(line);
>             scm_set_port_filename_x(port, file);
>             scm_set_port_line_x(port, ln);
>             scm_set_port_column_x(port, SCM_INUM0);
>         }

I don't think you should need to add anything here, but this reminds me
of another problem with interpreting the inputs as byte streams: the
column number in error messages will not be correct.  It will be a byte
number instead of a character number on the line.

     Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Program received signal SIGSEGV, Segmentation fault.
  2012-11-17  2:19           ` Noah Lavine
@ 2012-11-17 20:22             ` Bruce Korb
  0 siblings, 0 replies; 11+ messages in thread
From: Bruce Korb @ 2012-11-17 20:22 UTC (permalink / raw)
  To: Noah Lavine; +Cc: Mark H Weaver, guile-devel Development

Hi Noah, Mark,

On 11/16/12 18:19, Noah Lavine wrote:
>     OK, so I tried latin1, too.  (replacing scm_from_utf3_string with
>     scm_from_latin1_string).  That also does not work.  It replaced the
>     0xA9 character with '?'.
> 
> 
> I am no expert on character encodings, but we've seen errors like this before 
> where it turned out that Guile was attempting to display the character on a 
> terminal which didn't support it, and then the terminal converted it into '?'.

I actually do not use Guile to output anything.
I give Guile a string with either the "define" or "set!" functions
and later pull them out with the verboten function scm_i_string_chars.
As far as I can tell, I have to call this procedure myself because
all other paths seem to lead through u8_uctomb, which is not
helpful for my application.

> Could there have been some change in how Guile displays strings that caused 
> this error? Did it used to show a \-escape sequence?

I have always called some function to obtain the text.
I understand the memory is not stable, so I immediately copy
the text out and forget the returned address.  Much like what
scm_to_utf8_stringn() does, but without going through the
u8_uctomb_aux transmogrifier.  I tend to use memcpy or fwrite.

>     What it all boils down to is that
>     I am looking for string handling functions that will handle the
>     NUL terminated list of bytes and keep its nose out of the contents
>     of the string.  Period.  Full stop.
> 
> Could you explain what you're trying to do a little more?

Read input text, modify it according to some embedded markups,
inserting some auxiliary text gotten from another file, and
emitting the result.  95% of it is for computer program text.
The remainder is for man pages and texi docs.

Sometimes, people like to insert a copyright character in their
program text.  About a decade ago, someone asked me to do
something that would verify that a particular file was pure and
proper text and that copyright characters were okay.  This meant
that the "file" program was insufficient.  I rebuilt my library
of old stuff and my current autogen choked and died.

So here I am investigating the cause.

> If you're calling a function that looks at characters on a string 
> object that doesn't contain valid characters, then it will fail.

The only invalid character in my tiny little world is the NUL byte.

> ... So do you intend to make a 
> string object and then never look inside?

I look inside for a limited set of reasons:

1. to write it to output
2. to move it someplace else
3. to compare it against another sequence of bytes

> Or are you going to 
> roll your own string-handling starting from byte sequences?

I do not want to do this, but I will if I have to.

I really do not want to mess with transforming character sets on my
input and output.  Just read in, adjust as directed, and write.

Thank you for any help on using the Guile interface properly!

Regards, Bruce



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-11-17 20:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-16 18:00 Program received signal SIGSEGV, Segmentation fault Bruce Korb
2012-11-16 19:19 ` Mark H Weaver
2012-11-16 19:50   ` Bruce Korb
2012-11-16 20:20     ` Bruce Korb
2012-11-16 21:23       ` Mark H Weaver
2012-11-16 23:32         ` Bruce Korb
2012-11-17  2:19           ` Noah Lavine
2012-11-17 20:22             ` Bruce Korb
2012-11-17  4:22           ` Mark H Weaver
2012-11-17 18:12             ` Bruce Korb
2012-11-17 19:56               ` Mark H Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).