bug#10627: char-ready? is broken for multibyte encodings

unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed

* bug#10627: char-ready? is broken for multibyte encodings
@ 2012-01-28 10:21 Mark H Weaver
  2013-02-24 19:11 ` Andy Wingo
  0 siblings, 1 reply; 10+ messages in thread
From: Mark H Weaver @ 2012-01-28 10:21 UTC (permalink / raw)
  To: 10627

The R5RS specifies that if 'char-ready?' returns #t, then the next
'read-char' operation is guaranteed not to hang.  This is not currently
the case for ports using a multibyte encoding.

'char-ready?' currently returns #t whenever at least one _byte_ is
available.  This is not correct in general.  It should return #t only if
there is a complete _character_ available.

     Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2012-01-28 10:21 bug#10627: char-ready? is broken for multibyte encodings Mark H Weaver
@ 2013-02-24 19:11 ` Andy Wingo
  2013-02-24 20:14   ` Mark H Weaver
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Wingo @ 2013-02-24 19:11 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627

On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw@netris.org> writes:

> The R5RS specifies that if 'char-ready?' returns #t, then the next
> 'read-char' operation is guaranteed not to hang.  This is not currently
> the case for ports using a multibyte encoding.
>
> 'char-ready?' currently returns #t whenever at least one _byte_ is
> available.  This is not correct in general.  It should return #t only if
> there is a complete _character_ available.

This procedure is omitted in the R6RS because it is not a good
interface.  Besides its semantic difficulties, can you think of a sane
implementation for multibyte characters?

I suggest we document that this procedure only works correctly in
encodings with 1-byte characters and recommend that people use u8-ready?
instead.

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-24 19:11 ` Andy Wingo
@ 2013-02-24 20:14   ` Mark H Weaver
  2013-02-24 22:15     ` Andy Wingo
  0 siblings, 1 reply; 10+ messages in thread
From: Mark H Weaver @ 2013-02-24 20:14 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 10627

Hi Andy,

Andy Wingo <wingo@pobox.com> writes:

> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw@netris.org> writes:
>
>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>> 'read-char' operation is guaranteed not to hang.  This is not currently
>> the case for ports using a multibyte encoding.
>>
>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>> available.  This is not correct in general.  It should return #t only if
>> there is a complete _character_ available.
>
> This procedure is omitted in the R6RS because it is not a good
> interface.  Besides its semantic difficulties, can you think of a sane
> implementation for multibyte characters?

Maybe I'm missing something, but I don't see any semantic problem here,
and it seems straightforward to implement.  'char-ready?' should simply
read bytes until either a complete character is available, or no more
bytes are ready.  In either case, all the bytes should then be 'unget'
before returning.  What's the problem?

The only reason I haven't yet fixed this is because it will require some
refactoring in ports.c.  I guess the most straightforward approach is to
generalize 'get_codepoint', 'get_utf8_codepoint', and
'get_iconv_codepoint' to support a non-blocking mode of operation.

What do you think?

  Regards,
    Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-24 20:14   ` Mark H Weaver
@ 2013-02-24 22:15     ` Andy Wingo
  2013-02-25  0:06       ` Mark H Weaver
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Wingo @ 2013-02-24 22:15 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627

Hi :)

On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw@netris.org> writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> On Sat 28 Jan 2012 11:21, Mark H Weaver <mhw@netris.org> writes:
>>
>>> The R5RS specifies that if 'char-ready?' returns #t, then the next
>>> 'read-char' operation is guaranteed not to hang.  This is not currently
>>> the case for ports using a multibyte encoding.
>>>
>>> 'char-ready?' currently returns #t whenever at least one _byte_ is
>>> available.  This is not correct in general.  It should return #t only if
>>> there is a complete _character_ available.
>>
>> This procedure is omitted in the R6RS because it is not a good
>> interface.  Besides its semantic difficulties, can you think of a sane
>> implementation for multibyte characters?
>
> Maybe I'm missing something, but I don't see any semantic problem here,
> and it seems straightforward to implement.  'char-ready?' should simply
> read bytes until either a complete character is available, or no more
> bytes are ready.  In either case, all the bytes should then be 'unget'
> before returning.  What's the problem?

The problem is that char-ready? should not read anything.  If you want
to peek, use peek-char.  Note that if the stream is at EOF, char-ready?
should return #t.

Andy
-- 
http://wingolog.org/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-24 22:15     ` Andy Wingo
@ 2013-02-25  0:06       ` Mark H Weaver
  2013-02-25  1:23         ` Daniel Hartwig
  2013-02-25  8:55         ` Andy Wingo
  0 siblings, 2 replies; 10+ messages in thread
From: Mark H Weaver @ 2013-02-25  0:06 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 10627

Andy Wingo <wingo@pobox.com> writes:

> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw@netris.org> writes:
>
>> Maybe I'm missing something, but I don't see any semantic problem here,
>> and it seems straightforward to implement.  'char-ready?' should simply
>> read bytes until either a complete character is available, or no more
>> bytes are ready.  In either case, all the bytes should then be 'unget'
>> before returning.  What's the problem?
>
> The problem is that char-ready? should not read anything.

Okay, but if all bytes read are later *unread*, and the reads never
block, then why does it matter?  The reads in my proposed implementation
are just an internal implementation detail, and it seems to me that the
user cannot tell the difference, as long as he does not peek underneath
the Scheme port abstraction.

If you prefer, perhaps a nicer way to think about it is that
'char-ready?' looks ahead in the putback buffer and/or the read buffer
(refilling it in a non-blocking mode if needed), and returns #t iff a
complete character is present in the buffer(s), or EOF is reached.
However, is seems to me that implementing this in terms of read-byte and
unget-byte is simpler, because it avoids duplication of the logic
regarding putback buffers and refilling of buffers.  Maybe there's some
reason why this is a bad idea, but I haven't heard one.

I agree that 'char-ready?' is an antiquated interface, but it is
nonetheless part of the R5RS (and Guile since approximately forever),
and it is the only way to do a non-blocking read in portable R5RS.  It
seems to me that we ought to try to implement it as well as we can, no?

> If you want to peek, use peek-char.

Okay, but that's a totally different tool with a different use case.
It cannot be used to do non-blocking reads.

> Note that if the stream is at EOF, char-ready? should return #t.

Agreed.

More thoughts?

   Thanks,
     Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-25  0:06       ` Mark H Weaver
@ 2013-02-25  1:23         ` Daniel Hartwig
  2013-02-25  8:55         ` Andy Wingo
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel Hartwig @ 2013-02-25  1:23 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627

On 25 February 2013 08:06, Mark H Weaver <mhw@netris.org> wrote:
> Andy Wingo <wingo@pobox.com> writes:
>
>> On Sun 24 Feb 2013 21:14, Mark H Weaver <mhw@netris.org> writes:
>>
>>> Maybe I'm missing something, but I don't see any semantic problem here,
>>> and it seems straightforward to implement.  'char-ready?' should simply
>>> read bytes until either a complete character is available, or no more
>>> bytes are ready.  In either case, all the bytes should then be 'unget'
>>> before returning.  What's the problem?
>>
>> The problem is that char-ready? should not read anything.
>
> Okay, but if all bytes read are later *unread*, and the reads never
> block, then why does it matter?

Taking care to still use sf_input_waiting for soft ports?  Reading
bytes from a soft port could have side effects (i.e. logging action or
similar).





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-25  0:06       ` Mark H Weaver
  2013-02-25  1:23         ` Daniel Hartwig
@ 2013-02-25  8:55         ` Andy Wingo
  2013-02-26 19:50           ` Mark H Weaver
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Wingo @ 2013-02-25  8:55 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627

Hi Mark,

Are you proposing that `char-ready?' do a nonblocking read if
the buffer is empty?  That could work.

On Mon 25 Feb 2013 01:06, Mark H Weaver <mhw@netris.org> writes:

> However, is seems to me that implementing this in terms of read-byte and
> unget-byte is simpler, because it avoids duplication of the logic
> regarding putback buffers and refilling of buffers.

Could work, if the port is nonblocking to begin with.

> I agree that 'char-ready?' is an antiquated interface, but it is
> nonetheless part of the R5RS (and Guile since approximately forever),
> and it is the only way to do a non-blocking read in portable R5RS.  It
> seems to me that we ought to try to implement it as well as we can, no?

Do what you like to do :)  But if it were my time, I would simply
document that it checks for a byte and not a character and move on.

Andy
-- 
http://wingolog.org/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-25  8:55         ` Andy Wingo
@ 2013-02-26 19:50           ` Mark H Weaver
  2013-02-26 19:59             ` Andy Wingo
  0 siblings, 1 reply; 10+ messages in thread
From: Mark H Weaver @ 2013-02-26 19:50 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 10627

Andy Wingo <wingo@pobox.com> writes:
> Are you proposing that `char-ready?' do a nonblocking read if
> the buffer is empty?  That could work.

Yes.  I suspect that something along these lines is already implemented,
because I don't see how 'u8-ready?' could work properly without it.

> Do what you like to do :)  But if it were my time, I would simply
> document that it checks for a byte and not a character and move on.

I'd like to fix it properly.  Let's keep this bug open until it's done.

     Thanks,
       Mark





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-26 19:50           ` Mark H Weaver
@ 2013-02-26 19:59             ` Andy Wingo
  2016-06-20 19:23               ` Andy Wingo
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Wingo @ 2013-02-26 19:59 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627

On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw@netris.org> writes:

> Andy Wingo <wingo@pobox.com> writes:
>> Are you proposing that `char-ready?' do a nonblocking read if
>> the buffer is empty?  That could work.
>
> Yes.  I suspect that something along these lines is already implemented,
> because I don't see how 'u8-ready?' could work properly without it.

It does a poll with a timeout of 0.

Andy
-- 
http://wingolog.org/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#10627: char-ready? is broken for multibyte encodings
  2013-02-26 19:59             ` Andy Wingo
@ 2016-06-20 19:23               ` Andy Wingo
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Wingo @ 2016-06-20 19:23 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 10627-done

On Tue 26 Feb 2013 20:59, Andy Wingo <wingo@pobox.com> writes:

> On Tue 26 Feb 2013 20:50, Mark H Weaver <mhw@netris.org> writes:
>
>> Andy Wingo <wingo@pobox.com> writes:
>>> Are you proposing that `char-ready?' do a nonblocking read if
>>> the buffer is empty?  That could work.
>>
>> Yes.  I suspect that something along these lines is already implemented,
>> because I don't see how 'u8-ready?' could work properly without it.
>
> It does a poll with a timeout of 0.

In the end I added this to the manual:

    Note that @code{char-ready?} only works reliably for terminals and
    sockets with one-byte encodings.  Under the hood it will return
    @code{#t} if the port has any input buffered, or if the file descriptor
    that backs the port polls as readable, indicating that Guile can fetch
    more bytes from the kernel.  However being able to fetch one byte
    doesn't mean that a full character is available; @xref{Encoding}.  Also,
    on many systems it's possible for a file descriptor to poll as readable,
    but then block when it comes time to read bytes.  Note also that on
    Linux kernels, all file ports backed by files always poll as readable.
    For non-file ports, this procedure always returns @code{#t}, except for
    soft ports, which have a @code{char-ready?} handler.  @xref{Soft Ports}.

    In short, this is a legacy procedure whose semantics are hard to
    provide.  However it is a useful check to see if any input is buffered.
    @xref{Non-Blocking I/O}.

We could try a non-blocking read but at that point we should just
provide a non-blocking read-char, and allow users to unread-char.  That
would be a different bug :)

Andy





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-06-20 19:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-28 10:21 bug#10627: char-ready? is broken for multibyte encodings Mark H Weaver
2013-02-24 19:11 ` Andy Wingo
2013-02-24 20:14   ` Mark H Weaver
2013-02-24 22:15     ` Andy Wingo
2013-02-25  0:06       ` Mark H Weaver
2013-02-25  1:23         ` Daniel Hartwig
2013-02-25  8:55         ` Andy Wingo
2013-02-26 19:50           ` Mark H Weaver
2013-02-26 19:59             ` Andy Wingo
2016-06-20 19:23               ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).