Hi Ludovic,

Thanks for the quick review!  An improved patch is attached below.

ludo@gnu.org (Ludovic Courtès) writes:
> Woow, well thought out.  The semantics seem good.  (It’s interesting to
> see how BOMs complicate things, but that’s life, I guess.)
>
> The patch looks good to me.  The test suite is nice.  It doesn’t seem to
> cover all the corner cases listed above, but that can be added later on
> perhaps?

Yes, the tests are still a work-in-progess, but I've added quite a few
more since you last looked.

> Perhaps the text above could be added to the manual,

In the attached patch, I've added a new node to the "Input and Output"
section.

> Mark H Weaver <mhw@netris.org> skribis:
>> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
>> index 73a788f..cd1746b 100644
>> --- a/libguile/ports-internal.h
>> +++ b/libguile/ports-internal.h
>> @@ -48,14 +48,19 @@ struct scm_port_internal
>>  {
>>    scm_t_port_encoding_mode encoding_mode;
>>    scm_t_iconv_descriptors *iconv_descriptors;
>> +  int at_stream_start_for_bom_read;
>> +  int at_stream_start_for_bom_write;
>
> Add “:1”?

Good idea.

[...more good suggestions that I've incorporated in the new patch...]

>> +{
>> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +  int result;
>> +  int i = 0;
>> +
>> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
>> +    {
>> +      pt->read_pos++;
>> +      i++;
>> +    }
>> +
>> +  result = (i == len);
>> +
>> +  while (i > 0)
>> +    scm_unget_byte (bytes[--i], port);
>> +
>> +  return result;
>> +}
>
> Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

Yes.  Bytes are only consumed if are equal to bytes[i], so an EOF will
never be consumed or passed to scm_unget_byte.

> What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

All of those details are handled by 'scm_peek_byte_or_eof', which is
guaranteed to leave 'pt->read_pos' pointing at the byte that's returned
(if not EOF).  Therefore, it's always safe to increment that pointer by
one (but no more than one) after calling 'scm_peek_byte_or_eof' if it
returned non-EOF.

Look at the code for 'scm_peek_byte_or_eof' and this will be clear.
Also note that you did the same thing in 'scm_utf8_codepoint' :)

[...more good suggestions, incorporated...]

>> +      /* If the specified encoding is UTF-16 or UTF-32, then make
>> +         that more precise by deciding what endianness to use.  */
>> +      if (strcmp (pt->encoding, "UTF-16") == 0)
>> +        precise_encoding = decide_utf16_encoding (port, mode);
>> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
>> +        precise_encoding = decide_utf32_encoding (port, mode);
>
> Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
> already, but I think it’s a mistake.)

Ouch, good catch!  Indeed, we already had some bugs because of this.  I
pushed a fix for the existing bugs to stable-2.0, and updated this patch
accordingly.

Here's the new patch.  Any more suggestions?

    Thanks!
      Mark