Hi Ludovic, Thanks for the quick review! An improved patch is attached below. ludo@gnu.org (Ludovic Courtès) writes: > Woow, well thought out. The semantics seem good. (It’s interesting to > see how BOMs complicate things, but that’s life, I guess.) > > The patch looks good to me. The test suite is nice. It doesn’t seem to > cover all the corner cases listed above, but that can be added later on > perhaps? Yes, the tests are still a work-in-progess, but I've added quite a few more since you last looked. > Perhaps the text above could be added to the manual, In the attached patch, I've added a new node to the "Input and Output" section. > Mark H Weaver skribis: >> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h >> index 73a788f..cd1746b 100644 >> --- a/libguile/ports-internal.h >> +++ b/libguile/ports-internal.h >> @@ -48,14 +48,19 @@ struct scm_port_internal >> { >> scm_t_port_encoding_mode encoding_mode; >> scm_t_iconv_descriptors *iconv_descriptors; >> + int at_stream_start_for_bom_read; >> + int at_stream_start_for_bom_write; > > Add “:1”? Good idea. [...more good suggestions that I've incorporated in the new patch...] >> +{ >> + scm_t_port *pt = SCM_PTAB_ENTRY (port); >> + int result; >> + int i = 0; >> + >> + while (i < len && scm_peek_byte_or_eof (port) == bytes[i]) >> + { >> + pt->read_pos++; >> + i++; >> + } >> + >> + result = (i == len); >> + >> + while (i > 0) >> + scm_unget_byte (bytes[--i], port); >> + >> + return result; >> +} > > Should it be scm_get_byte_or_eof given that scm_unget_byte is used later? Yes. Bytes are only consumed if are equal to bytes[i], so an EOF will never be consumed or passed to scm_unget_byte. > What if pt->read_buf_size == 1? What if there’s data in saved_read_buf? All of those details are handled by 'scm_peek_byte_or_eof', which is guaranteed to leave 'pt->read_pos' pointing at the byte that's returned (if not EOF). Therefore, it's always safe to increment that pointer by one (but no more than one) after calling 'scm_peek_byte_or_eof' if it returned non-EOF. Look at the code for 'scm_peek_byte_or_eof' and this will be clear. Also note that you did the same thing in 'scm_utf8_codepoint' :) [...more good suggestions, incorporated...] >> + /* If the specified encoding is UTF-16 or UTF-32, then make >> + that more precise by deciding what endianness to use. */ >> + if (strcmp (pt->encoding, "UTF-16") == 0) >> + precise_encoding = decide_utf16_encoding (port, mode); >> + else if (strcmp (pt->encoding, "UTF-32") == 0) >> + precise_encoding = decide_utf32_encoding (port, mode); > > Shouldn’t it be strcasecmp? (Actually there are other uses of strcmp > already, but I think it’s a mistake.) Ouch, good catch! Indeed, we already had some bugs because of this. I pushed a fix for the existing bugs to stable-2.0, and updated this patch accordingly. Here's the new patch. Any more suggestions? Thanks! Mark