I have a program that processes a UTF-16 input file, using `with-input-from-file', `set-port-encoding' and `read-line' in a pattern like this: (use-modules (ice-9 rdelim)) (with-input-from-file "rdelim-utf16.txt" (lambda () (set-port-encoding! (current-input-port) "UTF16LE") (let ((first-line (read-line)) (second-line (read-line))) ...) )) A sample UTF-16 input file is attached. This hits a couple of problems. 1. It seems that most (all?) UTF-16 files begin with a byte order marker (BOM), \ufeff, which readers are conventionally supposed to discard - but Guile doesn't. So first-line becomes "\ufeffhello" 2. The internals of (read-line) just search for a '\n' char to determine the end of the first line, which means they're assuming that - '\n' never occurs as part of some other multibyte sequence - when '\n' occurs as part of the newline sequence, it occupies a single byte. This causes the second line to be read wrong, because newline in UTF-16 is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0 byte unconsumed. I think the fixes for these are roughly as follows. For 1: - Add a flag to the representation of a file port to say whether we're still at the start of the file. This flag starts off true, and becomes false once we've read enough bytes to get past a possible BOM. - Define a static map from encodings to possible BOMs. - When reading bytes, and the flag is true, and the port has an encoding, and that encoding has a possible BOM, check for and consume the BOM. Or is it too magic for the port to do this automatically? Alternatively, we could provide something like `read-line-discarding-bom', and it would be up to the application to know when to use this instead of `read-line'. For 2: - In scm_do_read_line(), keep the current (fast) code for the case where the port has no encoding. - When the port has an encoding, use a modified implementation that copies raw bytes into an intermediate buffer, calls u32_conv_from_encoding to convert those to u32*, and uses u32_strchr to look for a newline. Does that sound about right? Are there any possible optimizations? For the static map, is there a canonical set of possible encoding strings, or a way to get a single canonical ID for all the strings that are allowed to mean the same encoding? For UTF-16, for example, it seems to me that many of the following encoding strings will work utf-16 utf-16-le utf16le utf16-le utf-16le utf16 + the same with different case and we don't want a map entry for each one. I suppose one pseudo-canonical method would be to upcase and remove all punctuation. Then we're only left with "UTF16" and "UTF16LE", which makes sense. Regards, Neil