Multiple encodings in one file

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Multiple encodings in one file
@ 2024-04-29  4:20 Lambert, Joshua D
  2024-04-29  7:22 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Lambert, Joshua D @ 2024-04-29  4:20 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Question: If I open a file that uses one encoding in one part of the file and another encoding in another part of the file, and also uses multiple character sets, can I edit a small part of it in Emacs, using UCS (Unicode), without Emacs changing the rest of the file?

Background: I work in a library ( the book kind) and use MARC files (https://www.loc.gov/marc/). MARC is a file transmission format that may have multiple "records." These records may contain text encoded in ISO/IEC 10646:202 (Unicode) or MARC-8 (https://www.loc.gov/marc/specifications/speccharintro.html), an encoding used mainly/only by libraries for MARC files. Libraries may have records of both types and one file may have records with both encodings, and multiple character sets. I'm attempting to write a major mode to work with these files in Emacs. Emacs does not recognize the MARC-8 encoding.

Thanks for your help,

Joshua Lambert
Librarian
Missouri State University

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-29  4:20 Multiple encodings in one file Lambert, Joshua D
@ 2024-04-29  7:22 ` Eli Zaretskii
  2024-04-29 18:45   ` Lambert, Joshua D
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2024-04-29  7:22 UTC (permalink / raw)
  To: help-gnu-emacs

> From: "Lambert, Joshua D" <JLambert@MissouriState.edu>
> Date: Mon, 29 Apr 2024 04:20:54 +0000
> msip_labels: 
> 
> Question: If I open a file that uses one encoding in one part of the file and another encoding in another part of the file, and also uses multiple character sets, can I edit a small part of it in Emacs, using UCS (Unicode), without Emacs changing the rest of the file?

No.  The built-in machinery for encoding and decoding file's contents
when visiting or saving files assumes the same encoding for the entire
file.  To support files whose different parts are encoded differently,
you will need to decode each part "by hand": visit the file literally,
then loop over each part and decode each part using
decode-coding-region.  When saving, do the opposite.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-29  7:22 ` Eli Zaretskii
@ 2024-04-29 18:45   ` Lambert, Joshua D
  2024-04-29 19:14     ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Lambert, Joshua D @ 2024-04-29 18:45 UTC (permalink / raw)
  To: Eli Zaretskii, help-gnu-emacs@gnu.org

Thank you for the time. What you said gives me some hope but I have a follow-up question. If I visit a file literally, make a change, and save it, the file seems to be different only where I changed it. Is that true?

If so, then does the following seem reasonable.

  1.
Find a file literally.
  2.
The user will accept that some characters will show octal codes or something similar.
  3.
Edit the records where understandable and possible.
  4.
Save file.

Furthermore, if I want to try to convert the MARC8 encoded records to UTF8 (mappings are available), is it reasonable/possible for me to do that in the buffer after using find-file-literally or would it be better to do that using hexl-mode, or another method?

Thanks,
Joshua

________________________________
From: help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org <help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org> on behalf of Eli Zaretskii <eliz@gnu.org>
Sent: Monday, April 29, 2024 2:22 AM
To: help-gnu-emacs@gnu.org <help-gnu-emacs@gnu.org>
Subject: Re: Multiple encodings in one file

CAUTION: External Sender

> From: "Lambert, Joshua D" <JLambert@MissouriState.edu>
> Date: Mon, 29 Apr 2024 04:20:54 +0000
> msip_labels:
>
> Question: If I open a file that uses one encoding in one part of the file and another encoding in another part of the file, and also uses multiple character sets, can I edit a small part of it in Emacs, using UCS (Unicode), without Emacs changing the rest of the file?

No.  The built-in machinery for encoding and decoding file's contents
when visiting or saving files assumes the same encoding for the entire
file.  To support files whose different parts are encoded differently,
you will need to decode each part "by hand": visit the file literally,
then loop over each part and decode each part using
decode-coding-region.  When saving, do the opposite.

This message originated outside Missouri State University. Please use caution when opening attachments, clicking links, or replying.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-29 18:45   ` Lambert, Joshua D
@ 2024-04-29 19:14     ` Eli Zaretskii
  2024-04-29 21:07       ` Lambert, Joshua D
  2024-04-30  2:02       ` Stefan Monnier via Users list for the GNU Emacs text editor
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2024-04-29 19:14 UTC (permalink / raw)
  To: help-gnu-emacs

> From: "Lambert, Joshua D" <JLambert@MissouriState.edu>
> Date: Mon, 29 Apr 2024 18:45:52 +0000
> msip_labels:
> 
> Thank you for the time. What you said gives me some hope but I have a follow-up question. If I visit a file
> literally, make a change, and save it, the file seems to be different only where I changed it. Is that true?

If you save it while binding coding-system-to-write to no-conversion,
yes.  IOW, you need to disable encoding while saving.

> If so, then does the following seem reasonable.
> 
> 1 Find a file literally.
> 2 The user will accept that some characters will show octal codes or something similar.
> 3 Edit the records where understandable and possible. 
> 4 Save file.

That can be done, of course, but note that UTF-8 encoded text is not
legible, unless the characters are all ASCII.

> Furthermore, if I want to try to convert the MARC8 encoded records to UTF8 (mappings are available), is it
> reasonable/possible for me to do that in the buffer after using find-file-literally or would it be better to do that
> using hexl-mode, or another method?

You can convert MARC-8 to Unicode (not UTF-8, since Emacs uses
internal representation that is not exactly UTF-8), yes.  But then you
will have to convert back to MARC-8 when you save the file, at least
in the parts that the user didn't edit.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-29 19:14     ` Eli Zaretskii
@ 2024-04-29 21:07       ` Lambert, Joshua D
  2024-04-30  2:02       ` Stefan Monnier via Users list for the GNU Emacs text editor
  1 sibling, 0 replies; 7+ messages in thread
From: Lambert, Joshua D @ 2024-04-29 21:07 UTC (permalink / raw)
  To: Eli Zaretskii, help-gnu-emacs@gnu.org

Thank you very much. This is helpful.

Joshua
________________________________
From: help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org <help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org> on behalf of Eli Zaretskii <eliz@gnu.org>
Sent: Monday, April 29, 2024 2:14 PM
To: help-gnu-emacs@gnu.org <help-gnu-emacs@gnu.org>
Subject: Re: Multiple encodings in one file

CAUTION: External Sender


> From: "Lambert, Joshua D" <JLambert@MissouriState.edu>
> Date: Mon, 29 Apr 2024 18:45:52 +0000
> msip_labels:
>
> Thank you for the time. What you said gives me some hope but I have a follow-up question. If I visit a file
> literally, make a change, and save it, the file seems to be different only where I changed it. Is that true?

If you save it while binding coding-system-to-write to no-conversion,
yes.  IOW, you need to disable encoding while saving.

> If so, then does the following seem reasonable.
>
> 1 Find a file literally.
> 2 The user will accept that some characters will show octal codes or something similar.
> 3 Edit the records where understandable and possible.
> 4 Save file.

That can be done, of course, but note that UTF-8 encoded text is not
legible, unless the characters are all ASCII.

> Furthermore, if I want to try to convert the MARC8 encoded records to UTF8 (mappings are available), is it
> reasonable/possible for me to do that in the buffer after using find-file-literally or would it be better to do that
> using hexl-mode, or another method?

You can convert MARC-8 to Unicode (not UTF-8, since Emacs uses
internal representation that is not exactly UTF-8), yes.  But then you
will have to convert back to MARC-8 when you save the file, at least
in the parts that the user didn't edit.

This message originated outside Missouri State University. Please use caution when opening attachments, clicking links, or replying.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-29 19:14     ` Eli Zaretskii
  2024-04-29 21:07       ` Lambert, Joshua D
@ 2024-04-30  2:02       ` Stefan Monnier via Users list for the GNU Emacs text editor
  2024-04-30 17:17         ` Lambert, Joshua D
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2024-04-30  2:02 UTC (permalink / raw)
  To: help-gnu-emacs

>> Thank you for the time. What you said gives me some hope but I have
>> a follow-up question. If I visit a file literally, make a change, and
>> save it, the file seems to be different only where I changed it. Is
>> that true?
>
> If you save it while binding coding-system-to-write to no-conversion,
> yes.  IOW, you need to disable encoding while saving.

Also, if you open the file as a if it was all utf-8, then the utf-8
parts of the file should look just fine (and the MARC-8 parts may look
screwy) and if you edit it and save the result it *should* result in
a valid file where only the part your changed was modified.

>> If so, then does the following seem reasonable.
>> 
>> 1 Find a file literally.
>> 2 The user will accept that some characters will show octal codes or
>>   something similar.
>> 3 Edit the records where understandable and possible. 
>> 4 Save file.

For a quick&dirty solution that should work as long as you're doing
limited changes and only in parts that are mostly ASCII.

If you're designing a major mode, maybe a better approach would look
like: read the file literally (i.e. as bytes) and treat it as a kind of
directory or archive (think tar-mode, dired, archive-mode, Rmail) so
only show a summary of the contents, then let the users "open" a record
which is then extracted (and decoded) into another buffer.

        Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multiple encodings in one file
  2024-04-30  2:02       ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2024-04-30 17:17         ` Lambert, Joshua D
  0 siblings, 0 replies; 7+ messages in thread
From: Lambert, Joshua D @ 2024-04-30 17:17 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org, Stefan Monnier

Thank you. Your suggestion of editing one record at a time is how most MARC editors work and that was my first thought as well. It will require record by record redisplay of some sort to make it human readable to begin with. That said, I'm thinking of multiple interfaces depending on the user's goal.

MARC is a file transmission format from 1968 (which in part explains the odd encoding) and can include any number of records. It has no line breaks or carriage returns but Emacs' longlines-break-chars seems to work well enough that I can edit files with 10,000 typical sized records, including some fontification.

Thanks to you and all who contribute to Emacs.
Joshua
________________________________
From: help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org <help-gnu-emacs-bounces+jlambert=missouristate.edu@gnu.org> on behalf of Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org>
Sent: Monday, April 29, 2024 9:02 PM
To: help-gnu-emacs@gnu.org <help-gnu-emacs@gnu.org>
Subject: Re: Multiple encodings in one file

CAUTION: External Sender

>> Thank you for the time. What you said gives me some hope but I have
>> a follow-up question. If I visit a file literally, make a change, and
>> save it, the file seems to be different only where I changed it. Is
>> that true?
>
> If you save it while binding coding-system-to-write to no-conversion,
> yes.  IOW, you need to disable encoding while saving.

Also, if you open the file as a if it was all utf-8, then the utf-8
parts of the file should look just fine (and the MARC-8 parts may look
screwy) and if you edit it and save the result it *should* result in
a valid file where only the part your changed was modified.

>> If so, then does the following seem reasonable.
>>
>> 1 Find a file literally.
>> 2 The user will accept that some characters will show octal codes or
>>   something similar.
>> 3 Edit the records where understandable and possible.
>> 4 Save file.

For a quick&dirty solution that should work as long as you're doing
limited changes and only in parts that are mostly ASCII.

If you're designing a major mode, maybe a better approach would look
like: read the file literally (i.e. as bytes) and treat it as a kind of
directory or archive (think tar-mode, dired, archive-mode, Rmail) so
only show a summary of the contents, then let the users "open" a record
which is then extracted (and decoded) into another buffer.

        Stefan

This message originated outside Missouri State University. Please use caution when opening attachments, clicking links, or replying.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-30 17:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-29  4:20 Multiple encodings in one file Lambert, Joshua D
2024-04-29  7:22 ` Eli Zaretskii
2024-04-29 18:45   ` Lambert, Joshua D
2024-04-29 19:14     ` Eli Zaretskii
2024-04-29 21:07       ` Lambert, Joshua D
2024-04-30  2:02       ` Stefan Monnier via Users list for the GNU Emacs text editor
2024-04-30 17:17         ` Lambert, Joshua D

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).