unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#15984: 24.3; Problem with combining characters in attachment filename
@ 2013-11-28  8:08 Niels Möller
  2013-11-28 20:25 ` Eli Zaretskii
       [not found] ` <87eh574qmm.fsf@gnu.org>
  0 siblings, 2 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-28  8:08 UTC (permalink / raw)
  To: 15984

I'm reading email with Gnus. I received an email with an attachment
containing the headers

  Content-Type: application/pdf;
   name="Brev =?UTF-8?B?YWt0aWVhzIhnYXIgMTMxMTI3LnBkZg==?="
  Content-Transfer-Encoding: base64
  Content-Disposition: attachment;
   filename*0*=UTF-8''%42%72%65%76%20%61%6B%74%69%65%61%CC%88%67%61%72%20%31;
   filename*1*=%33%31%31%32%37%2E%70%64%66

Apparently sent by a Mac user,

  User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.1.1

The attachement was displayed in the *Article* buffer as

  [2. application/pdf; Brev aktiea?gar 131127.pdf]...

I was running emacs-24.3 in a tty, in a latin-1 locale, on a sparc
Solaris system. (In a latin-1 tty, emacs ought to display "ä" instead of
"a?", but that's a less severe and possibly unrelated problem).

  SunOS bacon 5.10 Generic_147147-26 sun4u sparc SUNW,Sun-Fire-15000

When I tried to save the attachment by pressing "o" on that button
(gnus-mime-save-part), emacs immediately crashed with a segmentation
violation signal. Since emacs very rarely crashes, I was a bit
surprised. I just restarted emacs and Gnus and tried again, and it
crashed again. So at least for me, the problem is reproducible.

And a crash triggered by untrusted data in a received email is always
scary. After fixing the bug, exploit possibilities ought to be analyzed.

The gdb backtrace, based on the generated core file, looks like this:

(gdb) bt
#0  0xfec4ebd4 in _lwp_kill () from /lib/libc.so.1
#1  0xfebe7bb8 in raise () from /lib/libc.so.1
#2  0x000e7f78 in terminate_due_to_signal ()
#3  0x00103d04 in handle_fatal_signal ()
#4  0x001037d0 in deliver_thread_signal ()
#5  0xfec4b014 in __sighndlr () from /lib/libc.so.1
#6  0xfec3f6c4 in call_user_handler () from /lib/libc.so.1
#7  <signal handler called>
#8  0x000b5748 in char_table_ref ()
#9  0x001ad54c in composition_compute_stop_pos ()
#10 0x001266ec in scan_for_column ()
#11 0x00127328 in current_column ()
#12 0x00114cec in read_minibuf ()
#13 0x00115688 in Fread_from_minibuffer ()
#14 0x0015c538 in Ffuncall ()
#15 0x00190de0 in exec_byte_code ()
#16 0x0015c368 in Ffuncall ()
#17 0x001158a0 in Fcompleting_read ()
#18 0x0015c4e4 in Ffuncall ()
#19 0x00190de0 in exec_byte_code ()
#20 0x0015c368 in Ffuncall ()
#21 0x00190de0 in exec_byte_code ()
#22 0x0015c368 in Ffuncall ()
#23 0x00190de0 in exec_byte_code ()
#24 0x0015bf18 in funcall_lambda ()
#25 0x0015c368 in Ffuncall ()
#26 0x00190de0 in exec_byte_code ()
#27 0x0015bf18 in funcall_lambda ()
#28 0x0015c368 in Ffuncall ()
#29 0x0015cbf0 in apply1 ()
#30 0x001573b4 in Fcall_interactively ()
#31 0x0015c574 in Ffuncall ()
#32 0x0015c77c in call3 ()
#33 0x000f0ac0 in Fcommand_execute ()
#34 0x000f829c in command_loop_1 ()
#35 0x001591dc in internal_condition_case ()
#36 0x000ea2a0 in command_loop_2 ()
#37 0x001590c0 in internal_catch ()
#38 0x000ea11c in recursive_edit_1 ()
#39 0x000ea264 in Frecursive_edit ()
#40 0x000e9b28 in main ()

The emacs binary I use appear to have been stripped, so bt full gives
no additional information, and xbacktrace fails with

  No symbol "CHECK_LISP_OBJECT_TYPE" in current context.

If I decode the base-64 part of the Content-type "name" value, I get

  $ od -tx1c fname.txt 
  0000000  61  6b  74  69  65  61  cc  88  67  61  72  20  31  33  31  31
            a   k   t   i   e   a 314 210   g   a   r       1   3   1   1
  0000020  32  37  2e  70  64  66
            2   7   .   p   d   f
  0000026

So it appears to contain the character "ä" (a with two dots), coded as
"a" followed by a unicode combining character. All in utf-8. If I run
cat fname.txt in xterm with a utf-8 locale, it displays the string as
"aktieägar 131127.pdf", which seems correct.

I don't understand the meaning of the Content-disposition: header, but I
guess it's possible that Content-type: ...; name= *is* processed
correctly, and it's the code processing Content-disposition which
crashes. But looking at the backtrace, it looks like the problem is
related to handling of combining characters.

Below is the info generated by report-emacs-bug, except that I deleted
recent input and recent messages, since the problem was in the emacs
process which crashed, not in this one where I'm composing this message.
Environment should otherwise be identical (same emacs, same Gnus, same
machine, same tty).

Regards,
/Niels

In GNU Emacs 24.3.1 (sparc-sun-solaris2.10, X toolkit, Xaw scroll bars)
 of 2013-03-15 on stalhein
Configured using:
 `configure '--prefix=/pkg/emacs/sparc-sol10/24.3' '--with-gif=no'
 '--with-jpeg=no' '--with-tiff=no' '--with-png=no' '--with-dbus=no'
 '--with-gsettings=no' '--with-gnutls=no' 'CC=gcc' 'CFLAGS=-O2 -mcpu=v9'
 'LDFLAGS=-L/usr/local/lib -R/usr/local/lib'
 'CPPFLAGS=-I/usr/local/include''

Important settings:
  value of $LC_COLLATE: C
  value of $LC_CTYPE: sv_SE.ISO8859-1
  value of $LC_MESSAGES: C
  value of $LC_MONETARY: en_US.ISO8859-1
  value of $LC_NUMERIC: en_US.ISO8859-1
  value of $LC_TIME: en_US.ISO8859-1
  locale-coding-system: iso-latin-1-unix
  default enable-multibyte-characters: t

Major mode: Summary

Minor modes in effect:
  type-break-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  line-number-mode: t
  transient-mark-mode: t

Recent input: [omitted]

Recent messages: [omitted]

Features:
(shadow emacsbug help-mode sort ansi-color gnus-cite flow-fill
mm-archive mail-extr gnus-async gnus-bcklg qp parse-time gnus-ml
disp-table misearch multi-isearch gnus-topic byte-opt bytecomp
byte-compile cconv nndraft nnmh nnml gnus-agent gnus-srvr gnus-score
score-mode nnvirtual gnus-msg gnus-art mm-uu mml2015 epg-config mm-view
mml-smime smime password-cache dig mailcap nntp gnus-cache gnus-sum nnoo
gnus-group gnus-undo nnmail mail-source gnus-start gnus-spec gnus-int
gnus-range message sendmail format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 rfc2047 rfc2045 ietf-drums
mailabbrev gmm-utils mailheader gnus-win gnus gnus-ems nnheader
gnus-util mail-utils mm-util mail-prsvr wid-edit bbdb-autoloads package
cl-macs gv bookmark pp recurse cl time-date type-break uniquify advice
help-fns cl-lib advice-preload info easymenu tooltip ediff-hook vc-hooks
lisp-float-type mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt
fringe tabulated-list newcomment lisp-mode register page menu-bar
rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax
facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak
czech european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face macroexp files text-properties overlay sha1 md5 base64 format
env code-pages mule custom widget hashtable-print-readable backquote
make-network-process dynamic-setting x-toolkit x multi-tty emacs)

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.






^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-28  8:08 bug#15984: 24.3; Problem with combining characters in attachment filename Niels Möller
@ 2013-11-28 20:25 ` Eli Zaretskii
  2013-11-28 22:17   ` Niels Möller
  2013-11-29 13:11   ` Kenichi Handa
       [not found] ` <87eh574qmm.fsf@gnu.org>
  1 sibling, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-28 20:25 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Date: Thu, 28 Nov 2013 09:08:54 +0100
> 
> I'm reading email with Gnus. I received an email with an attachment
> containing the headers
> 
>   Content-Type: application/pdf;
>    name="Brev =?UTF-8?B?YWt0aWVhzIhnYXIgMTMxMTI3LnBkZg==?="
>   Content-Transfer-Encoding: base64
>   Content-Disposition: attachment;
>    filename*0*=UTF-8''%42%72%65%76%20%61%6B%74%69%65%61%CC%88%67%61%72%20%31;
>    filename*1*=%33%31%31%32%37%2E%70%64%66
> 
> Apparently sent by a Mac user,
> 
>   User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.1.1
> 
> The attachement was displayed in the *Article* buffer as
> 
>   [2. application/pdf; Brev aktiea?gar 131127.pdf]...
> 
> I was running emacs-24.3 in a tty, in a latin-1 locale, on a sparc
> Solaris system. (In a latin-1 tty, emacs ought to display "ä" instead of
> "a?", but that's a less severe and possibly unrelated problem).

If ä was supposed to be produced by character compositions, then Emacs
cannot do that on a TTY, because compositions require drawing one
glyph over the other (with certain offsets).

If you expected Emacs to perform normalization in this case, then I
don't think we do this automatically (or at all).

> When I tried to save the attachment by pressing "o" on that button
> (gnus-mime-save-part), emacs immediately crashed with a segmentation
> violation signal. Since emacs very rarely crashes, I was a bit
> surprised. I just restarted emacs and Gnus and tried again, and it
> crashed again. So at least for me, the problem is reproducible.

Can you send that message as a binary attachment?

> And a crash triggered by untrusted data in a received email is always
> scary. After fixing the bug, exploit possibilities ought to be analyzed.

I suggest to try a recent development trunk, several similar crashes
were fixed a few months ago.  If that doesn't help, please reproduce
the problem in a non-optimized non-stripped build, and show the
variables from char_table_ref that are involved in the crash.  (I'm
guessing char_table_ref got a bogus character code.)





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-28 20:25 ` Eli Zaretskii
@ 2013-11-28 22:17   ` Niels Möller
  2013-11-28 22:46     ` Niels Möller
  2013-11-29  7:16     ` Eli Zaretskii
  2013-11-29 13:11   ` Kenichi Handa
  1 sibling, 2 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-28 22:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

Eli Zaretskii <eliz@gnu.org> writes:

> If you expected Emacs to perform normalization in this case, then I
> don't think we do this automatically (or at all).

I think for display, normalizing is definitely the right thing to do
(the unicode spec, as I understand it, require that a "compliant"
implementation treats different ways to code "ä" equivalently).
But I understand if emacs currenty doesn't do that.

(Digression: I think text-processor supporting unicode really ought to
represent "characters" as interned strings of unicode (or utf-8) code
points. These characters can have relations such as "normalized to", and
glyphs should usually be associated only with the normalized form. One
could also have configurable rules for character boundaries, as is
described in the unicode book, or at least was in the version which was
current when I tried to read up on this some years ago).

> Can you send that message as a binary attachment?

It's not very sensitive (it's about shares and options for a company I
used to be employed by), but I'd prefer it not to be posted publicly on
the bugtracker, or widely distributed among emacs hackers.

I'll try to send you a private mail with the bulk of the message with
the body of the attachment replaced (the base64 text in the raw message;
if the problem really is with the attachment headers, that shouldn't
matter); if that's for some reason not usable, I'll send you the
complete message.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-28 22:17   ` Niels Möller
@ 2013-11-28 22:46     ` Niels Möller
  2013-11-29  7:16     ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-28 22:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

nisse@lysator.liu.se (Niels Möller) writes:

>> Can you send that message as a binary attachment?
>
> It's not very sensitive (it's about shares and options for a company I
> used to be employed by), but I'd prefer it not to be posted publicly on
> the bugtracker, or widely distributed among emacs hackers.

I've now created a smaller an anonymized example. I tried to mail it to
myself with sendmail -t, to confirm it still crashes emacs. Mailing for
some reason didn't work, but the bounce I got back is a good enough
example: It is displayed by Gnus with a button looking like

[5. application/pdf; Brev aktieägar 131127.pdf]...

and pressing "o" on that makes emacs crash, just as withh the original
message.

Attached in gzip form. I hope emacs doesn't automagically unpack and
display the buttons for the embedded attachment when you read this in
emacs, but if it does, be careful.

Regards,
/Niels


[-- Attachment #2: Compressed problem message --]
[-- Type: application/octet-stream, Size: 2160 bytes --]

[-- Attachment #3: Type: text/plain, Size: 133 bytes --]


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-28 22:17   ` Niels Möller
  2013-11-28 22:46     ` Niels Möller
@ 2013-11-29  7:16     ` Eli Zaretskii
  2013-11-29  8:49       ` Niels Möller
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29  7:16 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Cc: 15984@debbugs.gnu.org
> Date: Thu, 28 Nov 2013 23:17:06 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > If you expected Emacs to perform normalization in this case, then I
> > don't think we do this automatically (or at all).
> 
> I think for display, normalizing is definitely the right thing to do
> (the unicode spec, as I understand it, require that a "compliant"
> implementation treats different ways to code "ä" equivalently).
> But I understand if emacs currenty doesn't do that.

Someone(TM) should write the code to do that.

> (Digression: I think text-processor supporting unicode really ought to
> represent "characters" as interned strings of unicode (or utf-8) code
> points.

That's what Emacs does since v23.1 (except that we extend the range of
Unicode codepoints to represent some non-unified characters and binary
raw bytes).

> These characters can have relations such as "normalized to"

This part requires incorporation of tables and supporting code, which
needs to be written.

> glyphs should usually be associated only with the normalized form.

Here I disagree.  There are definitely situations where this is not
TRT, and they aren't "unusual".

> I'll try to send you a private mail with the bulk of the message with
> the body of the attachment replaced (the base64 text in the raw message;
> if the problem really is with the attachment headers, that shouldn't
> matter); if that's for some reason not usable, I'll send you the
> complete message.

Thanks.  I'd also need instructions to display that message in Gnus
after saving it to a file, starting with "emacs -Q", as I don't have
Gnus set up and don't use it in my day-to-day work.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29  7:16     ` Eli Zaretskii
@ 2013-11-29  8:49       ` Niels Möller
  2013-11-29  9:00         ` Eli Zaretskii
  0 siblings, 1 reply; 21+ messages in thread
From: Niels Möller @ 2013-11-29  8:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

Eli Zaretskii <eliz@gnu.org> writes:

>> (Digression: I think text-processor supporting unicode really ought to
>> represent "characters" as interned strings of unicode (or utf-8) code
>> points.
>
> That's what Emacs does since v23.1 (except that we extend the range of
> Unicode codepoints to represent some non-unified characters and binary
> raw bytes).

Good! I thought emacs used a simpler mapping character <-> a single
unicode value.

> > glyphs should usually be associated only with the normalized form.
> 
> Here I disagree.  There are definitely situations where this is not
> TRT, and they aren't "unusual".

Ok. What's the typical use case where you'd want to have different
glyphs for "Å", "A" + ring above combining char, and Ångström unit sign?

> Thanks.  I'd also need instructions to display that message in Gnus
> after saving it to a file, starting with "emacs -Q", as I don't have
> Gnus set up and don't use it in my day-to-day work.

I'm also not sure how to do that, but I'll try to figure out.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29  8:49       ` Niels Möller
@ 2013-11-29  9:00         ` Eli Zaretskii
  2013-11-29 10:43           ` Niels Möller
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29  9:00 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Cc: 15984@debbugs.gnu.org
> Date: Fri, 29 Nov 2013 09:49:15 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> (Digression: I think text-processor supporting unicode really ought to
> >> represent "characters" as interned strings of unicode (or utf-8) code
> >> points.
> >
> > That's what Emacs does since v23.1 (except that we extend the range of
> > Unicode codepoints to represent some non-unified characters and binary
> > raw bytes).
> 
> Good! I thought emacs used a simpler mapping character <-> a single
> unicode value.

Maybe I misunderstood you: what's the difference between those two
alternatives?

> > > glyphs should usually be associated only with the normalized form.
> > 
> > Here I disagree.  There are definitely situations where this is not
> > TRT, and they aren't "unusual".
> 
> Ok. What's the typical use case where you'd want to have different
> glyphs for "Å", "A" + ring above combining char, and Ångström unit sign?

MacOS file names, I think.  Also, display in "C-u C-x =", which is
very important for understanding and debugging Emacs display features.

> > Thanks.  I'd also need instructions to display that message in Gnus
> > after saving it to a file, starting with "emacs -Q", as I don't have
> > Gnus set up and don't use it in my day-to-day work.
> 
> I'm also not sure how to do that, but I'll try to figure out.

Thanks.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29  9:00         ` Eli Zaretskii
@ 2013-11-29 10:43           ` Niels Möller
  2013-11-29 11:26             ` Eli Zaretskii
  2013-11-29 15:04             ` Stefan Monnier
  0 siblings, 2 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-29 10:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

Eli Zaretskii <eliz@gnu.org> writes:

>> Good! I thought emacs used a simpler mapping character <-> a single
>> unicode value.
>
> Maybe I misunderstood you: what's the difference between those two
> alternatives?

What I think is the right thing, is to allow a sequence of unicode
values, e.g., "A" + combining character, or "A" + any random sequence of
combining characters, intern this string, and treat this as a single
"character".

The idea is that this character object should correspond to what the
user thinks of as a single character. E.g, one glyph per character, and
treated as a unit by forward-char, and regexp matching with "." and
character sets.

When reading text files, the character boundaries may be configurble.
E.g, there could be a mode which makes each and every unicode value a
single character, which will then be displayed as separate glyphs,
separate characters for regexp matching, etc.

>> > Thanks.  I'd also need instructions to display that message in Gnus
>> > after saving it to a file, starting with "emacs -Q", as I don't have
>> > Gnus set up and don't use it in my day-to-day work.

Move away any gnus-related configuration files (~/.gnus, ~/.newsrc*).

Create a spool-like directory, e.g, "~/tmp/mail". Copy the file to
"~/tmp/mail/1". Start emacs -Q -nw -f gnus-no-server. In the *Group* buffer,
press G d to create a directory group, enter ~/tmp/mail. You should now
be able to enter that group, and select the message in the *Summary*
buffer.

To mimic my setup, do this in an xterm running in a latin-1 locale. (I
have to send this off now, I'll try later to really see if this recipe
reproduces the problem for me).

I also tried to reproduce the problem on another machine, with debian
gnu/linux and emacs-23.4. This version worked fine, no crash.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 10:43           ` Niels Möller
@ 2013-11-29 11:26             ` Eli Zaretskii
  2013-11-29 12:41               ` Niels Möller
  2013-11-29 15:04             ` Stefan Monnier
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29 11:26 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Cc: 15984@debbugs.gnu.org
> Date: Fri, 29 Nov 2013 11:43:45 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Good! I thought emacs used a simpler mapping character <-> a single
> >> unicode value.
> >
> > Maybe I misunderstood you: what's the difference between those two
> > alternatives?
> 
> What I think is the right thing, is to allow a sequence of unicode
> values, e.g., "A" + combining character, or "A" + any random sequence of
> combining characters, intern this string, and treat this as a single
> "character".

That's not how Emacs represents and treats characters.  The
composition happens only at display time, and normalization, as it's
currently implemented, happens when text is read into a buffer.
Thereafter, each Unicode character is a single character, and there's
no combining of them for any purpose except display.

> The idea is that this character object should correspond to what the
> user thinks of as a single character. E.g, one glyph per character, and
> treated as a unit by forward-char, and regexp matching with "." and
> character sets.

What gets displayed as a single unit is a "grapheme cluster", not a
single glyph.  Whether a grapheme cluster that corresponds to "A" +
any random sequence of combining characters maps to a single glyph
depends on the font being used, which is something the user should not
need to worry about.  However, we do want to give the user a way to
delete only one or more of the combining characters, so forcing the
entire combination to be a single indivisible entity would not be TRT
for users.  Cursor motion does consider the entire thing as a single
entity and moves across all of it, but that requires special code.

IOW, things are not that simple, and I think the design you are
suggesting is problematic in that it will remove several important
features, or make them harder to implement.

> When reading text files, the character boundaries may be configurble.

The important question is what to do by default, as many users will
not be happy if asked too many questions or requested to specify too
many parameters for reading text.  Compare this with the need to
specify the encoding in too many cases in the early days of
multilingual Emacs -- there was a user outcry about that.

> E.g, there could be a mode which makes each and every unicode value a
> single character, which will then be displayed as separate glyphs,
> separate characters for regexp matching, etc.

You are mixing display issues with editing issues and with how
characters are represented internally in an Emacs buffer.  These all
are separate, and do not necessarily need to handle characters in the
same rigid way.

> Move away any gnus-related configuration files (~/.gnus, ~/.newsrc*).
> 
> Create a spool-like directory, e.g, "~/tmp/mail". Copy the file to
> "~/tmp/mail/1". Start emacs -Q -nw -f gnus-no-server. In the *Group* buffer,
> press G d to create a directory group, enter ~/tmp/mail. You should now
> be able to enter that group, and select the message in the *Summary*
> buffer.
> 
> To mimic my setup, do this in an xterm running in a latin-1 locale. (I
> have to send this off now, I'll try later to really see if this recipe
> reproduces the problem for me).

Thanks, I will try that.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 11:26             ` Eli Zaretskii
@ 2013-11-29 12:41               ` Niels Möller
  2013-11-29 14:50                 ` Eli Zaretskii
  2013-11-29 16:18                 ` Eli Zaretskii
  0 siblings, 2 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-29 12:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

Eli Zaretskii <eliz@gnu.org> writes:

> However, we do want to give the user a way to
> delete only one or more of the combining characters, so forcing the
> entire combination to be a single indivisible entity would not be TRT
> for users.

Good question, how to handle this.

Today, to remove the dots from an "ä" character, I'll have to delete the
complete "ä" character and insert a new "a" character. Or similarly for
the reverse edit. I think this "atomic" handling is the desired
behaviour in many cases. And I don't think it should behave differently
depending on the representation of "ä" in the original file. But if you
have a complex sequence of unicode combining characters, I agree there's
some need to be able to edit it. Maybe put point on the character and
invoke edit-char to go in some special mode which explodes the usually
"atomic" character into smaller pieces.

And such a character edit mode might be useful for more things than
unicode composing characters, e.g, manipulationg the different sub-parts
of a chinese character. Anyway, this user interface is not intimately
tied to the internal character representation; its overall effect on the
buffer will be the same as replacing any substring.

>> When reading text files, the character boundaries may be configurble.
>
> The important question is what to do by default,

I'm pretty sure the default should be that a sequence of one unicode
base char and all following unicode combining chars is interned as a
single "emacs character". (I think the detailed rules for this are
spelled out in the unicode book). With some arbitrary limit to prevent a
GByte file with only unicode combining characters to get read as a
single emacs character; say at most 10 combining characters.

> You are mixing display issues with editing issues and with how
> characters are represented internally in an Emacs buffer.

I think it's confusing for users if the units of text which forward-char
skips over, do not correspond to the units matched by "." in
isearch-forward-regexp.

My suggested internal representation seems to be a natural way to get
this correspondence right, at the cost of some memory (or lots of
complexity in reducing memory usage). I'm sure there are other ways, and
maybe also a lot better ways, to implement the same thing.

> Thanks, I will try that.

Now I've also reproduced it on the same machine, without my normal Gnus
setup getting in the way. I start emacs with

  $ rm -rf ~/tmp/home/ && mkdir ~/tmp/home/ && HOME=$HOME/tmp/home emacs -nw -Q -l bug.el

where bug.el contains

  (setq gnus-init-file nil)
  (setq gnus-nntp-server nil)
  (gnus-no-server)

Then create the group with G d, pointing out the spool-like directory,
enter the group (RET), view the message (RET), try to write out the
attachment ("o" on the attachment button). Still crashes for me.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-28 20:25 ` Eli Zaretskii
  2013-11-28 22:17   ` Niels Möller
@ 2013-11-29 13:11   ` Kenichi Handa
  1 sibling, 0 replies; 21+ messages in thread
From: Kenichi Handa @ 2013-11-29 13:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984, nisse

In article <83iovc8eaq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> If you expected Emacs to perform normalization in this case, then I
> don't think we do this automatically (or at all).

The library "ucs-normalize" (under lisp/international/)
provides the coding system utf-8-hfs which may be appropiate
for file-name-coding-system on Mac.

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 12:41               ` Niels Möller
@ 2013-11-29 14:50                 ` Eli Zaretskii
  2013-11-29 16:18                 ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29 14:50 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Cc: 15984@debbugs.gnu.org
> Date: Fri, 29 Nov 2013 13:41:01 +0100
> 
> Today, to remove the dots from an "ä" character, I'll have to delete the
> complete "ä" character and insert a new "a" character.

Not if they were originally two or more characters which were composed
into one.  In that case, we let the user edit them separately.

> I think this "atomic" handling is the desired behaviour in many
> cases.

For "ä", this is arguable.  For more complex script, this is
definitely wrong: users want to be able to edit each component
separately.

> But if you have a complex sequence of unicode combining characters,
> I agree there's some need to be able to edit it. Maybe put point on
> the character and invoke edit-char to go in some special mode which
> explodes the usually "atomic" character into smaller pieces.

We already do that, but if the characters were combined, and Emacs
doesn't even know they were separate to begin with, it cannot do that,
can it?

> > You are mixing display issues with editing issues and with how
> > characters are represented internally in an Emacs buffer.
> 
> I think it's confusing for users if the units of text which forward-char
> skips over, do not correspond to the units matched by "." in
> isearch-forward-regexp.

What happens under the hood with matching and what is shown to the
user doesn't have to be identical.  In fact, it cannot be identical.
Again, please don't mix internal implementation and UI, they cannot be
possibly identical anyway, because there are conflicting user
requirements in different situations.

> My suggested internal representation seems to be a natural way to get
> this correspondence right, at the cost of some memory (or lots of
> complexity in reducing memory usage).

It only seems to be that.  Real life is much more messy, and defeats
such simplicity on many levels.

> Now I've also reproduced it on the same machine, without my normal Gnus
> setup getting in the way. I start emacs with
> 
>   $ rm -rf ~/tmp/home/ && mkdir ~/tmp/home/ && HOME=$HOME/tmp/home emacs -nw -Q -l bug.el
> 
> where bug.el contains
> 
>   (setq gnus-init-file nil)
>   (setq gnus-nntp-server nil)
>   (gnus-no-server)
> 
> Then create the group with G d, pointing out the spool-like directory,
> enter the group (RET), view the message (RET), try to write out the
> attachment ("o" on the attachment button). Still crashes for me.

Thanks.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 10:43           ` Niels Möller
  2013-11-29 11:26             ` Eli Zaretskii
@ 2013-11-29 15:04             ` Stefan Monnier
  2013-11-29 15:27               ` Eli Zaretskii
  2013-11-30  8:53               ` Niels Möller
  1 sibling, 2 replies; 21+ messages in thread
From: Stefan Monnier @ 2013-11-29 15:04 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> What I think is the right thing, is to allow a sequence of unicode
> values, e.g., "A" + combining character, or "A" + any random sequence of
> combining characters, intern this string, and treat this as a single
> "character".

For the Lisp-level notion of "character", I think this would require too
many deep changes.

> The idea is that this character object should correspond to what the
> user thinks of as a single character. E.g, one glyph per character, and
> treated as a unit by forward-char, and regexp matching with "." and
> character sets.

For forward-char, we do try to fake that behavior (e.g. a `forward-char'
command will skip over the whole A+ring combo) but not faithfully
(e.g. `C-u 2 forward-char' will also just skip that combo, and not the
subsequent char).  It's not perfect, but it seems "close enough" that it
hasn't proved problematic.

Adjusting . in regexps would indeed help solve some
unexpected behaviors.  We would probably want to keep the ability to match
a single "code point", so we'd need to introduce a new regexp operator.

Maybe we could follow the lead of the POSIX collation thingy, IIRC,
where [ϐ] in case-folding mode wants to be able to match SS in
a German locale.  So maybe [[:any:]] could match A+ring.

> E.g, there could be a mode which makes each and every unicode value a
> single character, which will then be displayed as separate glyphs,
> separate characters for regexp matching, etc.

I think we wouldn't want to use different modes (too coarse) but
different commands instead.

In any case, a first step would be to find a name for that notion of "multi
character character".  "Grapheme cluster" doesn't sound too good if we
want to expose the concept to the end user.


        Stefan





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 15:04             ` Stefan Monnier
@ 2013-11-29 15:27               ` Eli Zaretskii
  2013-11-30  8:53               ` Niels Möller
  1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29 15:27 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15984, nisse

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz@gnu.org>,  15984@debbugs.gnu.org
> Date: Fri, 29 Nov 2013 10:04:04 -0500
> 
> In any case, a first step would be to find a name for that notion of "multi
> character character".  "Grapheme cluster" doesn't sound too good if we
> want to expose the concept to the end user.

Why should we invent terminology where one already exists and is
widely accepted and used?  It sounds like waste of energy.  Explain
the term well enough, and users will have no difficulty.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 12:41               ` Niels Möller
  2013-11-29 14:50                 ` Eli Zaretskii
@ 2013-11-29 16:18                 ` Eli Zaretskii
  2013-11-30 13:20                   ` Eli Zaretskii
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-29 16:18 UTC (permalink / raw)
  To: Niels Möller; +Cc: 15984

> From: nisse@lysator.liu.se (Niels Möller)
> Cc: 15984@debbugs.gnu.org
> Date: Fri, 29 Nov 2013 13:41:01 +0100
> 
>   $ rm -rf ~/tmp/home/ && mkdir ~/tmp/home/ && HOME=$HOME/tmp/home emacs -nw -Q -l bug.el
> 
> where bug.el contains
> 
>   (setq gnus-init-file nil)
>   (setq gnus-nntp-server nil)
>   (gnus-no-server)
> 
> Then create the group with G d, pointing out the spool-like directory,
> enter the group (RET), view the message (RET), try to write out the
> attachment ("o" on the attachment button). Still crashes for me.

It crashes in the current development trunk as well, but only if the
locale is set to Latin-1, like yours.

I'm looking at this.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 15:04             ` Stefan Monnier
  2013-11-29 15:27               ` Eli Zaretskii
@ 2013-11-30  8:53               ` Niels Möller
  1 sibling, 0 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-30  8:53 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 15984

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> What I think is the right thing, is to allow a sequence of unicode
>> values, e.g., "A" + combining character, or "A" + any random sequence of
>> combining characters, intern this string, and treat this as a single
>> "character".
>
> For the Lisp-level notion of "character", I think this would require too
> many deep changes.

I can understand that. I'm actually impressed by the move from MULE
encodings to unicode, which to a user appeared to very smooth.

But I still think that type of "character" abstraction the right thing
for unicode text processing in general.

> For forward-char, we do try to fake that behavior (e.g. a `forward-char'
> command will skip over the whole A+ring combo) but not faithfully
> (e.g. `C-u 2 forward-char' will also just skip that combo, and not the
> subsequent char).  It's not perfect, but it seems "close enough" that it
> hasn't proved problematic.

Didn't know, that's a bit weird. I just tried, as Eli suggested, editing
text with "ä" represented with a as a combining character. In
emacs-23.4, pressing DEL after the "ä" deletes the dots only. I now
understand why, but it's not what I had expected, and I think deleteing
the entire A + dots would be preferable. Plain C-x = on the "a" shows
just "Char: a (97, #o141, #x61) point=443 of 455 (97%) column=1", but
C-u C-x = also shows the combining char.

However, emacs-24.3 behaves differently, the 'a' and the '"' gets
displayed differently, and are not combined at all for display.
The buffer shows 'a"', and according to C-u C-x 8 the '"' is a
"COMBINING DIAERESIS". These tests done in an X11 frame, so maybe
they're just picking up different fonts?

>> E.g, there could be a mode which makes each and every unicode value a
>> single character, which will then be displayed as separate glyphs,
>> separate characters for regexp matching, etc.
>
> I think we wouldn't want to use different modes (too coarse) but
> different commands instead.

I didn't mean an emacs major or minor mode. It would be more like a
special coding system, applied when reading the text from file.

> In any case, a first step would be to find a name for that notion of "multi
> character character".  "Grapheme cluster" doesn't sound too good if we
> want to expose the concept to the end user.

I think "character" is the right word, the main source of confusion is
that unicode code points are often referred to as "characters".

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-29 16:18                 ` Eli Zaretskii
@ 2013-11-30 13:20                   ` Eli Zaretskii
  2013-11-30 14:25                     ` Kenichi Handa
  2013-11-30 15:50                     ` Niels Möller
  0 siblings, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-30 13:20 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 15984, nisse

> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 15984@debbugs.gnu.org
> 
> > From: nisse@lysator.liu.se (Niels Möller)
> > Cc: 15984@debbugs.gnu.org
> > Date: Fri, 29 Nov 2013 13:41:01 +0100
> > 
> >   $ rm -rf ~/tmp/home/ && mkdir ~/tmp/home/ && HOME=$HOME/tmp/home emacs -nw -Q -l bug.el
> > 
> > where bug.el contains
> > 
> >   (setq gnus-init-file nil)
> >   (setq gnus-nntp-server nil)
> >   (gnus-no-server)
> > 
> > Then create the group with G d, pointing out the spool-like directory,
> > enter the group (RET), view the message (RET), try to write out the
> > attachment ("o" on the attachment button). Still crashes for me.
> 
> It crashes in the current development trunk as well, but only if the
> locale is set to Latin-1, like yours.
> 
> I'm looking at this.

There's something strange going on here; I'm CC'ing Handa-san, because
the problem is related to processing character compositions on a TTY.

The reason for the crash is simple: the following code from
indent.c:scan_for_column

      /* Check composition sequence.  */
      if (cmp_it.id >= 0
	  || (scan == cmp_it.stop_pos
	      && composition_reseat_it (&cmp_it, scan, scan_byte, end,
					w, NULL, Qnil)))
	composition_update_it (&cmp_it, scan, scan_byte, Qnil);
      if (cmp_it.id >= 0)
	{
	  scan += cmp_it.nchars;
	  scan_byte += cmp_it.nbytes;
	  if (scan <= end)
	    col += cmp_it.width;
	  if (cmp_it.to == cmp_it.nglyphs)
	    {
	      cmp_it.id = -1;
	      composition_compute_stop_pos (&cmp_it, scan, scan_byte, end,
					    Qnil);
	    }
	  else
	    cmp_it.from = cmp_it.to;
	  continue;
	}

incorrectly steps into the middle of a multibyte sequence #xCC #x88
for the character u+0308, the Combining Diaeresis, because
cmp_it.nbytes is computed as 1 instead of 2.  The question is why it
does so.

From stepping through composition_reseat_it and composition_update_it,
it looks like the code contradicts itself: it thinks that 'a' and the
combining diaeresis should be composed, but then acts as if no
composition should happen.  As result, this code in
composition_update_it:

      glyph = LGSTRING_GLYPH (gstring, cmp_it->from);
      cmp_it->nchars = LGLYPH_TO (glyph) + 1 - from;
      cmp_it->nbytes = 0;
      cmp_it->width = 0;
      for (i = cmp_it->nchars - 1; i >= 0; i--)
	{
	  c = XINT (LGSTRING_CHAR (gstring, i));
	  cmp_it->nbytes += CHAR_BYTES (c);
	  cmp_it->width += CHAR_WIDTH (c);
	}

always considers only 'a', never the diaeresis, and so cmp_it->nbytes
is always computed as 1.  So scan_for_column advances only 1 byte,
instead of 2, and finds itself in the middle of a multibyte sequence.
From there, it's a sure way to a crash.

I hope Handa-san will be able to find the problem.  The crash is 100%
reproducible with the steps described above and a mail message that
Niels can send you off-list.

TIA





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-30 13:20                   ` Eli Zaretskii
@ 2013-11-30 14:25                     ` Kenichi Handa
  2013-11-30 16:09                       ` Eli Zaretskii
  2013-11-30 15:50                     ` Niels Möller
  1 sibling, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2013-11-30 14:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984, nisse

In article <83siue58mq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> There's something strange going on here; I'm CC'ing Handa-san, because
> the problem is related to processing character compositions on a TTY.

[...]

> I hope Handa-san will be able to find the problem.  The crash is 100%
> reproducible with the steps described above and a mail message that
> Niels can send you off-list.

Thank you for tracking down the bug.  I'll investigate
the cause of of the problem.

---
Kenichi Handa
handa@m17n.org





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-30 13:20                   ` Eli Zaretskii
  2013-11-30 14:25                     ` Kenichi Handa
@ 2013-11-30 15:50                     ` Niels Möller
  1 sibling, 0 replies; 21+ messages in thread
From: Niels Möller @ 2013-11-30 15:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15984

Eli Zaretskii <eliz@gnu.org> writes:

> I hope Handa-san will be able to find the problem.  The crash is 100%
> reproducible with the steps described above and a mail message that
> Niels can send you off-list.

I ended up sending an anonymized example message to the list, see
http://debbugs.gnu.org/cgi/bugreport.cgi?msg=14;filename=bounce.gz;att=1;bug=15984

Thanks for looking into this.

Regards,
/Niels


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
  2013-11-30 14:25                     ` Kenichi Handa
@ 2013-11-30 16:09                       ` Eli Zaretskii
  0 siblings, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2013-11-30 16:09 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 15984, nisse

> From: Kenichi Handa <handa@gnu.org>
> Cc: nisse@lysator.liu.se, 15984@debbugs.gnu.org
> Date: Sat, 30 Nov 2013 23:25:06 +0900
> 
> > I hope Handa-san will be able to find the problem.  The crash is 100%
> > reproducible with the steps described above and a mail message that
> > Niels can send you off-list.
> 
> Thank you for tracking down the bug.  I'll investigate
> the cause of of the problem.

Thanks.  To save you some time, the problem only happens in a Latin-1
locale, so I used this command to invoke Emacs:

   HOME=$HOME/tmp LC_CTYPE=sv_SE.ISO8859-1 src/emacs -Q -l bug.el





^ permalink raw reply	[flat|nested] 21+ messages in thread

* bug#15984: 24.3; Problem with combining characters in attachment filename
       [not found] ` <87eh574qmm.fsf@gnu.org>
@ 2014-01-17 13:30   ` K. Handa
  0 siblings, 0 replies; 21+ messages in thread
From: K. Handa @ 2014-01-17 13:30 UTC (permalink / raw)
  To: K. Handa; +Cc: 15984, nisse

In article <87eh574qmm.fsf@gnu.org>, handa@gnu.org (K. Handa) writes:

> I'll keep trying to find why the trunk doesn't crash with
> you recipe, and once I find the whole story, I'll install a
> proper patch (which may be the same as what I sent) to the
> trunk.

I couldn't reproduce that bug with the trunk code.  I
rewinded back to the day 2013-03-11 which is the day 24.3
was released and I can reproduce the bug with 24.3.  So, I
am now very puzzled.

Anyway, I installed that fix to the trunk because the
previous code was apparently wrong.

---
Kenichi Handa
handa@gnu.org

PS. I've just noticed that recent mails exchanged on this
matter were not CC:ed to 15984@debbugs.gnu.org.  So, to
provide the context, I attach some key mails here.

-1--------------------------------------------------------------------
From: nisse@lysator.liu.se (Niels Möller)
To: handa@gnu.org (K. Handa)
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

handa@gnu.org (K. Handa) writes:

> In article <83siue58mq.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
>
>> I hope Handa-san will be able to find the problem.  The crash is 100%
>> reproducible with the steps described above and a mail message that
>> Niels can send you off-list.
>
> Could you please send me that mail message?  I'll delete it
> as soon as I can find a fix.

I believe the smaller bounce message I posted in the bugtracker exhibits
the the problem. That's the same file Eli was using when reproducing the
problem. Described at

  http://debbugs.gnu.org/cgi/bugreport.cgi?msg=14;bug=15984

actual messge (gzipped):

  http://debbugs.gnu.org/cgi/bugreport.cgi?msg=14;filename=bounce.gz;att=1;bug=15984

Steps to reproduce the problem (this info spread out in the bug thread):

1. Create a new directory, say mail-tmp. Copy the message (uncompressed)
   into that directory, with filename "1".

2. Start emacs in tty mode, with a latin-1 locale, like

   HOME=$HOME/tmp LC_CTYPE=sv_SE.ISO8859-1 src/emacs -Q -l bug.el

   with bug.el containing

     (setq gnus-init-file nil)
     (setq gnus-nntp-server nil)
     (gnus-no-server)

3. Then, in Gnus' *Group* buffer, create the group with G d, pointing
   out the mail-tmp directory, enter the group (RET), view the message
   (RET), try to write out the attachment ("o" on the attachment
   button). Still crashes for me.

Let me know if you need anything further info.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

-2--------------------------------------------------------------------
From: Eli Zaretskii <eliz@gnu.org>
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename
To: handa@gnu.org (K. Handa)
Cc: nisse@lysator.liu.se, handa@gnu.org

> From: handa@gnu.org (K. Handa)
> Cc: eliz@gnu.org, handa@gnu.org
> Date: Fri, 13 Dec 2013 23:15:00 +0900
> 
> In article <nn4n6dag53.fsf@bacon.lysator.liu.se>, nisse@lysator.liu.se (Niels Möller) writes:
> 
> > And tty mode, no X frame (I used an xterm, started in a latin-1 locale).
> 
> Yes.  I surely add "-nw" argument, and I tried the recipe
> with xterm and lxterminal.

I cannot reproduce this either, with today's trunk.  Perhaps you
could try with the trunk as it was on Nov 30, or with Emacs 24.3?

> By the way, I noticed that buffer-file-coding-system of
> Gnus's message buffer (the buffer showing that bounce mail)
> is raw-text-unix.  Is it the same with you?

Yes.  This might be part of the problem, or it could be the trigger
for the crash.


-3---------------------------------------------------------------------
From: handa@gnu.org (K. Handa)
To: nisse@lysator.liu.se (Niels Möller)
Cc: eliz@gnu.org, handa@gnu.org
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

In article <nn4n6dag53.fsf@bacon.lysator.liu.se>, nisse@lysator.liu.se (Niels Möller) writes:

> And tty mode, no X frame (I used an xterm, started in a latin-1 locale).

Yes.  I surely add "-nw" argument, and I tried the recipe
with xterm and lxterminal.

By the way, I noticed that buffer-file-coding-system of
Gnus's message buffer (the buffer showing that bounce mail)
is raw-text-unix.  Is it the same with you?

---
Kenichi Handa
handa@gnu.org

-4--------------------------------------------------------------------
From: nisse@lysator.liu.se (Niels Möller)
To: handa@gnu.org (K. Handa)
Cc: eliz@gnu.org
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

handa@gnu.org (K. Handa) writes:

> By the way, I noticed that buffer-file-coding-system of
> Gnus's message buffer (the buffer showing that bounce mail)
> is raw-text-unix.  Is it the same with you?

Yes. Probably wasn't in the original mail (if you like, I can look into
that further, but I don't want to crash the emacs I'm writing this in
right now).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

-5--------------------------------------------------------------------
From: handa@gnu.org (K. Handa)
To: Eli Zaretskii <eliz@gnu.org>
Cc: nisse@lysator.liu.se, handa@gnu.org
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

In article <838uvo6cjx.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> I cannot reproduce this either, with today's trunk.  Perhaps you
> could try with the trunk as it was on Nov 30, or with Emacs 24.3?

With Emacs 24.3, I could reproduce the bug and the patch
attached at the tail seems to fix it.  Could you please try
it?  It is applicable to the latest code too.

But, with the trunk, I have not yet succeeded in reproducing
the bug.  I tried from the revision of Nov 30 and went
back to April one month by one.

> > By the way, I noticed that buffer-file-coding-system of
> > Gnus's message buffer (the buffer showing that bounce mail)
> > is raw-text-unix.  Is it the same with you?

> Yes.  This might be part of the problem, or it could be the trigger
> for the crash.

With Emacs 24.3, the bug can be reproduced with a multibyte
buffer.

---
Kenichi Handa
handa@gnu.org

=== modified file 'src/composite.c'
--- src/composite.c	2013-01-01 09:11:05 +0000
+++ src/composite.c	2013-12-19 13:49:53 +0000
@@ -1426,7 +1426,7 @@
       cmp_it->width = 0;
       for (i = cmp_it->nchars - 1; i >= 0; i--)
 	{
-	  c = XINT (LGSTRING_CHAR (gstring, i));
+	  c = XINT (LGSTRING_CHAR (gstring, cmp_it->from + i));
 	  cmp_it->nbytes += CHAR_BYTES (c);
 	  cmp_it->width += CHAR_WIDTH (c);
 	}


-6--------------------------------------------------------------------
From: nisse@lysator.liu.se (Niels Möller)
To: handa@gnu.org (K. Handa)
Cc: Eli Zaretskii <eliz@gnu.org>
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

handa@gnu.org (K. Handa) writes:

> With Emacs 24.3, I could reproduce the bug and the patch
> attached at the tail seems to fix it.  Could you please try
> it?  It is applicable to the latest code too.

I compiled 24.3.1 with the patch applied. It no longer crashes. Great!

Behavior is that on saving the attachment, the default filename is
displayed as "Brev aktiea?gar 131127.pdf", where the question mark
really is a COMBINING DIAERESIS (according to C-u C-x =). When I press
enter, the file is saved under the file name "Brev aktiea gar
131127.pdf", with the combining diaeresis replaced by a SPC character
(checked with GNU ls -N | od -tx1c).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

-7---------------------------------------------------------------------
From: handa@gnu.org (K. Handa)
To: nisse@lysator.liu.se (Niels Möller)
Cc: eliz@gnu.org, handa@gnu.org
Subject: Re: bug#15984: 24.3;	Problem with combining characters in attachment filename

In article <nn4n64m18i.fsf@bacon.lysator.liu.se>, nisse@lysator.liu.se (Niels Möller) writes:

> handa@gnu.org (K. Handa) writes:
> > With Emacs 24.3, I could reproduce the bug and the patch
> > attached at the tail seems to fix it.  Could you please try
> > it?  It is applicable to the latest code too.

> I compiled 24.3.1 with the patch applied. It no longer crashes. Great!

Thank you for testing that.

> Behavior is that on saving the attachment, the default filename is
> displayed as "Brev aktiea?gar 131127.pdf", where the question mark
> really is a COMBINING DIAERESIS (according to C-u C-x =). When I press
> enter, the file is saved under the file name "Brev aktiea gar
> 131127.pdf", with the combining diaeresis replaced by a SPC character
> (checked with GNU ls -N | od -tx1c).

This just my guess, but, as far as you are in ISO-8859-1
locale, there's no way to encode that combining diaeresis,
so gnus uses SPC as a replacement character.

Perhaps, gnus should warn you about that and ask you how to
encode the file name.

Anyway that is completely different matter than bug#15984.

I'll keep trying to find why the trunk doesn't crash with
you recipe, and once I find the whole story, I'll install a
proper patch (which may be the same as what I sent) to the
trunk.

---
Kenichi Handa
handa@gnu.org





^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-01-17 13:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-28  8:08 bug#15984: 24.3; Problem with combining characters in attachment filename Niels Möller
2013-11-28 20:25 ` Eli Zaretskii
2013-11-28 22:17   ` Niels Möller
2013-11-28 22:46     ` Niels Möller
2013-11-29  7:16     ` Eli Zaretskii
2013-11-29  8:49       ` Niels Möller
2013-11-29  9:00         ` Eli Zaretskii
2013-11-29 10:43           ` Niels Möller
2013-11-29 11:26             ` Eli Zaretskii
2013-11-29 12:41               ` Niels Möller
2013-11-29 14:50                 ` Eli Zaretskii
2013-11-29 16:18                 ` Eli Zaretskii
2013-11-30 13:20                   ` Eli Zaretskii
2013-11-30 14:25                     ` Kenichi Handa
2013-11-30 16:09                       ` Eli Zaretskii
2013-11-30 15:50                     ` Niels Möller
2013-11-29 15:04             ` Stefan Monnier
2013-11-29 15:27               ` Eli Zaretskii
2013-11-30  8:53               ` Niels Möller
2013-11-29 13:11   ` Kenichi Handa
     [not found] ` <87eh574qmm.fsf@gnu.org>
2014-01-17 13:30   ` K. Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).