unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#39506: patch
@ 2020-02-08  0:40 dick
  2020-02-08 15:10 ` dick.r.chiang
  0 siblings, 1 reply; 7+ messages in thread
From: dick @ 2020-02-08  0:40 UTC (permalink / raw)
  To: 39506

[-- Attachment #1: patch --]
[-- Type: text/x-diff, Size: 3637 bytes --]

From 2674b6a08a90b9a97d3adf2b3e4497b61880e173 Mon Sep 17 00:00:00 2001
From: dickmao <none>
Date: Fri, 7 Feb 2020 19:33:13 -0500
Subject: [PATCH] Question assumption that mm source buffer is unibyte

In my Gnus experience, the source buffer that mm-shr acts upon can be
multibyte.

* lisp/gnus/mm-decode.el (mm-with-part): propagate multibyte/unibyte setting
from source buffer to target.
* test/lisp/gnus/mm-decode-tests.el (test-mm-decode-multibyte): add test
---
 lisp/gnus/mm-decode.el            |  8 ++-----
 test/lisp/gnus/mm-decode-tests.el | 39 +++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 6 deletions(-)
 create mode 100644 test/lisp/gnus/mm-decode-tests.el

diff --git a/lisp/gnus/mm-decode.el b/lisp/gnus/mm-decode.el
index d33bb56dc9..19a18b4f45 100644
--- a/lisp/gnus/mm-decode.el
+++ b/lisp/gnus/mm-decode.el
@@ -1255,16 +1255,12 @@ mm-handle-displayed-p
 
 (defmacro mm-with-part (handle &rest forms)
   "Run FORMS in the temp buffer containing the contents of HANDLE."
-  ;; The handle-buffer's content is a sequence of bytes, not a sequence of
-  ;; chars, so the buffer should be unibyte.  It may happen that the
-  ;; handle-buffer is multibyte for some reason, in which case now is a good
-  ;; time to adjust it, since we know at this point that it should
-  ;; be unibyte.
   `(let* ((handle ,handle))
      (when (and (mm-handle-buffer handle)
 		(buffer-name (mm-handle-buffer handle)))
        (with-temp-buffer
-	 (mm-disable-multibyte)
+         (set-buffer-multibyte (buffer-local-value 'enable-multibyte-characters
+                                                   (mm-handle-buffer handle)))
 	 (insert-buffer-substring (mm-handle-buffer handle))
 	 (mm-decode-content-transfer-encoding
 	  (mm-handle-encoding handle)
diff --git a/test/lisp/gnus/mm-decode-tests.el b/test/lisp/gnus/mm-decode-tests.el
new file mode 100644
index 0000000000..8a9d471b74
--- /dev/null
+++ b/test/lisp/gnus/mm-decode-tests.el
@@ -0,0 +1,39 @@
+;;; mm-decode-tests.el --- tests for gnus/mm-decode.el    -*- lexical-binding:t -*-
+
+;; Copyright (C) 2019-2020 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+(require 'mm-decode)
+
+(ert-deftest test-mm-decode-multibyte ()
+  (should
+   (or (not (fboundp 'libxml-parse-html-region))
+       (with-temp-buffer
+         (set-buffer-multibyte t)
+         (save-excursion
+           (insert "<p>最近也想尝试,但是感觉蛮难的,比如不知道如何在"))
+         (let ((handle (mm-make-handle
+                        (current-buffer)
+                        (rfc2231-parse-qp-string
+                         "Content-Type: text/html; charset=UTF-8"))))
+           (not (zerop (length (with-temp-buffer (mm-shr handle)
+                                                 (buffer-string))))))))))
+
+;;; mm-decode-tests.el ends here
-- 
2.24.1






^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#39506: patch
  2020-02-08  0:40 bug#39506: patch dick
@ 2020-02-08 15:10 ` dick.r.chiang
       [not found]   ` <87v9ohdr21.fsf@dick>
  0 siblings, 1 reply; 7+ messages in thread
From: dick.r.chiang @ 2020-02-08 15:10 UTC (permalink / raw)
  To: 39506, smonnier

In the changelog message of commit d4eb2b7, you claim to "set the buffer to unibyte
before inserting" but it's not clear where.  Then in a later comment, you say
"may happen that the handle-buffer is multibyte... in which case now is a good
time to adjust it, since we know ... it should be unibyte."  The revision before that
preserves the multibyte-p setting, so I do the same in this patch.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#39506: patch
       [not found]   ` <87v9ohdr21.fsf@dick>
@ 2020-02-08 18:32     ` Stefan Monnier
  2020-02-08 19:01       ` dick.r.chiang
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2020-02-08 18:32 UTC (permalink / raw)
  To: dick.r.chiang; +Cc: 39506

> In the changelog message of commit d4eb2b7, you claim to "set the
> buffer to unibyte before inserting" but it's not clear where.

The focus of the sentence is on "before": the previous code already set
the buffer to unibyte, but it did it afterwards.

> Then in a later comment, you say "may happen that the handle-buffer is
> multibyte... in which case now is a good time to adjust it, since we
> know ... it should be unibyte."  The revision before that preserves
> the multibyte-p setting, so I do the same in this patch.

As the motivation for your patch, you write:
> In my Gnus experience, the source buffer that mm-shr acts upon can be
> multibyte.

Two questions:
- How does `mm-with-part` relate to `mm-shr`?
- Before deciding whether unibyte or multibyte is the right choice, the
  main question is whether the buffer contains bytes or chars.
  AFAIK `mm-with-part` should only ever handle bytes (otherwise calling
  `mm-decode-content-transfer-encoding` doesn't make much sense).
  Your patch suggests you have a use case where this is not true.
  Can you give more details about this case?
  Maybe a backtrace showing how we got to this particular call to
  `mm-with-part`?


-- Stefan






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#39506: patch
  2020-02-08 18:32     ` Stefan Monnier
@ 2020-02-08 19:01       ` dick.r.chiang
  2020-02-08 19:51         ` Stefan Monnier
  0 siblings, 1 reply; 7+ messages in thread
From: dick.r.chiang @ 2020-02-08 19:01 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 39506

> The focus of the sentence is on "before": the previous code already set the
> buffer to unibyte, but it did it afterwards.

Ah, so the buffer in question is merely the working temp-buffer, not the
handle-buffer.

Indeed, the previous code did it afterwards, and therein lies my salvation
because insert-buffer-substring of multibytes to a unibyte buffer corrupts.
I think the Miles Bader code replicated the multibyteness of the
handle-buffer to the temp-buffer, called insert-buffer-substring, and then
converted the temp-buffer to unibyte.

> - How does `mm-with-part` relate to `mm-shr`?

mm-shr calls mm-with-part.

> - Before deciding whether unibyte or multibyte is the right choice, the
>   main question is whether the buffer contains bytes or chars.

My buffer contained some Chinese multibytes.  You can see my unit test in the
patch.

>   AFAIK `mm-with-part` should only ever handle bytes (otherwise calling
>   `mm-decode-content-transfer-encoding` doesn't make much sense).

Okay, maybe mm-decode-content-transfer-encoding is a noop when it doesn't make
sense.

I'm not completely on top of all this, but my individual use case rather
prefers the old Miles Bader order of ops.  I can easily work around it if you feel
your 2008 change genuinely fixed something that was broken (my patch is largely
speculative and inconsiderate of broader ramifications).





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#39506: patch
  2020-02-08 19:01       ` dick.r.chiang
@ 2020-02-08 19:51         ` Stefan Monnier
  2020-02-20 13:24           ` Lars Ingebrigtsen
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2020-02-08 19:51 UTC (permalink / raw)
  To: dick.r.chiang; +Cc: 39506

>> - Before deciding whether unibyte or multibyte is the right choice, the
>>   main question is whether the buffer contains bytes or chars.
> My buffer contained some Chinese multibytes.

That suggests it contains characters rather than bytes.  How did
that happen?  Where does this buffer ('s contents) come from?

> You can see my unit test in the patch.

In your unit test, you artificially create a multibyte buffer with
chinese chars, so that doesn't answer my question ;-)

AFAIK `mm-with-part` is designed for MIME parts and MIME parts can only
contain bytes at that point.  Only after we extract them as bytes and
apply `mm-decode-content-transfer-encoding` to it can we consider
decoding those bytes into chars.

So I suspect that the source of your problem is earlier, where some code
incorrectly decodes some content too early.  Hence the need to better
understand where those chinese chars come from.


        Stefan






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#39506: patch
  2020-02-08 19:51         ` Stefan Monnier
@ 2020-02-20 13:24           ` Lars Ingebrigtsen
  2020-02-21 14:34             ` dick.r.chiang
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Ingebrigtsen @ 2020-02-20 13:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 39506, dick.r.chiang

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> So I suspect that the source of your problem is earlier, where some code
> incorrectly decodes some content too early.  Hence the need to better
> understand where those chinese chars come from.

Yes, I think that's correct -- the mm handle buffers should contain
bytes only, so it would be interesting to know how an mm handle buffer
ended up containing characters.

Dick, do you have a way to reproduce that bug?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#39506: patch
  2020-02-20 13:24           ` Lars Ingebrigtsen
@ 2020-02-21 14:34             ` dick.r.chiang
  0 siblings, 0 replies; 7+ messages in thread
From: dick.r.chiang @ 2020-02-21 14:34 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 39506, Stefan Monnier

[-- Attachment #1: Type: text/html, Size: 94 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-02-21 14:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-08  0:40 bug#39506: patch dick
2020-02-08 15:10 ` dick.r.chiang
     [not found]   ` <87v9ohdr21.fsf@dick>
2020-02-08 18:32     ` Stefan Monnier
2020-02-08 19:01       ` dick.r.chiang
2020-02-08 19:51         ` Stefan Monnier
2020-02-20 13:24           ` Lars Ingebrigtsen
2020-02-21 14:34             ` dick.r.chiang

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).