* bug#39506: patch @ 2020-02-08 0:40 dick 2020-02-08 15:10 ` dick.r.chiang 0 siblings, 1 reply; 7+ messages in thread From: dick @ 2020-02-08 0:40 UTC (permalink / raw) To: 39506 [-- Attachment #1: patch --] [-- Type: text/x-diff, Size: 3637 bytes --] From 2674b6a08a90b9a97d3adf2b3e4497b61880e173 Mon Sep 17 00:00:00 2001 From: dickmao <none> Date: Fri, 7 Feb 2020 19:33:13 -0500 Subject: [PATCH] Question assumption that mm source buffer is unibyte In my Gnus experience, the source buffer that mm-shr acts upon can be multibyte. * lisp/gnus/mm-decode.el (mm-with-part): propagate multibyte/unibyte setting from source buffer to target. * test/lisp/gnus/mm-decode-tests.el (test-mm-decode-multibyte): add test --- lisp/gnus/mm-decode.el | 8 ++----- test/lisp/gnus/mm-decode-tests.el | 39 +++++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+), 6 deletions(-) create mode 100644 test/lisp/gnus/mm-decode-tests.el diff --git a/lisp/gnus/mm-decode.el b/lisp/gnus/mm-decode.el index d33bb56dc9..19a18b4f45 100644 --- a/lisp/gnus/mm-decode.el +++ b/lisp/gnus/mm-decode.el @@ -1255,16 +1255,12 @@ mm-handle-displayed-p (defmacro mm-with-part (handle &rest forms) "Run FORMS in the temp buffer containing the contents of HANDLE." - ;; The handle-buffer's content is a sequence of bytes, not a sequence of - ;; chars, so the buffer should be unibyte. It may happen that the - ;; handle-buffer is multibyte for some reason, in which case now is a good - ;; time to adjust it, since we know at this point that it should - ;; be unibyte. `(let* ((handle ,handle)) (when (and (mm-handle-buffer handle) (buffer-name (mm-handle-buffer handle))) (with-temp-buffer - (mm-disable-multibyte) + (set-buffer-multibyte (buffer-local-value 'enable-multibyte-characters + (mm-handle-buffer handle))) (insert-buffer-substring (mm-handle-buffer handle)) (mm-decode-content-transfer-encoding (mm-handle-encoding handle) diff --git a/test/lisp/gnus/mm-decode-tests.el b/test/lisp/gnus/mm-decode-tests.el new file mode 100644 index 0000000000..8a9d471b74 --- /dev/null +++ b/test/lisp/gnus/mm-decode-tests.el @@ -0,0 +1,39 @@ +;;; mm-decode-tests.el --- tests for gnus/mm-decode.el -*- lexical-binding:t -*- + +;; Copyright (C) 2019-2020 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>. + +;;; Code: + +(require 'ert) +(require 'mm-decode) + +(ert-deftest test-mm-decode-multibyte () + (should + (or (not (fboundp 'libxml-parse-html-region)) + (with-temp-buffer + (set-buffer-multibyte t) + (save-excursion + (insert "<p>最近也想尝试,但是感觉蛮难的,比如不知道如何在")) + (let ((handle (mm-make-handle + (current-buffer) + (rfc2231-parse-qp-string + "Content-Type: text/html; charset=UTF-8")))) + (not (zerop (length (with-temp-buffer (mm-shr handle) + (buffer-string)))))))))) + +;;; mm-decode-tests.el ends here -- 2.24.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#39506: patch 2020-02-08 0:40 bug#39506: patch dick @ 2020-02-08 15:10 ` dick.r.chiang [not found] ` <87v9ohdr21.fsf@dick> 0 siblings, 1 reply; 7+ messages in thread From: dick.r.chiang @ 2020-02-08 15:10 UTC (permalink / raw) To: 39506, smonnier In the changelog message of commit d4eb2b7, you claim to "set the buffer to unibyte before inserting" but it's not clear where. Then in a later comment, you say "may happen that the handle-buffer is multibyte... in which case now is a good time to adjust it, since we know ... it should be unibyte." The revision before that preserves the multibyte-p setting, so I do the same in this patch. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <87v9ohdr21.fsf@dick>]
* bug#39506: patch [not found] ` <87v9ohdr21.fsf@dick> @ 2020-02-08 18:32 ` Stefan Monnier 2020-02-08 19:01 ` dick.r.chiang 0 siblings, 1 reply; 7+ messages in thread From: Stefan Monnier @ 2020-02-08 18:32 UTC (permalink / raw) To: dick.r.chiang; +Cc: 39506 > In the changelog message of commit d4eb2b7, you claim to "set the > buffer to unibyte before inserting" but it's not clear where. The focus of the sentence is on "before": the previous code already set the buffer to unibyte, but it did it afterwards. > Then in a later comment, you say "may happen that the handle-buffer is > multibyte... in which case now is a good time to adjust it, since we > know ... it should be unibyte." The revision before that preserves > the multibyte-p setting, so I do the same in this patch. As the motivation for your patch, you write: > In my Gnus experience, the source buffer that mm-shr acts upon can be > multibyte. Two questions: - How does `mm-with-part` relate to `mm-shr`? - Before deciding whether unibyte or multibyte is the right choice, the main question is whether the buffer contains bytes or chars. AFAIK `mm-with-part` should only ever handle bytes (otherwise calling `mm-decode-content-transfer-encoding` doesn't make much sense). Your patch suggests you have a use case where this is not true. Can you give more details about this case? Maybe a backtrace showing how we got to this particular call to `mm-with-part`? -- Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#39506: patch 2020-02-08 18:32 ` Stefan Monnier @ 2020-02-08 19:01 ` dick.r.chiang 2020-02-08 19:51 ` Stefan Monnier 0 siblings, 1 reply; 7+ messages in thread From: dick.r.chiang @ 2020-02-08 19:01 UTC (permalink / raw) To: Stefan Monnier; +Cc: 39506 > The focus of the sentence is on "before": the previous code already set the > buffer to unibyte, but it did it afterwards. Ah, so the buffer in question is merely the working temp-buffer, not the handle-buffer. Indeed, the previous code did it afterwards, and therein lies my salvation because insert-buffer-substring of multibytes to a unibyte buffer corrupts. I think the Miles Bader code replicated the multibyteness of the handle-buffer to the temp-buffer, called insert-buffer-substring, and then converted the temp-buffer to unibyte. > - How does `mm-with-part` relate to `mm-shr`? mm-shr calls mm-with-part. > - Before deciding whether unibyte or multibyte is the right choice, the > main question is whether the buffer contains bytes or chars. My buffer contained some Chinese multibytes. You can see my unit test in the patch. > AFAIK `mm-with-part` should only ever handle bytes (otherwise calling > `mm-decode-content-transfer-encoding` doesn't make much sense). Okay, maybe mm-decode-content-transfer-encoding is a noop when it doesn't make sense. I'm not completely on top of all this, but my individual use case rather prefers the old Miles Bader order of ops. I can easily work around it if you feel your 2008 change genuinely fixed something that was broken (my patch is largely speculative and inconsiderate of broader ramifications). ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#39506: patch 2020-02-08 19:01 ` dick.r.chiang @ 2020-02-08 19:51 ` Stefan Monnier 2020-02-20 13:24 ` Lars Ingebrigtsen 0 siblings, 1 reply; 7+ messages in thread From: Stefan Monnier @ 2020-02-08 19:51 UTC (permalink / raw) To: dick.r.chiang; +Cc: 39506 >> - Before deciding whether unibyte or multibyte is the right choice, the >> main question is whether the buffer contains bytes or chars. > My buffer contained some Chinese multibytes. That suggests it contains characters rather than bytes. How did that happen? Where does this buffer ('s contents) come from? > You can see my unit test in the patch. In your unit test, you artificially create a multibyte buffer with chinese chars, so that doesn't answer my question ;-) AFAIK `mm-with-part` is designed for MIME parts and MIME parts can only contain bytes at that point. Only after we extract them as bytes and apply `mm-decode-content-transfer-encoding` to it can we consider decoding those bytes into chars. So I suspect that the source of your problem is earlier, where some code incorrectly decodes some content too early. Hence the need to better understand where those chinese chars come from. Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#39506: patch 2020-02-08 19:51 ` Stefan Monnier @ 2020-02-20 13:24 ` Lars Ingebrigtsen 2020-02-21 14:34 ` dick.r.chiang 0 siblings, 1 reply; 7+ messages in thread From: Lars Ingebrigtsen @ 2020-02-20 13:24 UTC (permalink / raw) To: Stefan Monnier; +Cc: 39506, dick.r.chiang Stefan Monnier <monnier@iro.umontreal.ca> writes: > So I suspect that the source of your problem is earlier, where some code > incorrectly decodes some content too early. Hence the need to better > understand where those chinese chars come from. Yes, I think that's correct -- the mm handle buffers should contain bytes only, so it would be interesting to know how an mm handle buffer ended up containing characters. Dick, do you have a way to reproduce that bug? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#39506: patch 2020-02-20 13:24 ` Lars Ingebrigtsen @ 2020-02-21 14:34 ` dick.r.chiang 0 siblings, 0 replies; 7+ messages in thread From: dick.r.chiang @ 2020-02-21 14:34 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 39506, Stefan Monnier [-- Attachment #1: Type: text/html, Size: 94 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-02-21 14:34 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-02-08 0:40 bug#39506: patch dick 2020-02-08 15:10 ` dick.r.chiang [not found] ` <87v9ohdr21.fsf@dick> 2020-02-08 18:32 ` Stefan Monnier 2020-02-08 19:01 ` dick.r.chiang 2020-02-08 19:51 ` Stefan Monnier 2020-02-20 13:24 ` Lars Ingebrigtsen 2020-02-21 14:34 ` dick.r.chiang
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).