bug#21055: Info reader fails to follow xrefs to anchors

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#21055: Info reader fails to follow xrefs to anchors
       [not found]   ` <87twt7x18d.fsf@gnu.org>
@ 2015-07-14 14:57     ` Eli Zaretskii
  2015-07-14 23:16       ` Juri Linkov
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2015-07-14 14:57 UTC (permalink / raw)
  To: Ludovic Courtès, Juri Linkov; +Cc: 21055

Redirecting the Emacs part to bug-gnu-emacs; see
http://lists.gnu.org/archive/html/bug-texinfo/2015-07/msg00051.html
for the related Texinfo discussion.

CC to Juri, who made the offending change.

> From: ludo@gnu.org (Ludovic Courtès)
> Cc: bug-texinfo@gnu.org
> Date: Mon, 13 Jul 2015 22:16:02 +0200
> 
> The standalone Info reader in Texinfo 6.0 fails to follow
> cross-references to anchors: Following such a link leads to an unrelated
> place in the document.  This is a regression compared to Texinfo 5.2
> (guix.texi is one example that illustrates the bug.)
> 
> Unfortunately the Emacs Info reader has had the same problem for a long
> time, but I suppose this one should go to bug-emacs?
> 
> That’s with 24.5.1, and I remember experience that with earlier
> versions too.

There are two issues here.  One is that Emacs 24.4 introduced a
change, as part of fixing bug #14125 (see
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=14125), which caused the
Emacs Info reader to go to the wrong place when it follows
cross-references to anchors (as opposed to references to nodes).  The
other problem is the generous use of UTF-8 encoded characters in
guix.info, including in the preamble, which makes Emacs's job even
harder, because references in Info files are given in bytes, not
characters.

The second problem needs an infrastructure, part of which was
introduced only recently: how to convert a file byte offset to an
Emacs buffer position (which counts characters), accounting correctly
for the file's encoding and EOL format.  It sounds like we would need
the reverse conversion for fixing this present problem, see below.

As for the first part: I've read the discussions in bug #14125, and
tried playing with the test file provided there, and I must say that I
understand neither the problem nor its solution.  The analysis of the
problem (see http://debbugs.gnu.org/cgi/bugreport.cgi?bug=14125#11)
was this:

> Makeinfo 4.13 produced the character positions of indirect subfiles
> relative to the beginning of the first node, but Makeinfo 5.0 produces the
> positions relative to the beginning of the subfile.  The Emacs Info reader
> fails when the distance between the beginning of the subfile and
> the beginning of its first node is longer than a thousand characters.
> [...]
> The expression (+ (- nodepos lastfilepos) (point)) in `Info-read-subfile'
> assumes that `lastfilepos' in `Info-read-subfile' is the beginning of the
> first node, so for Info files produced by Makeinfo 4.13 it returns the
> length of the summary segment, but for Makeinfo 5.0 it returns
> two lengths of the summary segment.

Perhaps I don't understand what this says, but the conclusion sounds
incorrect to me.

The actual difference between makeinfo 4.13 and makeinfo 5.0 and later
is that with makeinfo 5 the starting position of the 2nd, 3rd,
etc. subfile includes the length of the preamble text that precedes
the first node in the subfile.  In makeinfo 4, only the beginning of
the first subfile included the preamble, and all the rest excluded it.

But that doesn't matter, IMO, because with both versions of makeinfo,
if a subfile's beginning is recorded in the tag table as byte position
N, the first node in that subfile is also recorded to start at byte
position N.  Therefore, to find the byte offset of a node/anchor from
the beginning of a subfile, one needs to do this:

   (+ (- nodepos lastfilepos) preamble-length)

in both the old and the new versions.  To find the length of the
preamble, one needs to search from the beginning of the subfile for
the start of the first node, and then compute the file's byte number
of that position.  Therefore, the original code in Info-read-subfile,
viz.:

  (+ (- nodepos lastfilepos) (point))

was an approximation that did TRT for ASCII Info files.  It is easy to
extend this to UTF-8 encoded files:

  (+ (- nodepos lastfilepos) (position-bytes (point)))

Other encodings, as well as DOS end-of-line format, will need a
dedicated function similar to filepos-to-bufferpos, but in the
reverse direction.  (We also need to subtract 1 from the above
expression, since we need a zero-based offset.)

Juri, do you see any flaws in the above description?  I couldn't
reproduce the problem reported in bug #14125, so I'm not sure why the
fix you installed was even needed, or where my reasoning is wrong.  I
tried both Emacs 24.3 (for which the bug was filed) and later
versions, and they all work correctly with the Info file produced from
the Texinfo source attached to that bug report, no matter if I produce
the Info file with makeinfo 4.13 or makeinfo 5.1 or 6.0.  So I'm
unsure what problems you saw with the original code in
Info-read-subfile.  Could you describe those problems in more detail
than you did in the bug discussions?

Why are these problems invisible when following references to nodes,
you ask?  Because in that case we search for the node's header line
after going to the recorded position.  So going to a position that
undershoots (which is what that change caused) doesn't do any visible
harm.  But for references to anchors, we don't have any text to
search, so the position where we place the reader should be reasonably
exact.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#21055: Info reader fails to follow xrefs to anchors
  2015-07-14 14:57     ` bug#21055: Info reader fails to follow xrefs to anchors Eli Zaretskii
@ 2015-07-14 23:16       ` Juri Linkov
  2015-07-15 15:09         ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2015-07-14 23:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ludo, 21055

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

> Juri, do you see any flaws in the above description?  I couldn't
> reproduce the problem reported in bug #14125, so I'm not sure why the
> fix you installed was even needed, or where my reasoning is wrong.  I
> tried both Emacs 24.3 (for which the bug was filed) and later
> versions, and they all work correctly with the Info file produced from
> the Texinfo source attached to that bug report, no matter if I produce
> the Info file with makeinfo 4.13 or makeinfo 5.1 or 6.0.  So I'm
> unsure what problems you saw with the original code in
> Info-read-subfile.  Could you describe those problems in more detail
> than you did in the bug discussions?

I'm attaching here all the files that I used to fix bug#14125,
so you could compare the output of different makeinfo versions
and see the problem.  The command line used to translate
Texinfo files was: makeinfo --split-size=2000 test.texi


[-- Attachment #2: bug_14125.zip --]
[-- Type: application/zip, Size: 7762 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#21055: Info reader fails to follow xrefs to anchors
  2015-07-14 23:16       ` Juri Linkov
@ 2015-07-15 15:09         ` Eli Zaretskii
  2015-07-16 22:49           ` Juri Linkov
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2015-07-15 15:09 UTC (permalink / raw)
  To: Juri Linkov; +Cc: ludo, 21055

> From: Juri Linkov <juri@jurta.org>
> Cc: ludo@gnu.org (Ludovic Courtès),
>   bug-gnu-emacs@gnu.org
> Date: Wed, 15 Jul 2015 02:16:32 +0300
> 
> I'm attaching here all the files that I used to fix bug#14125,
> so you could compare the output of different makeinfo versions
> and see the problem.  The command line used to translate
> Texinfo files was: makeinfo --split-size=2000 test.texi

Thanks.

I see the problem now.  It only happened in makeinfo 5.0 and 5.1, and
is fixed since 5.2.  Furthermore, it only rears its ugly head if the
Texinfo source has an @ifnottex block before the Top node; any other
blurbs usually put there, like @copying, @direntry, etc. -- don't
trigger the problem even in those 2 versions of makeinfo.  Moreover,
when this problem happens, it only affects the 1st subfile; the rest
have their offsets set correctly.  So it's a pretty rare combination
of conditions.

Therefore, I think we should fix the anchor use case by making the
value returned from Info-read-subfile as accurate as possible, and
then cater to the problematic output of makeinfo 5.0 and 5.1 by
attempting another search for a node with a larger slop value.

So any objections to the patch below?  It introduces a new
infrastructure, and then uses it to get the file byte offset
corresponding to the first node on a subfile.

--- lisp/international/mule-util.el~0	2015-06-21 06:45:33.000000000 +0300
+++ lisp/international/mule-util.el	2015-07-15 18:00:57.053036400 +0300
@@ -412,6 +412,79 @@
                        (decode-coding-region (point-min)
                                              (min (point-max) (+ pm byte))
                                              coding-system t))))))))))))
+;;;###autoload
+(defun bufferpos-to-filepos (position &optional quality coding-system)
+  "Try to return the file byte corresponding to a particular buffer POSITION.
+Value is the file position given as a (0-based) byte count.
+The function presumes the file is encoded with CODING-SYSTEM, which defaults
+to `buffer-file-coding-system'.
+QUALITY can be:
+  `approximate', in which case we may cut some corners to avoid
+    excessive work.
+  `exact', in which case we may end up re-(en/de)coding a large
+    part of the file/buffer.
+  nil, in which case we may return nil rather than an approximation."
+  (unless coding-system (setq coding-system buffer-file-coding-system))
+  (let* ((eol (coding-system-eol-type coding-system))
+         (lineno (if (= eol 1) (1- (line-number-at-pos position)) 0))
+         (type (coding-system-type coding-system))
+         (base (coding-system-base coding-system))
+         byte)
+    (and (eq type 'utf-8)
+         ;; Any post-read/pre-write conversions mean it's not really UTF-8.
+         (not (null (coding-system-get coding-system :post-read-conversion)))
+         (setq type 'not-utf-8))
+    (and (memq type '(charset raw-text undecided))
+         ;; The following are all of type 'charset', but they are
+         ;; actually variable-width encodings.
+         (not (memq base '(chinese-gbk chinese-gb18030 euc-tw euc-jis-2004
+                                       korean-iso-8bit chinese-iso-8bit
+                                       japanese-iso-8bit chinese-big5-hkscs
+                                       japanese-cp932 korean-cp949)))
+         (setq type 'single-byte))
+    (pcase type
+      (`utf-8
+       (setq byte (position-bytes position))
+       (when (null byte)
+         (if (<= position 0)
+             (setq byte 1)
+           (setq byte (position-bytes (point-max)))))
+       (setq byte (1- byte))
+       (+ byte
+          ;; Account for BOM, if any.
+          (if (coding-system-get coding-system :bom) 3 0)
+          ;; Account for CR in CRLF pairs.
+          lineno))
+      (`single-byte
+       (+ position -1 lineno))
+      ((and `utf-16
+            ;; FIXME: For utf-16, we could use the same approach as used for
+            ;; dos EOLs (counting the number of non-BMP chars instead of the
+            ;; number of lines).
+            (guard (not (eq quality 'exact))))
+       ;; In approximate mode, assume all characters are within the
+       ;; BMP, i.e. each one takes up 2 bytes.
+       (+ (* (1- position) 2)
+          ;; Account for BOM, if any.
+          (if (coding-system-get coding-system :bom) 2 0)
+          ;; Account for CR in CRLF pairs.
+          lineno))
+      (_
+       (pcase quality
+         (`approximate (+ (position-bytes position) -1 lineno))
+         (`exact
+          ;; Rather than assume that the file exists and still holds the right
+          ;; data, we reconstruct its relevant portion.
+          (let ((buf (current-buffer)))
+            (with-temp-buffer
+              (set-buffer-multibyte nil)
+              (let ((tmp-buf (current-buffer)))
+                (with-current-buffer buf
+                  (save-restriction
+                    (widen)
+                    (encode-coding-region (point-min) (min (point-max) position)
+                                          coding-system tmp-buf)))
+                (1- (point-max)))))))))))
 \f
 (provide 'mule-util)
 
--- lisp/info.el~0	2015-06-16 10:34:22.000000000 +0300
+++ lisp/info.el	2015-07-15 18:08:58.585385400 +0300
@@ -1217,6 +1217,18 @@
 		  (goto-char pos)
 		  (throw 'foo t)))
 
+              ;; If the Texinfo source had an @ifnottex block of text
+              ;; before the Top node, makeinfo 5.0 and 5.1 mistakenly
+              ;; omitted that block's size from the starting position
+              ;; of the 1st subfile, which makes GUESSPOS overshoot
+              ;; the correct position by the length of that text.  So
+              ;; we try again with a larger slop.
+              (goto-char (max (point-min) (- guesspos 10000)))
+	      (let ((pos (Info-find-node-in-buffer regexp strict-case)))
+		(when pos
+		  (goto-char pos)
+		  (throw 'foo t)))
+
               (when (string-match "\\([^.]+\\)\\." nodename)
                 (let (Info-point-loc)
                   (Info-find-node-2
@@ -1553,10 +1565,13 @@
     (if (looking-at "\^_")
 	(forward-char 1)
       (search-forward "\n\^_"))
-    ;; Don't add the length of the skipped summary segment to
-    ;; the value returned to `Info-find-node-2'.  (Bug#14125)
     (if (numberp nodepos)
-	(- nodepos lastfilepos))))
+        ;; Our caller ('Info-find-node-2') wants the (zero-based) byte
+        ;; offset corresponding to NODEPOS, from the beginning of the
+        ;; subfile.  This is especially important if NODEPOS is for an
+        ;; anchor reference, because for those the position is all we
+        ;; have.
+	(+ (- nodepos lastfilepos) (bufferpos-to-filepos (point) 'exact)))))
 
 (defun Info-unescape-quotes (value)
   "Unescape double quotes and backslashes in VALUE."






^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#21055: Info reader fails to follow xrefs to anchors
  2015-07-15 15:09         ` Eli Zaretskii
@ 2015-07-16 22:49           ` Juri Linkov
  2015-07-18 10:24             ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2015-07-16 22:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ludo, 21055

> So any objections to the patch below?  It introduces a new
> infrastructure, and then uses it to get the file byte offset
> corresponding to the first node on a subfile.

I tested your patch on the output of makeinfo versions 4.13, 5.0
and the newest version 6.0, and it works flawlessly.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#21055: Info reader fails to follow xrefs to anchors
  2015-07-16 22:49           ` Juri Linkov
@ 2015-07-18 10:24             ` Eli Zaretskii
  0 siblings, 0 replies; 5+ messages in thread
From: Eli Zaretskii @ 2015-07-18 10:24 UTC (permalink / raw)
  To: Juri Linkov; +Cc: ludo, 21055-done

> From: Juri Linkov <juri@jurta.org>
> Cc: ludo@gnu.org,  21055@debbugs.gnu.org
> Date: Fri, 17 Jul 2015 01:49:19 +0300
> 
> > So any objections to the patch below?  It introduces a new
> > infrastructure, and then uses it to get the file byte offset
> > corresponding to the first node on a subfile.
> 
> I tested your patch on the output of makeinfo versions 4.13, 5.0
> and the newest version 6.0, and it works flawlessly.

Thanks, pushed to master.





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-07-18 10:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87615o2l0e.fsf@gnu.org>
     [not found] ` <83h9p884w2.fsf@gnu.org>
     [not found]   ` <87twt7x18d.fsf@gnu.org>
2015-07-14 14:57     ` bug#21055: Info reader fails to follow xrefs to anchors Eli Zaretskii
2015-07-14 23:16       ` Juri Linkov
2015-07-15 15:09         ` Eli Zaretskii
2015-07-16 22:49           ` Juri Linkov
2015-07-18 10:24             ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).