unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: A Soare <alinsoar@voila.fr>
To: "Emacs   Dev  [emacs-devel]" <emacs-devel@gnu.org>
Subject: Embedding Html in Lisp
Date: Sun, 22 Jun 2008 22:56:15 +0200 (CEST)	[thread overview]
Message-ID: <6020102.8572091214168175976.JavaMail.www@wwinf4622> (raw)

[-- Attachment #1: Type: text/plain, Size: 7157 bytes --]

Bonjour à tous.



I have to prepare an exam and I needed a good access to a French
dictionary. I considered that the best choice is http://www.cnrtl.fr/,
a database of the "Centre national de la recherche scientifique". This
dictionary is provided in html format.

So I thought how to get access to a html page. I tried to define html.

I define html as « html is a programming language like any other
programming language ».

Being a language, it can be transformed in lisp and evaluated using
the lisp evaluator.

Apart a few special tags (like BR, HR, <tag expr />, etc), the grammar
of the html is identical ( via an isomorphism ) to the lisp grammar:

<tag expr> text </tag> <----> (tag (expr) text)

So I transformed html in lisp via this iso.

<table>
 <tr>
  <td>x</td>
  <td>y</td>
 </tr>
 <tr>
  <td>a</td>
  <td>b</td>
 </tr>
</table>

will become

(table (tr (td "x") (td "y"))
       (tr (td "a") (td "b")))

and this form will be evaluated after the rules of the emacs lisp
evaluator.

Now I have to give the tags' definitions.

I observe that the lisp evaluator will first call the functions `td'
with a string as parameter, and then will call the functions `tr' with
the output of `td' as parameter, and finally will call `table' with a
list of elements that are the ouputs from `tr' functions as parameter.

Hence html is a subset of lisp, in which the functions are the symbols
like table, div, input, center, I, B, et ctera.

So I did so:

1. I downloaded the page from CNRTL.

2. I filtered it to extract just the definition, and cut the unuseful
informations. (using a signal->filter->accumulate library (not
included in the example I send here))

3. I transformed the result using the grammar isomorphism

4. I evaluated the resulted lisp structure.

The result of the 4th step is the context of the filtered html
page (just the definition).

So that was my implementation of html.



Now if I look in the dictionary, to see the definition of html:

http://fr.wikipedia.org/wiki/Hypertext_Markup_Language

« L'Hypertext Markup Language, généralement abrégé HTML, est le
**format de données** conçu pour représenter les pages web. Il permet
notamment d'implanter de l'hypertexte dans le contenu des pages et
repose sur un **langage de balisage**, d’où son nom »

More, look at the definition of «langage de balisage»:

http://fr.wikipedia.org/wiki/Langage_de_balisage

« L'inclusion de balises permet de transférer à la fois la structure
du document et son contenu. Cette structure est compréhensible par un
**programme informatique** , ce qui autorise un affichage personnalisé
selon des règles pré-établies »

So, in the dictionaries, html is considered as a data structure, that
can be rendered by a program. Maybe the lisp evaluator is that program
in my code... and the data structure is a lisp structure obtained by
html->lisp...

When I first read this definition, it sounded me very strange.

I redefined html as a programming language «en tant que tel», as any
other programming language, neither as a «langage de balisage», nor as
«format de données».

In this new implementation of html the emacs lisp user will have all
the liberty to redefine the rendering of every html (= lisp) symbol as
one wishes. For example, if the user does not like the default
implementation of the tag `table', he will know that the tag `table'
is a lisp function that receives as parameter a list that contains
many list (rows), and every such list contains strings that are the
information of columns (depending of the implementation of TD and
TR). The function must return a string that is the image of the
table. More than that, one could attach to this tag everything
semantically, like a cond form in lisp, or progn. But in this case the
html standard would not be accomplished. In lisp a procedure must
always return an object (Lisp_Object) that can be of a few types. In
html we return an object that must be a string or a list, depending on
the html function (id est: we chop, like Edward Scissorhands, the
lisp evaluator and we obtain so a html evaluator).

When a table is inside another table (or any object inside another
object), the inner table will be evaluated first, then the outside
table. Maybe the outside object (function) will not like the
dimensions of the returned image of the inner object, and want to
rescale it. To solve this, during parsing one can add the quoted text
of every procedure after its definition, for example:

(table (tr (td "x") '(td "x") (td "y") '(td "y"))
       '(tr (td "x") '(td "x") (td "y") '(td "y"))
       (tr
         (td
           (table (tr (td "a") '(td "a") (td "b") '(td "b"))
                  (tr (td "x") '(td "x") (td "b") '(td "b")))
           '(table (tr (td "a") '(td "a") (td "b") '(td "b"))
                   (tr (td "x") '(td "x") (td "b") '(td "b"))))
         (td "o")))

and, in case the table "x y" will be able to rescale the inner table
"a b". This has the disavantage that the lisp code grows exponentially
with the html tree's deep. Another solution is to insert the
percentage of the current object during the transformation
html->lisp. For example:


(table (tr (td 50 "x") (td 50 "y"))
       (tr (td 40 "a") (td 60 "b")))

When the lisp evaluator will call every function, it will know exactly
how huch the width of that element must me.

There are many possibilities to solve this problem.


The transformation of structures from html to lisp in html->lisp is
very inefficient in my implementation (it's just for test) and have to
be embedded in C (with DEFUN) for a good speed.

For the rest, I will write in the near future for myself a few more
filters for a few French newspapers that I read dayly, and I will add
some more definitions like <table, <a href= , <img src= , <hr>,
etc. But I will not define all the tags of the html language; I will
define just that ones that I need for these pages.

The current definitions of div, span, etc, are adapted for the CNRTL
dictionnary. That was just what I needed here. One can test it using
the call (cnrtl-get 'french-word).

I believe that the ideas from this implementation of html (cnrtl), can
be used in future to implement a complete web browser embedded in
emacs lisp, more customizable than any other browser of emacs (in fact
emacs does not have an incorporated web browser).

Html is lisp.



Finally, I wish to dedicate this implementation of html as a
programming language to Julie B. White from San Diego with all the
gratitude for her encouragements and for a beautiful friendship.


  Alin C. Soare




PS: I promised 1 year ago that I would send the indentation of
the lisp code in O(n) time complexity, and I will send it, I hope
soon, because I have much to work and I do not have time at this
instant to check it out before.



____________________________________________________

En quelques secondes, créez-vous une autre adresse mail ! http://mail.voila.fr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: /mnt/lisp/elisp/cnrtl.el --]
[-- Type: text/x-emacs-lisp; name="/mnt/lisp/elisp/cnrtl.el", Size: 14873 bytes --]


(defun concat* (text)
  (let ((r ""))
    (dolist (i text)
      (and (equal i "-") (setq i "\n-"))
      (setq r (concat r (format "%s " i ))))
    r))

(defun html->read-structure-name (limit buffer offset back)
  (with-current-buffer (or buffer (current-buffer))
    (skip-chars-forward " ")
    (forward-char offset)
    (let ((w (upcase (buffer-substring-no-properties
                      (point) (prog2 (search-forward-regexp limit nil t) (match-beginning 0))))))
      (backward-char back)
      (if (member (elt w 0) '(?#))
          (format "\"%s\"" w)
        w))))

(defun html->add-return (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (goto-char (point-min))
    (search-forward "<div " nil t)
    (while (search-forward "<div " nil t)
      (goto-char (match-beginning 0))
      (insert-char ?\n 2)
      (goto-char (match-end 0)))
    (goto-char (point-min))
    ;;(while (search-forward "</div>" nil t)
    ;;  (goto-char (match-beginning 0))
    ;;  (insert-char ?\n 1)
    ;;  (goto-char (match-end 0))))
    ))

(defun html->extract-article (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (goto-char (point-min))
    (if (search-forward "<div id=\"art" nil t)
        (let ((beg (match-beginning 0)) end (n 1))
           (while (not (zerop n))
             (search-forward "div" nil t)
             (cond ((and (equal (char-before (match-beginning 0)) ?/)
                         (equal (char-before (1- (match-beginning 0))) ?<))
                    (setq n (1- n)))
                   ((and (equal (char-before (match-beginning 0)) ?<)
                         (equal (char-after (match-end 0)) ? ))
                    (setq n (1+ n)))))
           (setq end (1+ (match-end 0)))
           (list beg end))
      (html-fail-proposed-words))))
  
(defun html-fail-proposed-words (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (let (beg end list-definitions)
      (goto-char 1)
      (if (search-forward "Le portail vous propose les termes suivants:" nil t)
          (progn
            (while (search-forward "*" nil t)
              (search-forward "<a href=")
              (forward-char)
              (setq beg (point))
              (search-forward "\"")
              (backward-char)
              (setq end (point))
              (setq list-definitions (append list-definitions (list (buffer-substring-no-properties beg end)))))
            (erase-buffer)
            (insert "\n\nLe portail vous propose les termes suivants: ")
            (dolist (i list-definitions)
              (insert (format "\n%s"
                              (substring i (1+ (and i
                                                    (equal (elt i 0) ?/)
                                                    (string-match "/" i 1))))))))
        (erase-buffer)
        (insert (format "Le terme %s est introuvable." mot)))
      ())))

(defun html->filter (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (let ((lim (html->extract-article buffer)))
      (if lim
          (progn
            (setq other-definitions (html->extract-definitions buffer))
            (delete-region (cadr lim) (point-max))
            (delete-region 1 (car lim))
            t)
        (switch-to-buffer buffer)
        ()))))

(defun html->extract-definitions (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (let (beg end a w list-definitions)
      (goto-char 1)
      (while (search-forward "return sendRequest(5," nil t)
        (search-forward "\'")
        (setq beg (match-end 0))
        (search-forward "\'")
        (setq end (match-beginning 0))
        (setq a (buffer-substring-no-properties beg end))
        (search-forward "<h1>")
        (setq beg (match-end 0))
        (search-forward "</h1>")
        (setq end (match-beginning 0))
        (setq w (buffer-substring-no-properties beg end))
        (search-forward "<h2>")
        (setq beg (match-end 0))
        (search-forward "</h2>")
        (setq end (match-beginning 0))
        (setq w (concat w "-" (buffer-substring-no-properties beg end)))
        (setq list-definitions (append list-definitions (list (cons w a)))))
      list-definitions )))

(defun html-step-forward (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (and (eobp) (error "il y a une erreur dans le fichier (fin dy fichier)"))
    (skip-chars-forward " \f\t\n\r\v")
    (let ((p (point)))
      (if (equal (char-after) ?<)
          (prog1 nil
            (search-forward ">" nil t)
            (save-excursion (search-backward "<" nil t)
                            (when (> (point) p)
                              (error "Charactère défendu dans le texte (position %d - %d)" p (point)))))
        (search-forward "<" nil t)
        (backward-char)
        (skip-chars-backward " \f\t\n\r\v")))))

(defun html-step-backward (&optional buffer)
  (with-current-buffer (or buffer (current-buffer))
    (skip-chars-backward " \f\t\n\r\v")
    (let ((p (point)))
      (if (equal (char-before) ?>)
          (prog1 nil
            (search-backward "<" nil t)
            (save-excursion (search-forward ">" nil t)
                            (when (< (point) p)
                              (error "Charactère défendu dans le texte (position %d - %d)" (point) p))))
        (search-backward ">" nil t)
        (forward-char)
        (skip-chars-forward " \f\t\n\r\v")))))

(defun cnrtl-get (mot)
  (let* ((address (concat "http://www.cnrtl.fr/definition/" (symbol-name mot)))
         (coding-system-for-write 'utf-8)
         (coding-system-for-read 'utf-8)
         (temporary-file-directory
          "/tmp/cnrtl/")
         (f (prog2 (or (file-exists-p temporary-file-directory)
                       (make-directory temporary-file-directory))
                (make-temp-file (concat "cnrtl" "-" "mot"))))
         (buffer (url-retrieve-synchronously address))
         other-definitions
         )
    ;;(html-check-coding-system buffer)
    (when (html->filter buffer)
      (html->add-return buffer)
      (with-current-buffer (or buffer (current-buffer)) (write-region nil nil f))
      (setq buffer (find-file-noselect f))
      (with-current-buffer buffer
        (html->lisp)
        (goto-char (point-max))
        (message "evaluating the html cpde...")
        (let ((r (eval-last-sexp t)))
          (erase-buffer)
          (insert r)
          ))
      (switch-to-buffer buffer)
      (goto-char (point-min)))))

(defun html->read-tag nil
  (if (equal (char-after) ?<)
      (let ((tag-name (prog2
                          (forward-char)
                          (html->read-structure-name "[ >]" (current-buffer) 0 1)))
            info
            name
            )
        (catch 'END-TAG
          (while t
            (when (or (looking-at ">") (looking-back ">"))
              (throw 'END-TAG 0))
            (setq name (html->read-structure-name "[ =>]" (current-buffer) 0 1))
            (if (looking-at "=\"")
                (setq info (append info (list (cons name (html->read-structure-name
                                                          "\"" (current-buffer) 2 0)))))
              (setq info (cons (list name) info)))))
        (and (looking-at ">") (forward-char))
        (cons tag-name info))
    nil))

(defun html->read-text nil
  (let ((w (buffer-substring-no-properties (point) (prog2 (html-step-forward) (point)))))
    (cond ((string-equal w (string 9830)) (setq w "\n\n §"))
          ((string-equal w (string 8722)) (setq w "\n\n #")))
    w))

(defconst html-special-tags
  '(("BR" . "NEWLINE")))

(defun html->struct nil
  (skip-chars-forward " \f\t\n\r\v")
  (let ((p (point)))
    (setq s (or (html->read-tag)
                (html->read-text)))
    (delete-region p (point))
    (setq type (cond ((listp s)
                      (cond ((assoc (car s) html-special-tags)
                             'SPECIAL-TAG)
                            ((member "/" (cadr s))
                             'UNUSED-TAG)
                            ((equal (elt (car s) 0) ?/)
                             'CLOSE-TAG)
                            (t
                             'OPEN-TAG)))
                     ('TEXT)))
    (setq expr
          (cond ((eq type 'SPECIAL-TAG)
                 (format "(%s) " (cdr (assoc (car s) html-special-tags))))
                ((eq type 'UNUSED-TAG)
                 "")
                ((eq type 'CLOSE-TAG)
                 (setq depth (1- depth))
                 ") ")
                ((eq type 'OPEN-TAG)
                 (setq depth (1+ depth))
                 (format "(%s \'%s " (car s) (cdr s)))
                ((eq type 'TEXT)
                 (format "\"%s\" " s))))))

(defun html->lisp ()
  (message "html->lisp start")
  (catch 'END
    (let ((depth 0)
          s
          type
          expr)
      (while t
        (html->struct)
        (insert expr)
        (when (zerop depth)
          (throw 'END nil)))))
  (message "html->lisp end")
  )

(defun html->pretty-print (text indent)
  (let (r s)
    (setq s (split-string text))
    (catch 'STOP
      (while t
        (let (nl)
          (while (or (and s (< (length nl) (- 100 indent)))
                     (and s (null (string-match "[[:alpha:]]" (substring (car s) 0 1)))
                          (not (member (elt (car s) 0) '(?\( ?\«) ))))
            (setq nl (concat nl (if (member (elt (car s) 0) '(?, ?.)) "" " ")
                             (car s))
                  s (cdr s)))
          (setq r (concat r (concat "\n" (make-string indent ? ) nl))))
        (and (null s)
             (throw 'STOP r))))))

;; html definition

(defun B (prop &rest text)
  (concat* text))

(defun SUP (prop &rest text)
  (concat* text))

(defun I (prop &rest text)
  (concat* text))

(defun CENTER (prop &rest text)
  (concat* text))

(defun INPUT (prop &rest text)
  "")

(defconst div-forbidden-classes nil)

(defun DIV (prop &rest text)
  (let* ((dclass (cdr (assoc 'CLASS prop)))
         (did (cdr (assoc 'ID prop)))
         indent
         (E "")
         )
    (cond ((member dclass div-forbidden-classes)
           )
          ((string-equal (caar prop) "ID")
           (setq indent 4))
          ((string-equal "TLF_PARAPUTIR" dclass)
           (and text (setcar (cdr text) (substring (cadr text) (string-match "[[:graph:]]" (cadr text)))))
           (setq E (concat* text))
           (add-text-properties 0 1 '(face (:foreground "green" ))  E)
           (setq E (concat "\n\n " E)))
          ((and text
                (string-equal "TLF_PAROTHERS" dclass)
                (or (string-match "Prononc." (car text))
                    (string-match "BBG." (car text))
                    (string-match "STAT." (car text))
                    (string-match "Étymol." (car text))))
           )
          ((and text
                (string-equal "TLF_PAROTHERS" dclass)
                (string-match "Rem." (car text)))
           (setq E (concat* text))
           (add-text-properties 0 (length E) '(face (:foreground "yellow" ))  E)
           (setq E (html->pretty-print E 7))
           (setq E (concat "\n" E))
           )
          ((string-equal "TLF_CVEDETTE" dclass)
           (let* (z
                  (def (mapcar 'car other-definitions)))
             (setq header-line-format (concat "« " (concat* text) " »" (format "  %s " def))))
           (setq indent nil)
           )
          ((string-equal "TLF_PARAH" dclass)
           (and text (setcar text (substring (car text) 0 (string-match " " (car text)))))
           (and text (setcar (cdr text) (substring (cadr text) (string-match "[[:graph:]]" (cadr text)))))
           (setq E (concat* text))
           (setq E (concat "\n\n " E))
           )
          ((string-equal "TLF_TABULATION" dclass)
           (setq indent 10))
          (t
           (setq indent 0)
           (setq text (append text '("\n")))))
    (if indent
        (concat "\n" (make-string indent ? ) (concat* text))
      E)))

(defconst span-forbidden-classes nil)

(defun SPAN (prop &rest text)
  (let ((sclass (cdar prop)) r i c forbid)
    (cond ((member sclass span-forbidden-classes)
           "" )
        ((string-equal sclass "TLF_CMOT")
         (setq c '(face (:foreground "red" ))) (concat* text))
        ((string-equal sclass "TLF_CCODE")
         (setq c '(face (:foreground "green" ))) (concat* text))
        ((string-equal sclass "TLF_CPLAN")
         (setq c '(face (:foreground "DeepPink1" ))) (concat* text))
        ((string-equal sclass "TLF_CEMPLOI")
         (setq c '(face (:foreground "cyan" ))
               i 3))
        ((string-equal sclass "TLF_CDEFINITION")
         (setq c '(face (:foreground "DodgerBlue" ))
               i 5))
        ((string-equal sclass "TLF_PARAH")
         (setq c '(face (:foreground "SlateBlue" ))) (concat* text))
        ((string-equal sclass "TLF_CSYNONIME")
         (setq c '(face (:foreground "MediumSeaGreen" ))
               i 10))
        ((string-equal sclass "TLF_CSYNTAGME")
         (setq c '(face (:foreground "DarkGreen" ))
               i 12))
        ((string-equal sclass "TLF_CTITRE")
         (setq c '(face (:foreground "LawnGreen" ))
               forbid t))
        ((string-equal sclass "TLF_CCROCHET")
         (setq c '(face (:foreground "IndianRed" ))
               i 5))
        ((string-equal sclass "TLF_CAUTEUR")
         (setq c '(face (:foreground "DarkKhaki" ))
               forbid t))
        ((string-equal sclass "TLF_SMALLCAPS")
         (when text (setq text (list (upcase (concat* text)))
                          c '(face (:foreground "orange" ))))
         )
        ((string-equal sclass "TLF_CEXEMPLE")
         (setq c '(face (:foreground "tomato1" ))
               i 15)
         )
        ((string-equal sclass "TLF_CDATE")
         (setq c '(face (:foreground "BlueViolet" ))
               forbid t))
        ((string-equal sclass "TLF_CCONSTRUCTION")
         (setq c '(face (:foreground "LemonChiffon3" ))
               i 14))
        ((string-equal sclass "TLF_CSOURCE")
         (setq c '(face (:foreground "LavenderBlush3" ))) (concat* text))
        ((string-equal sclass "TLF_CDOMAINE")
         (setq c '(face (:foreground "DeepSkyBlue4 " ))) (concat* text))
        ((string-equal sclass "TLF_CPUBLICATION")
         (setq c '(face (:foreground "LemonChiffon3" ))) (concat* text))
        (t
         (message "balise %s inconnue" sclass)
         (setq c '(face (:foreground "LemonChiffon3" )))))
    (setq r (concat* text))
    (and i (setq r (html->pretty-print r i)))
    (add-text-properties 0 (length r) c  r)
    (if forbid "X" r)))

(defun TR (prop &rest text)
  text)

(defun TD (prop &rest text)
  text)

(defun TABLE (prop &rest rows)
  rows)

(defun NEWLINE (&rest x)
  "\n")

;; (cnrtl-get 'marmite)

             reply	other threads:[~2008-06-22 20:56 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-22 20:56 A Soare [this message]
2008-06-22 21:29 ` Embedding Html in Lisp Drew Adams
2008-06-23  8:33 ` Thien-Thi Nguyen
  -- strict thread matches above, loose matches on Subject: below --
2008-07-01 19:05 A Soare
2008-06-24  3:09 A Soare
2008-06-24 10:19 ` Thien-Thi Nguyen
2008-06-24  0:32 A Soare
2008-06-24  0:51 ` Stefan Monnier
2008-06-24  1:34 ` Thomas Lord
2008-06-23 21:38 A Soare
2008-06-23 14:08 A Soare
2008-06-24 13:36 ` T. V. Raman
2008-06-24 14:41   ` joakim
2008-06-23 13:21 A Soare
2008-06-23 13:51 ` tomas
2008-06-23 16:56 ` Thomas Lord
2008-06-23 17:04 ` Thien-Thi Nguyen
2008-06-23 20:07 ` Stephen J. Turnbull
2008-06-23 12:23 A Soare
2008-06-23 11:49 A Soare
2008-06-22 22:27 A Soare
2008-06-22 20:53 Embedding HTML " A Soare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6020102.8572091214168175976.JavaMail.www@wwinf4622 \
    --to=alinsoar@voila.fr \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).