Letter-case conversions in network protocols

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Letter-case conversions in network protocols
@ 2021-05-08  9:50 Eli Zaretskii
  2021-05-08 10:20 ` Lars Ingebrigtsen
  2021-05-08 13:47 ` Stefan Monnier
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2021-05-08  9:50 UTC (permalink / raw)
  To: emacs-devel; +Cc: Fatih Aydin

For the immediate trigger, see bug#44604.

The general problem that I'd like us to discuss is how to deal with
letter-case conversions in code that deals with protocols, such as
network-related protocols, that need to recognize certain keywords.

The problem here is that when Emacs starts in certain locales, or
changes to the corresponding language-environments, we modify the case
tables to comply with the rules of those locales.  An example (though
not the only one) is the Turkish locale; see
turkish-case-conversion-enable.  As result of calling that, downcasing
'I' no longer produces 'i', and code which attempts to match keywords
including 'i' case-insensitively fails.

Since we generally use the same text-search and matching APIs both for
implementing the keyword-based protocols and for more general
processing of human-readable code, there's no easy solutions when we
need to ignore language-specific case-conversion rules in some of the
code.  For example, let-binding case-table cannot be done on a too
high level, because it will then affect any text processing below that
level, and a high-level function has no way of knowing what kind of
text processing will be needed by the code it calls, directly or
indirectly.

So what would be the best/easiest solution to this class of problems?

An immediate, but not necessarily easy, candidate is to use

  (with-case-table ascii-case-table

everywhere where we use text-search facilities for keyword processing.
However, this means we will have to go over all the places which do
this, and manually change the code there, and so will developers of
any 3rd-party packages.

Are there better solutions?  Ideas are welcome.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2021-05-08  9:50 Letter-case conversions in network protocols Eli Zaretskii
@ 2021-05-08 10:20 ` Lars Ingebrigtsen
  2021-05-08 13:47 ` Stefan Monnier
  1 sibling, 0 replies; 7+ messages in thread
From: Lars Ingebrigtsen @ 2021-05-08 10:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Fatih Aydin, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> As result of calling that, downcasing 'I' no longer produces 'i', and
> code which attempts to match keywords including 'i' case-insensitively
> fails.

(For those who wondered what the concrete problem we had was, it was
basically

(with-temp-buffer
  (let ((case-fold-search t))
    (insert "DIRECT\n")
    (goto-char (point-min))
    (looking-at "direct")))

If you start Emacs with

  LANG=tr_TR src/emacs

this will return nil, while in most (all?) other locales it will return
t, and this made HTTP proxying fail completely if started in that locale.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2021-05-08  9:50 Letter-case conversions in network protocols Eli Zaretskii
  2021-05-08 10:20 ` Lars Ingebrigtsen
@ 2021-05-08 13:47 ` Stefan Monnier
  2021-05-08 15:41   ` Daniel Martín
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Monnier @ 2021-05-08 13:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Fatih Aydin

> The general problem that I'd like us to discuss is how to deal with
> letter-case conversions in code that deals with protocols, such as
> network-related protocols, that need to recognize certain keywords.
[...]
> Are there better solutions?  Ideas are welcome.

FWIW, I think the sanest way to solve this problem is to make sure
localization doesn't affect the case-mapping of ASCII chars.

I have no idea how "unacceptable" that is for someone using a locale
such as Turkish, tho.

To do better, we could add a separate nonascii case table and those
places in the code that manipulate "text" would then have to use

    (with-case-table nonascii-case-table ...)

This said, another approach is to improve our handling of case-fold: instead
of applying `downcase` on both sides and checking that it gives the same
result, we should be using a "normalization" function which will return
"the representative" of a given equivalence class.  E.g. in a Turkish
locale, `i`, `I`, and `ı`, ` should all belong to the same equivalence
class and this normalization function should hence return the same value
for all three.

        Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2021-05-08 13:47 ` Stefan Monnier
@ 2021-05-08 15:41   ` Daniel Martín
  2021-05-08 19:45     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Martín @ 2021-05-08 15:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, emacs-devel, Fatih Aydin

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
> This said, another approach is to improve our handling of case-fold: instead
> of applying `downcase` on both sides and checking that it gives the same
> result, we should be using a "normalization" function which will return
> "the representative" of a given equivalence class.  E.g. in a Turkish
> locale, `i`, `I`, and `ı`, ` should all belong to the same equivalence
> class and this normalization function should hence return the same value
> for all three.
>

I like this, but I think it should be a different option (say
`case-fold-culture-invariant-search' or something like that), because
both ways to perform case-insensitive comparisons have their purpose,
depending on the context.

Then we could audit instances where 'case-fold-search' is set to non-nil
in the codebase and see if replacing them with the culture-invariant
form would be TRT to avoid these kind of subtle bugs that can specially
affect our Turkish and Azeri users.

Of course, code can still be broken if people explicitly do its own
thing with `downcase', etc. instead of using the case-folding string
APIs, but that's sort of an anti-pattern, anyway.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2021-05-08 15:41   ` Daniel Martín
@ 2021-05-08 19:45     ` Lars Ingebrigtsen
  2022-01-17  9:41       ` Fatih Aydin
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Ingebrigtsen @ 2021-05-08 19:45 UTC (permalink / raw)
  To: Daniel Martín
  Cc: Eli Zaretskii, Stefan Monnier, Fatih Aydin, emacs-devel

Daniel Martín <mardani29@yahoo.es> writes:

> Of course, code can still be broken if people explicitly do its own
> thing with `downcase', etc. instead of using the case-folding string
> APIs, but that's sort of an anti-pattern, anyway.

But code does this sort of thing -- for instance, Message allows headers
to be specified in various ways, but will run the header names trough
`capitalize'.

So this isn't just about doing comparisons, but separating out text
transformations that are done according to a protocol specification
(i.e., octets that happen to be ASCII) vs. the normal DWIM text
transformations.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2021-05-08 19:45     ` Lars Ingebrigtsen
@ 2022-01-17  9:41       ` Fatih Aydin
  2022-01-20  9:21         ` Lars Ingebrigtsen
  0 siblings, 1 reply; 7+ messages in thread
From: Fatih Aydin @ 2022-01-17  9:41 UTC (permalink / raw)
  To: Lars Ingebrigtsen
  Cc: Eli Zaretskii, emacs-devel, Stefan Monnier, Daniel Martín

[-- Attachment #1: Type: text/plain, Size: 1297 bytes --]

There is still a weird problem, not with network protocols but still in eww.
Run eww and visit Google and check non-ASCII chars, you will see the chars
correctly. No problems.
The bug is:
1) Set language environment to Turkish
2) Visit www.google.com.tr
3) Try to search something, or just observe the buttons
You will see that some chars are displayed as \345. It's weird because I
have tried other websites, it just happens with Google.

On Sat, May 8, 2021 at 10:46 PM Lars Ingebrigtsen <larsi@gnus.org> wrote:

> Daniel Martín <mardani29@yahoo.es> writes:
>
> > Of course, code can still be broken if people explicitly do its own
> > thing with `downcase', etc. instead of using the case-folding string
> > APIs, but that's sort of an anti-pattern, anyway.
>
> But code does this sort of thing -- for instance, Message allows headers
> to be specified in various ways, but will run the header names trough
> `capitalize'.
>
> So this isn't just about doing comparisons, but separating out text
> transformations that are done according to a protocol specification
> (i.e., octets that happen to be ASCII) vs. the normal DWIM text
> transformations.
>
> --
> (domestic pets only, the antidote for overdose, milk.)
>    bloggy blog: http://lars.ingebrigtsen.no
>

[-- Attachment #2: Type: text/html, Size: 1879 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Letter-case conversions in network protocols
  2022-01-17  9:41       ` Fatih Aydin
@ 2022-01-20  9:21         ` Lars Ingebrigtsen
  0 siblings, 0 replies; 7+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-20  9:21 UTC (permalink / raw)
  To: Fatih Aydin
  Cc: Eli Zaretskii, emacs-devel, Stefan Monnier, Daniel Martín

Fatih Aydin <fataydin138@gmail.com> writes:

> There is still a weird problem, not with network protocols but still in eww.
> Run eww and visit Google and check non-ASCII chars, you will see the chars
> correctly. No problems.
> The bug is:
> 1) Set language environment to Turkish
> 2) Visit www.google.com.tr
> 3) Try to search something, or just observe the buttons
> You will see that some chars are displayed as \345. It's weird because I have tried
> other websites, it just happens with Google.

As far as I can tell, it's because the Google web site returns invalid
data.  It's not returning utf-8 but a different charset, but the headers
claim that it's utf-8.  I've seen this before with various Google web
sites -- they return other data when not using Chrome/Firefox, and that
data is often invalid.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-20  9:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-08  9:50 Letter-case conversions in network protocols Eli Zaretskii
2021-05-08 10:20 ` Lars Ingebrigtsen
2021-05-08 13:47 ` Stefan Monnier
2021-05-08 15:41   ` Daniel Martín
2021-05-08 19:45     ` Lars Ingebrigtsen
2022-01-17  9:41       ` Fatih Aydin
2022-01-20  9:21         ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).