Request for discussion - how to make MC unicode capable

Egmont Koblinger egmont at uhulinux.hu
Mon Feb 26 12:17:40 UTC 2007


On Sat, Feb 24, 2007 at 02:57:44PM +0200, Pavel Tsekov wrote:

> I'd like to initiate a discussion on how to make MC
> unicode deal with multibyte character sets.

Hi,

Here are some of my thoughts:

- First of all, before doing any work, this is a must read for everyone:
  http://joelonsoftware.com/articles/Unicode.html
  One of the main points is: from the users' point of view, it absolutely
  doesn't matter what bytes there are, the only thing that matters is that
  the users should see every _letter_ correctly on the display. Byte
  sequences must always be converted accordingly. On the other hand, we'll
  see that it's often a must for mc to keep byte sequences unchanged. The
  other main point is: for _all_ the byte sequences, inside mc, in the
  config and history file, in the vfs interface, everywhere, you _must_ know
  in which character set the string stands there.

- Currently KDE has much more bugs with accented filenames than Gnome has.
  This is probably because they have a different philosophy. Gnome treats
  filenames as byte sequences (as every Unix does) and only converts them to
  characters for displaying purposes; while KDE treats them as character
  sequences (QString or something like that). Probably due to this, KDE has
  a lot of troubles, it is absolutely unable to correctly handle filenames
  that are invalid byte sequences according to the locale, and it often
  performs extra, erroneous conversions. So I think the right way is to
  internally _think_ in byte sequences, and only convert it to/from
  characters when displaying them or doing regexp matches and so on.

- Similar goes for file contents. Even in a UTF-8 environment, people want
  to display (read) and edit files with different encoding, and even if
  every text file used UTF-8 there would be other (non text) files too. We
  shouldn't drop support for editing binary files, hex editor mode and so
  on.

- When the author of the well-known text editor "joe" began to implement
  UTF-8 support, I helped him with advices and later with bug reports. (He
  managed to implement a working version 2 weeks after he first heard of
  UTF-8 :-)) The result is IMHO a very well designed editor and I'd prefer
  to see similar in mcview/mcedit. In order to help people immigrate from
  8-bit charset to UTF-8, and in order to be able to view older files, it's
  important to support different file encoding and terminal charset. For
  example, it should be possible to view a Latin-1 file inside a Latin-1 mc,
  to view an UTF-8 file in a Latin-1 mc (replacing non-representable
  characters with an inverted question mark or something like that), to view
  a Latin-1 file in an UTF-8 mc, and to view an UTF-8 file in an UTF-8 mc.

  - The terminal charset should be taken from nl_langinfo(CODESET) (that is,
    the LANG, LC_CTYPE and LC_ALL variables) and (as opposed to vim) I do
    believe that there should be _no_ way to override it in mc. No-one can
    expect correct behavior from any terminal application if these variables
    do not reflect the terminal's actual encoding, so it's the users' or
    software vendors' job to set it correctly, there should be no reason why
    anyone may want to fix it only in one particular application. MC is not
    the place to fix it, and once it's fixed outside mc, mc should not
    provide an option to mess it. (I have no experience with platforms that
    lack locale support, in such platforms it might make sense to create a
    "terminal encoding" option, the need for this could be detected by the
    ./configure script.)

  - The file encoding should probably default to the terminal encoding, but
    should be easily altered in the viewer or editor (and in fact, some
    auto-detection might be added, e.g. if the file is not valid UTF-8 then
    automatically fall back to the locale's legacy charset, or automatically
    assume UTF-8 if the file is valid. Joe does have two boolean options
    whether to enable these two ways of auto-guessing file encoding.) This
    setting alters the way the file's content is interpreted (displayed on
    the screen, searched case insensitively etc.) and alters how the pressed
    keys are inserted in the file, but does not alter the file itself (i.e. 
    do not perform iconv on it). This way the editor remains completely
    binary-safe. Obviously displaying the file requires conversion from the
    file encoding to the terminal encoding; interpreting pressed keys
    requires conversation in the reverse way).

- Currently mc with the UTF-8 patches have a bug: when you run it in UTF-8
  environment and copy a file whose name is invalid UTF-8 (copy means F5
  then Enter) then the file name is mangled: the invalid part (characters
  that are _shown_ as question marks) are replaced with literal question
  marks. Care should be taken to always _think_ in bytes and only convert to
  characters for displaying and similar purposes, so that the byte sequences
  always remain the same.

- In UTF-8, the "size" (memory consumption), "length" (number of Unicode
  entities) and "width" (width occupied in the terminal) are three different
  notions. The difference between the first two are trivial. The third is
  different since there are zero-width characters (e.g. combining accents,
  used e.g. in MacOS accented filenames) and double-width characters (CJK)
  too. I think it is a must to correctly handle them and it should not be
  hard. I highly recommend that the often-misunderstood Hungarian Notation
  be used ( http://joelonsoftware.com/articles/Wrong.html -- read it!), so
  that for all function and variables that somehow handles any of these
  three, it is reflected by its name whether it stores size, length or
  width. Currently a lot of CJK-related bugs originate from not
  distinguishing between length and width. For example there's a function
  called mbstrlen() that returns the width and not the length -- this _must_
  be fixed ASAP.  It's a good question whether to support more complicated
  languages, e.g. right-to-left writings, I'm not aware of the technical
  issues that arise here.

- vfs specification might need a major review. It should be decided and
  clearly documented what character set to use. I think there are two
  possible ways: always UTF-8, or use the locale settings. However, in both
  cases, invalid byte sequences should be tolerated. The story gets a little
  bit more complicated, as there are e.g. file system types where filenames
  are encoded in one particular encoding. For example, Windows filesystems
  always use UTF-16. Let's suppose you use Latin1 locale and enter a Joliet
  (non-RockRidge) .iso image and copy files out of it. The filename _must_
  be converted since UTF-16 is not usable on Unices. The user expects the
  software (mc+vfs) to convert it to Latin1 since this is his locale and
  most likely all other files have accents in this locale. But this
  conversion might fail due to unrepresentable characters. What to do in
  this case? Probably the best way is to imitate the kernel's behavior when
  you mount such an image with iocharset=iso-8859-1 and try to do the same
  operation. It is one particular error code, I think. And how to list the
  contents of that directory? Nice questions... Since it's quite unlikely
  that all software the vfs plugins invoke are able to handle this situation
  in the same consistent way, my guess is that it's cleaner to force UTF-8
  in the vfs communication, and then a Latin-1 mc can handle invalid entries
  it receives from the vfs plugin.

  (Just a side note: once the vfs interface is cleaned up, it's time to
  revisit other issues, e.g. 32-bit (64-bit??) UID/GID, >2GB files,
  nanosecond timestamp resolution etc... - are all these supported?)

- Currently mc supports both ncurses and slang backend via a common wrapper.
  Both libraries support Unicode, but in a different way: slang works with
  UTF-8 while ncurses works with wchar (practically UCS4). If only lower
  level ncurses routines are used, UTF-8 can be used too, I don't know if
  this is the case in mc. Someone experienced with mc's internals should
  examine whether dropping support for one of these libraries would save
  noticable developer resources or not. At this moment the resources to
  develop mc are IMHO much tighter than the resources on any site where
  either ncurses or slang has to be installed in order to install mc. So if
  only keeping support for one of these libraries would save work, I believe
  this is the way to go. Which library to support is a good question; a long
  time ago I wrote an e-mail here about my opinions on this.



-- 
Egmont



More information about the mc-devel mailing list