Request for discussion - how to make MC unicode capable
Egmont Koblinger
egmont at uhulinux.hu
Mon Feb 26 12:17:40 UTC 2007
On Sat, Feb 24, 2007 at 02:57:44PM +0200, Pavel Tsekov wrote:
> I'd like to initiate a discussion on how to make MC
> unicode deal with multibyte character sets.
Hi,
Here are some of my thoughts:
- First of all, before doing any work, this is a must read for everyone:
http://joelonsoftware.com/articles/Unicode.html
One of the main points is: from the users' point of view, it absolutely
doesn't matter what bytes there are, the only thing that matters is that
the users should see every _letter_ correctly on the display. Byte
sequences must always be converted accordingly. On the other hand, we'll
see that it's often a must for mc to keep byte sequences unchanged. The
other main point is: for _all_ the byte sequences, inside mc, in the
config and history file, in the vfs interface, everywhere, you _must_ know
in which character set the string stands there.
- Currently KDE has much more bugs with accented filenames than Gnome has.
This is probably because they have a different philosophy. Gnome treats
filenames as byte sequences (as every Unix does) and only converts them to
characters for displaying purposes; while KDE treats them as character
sequences (QString or something like that). Probably due to this, KDE has
a lot of troubles, it is absolutely unable to correctly handle filenames
that are invalid byte sequences according to the locale, and it often
performs extra, erroneous conversions. So I think the right way is to
internally _think_ in byte sequences, and only convert it to/from
characters when displaying them or doing regexp matches and so on.
- Similar goes for file contents. Even in a UTF-8 environment, people want
to display (read) and edit files with different encoding, and even if
every text file used UTF-8 there would be other (non text) files too. We
shouldn't drop support for editing binary files, hex editor mode and so
on.
- When the author of the well-known text editor "joe" began to implement
UTF-8 support, I helped him with advices and later with bug reports. (He
managed to implement a working version 2 weeks after he first heard of
UTF-8 :-)) The result is IMHO a very well designed editor and I'd prefer
to see similar in mcview/mcedit. In order to help people immigrate from
8-bit charset to UTF-8, and in order to be able to view older files, it's
important to support different file encoding and terminal charset. For
example, it should be possible to view a Latin-1 file inside a Latin-1 mc,
to view an UTF-8 file in a Latin-1 mc (replacing non-representable
characters with an inverted question mark or something like that), to view
a Latin-1 file in an UTF-8 mc, and to view an UTF-8 file in an UTF-8 mc.
- The terminal charset should be taken from nl_langinfo(CODESET) (that is,
the LANG, LC_CTYPE and LC_ALL variables) and (as opposed to vim) I do
believe that there should be _no_ way to override it in mc. No-one can
expect correct behavior from any terminal application if these variables
do not reflect the terminal's actual encoding, so it's the users' or
software vendors' job to set it correctly, there should be no reason why
anyone may want to fix it only in one particular application. MC is not
the place to fix it, and once it's fixed outside mc, mc should not
provide an option to mess it. (I have no experience with platforms that
lack locale support, in such platforms it might make sense to create a
"terminal encoding" option, the need for this could be detected by the
./configure script.)
- The file encoding should probably default to the terminal encoding, but
should be easily altered in the viewer or editor (and in fact, some
auto-detection might be added, e.g. if the file is not valid UTF-8 then
automatically fall back to the locale's legacy charset, or automatically
assume UTF-8 if the file is valid. Joe does have two boolean options
whether to enable these two ways of auto-guessing file encoding.) This
setting alters the way the file's content is interpreted (displayed on
the screen, searched case insensitively etc.) and alters how the pressed
keys are inserted in the file, but does not alter the file itself (i.e.
do not perform iconv on it). This way the editor remains completely
binary-safe. Obviously displaying the file requires conversion from the
file encoding to the terminal encoding; interpreting pressed keys
requires conversation in the reverse way).
- Currently mc with the UTF-8 patches have a bug: when you run it in UTF-8
environment and copy a file whose name is invalid UTF-8 (copy means F5
then Enter) then the file name is mangled: the invalid part (characters
that are _shown_ as question marks) are replaced with literal question
marks. Care should be taken to always _think_ in bytes and only convert to
characters for displaying and similar purposes, so that the byte sequences
always remain the same.
- In UTF-8, the "size" (memory consumption), "length" (number of Unicode
entities) and "width" (width occupied in the terminal) are three different
notions. The difference between the first two are trivial. The third is
different since there are zero-width characters (e.g. combining accents,
used e.g. in MacOS accented filenames) and double-width characters (CJK)
too. I think it is a must to correctly handle them and it should not be
hard. I highly recommend that the often-misunderstood Hungarian Notation
be used ( http://joelonsoftware.com/articles/Wrong.html -- read it!), so
that for all function and variables that somehow handles any of these
three, it is reflected by its name whether it stores size, length or
width. Currently a lot of CJK-related bugs originate from not
distinguishing between length and width. For example there's a function
called mbstrlen() that returns the width and not the length -- this _must_
be fixed ASAP. It's a good question whether to support more complicated
languages, e.g. right-to-left writings, I'm not aware of the technical
issues that arise here.
- vfs specification might need a major review. It should be decided and
clearly documented what character set to use. I think there are two
possible ways: always UTF-8, or use the locale settings. However, in both
cases, invalid byte sequences should be tolerated. The story gets a little
bit more complicated, as there are e.g. file system types where filenames
are encoded in one particular encoding. For example, Windows filesystems
always use UTF-16. Let's suppose you use Latin1 locale and enter a Joliet
(non-RockRidge) .iso image and copy files out of it. The filename _must_
be converted since UTF-16 is not usable on Unices. The user expects the
software (mc+vfs) to convert it to Latin1 since this is his locale and
most likely all other files have accents in this locale. But this
conversion might fail due to unrepresentable characters. What to do in
this case? Probably the best way is to imitate the kernel's behavior when
you mount such an image with iocharset=iso-8859-1 and try to do the same
operation. It is one particular error code, I think. And how to list the
contents of that directory? Nice questions... Since it's quite unlikely
that all software the vfs plugins invoke are able to handle this situation
in the same consistent way, my guess is that it's cleaner to force UTF-8
in the vfs communication, and then a Latin-1 mc can handle invalid entries
it receives from the vfs plugin.
(Just a side note: once the vfs interface is cleaned up, it's time to
revisit other issues, e.g. 32-bit (64-bit??) UID/GID, >2GB files,
nanosecond timestamp resolution etc... - are all these supported?)
- Currently mc supports both ncurses and slang backend via a common wrapper.
Both libraries support Unicode, but in a different way: slang works with
UTF-8 while ncurses works with wchar (practically UCS4). If only lower
level ncurses routines are used, UTF-8 can be used too, I don't know if
this is the case in mc. Someone experienced with mc's internals should
examine whether dropping support for one of these libraries would save
noticable developer resources or not. At this moment the resources to
develop mc are IMHO much tighter than the resources on any site where
either ncurses or slang has to be installed in order to install mc. So if
only keeping support for one of these libraries would save work, I believe
this is the way to go. Which library to support is a good question; a long
time ago I wrote an e-mail here about my opinions on this.
--
Egmont
More information about the mc-devel
mailing list