Request for discussion - how to make MC unicode capable

Egmont Koblinger egmont at uhulinux.hu
Mon Feb 26 12:56:49 UTC 2007


On Sun, Feb 25, 2007 at 02:41:45PM +0100, Leonard den Ottolander wrote:
> Just a few thoughts:
> 
> - Because multibyte is rather more memory hungry I think the user should
> still have the option to toggle the use of an 8bit path either in the
> interface or at compile time. This means where the UTF-8 patches replace
> paths we should preferably implement two paths.

Multibyte is memory hungry if you use UCS-4 internally, which I don't
recommend (e.g. viewing a 10MB log file would need 40MB of memory - this
would really be awful). But if you use Latin-1, UTF-8, whatever internally,
there's no problem. My proposal is to still store the original byte
sequences in memory, in this case memory consumption doesn't grow in the
8bit case.

On the other hand, separate execution paths should be avoided as much as
possible, I hope it's needless to explain why. Most of the glibc functions
and the wrappers we could write to them are perfectly able to handle every
charset, no matter if it's 8bit or UTF-8 or something else. E.g. if we
implement a general mbstrlen() that returns the number of Unicode entities,
and strwidth() that returns the with, they'll work both in UTF-8 and in
Latin-1. In Latin-1 they'll always return the same value, but it's not worth
branching the code just because of this and use separate code for the 8-bit
cases. Just write and test one piece of code: the general case that covers
both UTF-8 and the 8bit ones, and probably EUC_JP and others too.

> - I suppose a lot of the code of the UTF-8 patch can be reused, only we
> will need to add iconv() calls in the appropriate places. libiconv is
> already expected so not much trouble with the make files there. Iconv
> should only be used for the multibyte path, not the 8bit path. Using the
> multibyte path would still enable users to translate from one 8bit
> charset to another.

As said above, I think different paths should be avoided. As discussed in my
previous mail, the story is not so simple black and white (8-bit vs. utf8),
there are mixed scenarios as well (viewing an utf-8 file in a latin1
terminal or a latin1 file in an utf8 terminal etc...)

> - Unsupported character substitution character should be an ini option
> (and define some defaults for all/many character sets). (I'm not sure
> question mark is supported in all character sets.)

I don't think mc should support any non ASCII compatible (e.g. EBCDIC)
character sets. They'd make things much-much more complicated and would
result in a feature probably no-one would ever use. Question mark is
available in all other character sets.

I really don't care if the "unsupported character" (e.g. if mc wants to
display a kanji but is unable to do since the terminal is latin1) is
configurable or not (actually a hardcoded inverted question mark is fine for
me) -- but it's not an important issue at all.

If a UTF-8 terminal is used (which is now the case in most of the Linux
distributions, at least in every distribution that matters (in my eyes)),
then U+FFFD is the right character for invalid byte sequences, as well as it
could also be used (I think) to denote non-printable (!iswprint())
characters.

> - Users should be able to set character set per directory (mount). Of
> course there should be a system wide default taken from the environment
> (but also overridable).

No. MC should not try to fix what's incorrect even outside mc. For vfat-like
file systems, there's the iocharset= mount option. For Linux file systems,
there's no such option, so it really sucks if you have two file systems
using two different encodings, but if it bothers you, either use convmv() to
convert these files, or patch the kernel to support iocharset=... and
filesystemcharset=... mount option for Linux file systems that does this
conversion. If there's no way for you to see the filenames correctly
throughout your system with the echo or ls commands, it's not mc's job to
fix it.

I do believe there are many many things to do in mc with small developer
resources. I can't even see when mc will be able to properly support UTF-8
on systems that are properly set up. Based on the experiences of converting
a full distro from Latin-2 to UTF-8, I must say that mc is the only
important software where UTF-8 would be necessary but the mainstream version
completely lacks it. It is quite urgent to do something with it.
Unfortunately neither I can investigate much time in it. Let's try to do no
more than supporting properly set up systems. Let's not try to provide
workarounds for system misconfigurations and such.

I believe that if you see different filename encodings on your system then
your system is not probably set up. Go and fix it _there_ and leave mc's
developers alone. Having various encodings _inside_ the text files is a
different issue however, mc should deal with it...

And, by the way, is there any reason not to trust in environment variables
and provide a way to override them? Yet again it's a system configuration
issue: if your env vars are broken, go and fix them.

> - Copy/move dialogs should have a toggle to iconv the file name or do a
> binary name copy.

No. See above.

> - Maybe copy/move dialogs should also have a toggle to iconv file
> content, which could be quite usable for text files.

Maybe, but in this case why do I have to copy/move a file in order to
convert it? It could be a completely separate module, under the File or
Command menu. Yes, it would nice to see it, but IMHO it's quite irrelevant
to the topic we're about to discuss.



-- 
Egmont



More information about the mc-devel mailing list