Midnight Commander within console in UTF-8 mode

Pavel Roskin proski at gnu.org
Mon Oct 22 07:51:04 UTC 2001


> I mentioned that, but probably I should be more specific - in this
> mode console commands work just fine. All the "ls -l", "date" work
> fine and display all set of symbols for current locale OK, and when I
> try to do "more utf8.txt" it's just fine as well, and the file is in
> UTF-8 mode. Console itself does all (well, most :) tricks to keep
> UTF-8 display correctly.

Just in case, make sure that the symbols that are shown as dotted squares
are shown correctly.  Run "ls -l" in that directory, capture its output
and examine it in "more".  For comparison try "less", which is linked to
ncurses.

> MC does show some (e.g. cyrillic) symbols but others just break it
> somehow.

My guess is that MC is most likely not responsible for that.  You could
check how the screen library (included S-Lang, system installed S-Lang,
ncurses) affects this behavior.  Also you could look into
"Options->Display bits..." in MC and make sure that you have "Full 8 bits
output" enabled.

You could also describe your environment better so that I could try it for
you.  I cannot even get "ls -l" to work properly.  What version of KDE and
Konsole are you using?  What is the locale setting to "konsole" (i.e. what
does "locale" show when you just run "konsole")?  Do you have the same
effect with xterm?  What version?

> > in Ukrainian, by may happen e.g. in Japanese).  MC assumes that one
> > byte 
> > is one character wide (see e.g. name_trunc in src/util.c).
> Actually all non-ASCII (incl. cyrillic) symbols take 2 bytes in UTF-8
> or even 2-3-... in Korean, Japaneese etc.
> I don't know how much output of console and MC differ but the code
> can always be borrowed as far as it's GPL. May be this is more the
> issue with Curses library and less with MC itself.
> But if the world is moving to Unicode this should be done somehow.

I'm afraid that you missed my point.  The problem is not the the MC code
"differs" from something else.  The problem that that the code makes a
wrong assumption that one byte occupies one character cell.

I don't think there is anything that can be borrowed from the programs
that don't care much about the layout of their output on the screen (such
as "ls" and "date").

The right solution would be using mbswidth() (from gettext) instead of
strlen() to calculate the lenth of the strings on the screen.

There are also places in MC where is splits strings at a certain point.  
It should be ensured that the split is only done at the multibyte
character boundaries.  Implementation of name_trunc() would be very
non-trivial.  The worst thing is that it can affect the performance.

> I can't live w/out MC and that's the only thing that stops me turning
> completely to UTF, I just got enough of all those pesky encodings
> (especially cyrillic set).

Either you or somebody else fixes MC or you should make some workarounds.  
First of all, don't use LC_ALL - it cannot be overridden by other locale
variables.  Use LANG instead.  Set LC_TIME and LC_MESSAGES to "C" either 
globally or for MC only.  Use external UTF-8 aware viewer and editor.

This will help you "survive" until MC supports multibyte characters.

-- 
Regards,
Pavel Roskin








More information about the mc-devel mailing list