unicode encoding issues

Adam Williamson

2015-05-01 02:16:55 UTC

So at the time we turned anaconda translations into unicodes I guessed
we'd just be swapping one set of UnicodeDecodeErrors for another on the
live images; unfortunately it seems like that's what's happening. We've
already found two new UnicodeDecodeErrors in 21 Final TC1 that have been
caused by turning translations into unicodes:

https://bugzilla.redhat.com/show_bug.cgi?id=1217504
https://bugzilla.redhat.com/show_bug.cgi?id=1217610

and I suspect this bug is somehow caused by the same thing, though it's
not as clear cut:

https://bugzilla.redhat.com/show_bug.cgi?id=1217411

Looking at the first two bugs, I checked through anaconda for instances
of str(e) (on the basis that 'e' is conventionally used for exceptions):
there are 30. I also used a somewhat dumb grep to try and find cases
where we do a %s substitution into a translation:

grep -R "_(.*%" *

and that gives 244 cases.

One strand of this whole nightmare that we kinda lost track of is that
*this doesn't always go wrong*. Sometimes, python does somehow know to
use utf-8 rather than ascii. sgallagh and I got some way towards
investigating this back in the F21 timeframe, but eventually we moved on
to other angles. I thought it might have something to do with the
setup_locale() call in welcome.py , but the anaconda script itself
already calls that much earlier, so now I'm not so sure.

We still have the option of the big hammer to force the 'default'
encoding to be utf-8 on the lives as well as non-lives. I am (with
extreme regret) reading
http://www.gossamer-threads.com/lists/engine?do=post_view_flat;post=800861;page=1;sb=post_latest_reply;so=ASC;mh=25;list=python
again, which I think is where we get the objections to doing that. And
it sure sounds bad:

===================

"If you change these, you are on your own and strange things will
start to happen. The default encoding does not only affect
the translation between Python and the outside world, but also
all internal conversions between 8-bit strings and Unicode.

Hacks like [this] are just
downright wrong and will cause serious problems since Unicode
objects cache their default encoded representation."

"The key problem is that objects that compare equal should also hash
equal. String and Unicode hashing has been constructed so that byte
strings hash the same as if interpreted as latin-1. If, say, utf-8
would be the system encoding, then, for some values of S,

S == unicode(S) and hash(S) != hash(unicode(S))

That, in turn, *will* break dictionaries."

====================

But then - none of this seems unique to the live image case. On
non-lives, we *already use exactly the hack they say is so terrible* -
it's the whole reason we have pyanaconda/sitecustomize.py:

import sys
# pylint: disable=no-member
sys.setdefaultencoding('utf-8')

I may be missing something, but so far as I can see, while we would have
to implement the hack slightly differently in the live case, the
different implementation isn't any *more* dangerous than the one we're
already using in the non-live case. The only thing different about the
live case is the use of reload(sys) vs. using the site-customize trick,
and so far as I can see, none of the objections to this hack are about
the use of reload(sys), they're about the use of
sys.setdefaultencoding().

If I'm wrong about that, do enlighten me :)

Otherwise, though, what exactly do we have to lose? I'm happy with the
idea that it's the wrong thing to do. We do lots of wrong things. Some
days I do 30 wrong things before breakfast. If the only other
alternative is poking through the entire installer trying to trigger
every goddamn translated string to find all the broken cases, let's do
something wrong.

--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net