User: Password:
Subscribe / Log in / New account

Either clear the Unicode air--or make a release-blocker? (was: Unicode cheatsheet for Perl)

From:  Tom Christiansen <>
To:  Leon Timmermans <>
Subject:  Either clear the Unicode air--or make a release-blocker? (was: Unicode cheatsheet for Perl)
Date:  Sat, 25 Feb 2012 15:28:42 -0700
Message-ID:  <2747.1330208922@chthon>
Cc:  Karl Williamson <>
Archive-link:  Article

Maybe that's too strong a subject line, but I *really* want to know *exactly*
what we can't tell people they can use Perl with Unicode in anything but the
most excruciating all all possible ways -- if this is indeed true.

Leon Timmermans <> on Tue, 21 Feb 2012 09:44:15 +0100 wrote:

> I'm not entirely sure what gets fixed by that and what doesn't, it isn't
> documented at all. Looking at the source makes me feel it's a hack IMO, and
> I strongly suspect it is not quite a complete fix: there are just too many
> places that would need to get fixed. I believe the utf8 layer would be the
> right place to do it because that's the only place that almost all Input
> passes through. Fixing it there fixes it almost everywhere (except sysread I
> suppose, but that can be fixed too).

Some folks claim the only "safe" way to use Unicode in Perl is to always make
explicit calls to encode/decode with a bonus FB_CROAK argument.  They claim
that all nine of these perfectly reasonable and common-to-the-99th-percentile 

    #1.   $ perl -C...
    #2.   $ export PERL_UNICODE=...

    #3.     use utf8;

    #4.     use open qw[ :std :utf8            ];
    #5.     use open qw[ :std :encoding(UTF-8) ];

    #6.     binmode(FH, ":utf8");
    #7.     binmode(FH, ":encoding(UTF-8)");

    #8.     open(FH,  "< :utf8",            $path);
    #9.     open(FH,  "< :encoding(UTF-8)", $path);

...are all of them flawed in their not raising exceptions on UTF-8
encoding errors of one sort of another, and that somehow not even...

    #0.     use warnings qw(FATAL utf8); good enough to fix it.  

I do not know whether these claims are true.  My own tests suggest this may
not be the whole story, because this behaves as I think it should:

  darwin$ perl -C0 -E 'say for "caf\xE9", "stuff"' | 
	  perl -CS -Mwarnings=FATAL,utf8 -pe 'print "$. "'
  utf8 "\xE9" does not map to Unicode, <> line 1.
  Exit 255


Which seems to say that #0 makes at least #1 safe.  Again, I'm fuzzy on what
the perceived problem actually is.  Maybe they're using autodie or something,
which is known broken. I'm trying to get more info.

I also do not know the precise details of these so-called "security" bugs that
Christian references.  I've no reason to disbelieve Christian; I just don't
know the details myself, nor am I asking that they be splatted all over.

What I do know is that telling people that the only "right" or "safe" or
"acceptable" way to use Unicode in Perl is via myriad exclicit calls to
&Encode::{utf8_,}{en,de}code(..., FB_CROAK) just doesn't cut it—full stop.

If there's something so important that it must be done everytime to ensure
correct behavior, then that is too important to be left up to the programmer
to forget to do.  It needs to be done for him.

We should not have to endure five more tedious years of people getting
tonguelashed and flamenagged into writing horribly complicated code
all because something deep down in Perl's dwimmer is flawed.

I say five years, not one year, because of how long it takes to get vendors
to get themselves updated.  If this is a legit issue, and we push it off till
2013's v5.18, then it will be a further 2–4 years past that until people can
have reasonable Unicode processing in their vendor Perl.  That puts us into
the 2015–2017 (!) time frame, and... that's just not acceptable, eh?

By that time, one or both of two things will have happened.  Either untold
zillions of lines of code will have been written that either conform to
this ridiculous amount of monkeywork, entrenching a bad pattern forever, or
else untold zillions of line of code will have been written that ignore it
and are themselves open to the kind of spooky catastrophic failure that
people allude to, thereby rendering all Perl a security hole of Chicken
Little proportion.

Since we won't let either of those happen, I figure that either 

 —— all #1 .. #9 of my numbered points above are completely safe and
    proper, preferably always but minimally along with the #0 fatalization
 —— or else they need to be made so before we dare release v5.16.

Why?  Simple: remove those 9 simple ways to approach implicit Unicode
processing in Perl, and you so gut Perl's Unicode dwimmer that nobody
save the very most diligent and élite [sic] of Perl gurus will ever dare
use Perl with Unicode.  That would be tragic, maybe disastrous even.

Possibly this is all well known, and I just haven't been listening.  Maybe
it's even been fixed, assuming it was ever broken in the first place, which
I'm highly fuzzy on and can neither prove nor disprove with my own meagre
poking at the problem.  If so, I apologize for making much ado about nothing.

I do have notions about what should be happening with encoding layers,
including backwards compatibility concerns versus security concerns;
something along the lines that we're under no obligation to leave our
backdoors standing open for eternity.  Also, I strongly feel that all
encoding  errors should croak by default.  I hate garbage in files as
things silently fill them with manglings.  Those should croak if you haven't
explicitly asked to get garbage out.  That shouldn't be a default.

I'm still troubled that Encode is *not* one of "our" modules, yet a whole lot
of what we do seems dependent on it.  We can't usefully create new warnings
and classes, errors, encoding names, and exceptions of encodings used in
internally if they don't sync up with Encode.  But we have and they don't:
already the utf8 warnings subclasses are broken with Encode.  That makes it
even more important that we get the internal stuff "right". There's a bunch
more where that came from, but I'll save the rest till someone tells me where
we actually stand.

Karl, Leon, and Christian, thank you for your time and insights, both past
and future.  Nothing would please me more than to learn that I've had my cage
rattled for nothing, and that these are all non-issues.


PS: Christian Hansen, if you say your name fast enough, 
                      it rather sounds like my surname. :)

(Log in to post comments)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds