Perl 5.28.0 released

Version 5.28.0 of the Perl language has been released. "Perl 5.28.0 represents approximately 13 months of development since Perl 5.26.0 and contains approximately 730,000 lines of changes across 2,200 files from 77 authors". The full list of changes can be found over here; some highlights include Unicode 10.0 support, string- and number-specific bitwise operators, a change to more secure hash functions, and safer in-place editing.

From:		Sawyer X <xsawyerx-AT-gmail.com>
To:		"perl5-porters-AT-perl.org" <perl5-porters-AT-perl.org>
Subject:		Perl 5.28.0 is now available!
Date:		Fri, 22 Jun 2018 20:08:48 -0600
Message-ID:		<ec5fb9f4-afd9-9bad-0c4e-d2288376f3a5@gmail.com>
Cc:		noc-AT-metacpan.org
Archive-link:		Article

  When we look at modern man we have to face the fact that modern man
  suffers from a kind of poverty of the spirit which stands in glaring
  contrast with his scientific and technological abundance. We've
  learned to fly the air as birds, we've learned to swim the seas as
  fish, yet we haven't learned to walk the earth as brothers and
  sisters.

    -- Martin Luther King Jr., 1967

We are delighted to announce perl v5.28.0, the first stable release of
version 28 of Perl 5.

You will soon be able to download Perl 5.28.0 from your favorite CPAN
mirror or find it at:

https://metacpan.org/release/XSAWYERX/perl-5.28.0/

SHA1 digests for this release are:

  0622f86160e8969633cbd21a2cca9e11ae1f8c5a  perl-5.28.0.tar.gz
  c0e9e7a0dea97ec9816687d865fd461a99ef185c  perl-5.28.0.tar.xz

You can find a full list of changes in the file "perldelta.pod" located
in the "pod" directory inside the release and on the web at

https://metacpan.org/pod/release/XSAWYERX/perl-5.28.0/pod...

Perl 5.28.0 represents approximately 13 months of development since Perl
5.26.0 and contains approximately 730,000 lines of changes across 2,200
files from 77 authors.

Excluding auto-generated files, documentation and release tools, there
were approximately 580,000 lines of changes to 1,300 .pm, .t, .c and .h
files.

Perl continues to flourish into its fourth decade thanks to a vibrant
community of users and developers. The following people are known to
have contributed the improvements that became Perl 5.28.0:

Aaron Crane, Abigail, Ævar Arnfjörð Bjarmason, Alberto Simões, Alexandr
Savca, Andrew Fresh, Andy Dougherty, Andy Lester, Aristotle Pagaltzis,
Ask Bjørn Hansen, Chris 'BinGOs' Williams, Craig A. Berry, Dagfinn
Ilmari Mannsåker, Dan Collins, Daniel Dragan, David Cantrell, David
Mitchell, Dmitry Ulanov, Dominic Hargreaves, E. Choroba, Eric Herman,
Eugen Konkov, Father Chrysostomos, Gene Sullivan, George Hartzell,
Graham Knop, Harald Jörg, H.Merijn Brand, Hugo van der Sanden, Jacques
Germishuys, James E Keenan, Jarkko Hietaniemi, Jerry D. Hedden, J. Nick
Koston, John Lightsey, John Peacock, John P. Linderman, John SJ
Anderson, Karen Etheridge, Karl Williamson, Ken Brown, Ken Cotterill,
Leon Timmermans, Lukas Mai, Marco Fontani, Marc-Philip Werner, Matthew
Horsfall, Neil Bowers, Nicholas Clark, Nicolas R., Niko Tyni, Pali, Paul
Marquess, Peter John Acklam, Reini Urban, Renee Baecker, Ricardo Signes,
Robin Barker, Sawyer X, Scott Lanning, Sergey Aleynikov, Shirakata
Kentaro, Shoichi Kaji, Slaven Rezic, Smylers, Steffen Müller, Steve Hay,
Sullivan Beck, Thomas Sibley, Todd Rinaldo, Tomasz Konojacki, Tom
Hukins, Tom Wyant, Tony Cook, Vitali Peil, Yves Orton, Zefram.

The list above is almost certainly incomplete as it is automatically
generated from version control history. In particular, it does not
include the names of the (very much appreciated) contributors who
reported issues to the Perl bug tracker.

Many of the changes included in this version originated in the CPAN
modules included in Perl's core. We're grateful to the entire CPAN
community for helping Perl to flourish.

For a more complete list of all of Perl's historical contributors,
please see the AUTHORS file in the Perl source distribution.

We expect to release perl v5.29.0 tomorrow, followed by v5.29.1 on July
20th. The next major stable release of Perl 5, version 30.0, should
appear in June 2019.

In hugs and bugs,
Sawyer X.

Perl 5.28.0 released

Posted Jun 25, 2018 14:56 UTC (Mon) by excors (subscriber, #95769) [Link] (14 responses)

The changelog says "Mixed Unicode scripts are now detectable", and gives an example like /(*script_run:\d+)/ to match digits only if they come from the same script. That surprised me since I thought \d was just a synonym for [0-9], i.e. ASCII only; but apparently it defaults to matching any Unicode digit, and has done so for ages.

That seems rather weird behaviour - if you're using a regexp to check if something looks like a number, surely that's because you're about to pass it into int($n) or $n+0 etc and use it as an actual number? And int() only understands ASCII digits. If you're going to do some Unicode-aware number parsing, you need a library to do that for you, and then it would be no extra hassle to use that library's regexps like (hypothetically) /${Unicode::NumberParser::re_digit}+/ when you specifically want that behaviour, with the bonus of ensuring compatibility between the matching and parsing. And you can still use /\p{Digit}/ if you simply want all Unicode digits. But defaulting to Unicode for \d seems like it's going to achieve little beyond a proliferation of bugs.

Perl 5.28.0 released

Posted Jun 25, 2018 15:49 UTC (Mon) by MarcB (guest, #101804) [Link] (13 responses)

Well, if you don't want to extended \d to Unicode, then you also cannot extened \w - and that would be weird.

But I somewhat agree, that Perl got the defaults backwards, for the simple reason, that matching less than expected causes easier to spot errors than catching more than expected (because positive tests are much more common than negative ones).
I suspect a lot of people wrote expression that match much more than they expect them to match (I know I certainly did, until about two years ago).

In any case, if you want ASCII-characters only, Perl has the "a" modifier.

An example using "unichars" from Unicode::Tussle (it prints out all charactes matching a given regular expression):
$ ./unichars '/\d/' | wc -l
370
$ ./unichars '/\d/a' | wc -l
10
$ ./unichars '/\w/' | wc -l
11286
$ ./unichars '/\w/a' | wc -l
63

Perl 5.28.0 released

Posted Jun 25, 2018 16:17 UTC (Mon) by epa (subscriber, #39769) [Link] (8 responses)

There is a difference in that you can expect to capture a string of digits \d+ and then treat it as a number and use it in arithmetic -- particularly in languages like Perl where a number and its string representation are more or less the same thing. But there aren't arithmetic operations on 'words' or any equivalent expectation that you could match \w+ and then do operations on it. If the programming language itself supports Unicode in identifier names, it makes some sense for \w+ to continue to match an identifier. But Perl doesn't, AFAIK, support oddball non-ASCII number digits in numeric literals.

Added to that is the fact that the numerals 0-9 are commonly used to write numbers in every language, even if it doesn't use the letters A-Z and even if it has its own separate set of number characters. So I would be quite happy to restrict \d to match digits 0-9 while still extending the meaning of \w. In fact, I can't see that matching "any character which might be a numeric one in some script somewhere" is even a useful thing to match in any practical program.

Perl 5.28.0 released

Posted Jun 25, 2018 17:07 UTC (Mon) by JoeBuck (subscriber, #2330) [Link] (7 responses)

A number of languages have distinct characters to write numbers, for example Kannada, the local language in the Indian state that includes Bangalore, has Unicode codepoints from 0CE6 through 0CEE to represent digits (I've been there several times and love the curlicue Kannada characters though I can't read a bit of it and fortunately for me English is widely spoken and used on signs).

Perl 5.28.0 released

Posted Jun 26, 2018 15:16 UTC (Tue) by epa (subscriber, #39769) [Link] (6 responses)

Right, and in English too we have "twenty-three" as an alternative to "23". Do people in Bangalore, if their mobile phone is localized to the Kannada language, enter phone numbers using the Kannada digits? Do spreadsheet applications display them instead of 0-9? Does your computer keyboard have them on the numeric keypad? My point is not that other ways of writing numbers don't exist, but that the digits 0-9 are used almost everywhere for the kind of technical or scientific number representation that you'd typically want to match with a regular expression \d+ and then process further.

Perl 5.28.0 released

Posted Jun 26, 2018 20:43 UTC (Tue) by karkhaz (subscriber, #99844) [Link] (5 responses)

Not sure about Bangalore and Kannada, but in most Arab-speaking countries the answer to all three of your questions is "yes".

This isn't such a challenge to deal with, because the numerical system is exactly the same as in the West (i.e., the position of a digit within the number gives its magnitude), the difference is the actual numerals are the characters ٠١٢٣٤٥٦٧٨٩ rather than 0-9. They're even written with the highest-magnitude digits on the left, just like Western numbers, even though Arabic text is written right-to-left.

So to convert an East-Arabic number string into an int, it suffices to subtract a constant from each character in the string (to turn it into an ASCII number string) and then do the type conversion.

Perl 5.28.0 released

Posted Jul 16, 2018 7:43 UTC (Mon) by epa (subscriber, #39769) [Link] (4 responses)

Don't you also have to reverse the string?

Perl 5.28.0 released

Posted Jul 16, 2018 18:58 UTC (Mon) by dtlin (subscriber, #36537) [Link] (3 responses)

Nope. http://unicode.org/reports/tr9/ opens with exactly this case.

However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous.
However, because these right-to-left scripts use digits that are written from left to right, the text is actually bidirectional: a mixture of right-to-left and left-to-right text.

Arabic letters have Bidi_Class=AL (Arabic letter, strongly RTL), while Arabic digits have Bidi_Class=AN (Arabic number, weakly LTR).

Perl 5.28.0 released

Posted Jul 16, 2018 20:31 UTC (Mon) by zlynx (guest, #2285) [Link] (2 responses)

Claiming that Arabic numerals are RTL is funny because Western Latin languages copied Arabic numerals from Arabic including the direction which is LTR.

Arabic numerals are /supposed to be/ read right to left in little-endian order. Notice that when reading a number, we have to first count all of the digits to determine hundreds, thousands, millions, etc, before we start talking. Instead all these years we could have been reading them as "1 and 20 and 400" if only we'd written them the other direction.

We also have all of the strange formatting exceptions for numbers so that they align to the right. Note that in English that's the only thing we right-align. A big hint that we write them in the wrong order.

Perl 5.28.0 released

Posted Jul 16, 2018 20:46 UTC (Mon) by karkhaz (subscriber, #99844) [Link] (1 responses)

> Claiming that Arabic numerals are RTL is funny because Western Latin languages copied Arabic numerals

I don't think dtlin claimed that at all, their comment was that Arabic digits have a LTR class. However, there's a subtle point here: what we call "Arabic numerals" (0123456789) were indeed copied from Arabic, but I was talking about the numerals that are currently used in most Arab-speaking countries (٠١٢٣٤٥٦٧٨٩, which I referred to as East Arabic numerals to disambiguate).

> Arabic numerals are /supposed to be/ read right to left in little-endian order

I'm not sure what your source for this is. I suppose it makes sense when you have a number embedded in some RTL text. However, I speak Arabic (though I cannot read nor write), and numbers are not pronounced as "1 and 20 and 400". The order is actually a bit jumbled: that particular number is pronounced "four hundred and one and twenty".

In general, higher-magnitude digits are uttered before lower-magnitude ones in spoken Arabic, just as in English. The exceptions are that units are uttered before tens ("one and twenty"), and that the numbers from eleven to twenty have special names (as they do in English, i.e. we say "eleven" as opposed to "one and ten")

Perl 5.28.0 released

Posted Jul 16, 2018 21:02 UTC (Mon) by zlynx (guest, #2285) [Link]

Ah. I must have been confused by the "1 and 20" bit. I'd been told that somewhere and I thought it generalized to higher multiples of ten.

Perl 5.28.0 released

Posted Jun 25, 2018 16:34 UTC (Mon) by excors (subscriber, #95769) [Link] (3 responses)

> Well, if you don't want to extended \d to Unicode, then you also cannot extened \w

Why not?

\w is documented as matching "word" characters, so it's inherently quite vague - if someone wants to strictly validate a string against [a-zA-Z0-9_] then I suspect they'd nearly always do it explicitly with that character range, instead of trying to use \w as shorthand, because the precise meaning of \w is non-obvious even in the ASCII world. And when people understand it as matching "words" in some vague best-effort sense, it's unsurprising that it should include French words and Greek words etc, so the Unicode behavior makes sense. That contrasts with \d which is documented as matching decimal digits, and obviously digits are [0-9], so it's very tempting to (wrongly) use \d when you really want [0-9]. Given the different situation for \w and \d, it seems reasonable to have different solutions for them.

Perl 5.28.0 released

Posted Jun 25, 2018 17:24 UTC (Mon) by MarcB (guest, #101804) [Link] (2 responses)

That depends on what you typically use regular expressions for. If you are often dealing with protocols that are defined as ASCII, \w is almost as useful as \d, and in fact very precidely defined. We use \w - and also \s - a lot, to precisely match the characters they match. (I consider it another oversight, that \h is not affected by /a - this makes \h mostly useless).

And if you are doing input validation, capturing too much with \w is as bad as doing so with \d. In fact, it might be worse: Accidentally letting through non-ASCII digits is very likely to explode noisily, yet early, while doing so with word characters can lead to subtle, yet nasty errors that might strike years later.

Imagine letting through non-normalized Unicode into a system that is able sto store Unicode, but was designed with ASCII in mind, and produces wrong or strange results if "encoded bit strings are identical <=> decoded strings are identical" is no longer true. For example Linux filesystems, but also many databases, that for all intents and purposes would suddenly seem to violate unique constraints and wrongly fail equality checks.

I do not think, special-casing \d makes much sense. Yes, it would prevent some errors, but those are the obvious ones. And as I said: /a should be the default, not /u. The additonal character to type would not hurt anyone who wants Unicode sematics and it would be the safer thing. People will quickly notice, that something they expect to match is not matching, while the inverse is likely only to be discovered through bugs or even by attackers.

Perl 5.28.0 released

Posted Jun 28, 2018 13:32 UTC (Thu) by jrw (subscriber, #69959) [Link] (1 responses)

Yep, /a should obviously have been the default, absent something like a use unicode pragma. Especially when your code has to run on servers still running 5.8 where /a is not available.

Perl 5.28.0 released

Posted Jul 11, 2018 19:46 UTC (Wed) by epa (subscriber, #39769) [Link]

It reminds me of sort(1) where the behaviour changed to start sorting by locale (even when the input data is 7-bit clean) so you now have to set LC_ALL=C to get back sorting by byte values. Again, there's nothing wrong with locale-based and Unicode-aware behaviour, and it can even be the default in user interface code, but in programming languages and tools the fancy new scheme should be a new option, with the default staying as it was.

Perl 5.28.0 released

Posted Jun 25, 2018 20:20 UTC (Mon) by flussence (guest, #85566) [Link]

Human-readable named regex operators is a huge improvement. I've never been able to remember the (?$operator$operand) combinations in Perl 5 when I needed them without resorting to perldoc perlre.