Perl 5.28.0 released
Perl 5.28.0 represents approximately 13 months of development since Perl 5.26.0 and contains approximately 730,000 lines of changes across 2,200 files from 77 authors". The full list of changes can be found over here; some highlights include Unicode 10.0 support, string- and number-specific bitwise operators, a change to more secure hash functions, and safer in-place editing.
From: | Sawyer X <xsawyerx-AT-gmail.com> | |
To: | "perl5-porters-AT-perl.org" <perl5-porters-AT-perl.org> | |
Subject: | Perl 5.28.0 is now available! | |
Date: | Fri, 22 Jun 2018 20:08:48 -0600 | |
Message-ID: | <ec5fb9f4-afd9-9bad-0c4e-d2288376f3a5@gmail.com> | |
Cc: | noc-AT-metacpan.org | |
Archive-link: | Article |
When we look at modern man we have to face the fact that modern man suffers from a kind of poverty of the spirit which stands in glaring contrast with his scientific and technological abundance. We've learned to fly the air as birds, we've learned to swim the seas as fish, yet we haven't learned to walk the earth as brothers and sisters. -- Martin Luther King Jr., 1967 We are delighted to announce perl v5.28.0, the first stable release of version 28 of Perl 5. You will soon be able to download Perl 5.28.0 from your favorite CPAN mirror or find it at: https://metacpan.org/release/XSAWYERX/perl-5.28.0/ SHA1 digests for this release are: 0622f86160e8969633cbd21a2cca9e11ae1f8c5a perl-5.28.0.tar.gz c0e9e7a0dea97ec9816687d865fd461a99ef185c perl-5.28.0.tar.xz You can find a full list of changes in the file "perldelta.pod" located in the "pod" directory inside the release and on the web at https://metacpan.org/pod/release/XSAWYERX/perl-5.28.0/pod... Perl 5.28.0 represents approximately 13 months of development since Perl 5.26.0 and contains approximately 730,000 lines of changes across 2,200 files from 77 authors. Excluding auto-generated files, documentation and release tools, there were approximately 580,000 lines of changes to 1,300 .pm, .t, .c and .h files. Perl continues to flourish into its fourth decade thanks to a vibrant community of users and developers. The following people are known to have contributed the improvements that became Perl 5.28.0: Aaron Crane, Abigail, Ævar Arnfjörð Bjarmason, Alberto Simões, Alexandr Savca, Andrew Fresh, Andy Dougherty, Andy Lester, Aristotle Pagaltzis, Ask Bjørn Hansen, Chris 'BinGOs' Williams, Craig A. Berry, Dagfinn Ilmari Mannsåker, Dan Collins, Daniel Dragan, David Cantrell, David Mitchell, Dmitry Ulanov, Dominic Hargreaves, E. Choroba, Eric Herman, Eugen Konkov, Father Chrysostomos, Gene Sullivan, George Hartzell, Graham Knop, Harald Jörg, H.Merijn Brand, Hugo van der Sanden, Jacques Germishuys, James E Keenan, Jarkko Hietaniemi, Jerry D. Hedden, J. Nick Koston, John Lightsey, John Peacock, John P. Linderman, John SJ Anderson, Karen Etheridge, Karl Williamson, Ken Brown, Ken Cotterill, Leon Timmermans, Lukas Mai, Marco Fontani, Marc-Philip Werner, Matthew Horsfall, Neil Bowers, Nicholas Clark, Nicolas R., Niko Tyni, Pali, Paul Marquess, Peter John Acklam, Reini Urban, Renee Baecker, Ricardo Signes, Robin Barker, Sawyer X, Scott Lanning, Sergey Aleynikov, Shirakata Kentaro, Shoichi Kaji, Slaven Rezic, Smylers, Steffen Müller, Steve Hay, Sullivan Beck, Thomas Sibley, Todd Rinaldo, Tomasz Konojacki, Tom Hukins, Tom Wyant, Tony Cook, Vitali Peil, Yves Orton, Zefram. The list above is almost certainly incomplete as it is automatically generated from version control history. In particular, it does not include the names of the (very much appreciated) contributors who reported issues to the Perl bug tracker. Many of the changes included in this version originated in the CPAN modules included in Perl's core. We're grateful to the entire CPAN community for helping Perl to flourish. For a more complete list of all of Perl's historical contributors, please see the AUTHORS file in the Perl source distribution. We expect to release perl v5.29.0 tomorrow, followed by v5.29.1 on July 20th. The next major stable release of Perl 5, version 30.0, should appear in June 2019. In hugs and bugs, Sawyer X.
Posted Jun 25, 2018 14:56 UTC (Mon)
by excors (subscriber, #95769)
[Link] (14 responses)
That seems rather weird behaviour - if you're using a regexp to check if something looks like a number, surely that's because you're about to pass it into int($n) or $n+0 etc and use it as an actual number? And int() only understands ASCII digits. If you're going to do some Unicode-aware number parsing, you need a library to do that for you, and then it would be no extra hassle to use that library's regexps like (hypothetically) /${Unicode::NumberParser::re_digit}+/ when you specifically want that behaviour, with the bonus of ensuring compatibility between the matching and parsing. And you can still use /\p{Digit}/ if you simply want all Unicode digits. But defaulting to Unicode for \d seems like it's going to achieve little beyond a proliferation of bugs.
Posted Jun 25, 2018 15:49 UTC (Mon)
by MarcB (guest, #101804)
[Link] (13 responses)
But I somewhat agree, that Perl got the defaults backwards, for the simple reason, that matching less than expected causes easier to spot errors than catching more than expected (because positive tests are much more common than negative ones).
In any case, if you want ASCII-characters only, Perl has the "a" modifier.
An example using "unichars" from Unicode::Tussle (it prints out all charactes matching a given regular expression):
Posted Jun 25, 2018 16:17 UTC (Mon)
by epa (subscriber, #39769)
[Link] (8 responses)
Added to that is the fact that the numerals 0-9 are commonly used to write numbers in every language, even if it doesn't use the letters A-Z and even if it has its own separate set of number characters. So I would be quite happy to restrict \d to match digits 0-9 while still extending the meaning of \w. In fact, I can't see that matching "any character which might be a numeric one in some script somewhere" is even a useful thing to match in any practical program.
Posted Jun 25, 2018 17:07 UTC (Mon)
by JoeBuck (subscriber, #2330)
[Link] (7 responses)
Posted Jun 26, 2018 15:16 UTC (Tue)
by epa (subscriber, #39769)
[Link] (6 responses)
Posted Jun 26, 2018 20:43 UTC (Tue)
by karkhaz (subscriber, #99844)
[Link] (5 responses)
This isn't such a challenge to deal with, because the numerical system is exactly the same as in the West (i.e., the position of a digit within the number gives its magnitude), the difference is the actual numerals are the characters ٠١٢٣٤٥٦٧٨٩ rather than 0-9. They're even written with the highest-magnitude digits on the left, just like Western numbers, even though Arabic text is written right-to-left.
So to convert an East-Arabic number string into an int, it suffices to subtract a constant from each character in the string (to turn it into an ASCII number string) and then do the type conversion.
Posted Jul 16, 2018 7:43 UTC (Mon)
by epa (subscriber, #39769)
[Link] (4 responses)
Posted Jul 16, 2018 18:58 UTC (Mon)
by dtlin (subscriber, #36537)
[Link] (3 responses)
Nope. http://unicode.org/reports/tr9/ opens with exactly this case.
However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous.
However, because these right-to-left scripts use digits that are written from left to right, the text is actually bidirectional: a mixture of right-to-left and left-to-right text.
Arabic letters have Bidi_Class=AL (Arabic letter, strongly RTL), while Arabic digits have Bidi_Class=AN (Arabic number, weakly LTR).
Posted Jul 16, 2018 20:31 UTC (Mon)
by zlynx (guest, #2285)
[Link] (2 responses)
Arabic numerals are /supposed to be/ read right to left in little-endian order. Notice that when reading a number, we have to first count all of the digits to determine hundreds, thousands, millions, etc, before we start talking. Instead all these years we could have been reading them as "1 and 20 and 400" if only we'd written them the other direction.
We also have all of the strange formatting exceptions for numbers so that they align to the right. Note that in English that's the only thing we right-align. A big hint that we write them in the wrong order.
Posted Jul 16, 2018 20:46 UTC (Mon)
by karkhaz (subscriber, #99844)
[Link] (1 responses)
I don't think dtlin claimed that at all, their comment was that Arabic digits have a LTR class. However, there's a subtle point here: what we call "Arabic numerals" (0123456789) were indeed copied from Arabic, but I was talking about the numerals that are currently used in most Arab-speaking countries (٠١٢٣٤٥٦٧٨٩, which I referred to as East Arabic numerals to disambiguate).
> Arabic numerals are /supposed to be/ read right to left in little-endian order
I'm not sure what your source for this is. I suppose it makes sense when you have a number embedded in some RTL text. However, I speak Arabic (though I cannot read nor write), and numbers are not pronounced as "1 and 20 and 400". The order is actually a bit jumbled: that particular number is pronounced "four hundred and one and twenty".
In general, higher-magnitude digits are uttered before lower-magnitude ones in spoken Arabic, just as in English. The exceptions are that units are uttered before tens ("one and twenty"), and that the numbers from eleven to twenty have special names (as they do in English, i.e. we say "eleven" as opposed to "one and ten")
Posted Jul 16, 2018 21:02 UTC (Mon)
by zlynx (guest, #2285)
[Link]
Posted Jun 25, 2018 16:34 UTC (Mon)
by excors (subscriber, #95769)
[Link] (3 responses)
Why not?
\w is documented as matching "word" characters, so it's inherently quite vague - if someone wants to strictly validate a string against [a-zA-Z0-9_] then I suspect they'd nearly always do it explicitly with that character range, instead of trying to use \w as shorthand, because the precise meaning of \w is non-obvious even in the ASCII world. And when people understand it as matching "words" in some vague best-effort sense, it's unsurprising that it should include French words and Greek words etc, so the Unicode behavior makes sense. That contrasts with \d which is documented as matching decimal digits, and obviously digits are [0-9], so it's very tempting to (wrongly) use \d when you really want [0-9]. Given the different situation for \w and \d, it seems reasonable to have different solutions for them.
Posted Jun 25, 2018 17:24 UTC (Mon)
by MarcB (guest, #101804)
[Link] (2 responses)
And if you are doing input validation, capturing too much with \w is as bad as doing so with \d. In fact, it might be worse: Accidentally letting through non-ASCII digits is very likely to explode noisily, yet early, while doing so with word characters can lead to subtle, yet nasty errors that might strike years later.
Imagine letting through non-normalized Unicode into a system that is able sto store Unicode, but was designed with ASCII in mind, and produces wrong or strange results if "encoded bit strings are identical <=> decoded strings are identical" is no longer true. For example Linux filesystems, but also many databases, that for all intents and purposes would suddenly seem to violate unique constraints and wrongly fail equality checks.
I do not think, special-casing \d makes much sense. Yes, it would prevent some errors, but those are the obvious ones. And as I said: /a should be the default, not /u. The additonal character to type would not hurt anyone who wants Unicode sematics and it would be the safer thing. People will quickly notice, that something they expect to match is not matching, while the inverse is likely only to be discovered through bugs or even by attackers.
Posted Jun 28, 2018 13:32 UTC (Thu)
by jrw (subscriber, #69959)
[Link] (1 responses)
Posted Jul 11, 2018 19:46 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Jun 25, 2018 20:20 UTC (Mon)
by flussence (guest, #85566)
[Link]
Perl 5.28.0 released
Perl 5.28.0 released
I suspect a lot of people wrote expression that match much more than they expect them to match (I know I certainly did, until about two years ago).
$ ./unichars '/\d/' | wc -l
370
$ ./unichars '/\d/a' | wc -l
10
$ ./unichars '/\w/' | wc -l
11286
$ ./unichars '/\w/a' | wc -l
63
Perl 5.28.0 released
A number of languages have distinct characters to write numbers, for example Kannada, the local language in the Indian state that includes Bangalore, has Unicode codepoints from 0CE6 through 0CEE to represent digits (I've been there several times and love the curlicue Kannada characters though I can't read a bit of it and fortunately for me English is widely spoken and used on signs).
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released
Perl 5.28.0 released