|
|
Subscribe / Log in / New account

Julia v1.10: Performance, a new parser, and more

January 17, 2024

This article was contributed by Lee Phillips

The new year arrived bearing a new version of Julia, a general-purpose, open-source programming language with a focus on high-performance scientific computing. Some of Julia's unusual features are Lisp-inspired metaprogramming, the ability to examine compiled representations of code in the REPL or in a "reactive notebook", an advanced type and dispatch system, and a sophisticated, built-in package manager. Version 1.10 brings big increases in speed and developer convenience, especially improvements in code precompilation and loading times. It also features a new parser written in Julia.

Time to first x

The time needed to import packages has been a common annoyance since the first public release of Julia. This is sometimes called the "time to first plot", or, more generally, the "time to first x", where "x" is the output of some function when called for the first time from a freshly loaded package. Nearly every major release of Julia has improved compilation and loading times.

The latest release decreases these delays to the point where the strategy of creating custom system images with pre-loaded packages is largely no longer needed, as Chris Rackauckas pointed out. This strategy used the PackageCompiler module to make a custom Julia binary with one's commonly-used packages baked in, so it could start up instantly, ready to plot or perform other tasks without delays. The section "Creating binaries" below discusses this further.

I borrowed an example from that Hacker News discussion, and timed a simple "using Plots; plot(sin)" using Julia v1.10 on a low-powered laptop. It reported a wall-clock time of 2.16 seconds; in that time, it needed to load the Julia runtime, import the Plots package, and create a simple plot. Note that this experiment does not use a fresh Julia install. Plots has already been downloaded and precompiled (along with its dependencies), which can take some time, but that's a one-time cost. To get a sense of the improvements in v1.10, I repeated the same timed command with the two previous versions of Julia that I still have installed. Julia v1.9 takes 3.6 seconds, and v1.8 takes 18.1 seconds.

The Julia developers employed several strategies to achieve this latest improvement in code loading. The previous major release introduced native code caching and package extensions for weak dependencies, as well as the ability for package authors to ship precompiled, native binary code with their software. Some of the recent improvement in loading time is due to package developers taking greater advantage of these two mechanisms; as these practices spread through the community we'll see further decreases in latencies even without improvements to Julia itself.

The latest release corrects an oversight in the generation of code caches: previously, if a user was compiling code with more than one instance of Julia (a situation that can arise in parallel computing environments), a race condition might ensue, with different Julia processes writing to the same cache files simultaneously. Now this problem has been eliminated using lock files.

Part of the reduction in package load time is the result of parallelizing the initial stage of precompilation. The precompilation that happens upon installation of packages has been done in parallel for a long time, but the later stage, when the user imports functions into a program or environment, was a serial process until v1.10. It is this stage of precompilation that affects responsiveness in daily use, so speeding this up has a greater impact when working in the REPL or running programs.

Julia ships with a version of LLVM containing patches to fix bugs, mainly in numerical methods. The upgrade to LLVM 15 that came with v1.10 also had an impact on Julia's responsiveness, as the new compiler version has better performance.

For users who'd like to keep a closer eye on package precompilation times in the REPL, the command Pkg.precompile(timing=true) will precompile any packages in your environment that require it, and supply a per-package report of compilation times.

Language, libraries, and runtime

In addition to faster precompilation, v1.10 brings improved runtime performance and some developer conveniences.

The Julia parser has, since v1.0 until now, been implemented in Femtolisp. This Scheme dialect was, in the words of its creator, Jeff Bezanson (one of the inventors of Julia), in its README, "an attempt to write the fastest lisp interpreter I could in under 1000 lines of C".

With v1.10, the Femtolisp parser has been replaced with a parser written in Julia. The new parser brings three advantages: it is faster, it produces more useful syntax-error messages, and it provides better source-code mapping, which associates locations in compiled code to their corresponding lines in the source. That last improvement also leads to better error messages and makes it possible to write more sophisticated debuggers and linters.

The next figure illustrates the improved error reporting in v1.10:

[Syntax error]

As can be seen, the exact position of the error is indicated. The same error committed using v1.9 produces a somewhat less friendly message, mentioning the extra space but not pointing out its location visually.

Femtolisp still performs code lowering for the language. In the latest version, as in earlier iterations, starting an interactive session with the --lisp flag takes you to the Femtolisp REPL.

Stack traces have also been made more concise and useful by omitting unnecessary information, addressing a frequent complaint from Julia programmers.

Julia has always used a tracing garbage collector. Application programmers have never had to know anything about this, although, as an optimization, it is often possible to avoid garbage collection entirely by taking care not to allocate memory dynamically. Nevertheless, for some algorithms these strategies won't work.

Run-time performance in v1.10 was improved by parallelizing the mark phase of the garbage collector, achieving nearly linear speedup as a function of the number of threads allocated to garbage collection. Julia can be started with any number of worker threads, set with the -t flag. The number of threads used for garbage collection is half the number of worker threads, by default, but the user can set this with the new --gcthreads flag.

The use of Unicode characters has received some tweaks and refinements, as usual in any Julia release. There are two new fancy arrows (⥺ and ⥷) available to be used as names for binary operators. Physicists will be relieved to learn that two similar glyphs for hbar are now treated as identical. The glyph ∜ computes the fourth root, unsurprisingly, using a new function in Base.math called fourthroot(). Its implementer was motivated by the desire to not "let a perfectly good Unicode symbol go to waste", and also points out that fourthroot() is faster than what some programmers might turn to: x^(1/4).

Speaking of new functions, there is a generous handful in the new release. The tanpi(x) function calculates tan(xπ) more accurately for large arguments. For example, noting that tan(π/4) = 1 exactly, and the tangent has a period of π:

    julia> tanpi(1/4 + 2e10)
    0.9999999999999999

    julia> tan((1/4 + 2e10)*π)
    0.999995497014194

Some new memory-related functions, memmove(), memset(), and memcpy(), were added to Julia's C standard library, to aid in interacting with C library routines.

The function in Base for calculating binomial coefficients, binomial(x, n) (x choose n), now accepts non-integer x, applying the standard definition for extended binomial coefficients; n must still be an integer.

Julia has a printstyled() function that writes messages to the screen with optional effects including colors, underline, reverse video, and others; whether the user sees the effects depends on the capabilities of the terminal emulator in use. Julia v1.10 adds an italic option.

A View in Julia is a section of an array (called the parent), with which it shares memory. The function parent() returns the View's parent array. A SubString applies the same idea to strings: it's a piece of a parent string that doesn't create a new object. Previously parent() only worked with array Views. Now the function has been extended to work with SubString types:

    julia> s = "abcdefgh"
    "abcdefgh"

    julia> ss = SubString(s, 6:8)
    "fgh"

    julia> ss[1]
    'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)

    julia> parent(ss)
    "abcdefgh"

The startswith() function determines whether a string starts with a given character or substring. This has now been extended to work with I/O streams (for example, files). It looks at the stream not from the beginning, but from the current read position, as shown in this example:

    julia> maybePNG = open("fig1.png")
    IOStream(<file fig1.png>)

    julia> seek(maybePNG, 1);

    julia> startswith(maybePNG, "PNG")
    true

    julia> position(maybePNG)
    1

This example checks for the magic number one byte into the file to determine if it might be a PNG image. The startswith() function peeks at the required extent of the stream without changing the read position, as shown in the last line of the example.

In Julia rational numbers are defined and displayed with a double slash, such as 3//4. In the latest release, when printing an array, rational numbers are displayed as integers if they have an integral value (if they can be reduced to having a denominator of one). Here's an example:

    julia> TM = Matrix{Rational}(undef, 5, 5);

    julia> for i ∈ 1:5, j ∈ 1:5
              if j<i
                  TM[i, j] = i//j
              else
                  TM[i, j] = 0//1
              end
           end

    julia> TM
    5×5 Matrix{Rational}:
    0   0     0     0    0
    2   0     0     0    0
    3  3//2   0     0    0
    4   2    4//3   0    0
    5  5//2  5//3  5//4  0

This constructs a lower triangular matrix whose structure is easy to see at a glance. Here is the way previous versions of Julia display the same array:

    0//1  0//1  0//1  0//1  0//1
    2//1  0//1  0//1  0//1  0//1
    3//1  3//2  0//1  0//1  0//1
    4//1  2//1  4//3  0//1  0//1
    5//1  5//2  5//3  5//4  0//1

This style of displaying rational numbers only applies to members of arrays; outside of that context, there is no change.

The new release comes with a few technical refinements to the linear algebra library, which is part of the standard library. One new addition is the hermitianpart() function, which efficiently calculates the Hermitian part of a square matrix.

Creating binaries

The normal way to use Julia is to download and install the runtime and to work either in an interactive environment, such as the REPL or a Pluto notebook, or to run program files from the command line, using the julia shell command. In both styles, the compiler is available to generate new native machine code whenever it encounters a function call with a combination of argument types that has not previously been compiled. This allows making full use of multiple dispatch and user-defined types without sacrificing performance, aside from the necessary compile times when code for a new method needs to be generated.

There are other usage scenarios, however, that are not satisfied by the standard mechanisms. We may want to eliminate as much compiler delay as possible, for example if a Julia program is called repeatedly in a shell script. Even the small compilation penalties still present in v1.10 could be undesirable. Or we may want to give our program to someone who, for some odd reason, does not have the Julia runtime installed. In this case, they would need a standalone binary version of our program.

There are two main packages that provide utilities for generating various types of compiled Julia programs. StaticCompiler is a package that can make small, statically compiled binaries from sources written in a severely limited subset of Julia. It's an experimental package intended for adventurous programmers. StaticCompiler is the basis of ongoing experiments in compiling Julia to WebAssembly for running in web browsers. The past year has seen significant progress in that area.

A more generally useful tool is the package PackageCompiler. It can compile programs written in unconstrained Julia into either sysimages or into distributable, standalone binaries. The former target is used in order to have a Julia environment with a set of packages baked in that starts up instantly. This was a great help in earlier versions of Julia, when precompilation times interfered with using Julia programs as routine utilities (a script that took half a minute to plot a small bit of data was inconvenient). Now that precompilation and load times are so much smaller, this use of PackageCompiler is almost obsolete.

But it is still the standard utility for creating binaries for distribution or for use in environments where the Julia runtime is not installed. While still under active development, it's a mature tool. Until recently, the main complaint has been the enormous size of the compiled programs: PackageCompiler would turn a "hello world" program into a 900MB monstrosity. This is because everything was included; not only the entire Julia runtime, but everything in the standard library. For example, the BLAS routines were included even if the program didn't do any linear algebra.

In the past year, the size of a compiled "hello world" program has come down to 50MB, making the PackageCompiler approach far more practical. This progress results from stripping out unneeded parts of the runtime and unused libraries. This work is ongoing, and is a strong focus of developer activity. Bezanson gave an informative talk about the history of this progress.

Learning resources

Julia's official manual is compendious and accurate, but its organization can make it difficult to find what you need, especially if you're a beginner. Fortunately Julia has been in use long enough now that a constellation of articles and books has grown around it, written by authors from a wide variety of backgrounds and aimed at readers from beginning programmers to working scientists experienced with computation. Several books have recently appeared by authors whose Julia articles over the past several years I've found illuminating; for example, Bogumił Kamiński, Erik Engheim, and Logan Kilpatrick. Their books, as well as my volume, are collected into a list maintained on the official Julia site.

Another good resource for learning more about Julia and the projects taking advantage of it are the talks from JuliaCon on YouTube.

Finally, once you're exploring the language and ecosystem in earnest, the Julia Discourse discussion platform will become a valuable resource. Julia developers, package authors, and experienced users participate actively; the community is welcoming, patient, and helpful.

Conclusion

Between the improvements in precompilation and loading times, and the progress in making small binaries, two major and perennial complaints, of beginners and seasoned Julia users alike, have been addressed. Work on both these areas continues and is likely to see more improvement in the near future, especially in the area of binary generation. StaticCompiler and related WebAssembly tools will make it easier to write web applications in Julia for direct execution in the browser; it is already possible, but may become more convenient over the next few years.


Index entries for this article
GuestArticlesPhillips, Lee


to post comments

Julia v1.10: Performance, a new parser, and more

Posted Jan 17, 2024 10:19 UTC (Wed) by ianmcc (subscriber, #88379) [Link] (6 responses)

Its only sort-of true that the tanpi() function is more accurate for large arguments. Like most (all?) functions defined in IEEE that have a much smaller range than the input (trig functions, fmod, etc), it only has full accuracy when the argument is exact. In that example of calculating tanpi(1/4 + 2e10), fluctuations in the least significant digit of the argument still get amplified by 10 orders of magnitude,
println(tanpi(1/4 + 2e10))
println(tanpi(nextfloat(1/4 + 2e10)))
println(tanpi(prevfloat(1/4 + 2e10)))

# Compilation provided by Compiler Explorer at https://godbolt.org/

Standard out:
0.9999999999999999
1.0000239687370587
0.999976031837428

So it is only useful for this purpose if you can arrange that there is no rounding of the argument. Its exactly the same as if you wanted to calculate tan of (1/4 + 2e10) radians. The IEEE functions will give a result that is accurate to full precision, but calculating it as tanpi((1/4 + 2e10) / π) will only have ~5 digits of accuracy (-0.966688933802304, accurate to ~16 digits, for tan(1/4+2e10) versus -0.966687605232931, accurate to ~5 digits, for tanpi((1/4 + 2e10) / π)).

The bottom line is that if you want to calculate trig functions of very large arguments you need to know what you are doing, and using tanpi instead of tan isn't automatically going to help.

Julia v1.10: Performance, a new parser, and more

Posted Jan 17, 2024 11:15 UTC (Wed) by joib (subscriber, #8541) [Link] (5 responses)

The big advantage of tanpi() and the other *pi() trigonometric functions is that argument reduction is easy and fast to do exactly. The standard trigonometric functions require the argument to be reduced by a multiple of pi, which requires a high-precision approximation of pi and arbitrary precision calculations which is slower and introduces possibilities of bugs and roundoff error accumulating unless the implementation is very carefully done (which historically hasn't really been the case in libc's although that has started to change).

But of course an implementation can only calculate with whatever value is given to them. And yes, tan(x) is almost certainly going to be more accurate than calculating tanpi(x/pi). So the *pi() functions will only be useful if the argument you want to calculate needs to be multiplied by a multiple of pi (which per se isn't entirely uncommon).

Julia v1.10: Performance, a new parser, and more

Posted Jan 17, 2024 12:56 UTC (Wed) by ianmcc (subscriber, #88379) [Link] (1 responses)

Well, in ordinary cases (i.e. when the argument is close to 1), tanpi(x) and tan(pi*x) will have the same accuracy:
println(tanpi(1/4))
println(tan(π/4))
Program stdout
0.9999999999999999
0.9999999999999999
The problem with trig functions with very large arguments is that the calculation is ill-conditioned, and roundoff error in the argument produces large errors in the result. That it is even possible to implement the trig functions so that the result only has +/- 1 ulp error for any input is non-obvious (and wasn't the case in the original 287 coprocessor), and while it is nice to have, in practice it isn't all that useful since you usually have at least 1 ulp error on the input as well and there is no way to avoid that 1 ulp error on input getting magnified into a large error on the output.

In fact the error in tanpi is bigger than it could be, since 1/4 + 2e20 is exactly represented as Float64, and the exact result of tanpi(1/4 + 2e20) is 1.0, but Julia instead gives 0.9999999999999999. So for example, if the result is used as input for another function, then that error can blow up,

print(tanpi(tanpi(1/4 + 2e10) * 2e10 + 1/4))
Program stdout
0.999976031837428
although in principle all of the numbers here are exactly representable in Float64 and the exact result is 1.0.

If you need to calculate tan of numbers ~ 1e10 using double precision, and those numbers are calculated from some other function, then you ought to "not care" about errors in the result of order 1e-5. If you want the result to be more accurate then you need your input value to be specified to more digits than are available in double precision.

On order to rely on the special properties of tanpi(1/4 + 2e10) to produce 1.0 (+/- 1 ulp), the input number can't have been subject to any previous rounding. I've never seen a real-world example where that is the case (except trivial examples where the argument is a constant).

Julia v1.10: Performance, a new parser, and more

Posted Jan 17, 2024 14:31 UTC (Wed) by joib (subscriber, #8541) [Link]

Again, the implementation of a trigonometric function can only work with the argument it's given. It can't know whether the argument is exact, or if not, how much error there is and how much error is acceptable in the output. And thus it has to do the best it can, assuming the argument is exact. For a quality implementation, that is; Of course an implementation can take the position that the input probably contains error, so why should it bother with being particularly accurate. But for some applications this isn't good enough.

And yes, a particularly large argument to a trigonometric function probably implies that problem should be formulated in some other way. But again, the implementation of a trigonometric function can't know that, and the best it can do is to assume the argument is exact and calculate an answer that is as accurate as possible given that assumption.

And yes, the x87 was notoriously bad with reducing large arguments. Thankfully those days are now past us, and most libc's AFAIK do argument reduction roughly per the famous(?) paper by Ng.

Julia v1.10: Performance, a new parser, and more

Posted Jan 17, 2024 14:41 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

As a trivial example of when the *pi functions are useful, consider a case where your measurement is not in radians; for example, you have a measurement of 200° or 97 gon. The conversion from degrees to radians is multiply by π and divide by 180, while the conversion from gon to radians is multiply by π and divide by 200. In both cases, the *pi() functions help because it's incredibly cheap to reduce the argument to the range 0 to 2, whereas it's a lot more challenging to reduce it to 0 to 2π without loss of accuracy.

Julia v1.10: Performance, a new parser, and more

Posted Jan 18, 2024 9:55 UTC (Thu) by ianmcc (subscriber, #88379) [Link] (1 responses)

it's a lot more challenging to reduce it to 0 to 2π without loss of accuracy.

That isn't true, in IEEE arithmetic the remainder function is exact, and very fast. In fact I don't think there is any performance advantage to taking a remainder modulo 2 compared with 2 pi.

Julia v1.10: Performance, a new parser, and more

Posted Jan 18, 2024 10:22 UTC (Thu) by farnz (subscriber, #17727) [Link]

The IEEE arithmetic remainder function is not able to give an exact remainder when dividing by 2π; it can only give an exact remainder when the divisor's IEEE 754 representation is exact, and there is no exact, finite, representation of 2π (in any binary floating point system, not just in IEEE 754). You will thus, per IEEE 754's own rules, get an inexact remainder when you take the remainder after division by 2π, since you did not supply an exact representation of 2π to the remainder function.

In contrast, the real number 2 can be represented exactly in IEEE 754 floating point, and thus the remainder function will give you an exact result. This is, FWIW, the actual rationale given by P754 working group for the inclusion of sinPi, cosPi and atanPi in table 9.1; the only reason that IEEE 754 has traditionally omitted tanPi, acosPi, and asinPi is that these functions have problems where there's two or more valid output values for certain inputs, and there's arguments for both possible valid output values.

Julia v1.10: Performance, a new parser, and more

Posted Jan 18, 2024 1:41 UTC (Thu) by anachronicnomad (guest, #119989) [Link]

This is awesome! Exciting developments in Julia!


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds