Walking down the hall towards the next set of talks, I am accosted by Miguel de Icaza and friends. He is standing outside the door to this room redirecting people to Nat's talk on GNU Rope, promising entertainment and an overall wild time. Since I hadn't made up my mind which talk to go to, I decide to give it a try. I'll be missing the Netatalk discussion by Adrian Sun and "Public Relations" by Tom Geller, but Miguel is providing lots of entertainment here, even before the talk begins.
Nat, meanwhile, is from MIT. He's working on setup now. Miguel seems to be fairly effective. This is the first room for talks you come to, but also the smallest. Many of the talks in this room have been pretty sparsely attended compared to the other rooms. Any one heading to another talk has to get by Miguel first, though, which sounds like it is becoming difficult.
"Please go in here. Just do it ...", says Miguel. Wow, the room is filling up now. "Nat! Make some funny faces!" This is called out every few minutes or so, apparently to demonstrate that the talk will be amusing as well as enlightening.. "Nat! Wave to the nice people!" Not Miguel, this time, but the other procuror. Alan Cox has grabbed Miguel and forced him to sit down. The two of them are heading to the front. Apparently the harassment in the hallway had reached too high a level. No! He's escaped! Miguel heads to the back of the room. Well, it is still quiet, so he must have given up the task.
He achieved his goal, though. The room is comfortably full, with maybe ten or twenty open chairs.
Soon I may even find out what GNU Rope is ...
The presentation was down with Magicpoint, they were proud to announce.
What is GNU rope? Ah, this is the "grope" I've heard about.
It was inspired by SGI's cord program. It rearranges functions and optimizes programs so that they load faster, use less memory and are nicer to the cache. Miguel learned about the program and said, "We *must* have this program for Linux!" "rope" is a pun on "cord" but then creates a great word combined with GNU.
Grope produces a 30% reduction in memory use when used on the gcc program, for example.
"We are going to explain why rearranging the functions speeds things up, how we do this and how can you use it."
Code is loaded into memory a page at a time. If functions are arranged so that functions that call each other are automatically loaded at the same time, paging is automatically reduced. If we take functions that have "a high affinity" for each other and keep these functions on the same page, we can reduce memory consumption, improve load times and improve cache performance.
Typically "code coverage" is relatively low for big programs. For example, when gcc is used to compile haifa-sched.c, only 976 functions out of 3000-some were actually used. Obviously optimization that focuses on the functions that are actually used will be more effective.
How does Grope work?
For example, everyone goes to slashdot (audience laughs), so we optimize Netscape for slashdot. The profiling tools that existing before Nat started were basically gprof. It has callgraph, which show which function calls which function. Then it has a PC Histogram, which answers the question where the program spends its time. He wanted to gather other types of data, namely, how much memory does my program use over time? He also needed a framework for profiling modes. He created the PROFMODES environment variable so that you can modify the profiling modes at run time.
He showed a graph of memory use by gcc over time. The red lines showed the unoptimized (unmolested) gcc and the green one is the one that he "groped". (Richard Stallman was reportedly happy to see another "naughty" name in the GNU collection.) It showed a 30% reduction in memory use.
So after profiling, you need to determine your ordering. What are your goals? You want to reduce the working set size, the number of pages in memory at any time. How do we do this? We put related functions together. There are special cases where you want to do a really good job. One friend told him, "Dude, to do a really good job, you have to get the UltraSPARC case. If you can do this, that would be *awesome*." If you can take into account the specific behavior of cache on a specific architecture, it will make a tremendous difference.
"Your input is your profiling data and executable. Your output is architecture-specific orderings of functions." Nat's options for algorithms included gprof, the Node Grouping Algorithm, and a few others. The gprof algorithm was written by Jeff Law. It uses normal gprof callgraph data. It figures out what functions haven't been called and sticks them at the end. Figures out what is called a lot and sticks them at the beginning. Then the remaining functions are put together as a chain of functions calls, A calls B calls C. However, recent papers no longer use call graph data. It is not sufficient for the best optimization. We need more information about how child functions get called in comparison. Perhaps B and C need to be grouped together and D and E need to be grouped together.
So Nat stored the information in a graph. He takes the sequence of function accesses and sections off a window of them. For every window, make arcs between the functions that share that window. This generates a sequence graph, showing a relationship between functions that don't call each other directly. A call graph is a degenerate case of a sequence graph with a window of two.
So how do you coalesce this into an ordering? He looked at an algorithm described in paper by Pettis and Hansen from HP. They used it on undirected callgraphs. Nat used it differently. "Take all the arcs in the graphs and sort them by the weight. Take the top arc, the one with the highest weight. Contract the two functions in that arc into a new node with both names." Nat then showed a graph to demonstrate this. a and c functions had the highest number of arcs between them, so they are contracted into node ac. Now look for the next highest number of arcs between functions. ac inherited all the arcs from both a and c. Eventually you have one node left, acedb for example, which gives you your ordering. This is distance-based, with no knowledge of how close they should be, just "as close as possible".
For calculating the best distance, someone at SGI recommended Simulated Annealing. It has its origins in metallurgical annealing. Say you want to strengthen metals. You do this by minimizing the grain boundaries. Brittle metal has clear grain boundaries where metal can break. At high temperatures, the grains move around. Slowly cool the metal and grains fall into a minimum level, a type of "nature's optimization."
So some physicists applied it to another system. If you have a system you want to optimize, tweak it slightly, ask if its better, if it is, switch to it, if it isn't, maybe switch to it, then repeat the process.
First, you need to be able to tell whether or not the system is "better". For this you need an energy function. If the new system has a lower energy, switch to it. If the energy is higher, switch to it on a certain probability basis. You need to start at a really high temperature, where you will have a higher probability of moving to a higher energy, and then move to lower temperatures, where you will tend to move towards lower energy. Otherwise, you can get caught in "dead-end lows", which don't reflect the true potential minimum of the system.
So how do you apply this to computer systems? First create a random step function. This is very easy to do, but it needs the ability to insert a hole (not done yet). Next you need an energy function. The energy function is very cool because it allows you to handle all sorts of special cases easily.
So he went back to the sequence graph. It would be best if arcs with a high affinity end up on the same page. So if a system puts them on the same page, the energy is 1. If it puts them are on two pages, the energy is 2. If they are scattered all over the place, the energy is very high ... Then you can add tweaks for cache design.
He worked on simulated annealing for a while. It works, but it is very slow (much like the metallurgical process). He learned things like only taking downhill steps actually kills you. Nat's longterm contribution is that he knows a lot about the architecture and he can contribute a great energy function someday.
A code maintainer can gather profiling data, compute the function orderings and provide information so that the executable is ordered by the linker. The user compiles normally. However, the maintainer has to compile twice.
SGI uses pixie to get around the problem of compiling the program twice. You can just relink to turn on profiling and the linker will redirect function calls. He hasn't done anything quite so nice yet.
The post-link optimizer operates on functional executables. It needs to optimize a binary without relinking it. It requires a linker modification which causes the linker to emit modification data into the executable. That way, you don't have to keep all the object files around.
The link-time optimizer is fully working, benchmarks indicate 30% less memory usage, and it is better than twice as fast on many machines.
The ordering algorithms need tuning. The post-link optimizer needs debugging. It will be released soon:
has a paper with more detailed information.
How do you feel about dynamic linking?
If you are asking, can we optimize dynamic libraries, yes, we can. Not for specific programs, but for general use. You must gather profiling data for a lot of programs.
How parallelizing is the energy function?
Not very. You want to tranverse the landscape in a random manner. The landscape is huge.
How many arcs in gcc?
Doesn't remember, but he could look it up. He has some graphs, etc.
Is there a mechanism to manually specify special cases?
I was working on stuff like that a week ago.
Have you found many cases where that is needed?
Yes, you may have a case where you know information and you don't want to wait while the simulated annealer figures it out.
You want to optimize differently for different CPUs. Are you going to make it easy to tack on different CPUs and what needs to be done for this?
Yes, that would go into the energy function.
Inspite of the callgraph, you may want to optimize something that doesn't even get called as much. Mechanisms for this?
Yes, as mentioned above.
Simulated annealing is like heuristics. You may be able to cut down your annealing time.
Yes, you may be able to optimize by starting with a likely ordering scheme.
Have you considered Taboo search?
Don't know what it is.
It is a further application of heuristics to simulated annealing.
Comment: reordering the kernel is worth trying, but with 4MB pages, you don't get as much optimization.
In terms of parallelizing it more, have you looked at genetic orderings?
I mentioned it to someone, but I got a disgusted look in return, so I haven't looked at it.
Final note, the room did fill up totally, with a few people standing at the back.