January 18, 2012
This article was contributed by Dan Magenheimer
Over the last fifty years, thousands of very bright system
software engineers, including many of you reading this today,
have invested large parts of their careers trying to solve
a single problem: How to divide up a fixed amount of physical
RAM to maximize a machine's performance across a wide variety
of workloads. We can call this "the MM problem."
Because RAM has become incredibly large and inexpensive, because
the ratios and topologies of CPU speeds, disk access times,
and memory controllers have grown ever more complex, and
because the workloads have changed dramatically and have
become ever more diverse, this single MM problem has continued
to offer fresh challenges and excite the imagination of kernel
MM developers. But at the same time, the measure
of success in solving the problem has become increasingly
difficult to define.
So, although this problem has never been considered "solved",
it is about to become much more complex, because those same
industry changes have also brought new business computing models.
Gone are the days when optimizing a single machine and a single
workload was a winning outcome. Instead, dozens, hundreds,
thousands, perhaps millions of machines run an even larger
number of workloads. The "winners" in the future industry
are those that figure out how to get the most work done at
the lowest cost in this ever-growing environment. And that
means resource optimization. No matter how inexpensive a
resource is, a million times that small expense is a large
expense. Anything that can be done to reduce that large
expense, without a corresponding reduction in throughput,
results in greater profit for the winners.
Some call this (disdainfully or otherwise) "cloud computing",
but no matter what you call it, the trend is impossible to
ignore. Assuming it is both possible and prudent to consolidate
workloads, it is increasingly possible to execute those workloads
more cost effectively in certain data center environments
where the time-varying demands of the work can be statistically
load-balanced to reduce the maximum number of resources required.
A decade ago, studies showed that, on average, only 10% of the
CPU in an average pizza box server was being utilized... wouldn't
it be nice, they said, if we could consolidate and buy 10x fewer
servers? This would not only save money on servers, but
would save a lot on power, cooling, and space too. While many
organizations had some success in consolidating some workloads
"manually", many other workloads broke or became organizationally
unmanageable when they were combined onto the same system and/or OS.
As a result, scale-out has continued and different virtualization
and partitioning technologies have rapidly grown in popularity
to optimize CPU resources.
But let's get back to "MM", memory management. The management
of RAM has not changed much to track this trend toward optimizing
resources. Since "RAM is cheap", the common response to performance
problems is "buy more RAM". Sadly, in this evolving world where
workloads may run on different machines at different times, this
classic response results in harried IT organizations all
buying more RAM on most or all of the machines in a data center.
A further result is that the ratio of total RAM in a data center
vs. the sum of the "working set" of the workloads, is often at
least 2x and sometimes as much as 10x. This means that somewhere
between half and 90% of the RAM in an average data center is
wasted, which is decidedly not cheap. So the question is begged:
Is it possible to apply similar resource optimization techniques
to RAM?
A thought experiment
Bear with me and open your imagination for the following
thought experiment:
Let's assume that the next generation processors have two new
instructions: PageOffline and PageOnline. When PageOffline is
executed (with a physical address as a parameter), that (4K)
page of memory is marked by the hardware as inaccessible and
any attempts to load/store from/to that location result in an
exception until a corresponding PageOnline is executed.
And through some performance registers, it is possible to
measure which pages are in the offline state and which are not.
Let's further assume that John and Joe are kernel MM developers
and their employer "GreenCloud" is "green" and
enlightened. The employer offers the following bargain to John
and Joe and the thousands of other software engineers working at
GreenCloud:
"RAM is cheap but not free. We'd like to encourage you
to use only the RAM necessary to do your job. So, for every page,
on "average" over the course of the year, that you have offline,
we will add one-hundredth of one cent to your end-of-year bonus.
Of course, if you turn off too much RAM, you will be less efficient
at getting your job done, which will reflect negatively on your
year-end bonus. So it is up to you to find the right balance."
John and Joe quickly do some arithmetic: so, since my
machine has 8GB RAM, if I on average keep 4GB offline, I will
be $100 richer. They quickly start scheming ideas on how
to dynamically measure their working set and optimize page
offlining.
But the employer goes on: "And for any individual page that you
have offline for the entire year, we will double that to two-hundredths
of a cent. But once you've chosen the "permanent offline"
option on a page, you are stuck with that decision until the
next calendar year."
John, anticipating the extra $200, decides immediately to try to
shut off 4GB for the whole year. Sure, there will be some workload
peaks where his machine will get into a swapstorm and he won't get
any work done at all, but that will happen rarely and he can
pretend he is on a coffee break when it happens. Maybe the
boss won't notice.
Joe starts crafting a grander vision; he realizes that, if he
can come up with a way to efficiently allow others' machines that
are short on RAM capacity to utilize the RAM capacity on his machine,
then the "spread" between temporary offlining and permanent offlining
could create a nice RAM market that he could exploit. He could
ensure that he always has enough RAM to get his job done, but
dynamically "sell" excess RAM capacity to those, like John, who
have underestimated their RAM needs ... at say fifteen thousandths
of a cent per page-year. If he can implement this "RAM capacity
sharing capability" into the kernel MM subsystem, he may be
able to turn his machine into a "RAM server" and make a tidy
profit. If he can do this ...
Analysis
In the GreenCloud story, we have: (1) a mechanism for offlining
and onlining RAM one page at a time; (2) an incentive for using
less RAM than is physically available; and, (3) a market for load-
balancing RAM capacity dynamically. If Joe successfully figures
out a way to make his excess RAM capacity available to others
and get it back when he needs it for his own work, we may have
solved (at least in theory) the resource optimization problem for
RAM for the cloud.
While the specifics of the GreenCloud story may not be realistic
or accurate, there do exist some of the same factors in the real
world. In a virtual environment, "ballooning" allows individual
pages to be onlined/offlined in one VM and made available to other
VMs; in a bare-metal environment, the RAMster project provides a similar capability. So, though primitive and
not available in all environments, we do have a mechanism. By
substantially reducing the total amount
of RAM across a huge number of machines in a data center, both
capital outlay and power/cooling would be reduced, improving resource
efficiency and thus potential profit. So we have an incentive
and the foundation for a market.
Interestingly, the missing piece, and where this article started, is
that most OS MM developers are laser-focused on the existing problem
from the classic single machine world which is, you recall: how to
divide up a fixed amount of physical RAM to maximize a single
machine's performance across a wide variety of workloads.
The future version of this problem is this: how to vary the amount
of physical RAM provided by the kernel and divide it up to maximize
the performance of a workload. In the past, this was irrelevant:
you own the RAM, you paid for it, it's always on, so just use it.
But in this different and future world with virtualization, containers,
and/or RAMster, it's an intriguing problem. It will ultimately allow
us to optimize the utilization of RAM, as a resource, across a data
center.
It's also a hard problem, for three reasons: The first, is that we can't
predict, but can only estimate, the future RAM demands of any workload.
But this is true today, the only difference is whether the result
is "buy more RAM" or not. The second, is that we need to understand
the instantaneous benefit (performance) of each additional page of
RAM (cost); my math is very rusty, but this reminds me of differential
calculus, where "dy" is performance and "dx" is RAM size. At every
point in time, increasing dx past a certain size will have no
corresponding increase in dy. Perhaps this suggests control theory
more than calculus but the needed result is a true dynamic
representation of "working set" size. Third, there is some cost for
moving capacity efficiently; this cost (and impact on performance) must
be somehow measured and taken into account as well.
But, in my opinion, this "calculus" is the future of memory management.
I have no answers and only a few ideas, but there's a lot of bright
people who know memory management a lot better than I. My hope is to
stimulate discussion about this very-possible future and how the kernel
MM subsystem should deal with it.
(
Log in to post comments)