Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
Posted Apr 8, 2015 9:41 UTC (Wed) by tao (subscriber, #17563)In reply to: Kernel prepatch 4.0-rc7 by kloczek
Parent article: Kernel prepatch 4.0-rc7
Out of the top 500 super-computers in the world, 485 (!) of them run Linux. One runs a mix of Linux and something else. One runs Windows. The rest runs AIX.
None of them run Solaris...
Posted Apr 8, 2015 10:07 UTC (Wed)
by kloczek (guest, #6391)
[Link] (31 responses)
These are not supercomputers but HPC clusters using usually horizontally scaled 2 CPU sockets kits. Effectively these supercomputers are running sometimes thousands of Linux/Win/AIX systems.
Few months ago on Solaris was fixed bug present on systems with >=32TB RAM (terabytes). Please show me example of running single Linux kernel on kit with such amount of RAM.
Posted Apr 8, 2015 13:58 UTC (Wed)
by zdzichu (subscriber, #17118)
[Link] (10 responses)
Posted Apr 8, 2015 22:11 UTC (Wed)
by kloczek (guest, #6391)
[Link] (9 responses)
To utilize this hardware you must have MPI oriented application which additionally needs to be recompiled to be linked with delivered by SGI MPI libraries.
So if anyone will probably ask: can I have one of those kits to run my in memory MySQL DB? Answer will be: No.
Posted Apr 8, 2015 22:36 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
And ANY system of that scale is NUMA-ish, because you simply can't have one central controller overseeing all the action (simple light-speed delay in long conductors becomes an issue at this scale!).
Posted Apr 9, 2015 0:32 UTC (Thu)
by kloczek (guest, #6391)
[Link] (7 responses)
Theoretically? of course :)
Posted Apr 9, 2015 1:51 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Moreover, you can run multiple apps in parallel without any problems, NUMALink will simply migrate the required RAM pages onto the nodes with the CPUs used.
You need NUMA awareness if you want really heavy communication between multiple concurrent threads/processes.
Posted Apr 9, 2015 3:53 UTC (Thu)
by kloczek (guest, #6391)
[Link] (5 responses)
OKi doki .. let's say I have 1TB in memory database with only one table. Such memory region is not possible to fit within UV node memory (8 DIMMs per node).
Interesting how big will be ratio between time spend on doing memory scan and waiting on delivery exact pages over interconnect to exact node?
(googling) .. ok i found: http://www.hpcresearch.nl/euroben/Overview/web12/altix.php
"The distinguising factor of the UV systems is their distributed shared memory that can be up to 8.2 TB. Every blade can carry up to 128 GB of memory that is shared in a ccNUMA fashion through hubs and the 6$^{\rm 6th}$ generation of SGI's proprietary NumaLink6. A very high-speed interconnect with a point-to-point bandwidth of 6.7 GB/s per direction; doubled with respect to the former NumaLink5."
WOW .. "6.7 GB/s per direction"!!!
IIRC one socket latest v3 Intel CPU can do memory scan with something like 150 or 250 GB/s.
Current Sparc M7 can do this using single CPU socket which can handle up to 2TB RAM per socket. Each socket has 16 cores and each core can run up to 8 CPU threads. Each M7 can do on own node memory scan with up to 1TB/s per CPU socket.
$ echo "1024/6.7" | bc
So such operation on UV 2000 can be (theoretically) up to 150 times slower than on single socket M7 kit.
Let's forget temporary that M7 CPU can do memory scan on database compressed using columnar compression using special CPU subsystem accelerator for such operations which means that my 1TB in memory table can take less than 1TB RAM (usually compression ration like 6-7 is OK but sometimes it can be even 20 or 30).
During such scans on M7 CPU caches are not used which means that such in memory operation will not kick off my DB app code and other data from CPU caches .. it means -> less other hiccups.
Still looks like investing in M7 HW may have way more sense by factor 10 if not close to 100 per buck or quid if someone will be using TB scale in memory DBs. All above is not relevant to Linux or Solaris dilemmas :)
MPI can easy solve saturating interconnects by spreading such memory scan across multiple NUMA nodes. However there is no mysql or Oracle DB using MPI API on accessing to SGA or innodb memory pool :->
I know that I've made few assumptions in above but probably these calculations are not far from real numbers :)
Posted Apr 9, 2015 9:41 UTC (Thu)
by tao (subscriber, #17563)
[Link] (4 responses)
What's relevant is:
Out of the top 500 supercomputers, *none* of them use Solaris. Zip. Zilch. Nada. Hell, even Windows is more popular than Solaris on such hardware.
According to you Solaris performs much better than Linux on identical hardware. Hence it'd make sense for all those supercomputers to run Solaris rather than Linux. Why don't they?
Surely by now at least a few of them should've discovered that Solaris is oh so much better than Linux, right?
Posted Apr 9, 2015 11:16 UTC (Thu)
by kloczek (guest, #6391)
[Link] (3 responses)
Because my workloads are usually DB generated IO hog workloads.
Again: using "supercomputer" name here is very misleading.
HPC installations are usually relatively highly optimized for exact CPU intensive well localized workload. You can do enough good such optimization if you can customize some parts of the base OS or user space env. However you have quite big probability that such optimization cannot be repeated in other customers HPC env (because other customers may have different HPC needs). Most of the HPCs envs on OS/app layer are supported by on site eng team and all what they need is descent hardware support.
HPC is all about doing massive computations as cheaply as it is only possible by using any possible tricks which involves quite often quite big customization. Software support and/or help of professional OS support team are not on top of external help needs priorities list. Try to observe linux kernel lists how often HPC guys need some help (I don't remember even one case in more than last decade)
Part of whole single HPC ecosystem is usually some set of systems with IO hog workloads (backups, data import/export/(pre/post)processing etc). It is quite usual case that such parts here are working under Solarises (ZFS is very welcome friend here).
Posted Apr 11, 2015 18:05 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (1 responses)
> > According to you Solaris performs much better than Linux on identical hardware. Hence it'd make sense for all those supercomputers to run Solaris rather than Linux. Why don't they?
> Because my workloads are usually DB generated IO hog workloads.
So why not ditch relational and move from a (inappropriate) theory based database to an engineering based database that actually scales? :-)
Cheers,
Posted Apr 14, 2015 14:03 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Posted Apr 14, 2015 11:30 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Apr 8, 2015 17:40 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (19 responses)
Cheers,
Posted Apr 8, 2015 19:17 UTC (Wed)
by vonbrand (subscriber, #4458)
[Link] (18 responses)
Around 2000 we migrated our ageing Suns from Solaris to Linux. The difference in performance was not funny (gave the machines a couple of years life extension). I remember some discussions here about Solaris, IllumnOS, and related software. None for quite some time now (except for trolls around ZFS). In the same vein, I've seen nothing about Oracle Linux either here. Looks to me that all that is dead as a doornail.
Posted Apr 8, 2015 19:22 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
Solaris "scaled" better than Linux, in that a 64 cpu box was closer to 64x better than a 1 cpu box, but the base performance of Linux was so much better than until you got to a massive box, Linux would still outperform Solaris. And over time Linux has gained scalability, far beyond anything Solaris ever ran.
Posted Apr 9, 2015 13:15 UTC (Thu)
by pr1268 (guest, #24648)
[Link] (1 responses)
I find that surprising, seeing how Sun designed and built the hardware, OS kernel, a matching suite of shell utilities, and even the compiler for all this. Perfect vertical integration! (Except for third-party userspace apps.) You'd think that Sun would have optimized all this to Timbuktu and back given their knowledge of the architecture. Or, I could propose a conspiracy theory that Sun intentionally crippled performance in an attempt to get their customers to upgrade frequently. Or perhaps everything was rushed out the door by the marketing department. ;-)
Posted Apr 9, 2015 18:03 UTC (Thu)
by dlang (guest, #313)
[Link]
At the time, Linux wouldn't perform well with 8 CPUs and would have been horrible on 64 CPUs
but on 1-2 cpus, Linux absolutely trounced Solaris
As Linux has matured, there has been ongoing emphasis in keeping performance good on small systems while improving it on large systems. When something new in introduced, Linus is very vocal in demanding that it not hurt performance when it's not needed/used.
Yes, the smallest systems have been getting larger. I don't like that. but the range between the smallest systems that work well and the largest keeps growing dramatically.
Posted Apr 8, 2015 21:36 UTC (Wed)
by kloczek (guest, #6391)
[Link] (14 responses)
> I remember some discussions here about Solaris, IllumnOS, and related software. None for quite some time now (except for trolls around ZFS). In the same vein, I've seen nothing about Oracle Linux either here. Looks to me that all that is dead as a doornail.
ZFS has been introduced at November 2005. Whatever you been doing 5 years earlier you not been able to evaluate ZFS on Sun hardware.
If you are talking about compare Linux vs Solaris on very old hardware which was not supported by Solaris 10 it is really pure bollocks.
Posted Apr 8, 2015 22:09 UTC (Wed)
by vonbrand (subscriber, #4458)
[Link] (13 responses)
Just that I won't believe that Solaris suddenly got performant after that little affair.
Posted Apr 8, 2015 22:33 UTC (Wed)
by kloczek (guest, #6391)
[Link] (12 responses)
From Solaris 10 express development cycle (pre GA 10) every microbench showing that something is slower on Solaris than on Linux is treated by Solaris support as *critical bug* and nothing has changed up to now.
First Solaris 10 GA has been released at January 2005.
In last 10 years of using Solaris I had many examples where Solaris running on the same x86 commodity hardware with paid support (even on non-Oracle/Sun HW) was cheaper than free Linux only because was possible to stay longer on the same hardware or in case Linux was necessary to buy more powerfull hardware.
If you are thinking that for example ZFS is worthless I can only tell you that few months age I've migrated on the same hardware some MySQL DB and only by switching from ext4 to ZFS was possible to observe drop down of physical IOs/s by factor up to 3 (not 3% but up to three times less).
Try to have look on:
https://www.youtube.com/watch?v=HRnGZYEBpFg&list=PLH8...
You can start from:
https://www.youtube.com/watch?v=TrfD3pC0VSs&index=6&...
FYI: at the moment OpenSolaris/IlumOS on many areas is behind latest Oracle Solaris.
Posted Apr 8, 2015 22:50 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
And of course, let's not forget the historical: http://cryptnet.net/mirrors/texts/kissedagirl.html which sums it up perfectly.
Posted Apr 9, 2015 0:04 UTC (Thu)
by kloczek (guest, #6391)
[Link] (10 responses)
Sorry do you really want to say that IO scheduling can beat using free list instead allocation structures or using COW semantics?
Maybe some quotes:
https://storagegaga.wordpress.com/tag/copy-on-write/
"btrfs is going to be the new generation of file systems for Linux and even Ted T’so, the CTO of Linux Foundation and principal developer admitted that he believed btrfs is the better direction because “it offers improvements in scalability, reliability, and ease of management”."
If you don't know btrfs is using COW semantics by default.
Just below above is next paragraph:
"For those who has studied computer science, B-Tree is a data structure that is used in databases and file systems. B-Tree is an excellent data structure to store billions and billions of objects/data and is able to provide fast data retrieval in logarithmic time. And the B-Tree implementation is already present in some of the file systems such as JFS, XFS and ReiserFS. However, these file systems are not shadow-paging filesystems (popularly known as copy-on-write file systems).
You see, B-Tree, in its native form of implementation, is very incompatible with COW file systems. In fact, the implementation was thought of impossible, until someone by the name of Ohad Rodeh came along. He presented a paper in Usenix FAST ’07 which described the challenges of combining the B-Tree concept with shadow paging file systems. The solution, as he described it, was to introduce insert/remove key into the tree structure, and removing the dependency of intra-leaf linking"
And second one few years earlier:
https://lkml.org/lkml/2008/9/27/217
"What do you mean by "copy on write", precisely? Do you mean at the
How it was possible that Ted Tso changed his mind about COW in last few years????
In real scenarios LVM snaphoting is not enough because every new LVM snapshot slows down interaction with snapshoted block device. In case ZFS you can create as many snapshots as you can ans still performance will be the same. Such effect is combination of using free lists and COW.
BTW: try to read https://rudd-o.com/linux-and-free-software/ways-in-which-...
If you want to continue this discuss please .. tell more about what you been really testing. I'm really interested what exactly you done and what kind of results you had :)
> And of course, let's not forget the historical: http://cryptnet.net/mirrors/texts/kissedagirl.html which sums it up perfectly
Of course you did't notice that above link points to text which has line:
Date: 1996/10/29
Do want to say that you been testing ZFS or Solaris 10 in 1996?
Posted Apr 9, 2015 0:18 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
We had a project doing lots and lots of IO interspersed with heavy multithreading. Lot's of this IO was O_DIRECT, so it didn't care a fig about COW.
And most certainly, ZFS is not the _fastest_ FS. Ext4 or XFS are usually faster in many benchmarks than either ZFS or btrfs simply because they need to do much less work for a typical IO request, doubly so for many metadata-heavy workloads.
> Do want to say that you been testing ZFS or Solaris 10 in 1996?
Posted Apr 9, 2015 1:09 UTC (Thu)
by kloczek (guest, #6391)
[Link] (4 responses)
Seems you don't understand COW. It needs to be plugged below VFS layer on block allocation stage.
Using COW is causing that random VFS layer reads will cause random reads on block layer as well.
> We had a project doing lots and lots of IO interspersed with heavy multithreading. Lot's of this IO was O_DIRECT, so it didn't care a fig about COW.
ZFS on lower layers does not care about MTs sources of IOs. From this point of view ZFS is multithread agnostics (from app point of view). In the same time ZFS as in kernel space application internally is multithreaded and able much better balance or spread even single stream of IOs across pooled storage using threads.
O_DIRECT was designed for in-place filesystems to allow IO to bypass the filesystem layer and caching. Generally bypassing ZFS caching is probably most stupidest thing which may happen if someone don't understand ZFS or don't understand what exact application is doing.
However if you really understand your application and really know what you are doing and want to obey zfs cashing you can do this without magic .. by change per volume primarycache=none or primarycache=metadata only. Everything OOTB :)
Posted Apr 9, 2015 1:30 UTC (Thu)
by kloczek (guest, #6391)
[Link] (2 responses)
Above feature was implemented by my friend when we been working together in the same company.
BTW claiming that other FSes are the same fast as ZFS as none of them can concatenate write IOs is really not true.
Posted Apr 9, 2015 1:53 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
If you mean "IO request coalescing" then Linux can do it since 2.4 days.
Posted Apr 9, 2015 2:28 UTC (Thu)
by kloczek (guest, #6391)
[Link]
No this is not about this.
If something on VFS layer will be doing two update operations in two different files and those files will be using blocks in separated locations (ie. one fort of block device and second one at the end of the same bdev) COW on block layer will cause that none of these two regions will be used or overwritten during doing random updates inside of these files and new space will be allocated and written using single (bigger) IO.
Again: consequence of using COW is high possibility transforming random write IOs workload to sequential writes characteristics. Less seeks and high possibility of concatenate VFS write IOs on doing IOs on block dev layer.
http://faif.objectis.net/download-copy-on-write-based-fil...
Posted Apr 9, 2015 1:59 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> ZFS on lower layers does not care about MTs sources of IOs. From this point of view ZFS is multithread agnostics (from app point of view). In the same time ZFS as in kernel space application internally is multithreaded and able much better balance or spread even single stream of IOs across pooled storage using threads.
Surely, CoW and other tricks in ZFS give ability to easily make snapshots, do checksumming and other tricks.
Except that quite often I don't care about them - right now I'm tuning a Hadoop cluster and ext4 with disabled barriers and journaling totally wins over XFS and btrfs.
Posted Apr 9, 2015 1:44 UTC (Thu)
by kloczek (guest, #6391)
[Link] (2 responses)
Try to estimate how much income is generated by all these HPC systems for all software vendors.
Posted Apr 9, 2015 2:42 UTC (Thu)
by rodgerd (guest, #58896)
[Link] (1 responses)
Posted Apr 9, 2015 4:10 UTC (Thu)
by kloczek (guest, #6391)
[Link]
No I'm not. If I can discuss anything here it can be only financial aspects of supporting Solaris or other OS on HPC platforms from both consumers and hardware/software vendors point of views.
> you'll need to explain why Solaris is so successful that Sun went broke and Oracle refuse to break out their Solaris sales on their earnings reports.
I have no idea about real reasons of above but I know that from Sun time number of developers involved in Solaris development grow few times. I don't think that Oracle hired more developers to work on Solaris only "just for fun". Using only this fact I don't think that you suspicions are correct/relevant.
Posted Apr 10, 2015 0:22 UTC (Fri)
by cesarb (subscriber, #6266)
[Link]
> How it was possible that Ted Tso changed his mind about COW in last few years????
As far as I know, there are still no plans to implement copy-on-write in ext4. So no, he didn't change his mind.
Kernel prepatch 4.0-rc7
Even few kits with few hundredths CPUs in single machine cannot run single Linux kernel on all these CPUs. These boxes are working under hypervisor or partitioning software which allows you to run any system inside partition. Effectively these systems are working like set of 1-2 CPU machines with faster inteconnects.
Nevertheless number of workloads requiring such big memory in single image system at the moment is very small (we are talking probably about only few hundredths such systems across whole world). Linux is not ready for such scale and so far only system able to run on such-big-ass-kits is Solaris.
On such big scale it is not only problem with Linux but with hardware as well. Intel CPUs cannot be used here not because they have some address space limitations but because memory bandwidth to memory subsystem is to low.
Ty to compare maximum memory bandwidth of biggest Intel CPUs and Sparc M7 CPU.
Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
Each CPU/memory node has per CPU 8 DIM slots so your application is limited by those 8 DIMMs.
Kernel prepatch 4.0-rc7
You most certainly can. SGI's MPI library is a nice-to-have thingie that simply utilizes the cache coherence protocols efficiently than most of the regular software. But it's by no means essential.
Kernel prepatch 4.0-rc7
In practice .. so from where I ca download MySQL source code redesign to use MPI on accessing to shared innodb pool?
Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
So on doing full table scan this operation cause that my DB process will be jumping between NUMA nodes or interconnects will be constantly in use and here will be biggest bottleneck. Probably each interconnect transaction takes few thousands of CPU cycles (much more than L2 cache transaction).
Remember that usually HPC interconnects are optimized for lowest possible latency (not for maximum bandwidth).
So .. really UV interconnect is not about bandwidth :)
152
Less physical memory used means that warming up database can be done way faster only by factor of compression ratio (let's forget about this that each UV node has only one x16 PIC slot which probably will be real pain in a*s on warming up DB).
Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
Example: Top500 tests are not testing interconnects. Bunch of 1k boxes connected over RS-232 cables and over IB will generate almost the same index.
Strongest aspects of Solaris like HA, security and high data integrity are usually not on top of HPC priorities as some calculations are running many times (software specific factors).
Security? Whole HPC env is usually secured on "physical layer" (receptions, guards, bars, doors, physically separated from outside world network segments etc).
HA? During longest calculations very often is used checkpointing. If in the middle of the processing some data will be total power failure or single node will burn whole computations can be quickly continued from last checkpoint state. Few hours or even days downtime is not a big deal as almost all of every hour costs of running whole env are powering and cooling costs.
However on overall Top500 indexes calculations such parts are not taken.
Kernel prepatch 4.0-rc7
Wol
Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
Kernel prepatch 4.0-rc7
Wol
Solaris and Linux
Solaris and Linux
Solaris and Linux
I gave him a simple demo how much faster Linux was to boot and get running on that box (running a large MRTG instance), his jaw hit the floor.
Solaris and Linux
Solaris and Linux
In 2005 all Sun USparc hardware on which Linux been working was on EOS or very close to EOS.
Just try to take 5 years (or more) old disks and try to complete any x86 hardware to run few benchmarks.
What is the sense of doing this? I have completely no idea.
Solaris and Linux
Solaris and Linux
Seems you've lost almost 10 years of new features of Solaris.
Such example is not quite unique.
Solaris and Linux
Yeah, sure. I worked with Slowlaris and ZFS back in 2008-2009 and it definitely was waaaay slower than Linux in many aspects. In particular, we had very concurrent workloads and Linux scheduling was vastly superior.
Solaris and Linux
So maybe you will be able to explain why ext4 has now COW?
Do you really understand impact of using COW semantics?
file level, directory level, or the filesystem level?
***We don't have any plans to implement "copy on write" in ext4***, although
you can create a copy-on-write snapshot at the block device level
using LVM/devicemapper. For many things (database backups, etc.) this
is quite suitable."
Maybe you don't care about using COW but core fs Linux developer does.
Solaris 10 GA it is Jan 2005 .. ~8 years after this email has been send.
Solaris and Linux
Errr... I haven't really parsed the meaning of this.
There's a reason why Solaris disappeared from Top500. Think about it.
Solaris and Linux
However it is not the case in case write IOs. COW can transform random VFS write operation to sequential write IOs or smaller number of such IOs. With clever IO scheduling you are able to reduce number of physical write IOs.
THIS is main advantage as on even SSDs. Doing less bigger write IOs instead of batches small IOs gives you better performance (reducing number interrupts as well).
Solaris and Linux
For example during initial import of mysql database from text dump you can switch volume settings to sync=disabled which can make such import waaaay faster :)
http://milek.blogspot.co.uk/2010/05/zfs-synchronous-vs-as...
Solaris and Linux
WTF is "io concatenation"?
Solaris and Linux
This allows reduce physical layer IO bandwidth.
Solaris and Linux
That's not an advantage if applications themselves can do it. Applications like, you know, Oracle databases.
No, nothing is "multithread agnostic". ZFS has its own complicated metadata that needs additional locking compared to simplistic filesystems like ext4/ext3.
Solaris and Linux
Solaris and Linux
Solaris and Linux
Solaris and Linux
