MARS could help here (for some parts)
MARS could help here (for some parts)
Posted Mar 13, 2016 8:01 UTC (Sun) by schoebel (guest, #107651)Parent article: Filesystem support for SMR devices
MARS turns many random writes into (near-)sequential ones by its memory buffer architecture.
Please look at the appendix slide titled "MARS Light Data Flow Principle" at https://github.com/schoebel/mars/blob/master/docu/MARS_GU... . Also look into mars-manual.pdf in the same directory.
Not only the transaction logfiles are written in a sequential fashion, but also the writeback into the underlying disk is re-ordered by ascending sector numbers. This type of access pattern comes _close_ to what SMR drives probably want.
Notice that MARS can also be used standalone (without replication of the transaction logfiles), just for the sake of performance.
Example: assume we have 10 MARS resources on a storage server in a datacenter. Then you will get 10 near-sequential logfile write streams, and 1 near-sequential writeback stream which alternates / cycles through the resources. Thus only about 11 zones will be open by concept.
The actual situation complicates a little bit by the following effects. The following is only relevant when the /mars filesystem is placed on SMR drives (which can be avoided in datacenter setups):
1) During the MARS log-rotate operation, two transaction logfiles may be open during a short transitional period. Thus the number of open zones could double in worst case. The number of open zones could be reduced by running these operations more sequentially, resource by resource, instead of in parallel.
2) The current MARS implementation assumes _hardware_ RAID controllers with BBUs. This provides a very fast RAM cache for frequent inode updates caused by fsync(). This painpoint should be resolved, either by cooperation with hardware RAID vendors, and/or via modification of MARS (e.g. use of fallocate() for pre-allocation of larger logfile chunks could be feasible in future).
3) Currently all metadata information is stored in symlinks at the /mars filesystem. Happily their update is not performance-critical. The symlinks will be replaced anyway for kernel upstream submission of MARS. Details are to be discussed.
In order to get the best out of MARS as quickly as possible, I would suggest the following:
A) First of all, don' try to offload performance characteristics into higher layers of operating systems at all. I am supporting the opinions of Hannes and Dave by 100%. Don't introduce new interfaces just for the sake of performance. Performance issues should be solved inside of blackboxes whenever possible. Please _emulate_ a classical block layer interface (preferably at the drive firmware).
B) On that basis, start with a /mars filesystem on dedicated _conventional_ spindles, attached to a hardware RAID controller with BBU. Typically less than 1 TB is sufficient for /mars. Only the _mass_ data (in the range of several hundred TB) should be placed on SMR drives.
C) Cooperate with hardware RAID vendors to optimize RAID parameters like RAID stripe sizes etc for best performance of the internal writeback strategy from the BBU cache to the SMR media. In particular, delayed writeback (which is typically already implemented) could be _tuned_ for better coalescing of SMR zone updates.
D) Hint: look at blkreplay.org for real-live workloads (recorded via blktrace at 1&1 datacenters). They are differing vastly from artifical benchmarks. Otherwise you might be trapped by wrong assumptions about real workload behaviour. Compare MARS vs non-MARS setups via such real-life workloads.
E) After that, talk with me for improvements of MARS.
Cheers, Thomas
Posted Mar 19, 2016 6:59 UTC (Sat)
by schoebel (guest, #107651)
[Link]
1) Short term: SMR drives need to be established in their market niche (regardless how big this niche might be at the beginning).
If SMR cannot gain their own market niche, the following trunks would be pointless. So I think this is prio #1.
Here MARS could help at the block layer, needing only minimal modifications itself (quick win).
Altering higher layers will need some time anyway (e.g. major developments at the FS layer are measured in decades nowadays).
2) Medium term: adapt block layer components (not limited to MARS) to the specifics of SMR.
This is easier than 3) because of lower complexity.
Please talk with me, I have some ideas about this.
Best would be a discussion at some Linux conference with all block layer people, changing the title to "Block Layer Support for SMR devices".
3) Long term: FS layer adaptations, as already started the discussion here.
Best after SMR would have really established their niche. This would be a much better motivation for long-lasting tedious work ;)
MARS could help here (for some parts)
