M7350v1_en_gpl

2024-09-09 08:52:07 +00:00
commit f9cc65cfda
65988 changed files with 26357421 additions and 0 deletions
--- a/kernel/Documentation/block/00-INDEX
+++ b/kernel/Documentation/block/00-INDEX
@@ -0,0 +1,18 @@
+00-INDEX
+	- This file
+biodoc.txt
+	- Notes on the Generic Block Layer Rewrite in Linux 2.5
+capability.txt
+	- Generic Block Device Capability (/sys/block/<disk>/capability)
+deadline-iosched.txt
+	- Deadline IO scheduler tunables
+ioprio.txt
+	- Block io priorities (in CFQ scheduler)
+request.txt
+	- The members of struct request (in include/linux/blkdev.h)
+stat.txt
+	- Block layer statistics in /sys/block/<dev>/stat
+switching-sched.txt
+	- Switching I/O schedulers at runtime
+writeback_cache_control.txt
+	- Control of volatile write back caches
--- a/kernel/Documentation/block/biodoc.txt
+++ b/kernel/Documentation/block/biodoc.txt
--- a/kernel/Documentation/block/capability.txt
+++ b/kernel/Documentation/block/capability.txt
@@ -0,0 +1,15 @@
+Generic Block Device Capability
+===============================================================================
+This file documents the sysfs file block/<disk>/capability
+
+capability is a hex word indicating which capabilities a specific disk
+supports.  For more information on bits not listed here, see
+include/linux/genhd.h
+
+Capability				Value
+-------------------------------------------------------------------------------
+GENHD_FL_MEDIA_CHANGE_NOTIFY		4
+	When this bit is set, the disk supports Asynchronous Notification
+	of media change events.  These events will be broadcast to user
+	space via kernel uevent.
+
--- a/kernel/Documentation/block/cfq-iosched.txt
+++ b/kernel/Documentation/block/cfq-iosched.txt
@@ -0,0 +1,116 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on highly seeky media like
+single spindle SATA/SAS disks where we can cut down on overall number of
+seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
+keeping slice_idle enabled should be useful. For any configurations where
+there are multiple spindles behind single LUN (Host based hardware RAID
+controller or for storage arrays), setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+===================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
+
+CFQ IO scheduler Idling Theory
+===============================
+Idling on a queue is primarily about waiting for the next request to come
+on same queue after completion of a request. In this process CFQ will not
+dispatch requests from other cfq queues even if requests are pending there.
+
+The rationale behind idling is that it can cut down on number of seeks
+on rotational media. For example, if a process is doing dependent
+sequential reads (next read will come on only after completion of previous
+one), then not dispatching request from other queue should help as we
+did not move the disk head and kept on dispatching sequential IO from
+one queue.
+
+CFQ has following service trees and various queues are put on these trees.
+
+	sync-idle	sync-noidle	async
+
+All cfq queues doing synchronous sequential IO go on to sync-idle tree.
+On this tree we idle on each queue individually.
+
+All synchronous non-sequential queues go on sync-noidle tree. Also any
+request which are marked with REQ_NOIDLE go on this service tree. On this
+tree we do not idle on individual queues instead idle on the whole group
+of queues or the tree. So if there are 4 queues waiting for IO to dispatch
+we will idle only once last queue has dispatched the IO and there is
+no more IO on this service tree.
+
+All async writes go on async service tree. There is no idling on async
+queues.
+
+CFQ has some optimizations for SSDs and if it detects a non-rotational
+media which can support higher queue depth (multiple requests at in
+flight at a time), then it cuts down on idling of individual queues and
+all the queues move to sync-noidle tree and only tree idle remains. This
+tree idling provides isolation with buffered write queues on async tree.
+
+FAQ
+===
+Q1. Why to idle at all on queues marked with REQ_NOIDLE.
+
+A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
+    with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
+    queues. Otherwise in presence of many sequential readers, other
+    synchronous IO might not get fair share of disk.
+
+    For example, if there are 10 sequential readers doing IO and they get
+    100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
+    roughly after 1 second. If after completion of REQ_NOIDLE request we
+    do not idle, and after a couple of milli seconds a another REQ_NOIDLE
+    request comes in, again it will be scheduled after 1second. Repeat it
+    and notice how a workload can lose its disk share and suffer due to
+    multiple sequential readers.
+
+    fsync can generate dependent IO where bunch of data is written in the
+    context of fsync, and later some journaling data is written. Journaling
+    data comes in only after fsync has finished its IO (atleast for ext4
+    that seemed to be the case). Now if one decides not to idle on fsync
+    thread due to REQ_NOIDLE, then next journaling write will not get
+    scheduled for another second. A process doing small fsync, will suffer
+    badly in presence of multiple sequential readers.
+
+    Hence doing tree idling on threads using REQ_NOIDLE flag on requests
+    provides isolation from multiple sequential readers and at the same
+    time we do not idle on individual threads.
+
+Q2. When to specify REQ_NOIDLE
+A2. I would think whenever one is doing synchronous write and not expecting
+    more writes to be dispatched from same context soon, should be able
+    to specify REQ_NOIDLE on writes and that probably should work well for
+    most of the cases.
--- a/kernel/Documentation/block/data-integrity.txt
+++ b/kernel/Documentation/block/data-integrity.txt
@@ -0,0 +1,327 @@
+----------------------------------------------------------------------
+1. INTRODUCTION
+
+Modern filesystems feature checksumming of data and metadata to
+protect against data corruption.  However, the detection of the
+corruption is done at read time which could potentially be months
+after the data was written.  At that point the original data that the
+application tried to write is most likely lost.
+
+The solution is to ensure that the disk is actually storing what the
+application meant it to.  Recent additions to both the SCSI family
+protocols (SBC Data Integrity Field, SCC protection proposal) as well
+as SATA/T13 (External Path Protection) try to remedy this by adding
+support for appending integrity metadata to an I/O.  The integrity
+metadata (or protection information in SCSI terminology) includes a
+checksum for each sector as well as an incrementing counter that
+ensures the individual sectors are written in the right order.  And
+for some protection schemes also that the I/O is written to the right
+place on disk.
+
+Current storage controllers and devices implement various protective
+measures, for instance checksumming and scrubbing.  But these
+technologies are working in their own isolated domains or at best
+between adjacent nodes in the I/O path.  The interesting thing about
+DIF and the other integrity extensions is that the protection format
+is well defined and every node in the I/O path can verify the
+integrity of the I/O and reject it if corruption is detected.  This
+allows not only corruption prevention but also isolation of the point
+of failure.
+
+----------------------------------------------------------------------
+2. THE DATA INTEGRITY EXTENSIONS
+
+As written, the protocol extensions only protect the path between
+controller and storage device.  However, many controllers actually
+allow the operating system to interact with the integrity metadata
+(IMD).  We have been working with several FC/SAS HBA vendors to enable
+the protection information to be transferred to and from their
+controllers.
+
+The SCSI Data Integrity Field works by appending 8 bytes of protection
+information to each sector.  The data + integrity metadata is stored
+in 520 byte sectors on disk.  Data + IMD are interleaved when
+transferred between the controller and target.  The T13 proposal is
+similar.
+
+Because it is highly inconvenient for operating systems to deal with
+520 (and 4104) byte sectors, we approached several HBA vendors and
+encouraged them to allow separation of the data and integrity metadata
+scatter-gather lists.
+
+The controller will interleave the buffers on write and split them on
+read.  This means that Linux can DMA the data buffers to and from
+host memory without changes to the page cache.
+
+Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
+is somewhat heavy to compute in software.  Benchmarks found that
+calculating this checksum had a significant impact on system
+performance for a number of workloads.  Some controllers allow a
+lighter-weight checksum to be used when interfacing with the operating
+system.  Emulex, for instance, supports the TCP/IP checksum instead.
+The IP checksum received from the OS is converted to the 16-bit CRC
+when writing and vice versa.  This allows the integrity metadata to be
+generated by Linux or the application at very low cost (comparable to
+software RAID5).
+
+The IP checksum is weaker than the CRC in terms of detecting bit
+errors.  However, the strength is really in the separation of the data
+buffers and the integrity metadata.  These two distinct buffers must
+match up for an I/O to complete.
+
+The separation of the data and integrity metadata buffers as well as
+the choice in checksums is referred to as the Data Integrity
+Extensions.  As these extensions are outside the scope of the protocol
+bodies (T10, T13), Oracle and its partners are trying to standardize
+them within the Storage Networking Industry Association.
+
+----------------------------------------------------------------------
+3. KERNEL CHANGES
+
+The data integrity framework in Linux enables protection information
+to be pinned to I/Os and sent to/received from controllers that
+support it.
+
+The advantage to the integrity extensions in SCSI and SATA is that
+they enable us to protect the entire path from application to storage
+device.  However, at the same time this is also the biggest
+disadvantage. It means that the protection information must be in a
+format that can be understood by the disk.
+
+Generally Linux/POSIX applications are agnostic to the intricacies of
+the storage devices they are accessing.  The virtual filesystem switch
+and the block layer make things like hardware sector size and
+transport protocols completely transparent to the application.
+
+However, this level of detail is required when preparing the
+protection information to send to a disk.  Consequently, the very
+concept of an end-to-end protection scheme is a layering violation.
+It is completely unreasonable for an application to be aware whether
+it is accessing a SCSI or SATA disk.
+
+The data integrity support implemented in Linux attempts to hide this
+from the application.  As far as the application (and to some extent
+the kernel) is concerned, the integrity metadata is opaque information
+that's attached to the I/O.
+
+The current implementation allows the block layer to automatically
+generate the protection information for any I/O.  Eventually the
+intent is to move the integrity metadata calculation to userspace for
+user data.  Metadata and other I/O that originates within the kernel
+will still use the automatic generation interface.
+
+Some storage devices allow each hardware sector to be tagged with a
+16-bit value.  The owner of this tag space is the owner of the block
+device.  I.e. the filesystem in most cases.  The filesystem can use
+this extra space to tag sectors as they see fit.  Because the tag
+space is limited, the block interface allows tagging bigger chunks by
+way of interleaving.  This way, 8*16 bits of information can be
+attached to a typical 4KB filesystem block.
+
+This also means that applications such as fsck and mkfs will need
+access to manipulate the tags from user space.  A passthrough
+interface for this is being worked on.
+
+
+----------------------------------------------------------------------
+4. BLOCK LAYER IMPLEMENTATION DETAILS
+
+4.1 BIO
+
+The data integrity patches add a new field to struct bio when
+CONFIG_BLK_DEV_INTEGRITY is enabled.  bio->bi_integrity is a pointer
+to a struct bip which contains the bio integrity payload.  Essentially
+a bip is a trimmed down struct bio which holds a bio_vec containing
+the integrity metadata and the required housekeeping information (bvec
+pool, vector count, etc.)
+
+A kernel subsystem can enable data integrity protection on a bio by
+calling bio_integrity_alloc(bio).  This will allocate and attach the
+bip to the bio.
+
+Individual pages containing integrity metadata can subsequently be
+attached using bio_integrity_add_page().
+
+bio_free() will automatically free the bip.
+
+
+4.2 BLOCK DEVICE
+
+Because the format of the protection data is tied to the physical
+disk, each block device has been extended with a block integrity
+profile (struct blk_integrity).  This optional profile is registered
+with the block layer using blk_integrity_register().
+
+The profile contains callback functions for generating and verifying
+the protection data, as well as getting and setting application tags.
+The profile also contains a few constants to aid in completing,
+merging and splitting the integrity metadata.
+
+Layered block devices will need to pick a profile that's appropriate
+for all subdevices.  blk_integrity_compare() can help with that.  DM
+and MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
+will require extra work due to the application tag.
+
+
+----------------------------------------------------------------------
+5.0 BLOCK LAYER INTEGRITY API
+
+5.1 NORMAL FILESYSTEM
+
+    The normal filesystem is unaware that the underlying block device
+    is capable of sending/receiving integrity metadata.  The IMD will
+    be automatically generated by the block layer at submit_bio() time
+    in case of a WRITE.  A READ request will cause the I/O integrity
+    to be verified upon completion.
+
+    IMD generation and verification can be toggled using the
+
+      /sys/block/<bdev>/integrity/write_generate
+
+    and
+
+      /sys/block/<bdev>/integrity/read_verify
+
+    flags.
+
+
+5.2 INTEGRITY-AWARE FILESYSTEM
+
+    A filesystem that is integrity-aware can prepare I/Os with IMD
+    attached.  It can also use the application tag space if this is
+    supported by the block device.
+
+
+    int bdev_integrity_enabled(block_device, int rw);
+
+      bdev_integrity_enabled() will return 1 if the block device
+      supports integrity metadata transfer for the data direction
+      specified in 'rw'.
+
+      bdev_integrity_enabled() honors the write_generate and
+      read_verify flags in sysfs and will respond accordingly.
+
+
+    int bio_integrity_prep(bio);
+
+      To generate IMD for WRITE and to set up buffers for READ, the
+      filesystem must call bio_integrity_prep(bio).
+
+      Prior to calling this function, the bio data direction and start
+      sector must be set, and the bio should have all data pages
+      added.  It is up to the caller to ensure that the bio does not
+      change while I/O is in progress.
+
+      bio_integrity_prep() should only be called if
+      bio_integrity_enabled() returned 1.
+
+
+    int bio_integrity_tag_size(bio);
+
+      If the filesystem wants to use the application tag space it will
+      first have to find out how much storage space is available.
+      Because tag space is generally limited (usually 2 bytes per
+      sector regardless of sector size), the integrity framework
+      supports interleaving the information between the sectors in an
+      I/O.
+
+      Filesystems can call bio_integrity_tag_size(bio) to find out how
+      many bytes of storage are available for that particular bio.
+
+      Another option is bdev_get_tag_size(block_device) which will
+      return the number of available bytes per hardware sector.
+
+
+    int bio_integrity_set_tag(bio, void *tag_buf, len);
+
+      After a successful return from bio_integrity_prep(),
+      bio_integrity_set_tag() can be used to attach an opaque tag
+      buffer to a bio.  Obviously this only makes sense if the I/O is
+      a WRITE.
+
+
+    int bio_integrity_get_tag(bio, void *tag_buf, len);
+
+      Similarly, at READ I/O completion time the filesystem can
+      retrieve the tag buffer using bio_integrity_get_tag().
+
+
+5.3 PASSING EXISTING INTEGRITY METADATA
+
+    Filesystems that either generate their own integrity metadata or
+    are capable of transferring IMD from user space can use the
+    following calls:
+
+
+    struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
+
+      Allocates the bio integrity payload and hangs it off of the bio.
+      nr_pages indicate how many pages of protection data need to be
+      stored in the integrity bio_vec list (similar to bio_alloc()).
+
+      The integrity payload will be freed at bio_free() time.
+
+
+    int bio_integrity_add_page(bio, page, len, offset);
+
+      Attaches a page containing integrity metadata to an existing
+      bio.  The bio must have an existing bip,
+      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
+      the integrity metadata in the pages must be in a format
+      understood by the target device with the notable exception that
+      the sector numbers will be remapped as the request traverses the
+      I/O stack.  This implies that the pages added using this call
+      will be modified during I/O!  The first reference tag in the
+      integrity metadata must have a value of bip->bip_sector.
+
+      Pages can be added using bio_integrity_add_page() as long as
+      there is room in the bip bio_vec array (nr_pages).
+
+      Upon completion of a READ operation, the attached pages will
+      contain the integrity metadata received from the storage device.
+      It is up to the receiver to process them and verify data
+      integrity upon completion.
+
+
+5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
+    METADATA
+
+    To enable integrity exchange on a block device the gendisk must be
+    registered as capable:
+
+    int blk_integrity_register(gendisk, blk_integrity);
+
+      The blk_integrity struct is a template and should contain the
+      following:
+
+        static struct blk_integrity my_profile = {
+            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
+            .generate_fn            = my_generate_fn,
+       	    .verify_fn              = my_verify_fn,
+       	    .get_tag_fn             = my_get_tag_fn,
+       	    .set_tag_fn             = my_set_tag_fn,
+	    .tuple_size             = sizeof(struct my_tuple_size),
+	    .tag_size               = <tag bytes per hw sector>,
+        };
+
+      'name' is a text string which will be visible in sysfs.  This is
+      part of the userland API so chose it carefully and never change
+      it.  The format is standards body-type-variant.
+      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
+
+      'generate_fn' generates appropriate integrity metadata (for WRITE).
+
+      'verify_fn' verifies that the data buffer matches the integrity
+      metadata.
+
+      'tuple_size' must be set to match the size of the integrity
+      metadata per sector.  I.e. 8 for DIF and EPP.
+
+      'tag_size' must be set to identify how many bytes of tag space
+      are available per hardware sector.  For DIF this is either 2 or
+      0 depending on the value of the Control Mode Page ATO bit.
+
+      See 6.2 for a description of get_tag_fn and set_tag_fn.
+
+----------------------------------------------------------------------
+2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
--- a/kernel/Documentation/block/deadline-iosched.txt
+++ b/kernel/Documentation/block/deadline-iosched.txt
@@ -0,0 +1,75 @@
+Deadline IO scheduler tunables
+==============================
+
+This little file attempts to document how the deadline io scheduler works.
+In particular, it will clarify the meaning of the exposed tunables that may be
+of interest to power users.
+
+Selecting IO schedulers
+-----------------------
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
+
+
+********************************************************************************
+
+
+read_expire	(in ms)
+-----------
+
+The goal of the deadline io scheduler is to attempt to guarantee a start
+service time for a request. As we focus mainly on read latencies, this is
+tunable. When a read request first enters the io scheduler, it is assigned
+a deadline that is the current time + the read_expire value in units of
+milliseconds.
+
+
+write_expire	(in ms)
+-----------
+
+Similar to read_expire mentioned above, but for writes.
+
+
+fifo_batch	(number of requests)
+----------
+
+Requests are grouped into ``batches'' of a particular data direction (read or
+write) which are serviced in increasing sector order.  To limit extra seeking,
+deadline expiries are only checked between batches.  fifo_batch controls the
+maximum number of requests per batch.
+
+This parameter tunes the balance between per-request latency and aggregate
+throughput.  When low latency is the primary concern, smaller is better (where
+a value of 1 yields first-come first-served behaviour).  Increasing fifo_batch
+generally improves throughput, at the cost of latency variation.
+
+
+writes_starved	(number of dispatches)
+--------------
+
+When we have to move requests from the io scheduler queue to the block
+device dispatch queue, we always give a preference to reads. However, we
+don't want to starve writes indefinitely either. So writes_starved controls
+how many times we give preference to reads over writes. When that has been
+done writes_starved number of times, we dispatch some writes based on the
+same criteria as reads.
+
+
+front_merges	(bool)
+------------
+
+Sometimes it happens that a request enters the io scheduler that is contiguous
+with a request that is already on the queue. Either it fits in the back of that
+request, or it fits at the front. That is called either a back merge candidate
+or a front merge candidate. Due to the way files are typically laid out,
+back merges are much more common than front merges. For some work loads, you
+may even know that it is a waste of time to spend any time attempting to
+front merge requests. Setting front_merges to 0 disables this functionality.
+Front merges may still occur due to the cached last_merge hint, but since
+that comes at basically 0 cost we leave that on. We simply disable the
+rbtree front sector lookup when the io scheduler merge function is called.
+
+
+Nov 11 2002, Jens Axboe <jens.axboe@oracle.com>
+
+
--- a/kernel/Documentation/block/ioprio.txt
+++ b/kernel/Documentation/block/ioprio.txt
@@ -0,0 +1,183 @@
+Block io priorities
+===================
+
+
+Intro
+-----
+
+With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
+priorities are supported for reads on files.  This enables users to io nice
+processes or process groups, similar to what has been possible with cpu
+scheduling for ages.  This document mainly details the current possibilities
+with cfq; other io schedulers do not support io priorities thus far.
+
+Scheduling classes
+------------------
+
+CFQ implements three generic scheduling classes that determine how io is
+served for a process.
+
+IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
+higher priority than any other in the system, processes from this class are
+given first access to the disk every time. Thus it needs to be used with some
+care, one io RT process can starve the entire system. Within the RT class,
+there are 8 levels of class data that determine exactly how much time this
+process needs the disk for on each service. In the future this might change
+to be more directly mappable to performance, by passing in a wanted data
+rate instead.
+
+IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
+for any process that hasn't set a specific io priority. The class data
+determines how much io bandwidth the process will get, it's directly mappable
+to the cpu nice levels just more coarsely implemented. 0 is the highest
+BE prio level, 7 is the lowest. The mapping between cpu nice level and io
+nice level is determined as: io_nice = (cpu_nice + 20) / 5.
+
+IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
+level only get io time when no one else needs the disk. The idle class has no
+class data, since it doesn't really apply here.
+
+Tools
+-----
+
+See below for a sample ionice tool. Usage:
+
+# ionice -c<class> -n<level> -p<pid>
+
+If pid isn't given, the current process is assumed. IO priority settings
+are inherited on fork, so you can use ionice to start the process at a given
+level:
+
+# ionice -c2 -n0 /bin/ls
+
+will run ls at the best-effort scheduling class at the highest priority.
+For a running process, you can give the pid instead:
+
+# ionice -c1 -n2 -p100
+
+will change pid 100 to run at the realtime scheduling class, at priority 2.
+
+---> snip ionice.c tool <---
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <getopt.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+#include <asm/unistd.h>
+
+extern int sys_ioprio_set(int, int, int);
+extern int sys_ioprio_get(int, int);
+
+#if defined(__i386__)
+#define __NR_ioprio_set		289
+#define __NR_ioprio_get		290
+#elif defined(__ppc__)
+#define __NR_ioprio_set		273
+#define __NR_ioprio_get		274
+#elif defined(__x86_64__)
+#define __NR_ioprio_set		251
+#define __NR_ioprio_get		252
+#elif defined(__ia64__)
+#define __NR_ioprio_set		1274
+#define __NR_ioprio_get		1275
+#else
+#error "Unsupported arch"
+#endif
+
+static inline int ioprio_set(int which, int who, int ioprio)
+{
+	return syscall(__NR_ioprio_set, which, who, ioprio);
+}
+
+static inline int ioprio_get(int which, int who)
+{
+	return syscall(__NR_ioprio_get, which, who);
+}
+
+enum {
+	IOPRIO_CLASS_NONE,
+	IOPRIO_CLASS_RT,
+	IOPRIO_CLASS_BE,
+	IOPRIO_CLASS_IDLE,
+};
+
+enum {
+	IOPRIO_WHO_PROCESS = 1,
+	IOPRIO_WHO_PGRP,
+	IOPRIO_WHO_USER,
+};
+
+#define IOPRIO_CLASS_SHIFT	13
+
+const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
+
+int main(int argc, char *argv[])
+{
+	int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
+	int c, pid = 0;
+
+	while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
+		switch (c) {
+		case 'n':
+			ioprio = strtol(optarg, NULL, 10);
+			set = 1;
+			break;
+		case 'c':
+			ioprio_class = strtol(optarg, NULL, 10);
+			set = 1;
+			break;
+		case 'p':
+			pid = strtol(optarg, NULL, 10);
+			break;
+		}
+	}
+
+	switch (ioprio_class) {
+		case IOPRIO_CLASS_NONE:
+			ioprio_class = IOPRIO_CLASS_BE;
+			break;
+		case IOPRIO_CLASS_RT:
+		case IOPRIO_CLASS_BE:
+			break;
+		case IOPRIO_CLASS_IDLE:
+			ioprio = 7;
+			break;
+		default:
+			printf("bad prio class %d\n", ioprio_class);
+			return 1;
+	}
+
+	if (!set) {
+		if (!pid && argv[optind])
+			pid = strtol(argv[optind], NULL, 10);
+
+		ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
+
+		printf("pid=%d, %d\n", pid, ioprio);
+
+		if (ioprio == -1)
+			perror("ioprio_get");
+		else {
+			ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
+			ioprio = ioprio & 0xff;
+			printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
+		}
+	} else {
+		if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
+			perror("ioprio_set");
+			return 1;
+		}
+
+		if (argv[optind])
+			execvp(argv[optind], &argv[optind]);
+	}
+
+	return 0;
+}
+
+---> snip ionice.c tool <---
+
+
+March 11 2005, Jens Axboe <jens.axboe@oracle.com>
--- a/kernel/Documentation/block/queue-sysfs.txt
+++ b/kernel/Documentation/block/queue-sysfs.txt
@@ -0,0 +1,67 @@
+Queue sysfs files
+=================
+
+This text file will detail the queue files that are located in the sysfs tree
+for each block device. Note that stacked devices typically do not export
+any settings, since their queue merely functions are a remapping target.
+These files are the ones found in the /sys/block/xxx/queue/ directory.
+
+Files denoted with a RO postfix are readonly and the RW postfix means
+read-write.
+
+hw_sector_size (RO)
+-------------------
+This is the hardware sector size of the device, in bytes.
+
+max_hw_sectors_kb (RO)
+----------------------
+This is the maximum number of kilobytes supported in a single data transfer.
+
+max_sectors_kb (RW)
+-------------------
+This is the maximum number of kilobytes that the block layer will allow
+for a filesystem request. Must be smaller than or equal to the maximum
+size allowed by the hardware.
+
+nomerges (RW)
+-------------
+This enables the user to disable the lookup logic involved with IO
+merging requests in the block layer. By default (0) all merges are
+enabled. When set to 1 only simple one-hit merges will be tried. When
+set to 2 no merge algorithms will be tried (including one-hit or more
+complex tree/hash lookups).
+
+nr_requests (RW)
+----------------
+This controls how many requests may be allocated in the block layer for
+read or write requests. Note that the total allocated number may be twice
+this amount, since it applies only to reads or writes (not the accumulated
+sum).
+
+read_ahead_kb (RW)
+------------------
+Maximum number of kilobytes to read-ahead for filesystems on this block
+device.
+
+rq_affinity (RW)
+----------------
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
+
+scheduler (RW)
+--------------
+When read, this file will display the current and available IO schedulers
+for this block device. The currently active IO scheduler will be enclosed
+in [] brackets. Writing an IO scheduler name to this file will switch
+control of this block device to that new IO scheduler. Note that writing
+an IO scheduler name to this file will attempt to load that IO scheduler
+module, if it isn't already present in the system.
+
+
+
+Jens Axboe <jens.axboe@oracle.com>, February 2009
--- a/kernel/Documentation/block/request.txt
+++ b/kernel/Documentation/block/request.txt
@@ -0,0 +1,88 @@
+
+struct request documentation
+
+Jens Axboe <jens.axboe@oracle.com> 27/05/02
+
+1.0
+Index
+
+2.0 Struct request members classification
+
+	2.1 struct request members explanation
+
+3.0
+
+
+2.0
+Short explanation of request members
+
+Classification flags:
+
+	D	driver member
+	B	block layer member
+	I	I/O scheduler member
+
+Unless an entry contains a D classification, a device driver must not access
+this member. Some members may contain D classifications, but should only be
+access through certain macros or functions (eg ->flags).
+
+<linux/blkdev.h>
+
+2.1
+Member				Flag	Comment
+------				----	-------
+
+struct list_head queuelist	BI	Organization on various internal
+					queues
+
+void *elevator_private		I	I/O scheduler private data
+
+unsigned char cmd[16]		D	Driver can use this for setting up
+					a cdb before execution, see
+					blk_queue_prep_rq
+
+unsigned long flags		DBI	Contains info about data direction,
+					request type, etc.
+
+int rq_status			D	Request status bits
+
+kdev_t rq_dev			DBI	Target device
+
+int errors			DB	Error counts
+
+sector_t sector			DBI	Target location
+
+unsigned long hard_nr_sectors	B	Used to keep sector sane
+
+unsigned long nr_sectors	DBI	Total number of sectors in request
+
+unsigned long hard_nr_sectors	B	Used to keep nr_sectors sane
+
+unsigned short nr_phys_segments	DB	Number of physical scatter gather
+					segments in a request
+
+unsigned short nr_hw_segments	DB	Number of hardware scatter gather
+					segments in a request
+
+unsigned int current_nr_sectors	DB	Number of sectors in first segment
+					of request
+
+unsigned int hard_cur_sectors	B	Used to keep current_nr_sectors sane
+
+int tag				DB	TCQ tag, if assigned
+
+void *special			D	Free to be used by driver
+
+char *buffer			D	Map of first segment, also see
+					section on bouncing SECTION
+
+struct completion *waiting	D	Can be used by driver to get signalled
+					on request completion
+
+struct bio *bio			DBI	First bio in request
+
+struct bio *biotail		DBI	Last bio in request
+
+struct request_queue *q		DB	Request queue this request belongs to
+
+struct request_list *rl		B	Request list this request came from
--- a/kernel/Documentation/block/row-iosched.txt
+++ b/kernel/Documentation/block/row-iosched.txt
@@ -0,0 +1,134 @@
+Introduction
+============
+
+The ROW scheduling algorithm will be used in mobile devices as default
+block layer IO scheduling algorithm. ROW stands for "READ Over WRITE"
+which is the main requests dispatch policy of this algorithm.
+
+The ROW IO scheduler was developed with the mobile devices needs in
+mind. In mobile devices we favor user experience upon everything else,
+thus we want to give READ IO requests as much priority as possible.
+The main idea of the ROW scheduling policy is just that:
+- If there are READ requests in pipe - dispatch them, while write
+starvation is considered.
+
+Software description
+====================
+The elevator defines a registering mechanism for different IO scheduler
+to implement. This makes implementing a new algorithm quite straight
+forward and requires almost no changes to block/elevator framework. A
+new IO scheduler just has to implement a set of callback functions
+defined by the elevator.
+These callbacks cover all the required IO operations such as
+adding/removing request to/from the scheduler, merging two requests,
+dispatching a request etc.
+
+Design
+======
+
+The requests are kept in queues according to their priority. The
+dispatching of requests is done in a Round Robin manner with a
+different slice for each queue. The dispatch quantum for a specific
+queue is set according to the queues priority. READ queues are
+given bigger dispatch quantum than the WRITE queues, within a dispatch
+cycle.
+
+At the moment there are 6 types of queues the requests are
+distributed to:
+-	High priority READ queue
+-	High priority Synchronous WRITE queue
+-	Regular priority READ queue
+-	Regular priority Synchronous WRITE queue
+-	Regular priority WRITE queue
+-	Low priority READ queue
+
+The marking of request as high/low priority will be done by the
+application adding the request and not the scheduler. See TODO section.
+If the request is not marked in any way (high/low) the scheduler
+assigns it to one of the regular priority queues:
+read/write/sync write.
+
+If in a certain dispatch cycle one of the queues was empty and didn't
+use its quantum that queue will be marked as "un-served". If we're in
+a middle of a dispatch cycle dispatching from queue Y and a request
+arrives for queue X that was un-served in the previous cycle, if X's
+priority is higher than Y's, queue X will be preempted in the favor of
+queue Y.
+
+For READ request queues ROW IO scheduler allows idling within a
+dispatch quantum in order to give the application a chance to insert
+more requests. Idling means adding some extra time for serving a
+certain queue even if the queue is empty. The idling is enabled if
+the ROW IO scheduler identifies the application is inserting requests
+in a high frequency.
+Not all queues can idle. ROW scheduler exposes an enablement struct
+for idling.
+For idling on READ queues, the ROW IO scheduler uses timer mechanism.
+When the timer expires we schedule a delayed work that will signal the
+device driver to fetch another request for dispatch.
+
+ROW scheduler will support additional services for block devices that
+supports Urgent Requests. That is, the scheduler may inform the
+device driver upon urgent requests using a newly defined callback.
+In addition it will support rescheduling of requests that were
+interrupted. For example if the device driver issues a long write
+request and a sudden urgent request is received by the scheduler.
+The scheduler will inform the device driver about the urgent request,
+so the device driver can stop the current write request and serve the
+urgent request. In such a case the device driver may also insert back
+to the scheduler the remainder of the interrupted write request, such
+that the scheduler may continue sending urgent requests without the
+need to interrupt the ongoing write again and again. The write
+remainder will be sent later on according to the scheduler policy.
+
+SMP/multi-core
+==============
+At the moment the code is accessed from 2 contexts:
+- Application context (from block/elevator layer): adding the requests.
+- device driver thread: dispatching the requests and notifying on
+  completion.
+
+One lock is used to synchronize between the two. This lock is provided
+by the block device driver along with the dispatch queue.
+
+Config options
+==============
+1. hp_read_quantum: dispatch quantum for the high priority READ queue
+   (default is 100 requests)
+2. rp_read_quantum: dispatch quantum for the regular priority READ
+   queue (default is 100 requests)
+3. hp_swrite_quantum: dispatch quantum for the high priority
+   Synchronous WRITE queue (default is 2 requests)
+4. rp_swrite_quantum: dispatch quantum for the regular priority
+   Synchronous WRITE queue (default is 1 requests)
+5. rp_write_quantum: dispatch quantum for the regular priority WRITE
+   queue (default is 1 requests)
+6. lp_read_quantum: dispatch quantum for the low priority READ queue
+   (default is 1 requests)
+7. lp_swrite_quantum: dispatch quantum for the low priority Synchronous
+   WRITE queue (default is 1 requests)
+8. read_idle: how long to idle on read queue in Msec (in case idling
+   is enabled on that queue). (default is 5 Msec)
+9. read_idle_freq: frequency of inserting READ requests that will
+   trigger idling. This is the time in Msec between inserting two READ
+   requests. (default is 8 Msec)
+
+Note: Dispatch quantum is number of requests that will be dispatched
+from a certain queue in a dispatch cycle.
+
+To do
+=====
+The ROW algorithm takes the scheduling policy one step further, making
+it a bit more "user-needs oriented", by allowing the application to
+hint on the urgency of its requests. For example: even among the READ
+requests several requests may be more urgent for completion than other.
+The former will go to the High priority READ queue, that is given the
+bigger dispatch quantum than any other queue.
+
+Still need to design the way applications will "hint" on the urgency of
+their requests. May be done by ioctl(). We need to look into concrete
+use-cases in order to determine the best solution for this.
+This will be implemented as a second phase.
+
+Design and implement additional services for block devices that
+supports High Priority Requests.
--- a/kernel/Documentation/block/stat.txt
+++ b/kernel/Documentation/block/stat.txt
@@ -0,0 +1,82 @@
+Block layer statistics in /sys/block/<dev>/stat
+===============================================
+
+This file documents the contents of the /sys/block/<dev>/stat file.
+
+The stat file provides several statistics about the state of block
+device <dev>.
+
+Q. Why are there multiple statistics in a single file?  Doesn't sysfs
+   normally contain a single value per file?
+A. By having a single file, the kernel can guarantee that the statistics
+   represent a consistent snapshot of the state of the device.  If the
+   statistics were exported as multiple files containing one statistic
+   each, it would be impossible to guarantee that a set of readings
+   represent a single point in time.
+
+The stat file consists of a single line of text containing 11 decimal
+values separated by whitespace.  The fields are summarized in the
+following table, and described in more detail below.
+
+Name            units         description
+----            -----         -----------
+read I/Os       requests      number of read I/Os processed
+read merges     requests      number of read I/Os merged with in-queue I/O
+read sectors    sectors       number of sectors read
+read ticks      milliseconds  total wait time for read requests
+write I/Os      requests      number of write I/Os processed
+write merges    requests      number of write I/Os merged with in-queue I/O
+write sectors   sectors       number of sectors written
+write ticks     milliseconds  total wait time for write requests
+in_flight       requests      number of I/Os currently in flight
+io_ticks        milliseconds  total time this block device has been active
+time_in_queue   milliseconds  total wait time for all requests
+
+read I/Os, write I/Os
+=====================
+
+These values increment when an I/O request completes.
+
+read merges, write merges
+=========================
+
+These values increment when an I/O request is merged with an
+already-queued I/O request.
+
+read sectors, write sectors
+===========================
+
+These values count the number of sectors read from or written to this
+block device.  The "sectors" in question are the standard UNIX 512-byte
+sectors, not any device- or filesystem-specific block size.  The
+counters are incremented when the I/O completes.
+
+read ticks, write ticks
+=======================
+
+These values count the number of milliseconds that I/O requests have
+waited on this block device.  If there are multiple I/O requests waiting,
+these values will increase at a rate greater than 1000/second; for
+example, if 60 read requests wait for an average of 30 ms, the read_ticks
+field will increase by 60*30 = 1800.
+
+in_flight
+=========
+
+This value counts the number of I/O requests that have been issued to
+the device driver but have not yet completed.  It does not include I/O
+requests that are in the queue but not yet issued to the device driver.
+
+io_ticks
+========
+
+This value counts the number of milliseconds during which the device has
+had I/O requests queued.
+
+time_in_queue
+=============
+
+This value counts the number of milliseconds that I/O requests have waited
+on this block device.  If there are multiple I/O requests waiting, this
+value will increase as the product of the number of milliseconds times the
+number of requests waiting (see "read ticks" above for an example).
--- a/kernel/Documentation/block/switching-sched.txt
+++ b/kernel/Documentation/block/switching-sched.txt
@@ -0,0 +1,37 @@
+To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
+'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
+globally at boot time only presently.
+
+Each io queue has a set of io scheduler tunables associated with it. These
+tunables control how the io scheduler works. You can find these entries
+in:
+
+/sys/block/<device>/queue/iosched
+
+assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
+you can do so by typing:
+
+# mount none /sys -t sysfs
+
+As of the Linux 2.6.10 kernel, it is now possible to change the
+IO scheduler for a given block device on the fly (thus making it possible,
+for instance, to set the CFQ scheduler for the system default, but
+set a specific device to use the deadline or noop schedulers - which
+can improve that device's throughput).
+
+To set a specific scheduler, simply do this:
+
+echo SCHEDNAME > /sys/block/DEV/queue/scheduler
+
+where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
+device name (hda, hdb, sga, or whatever you happen to have).
+
+The list of defined schedulers can be found by simply doing
+a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
+will be displayed, with the currently selected scheduler in brackets:
+
+# cat /sys/block/hda/queue/scheduler
+noop deadline [cfq]
+# echo deadline > /sys/block/hda/queue/scheduler
+# cat /sys/block/hda/queue/scheduler
+noop [deadline] cfq
--- a/kernel/Documentation/block/test-iosched.txt
+++ b/kernel/Documentation/block/test-iosched.txt
@@ -0,0 +1,39 @@
+Test IO scheduler
+==================
+
+The test scheduler allows testing a block device by dispatching
+specific requests according to the test case and declare PASS/FAIL
+according to the requests completion error code.
+
+The test IO scheduler implements the no-op scheduler operations, and uses
+them in order to dispatch the non-test requests when no test is running.
+This will allow to keep a normal FS operation in parallel to the test
+capability.
+The test IO scheduler keeps two different queues, one for real-world requests
+(inserted by the FS) and the other for the test requests.
+The test IO scheduler chooses the queue for dispatch requests according to the
+test state (IDLE/RUNNING).
+
+the test IO scheduler is compiled by default as a dynamic module and enabled
+only if CONFIG_DEBUG_FS is defined.
+
+Each block device test utility that would like to use the test-iosched test
+services, should register as a blk_dev_test_type and supply an init and exit
+callbacks. Those callback are called upon selection (or removal) of the
+test-iosched as the active scheduler. From that point the block device test
+can start a test and supply its own callbacks for preparing, running, result
+checking and cleanup of the test.
+
+Each test is exposed via debugfs and can be triggered by writing to
+the debugfs file. In order to add a new test one should expose a new debugfs
+file for the new test.
+
+Selecting IO schedulers
+-----------------------
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
+
+
+May 10 2012, maya Erez <merez@codeaurora.org>
+
+
--- a/kernel/Documentation/block/writeback_cache_control.txt
+++ b/kernel/Documentation/block/writeback_cache_control.txt
@@ -0,0 +1,86 @@
+
+Explicit volatile write back cache control
+=====================================
+
+Introduction
+------------
+
+Many storage devices, especially in the consumer market, come with volatile
+write back caches.  That means the devices signal I/O completion to the
+operating system before data actually has hit the non-volatile storage.  This
+behavior obviously speeds up various workloads, but it means the operating
+system needs to force data out to the non-volatile storage when it performs
+a data integrity operation like fsync, sync or an unmount.
+
+The Linux block layer provides two simple mechanisms that let filesystems
+control the caching behavior of the storage device.  These mechanisms are
+a forced cache flush, and the Force Unit Access (FUA) flag for requests.
+
+
+Explicit cache flushes
+----------------------
+
+The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
+the filesystem and will make sure the volatile cache of the storage device
+has been flushed before the actual I/O operation is started.  This explicitly
+guarantees that previously completed write requests are on non-volatile
+storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
+set on an otherwise empty bio structure, which causes only an explicit cache
+flush without any dependent I/O.  It is recommend to use
+the blkdev_issue_flush() helper for a pure cache flush.
+
+
+Forced Unit Access
+-----------------
+
+The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
+filesystem and will make sure that I/O completion for this request is only
+signaled after the data has been committed to non-volatile storage.
+
+
+Implementation details for filesystems
+--------------------------------------
+
+Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
+worry if the underlying devices need any explicit cache flushing and how
+the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
+may both be set on a single bio.
+
+
+Implementation details for make_request_fn based block drivers
+--------------------------------------------------------------
+
+These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
+directly below the submit_bio interface.  For remapping drivers the REQ_FUA
+bits need to be propagated to underlying devices, and a global flush needs
+to be implemented for bios with the REQ_FLUSH bit set.  For real device
+drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
+on non-empty bios can simply be ignored, and REQ_FLUSH requests without
+data can be completed successfully without doing any work.  Drivers for
+devices with volatile caches need to implement the support for these
+flags themselves without any help from the block layer.
+
+
+Implementation details for request_fn based block drivers
+--------------------------------------------------------------
+
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_FLUSH requests before
+entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
+requests that have a payload.  For devices with volatile write caches the
+driver needs to tell the block layer that it supports flushing caches by
+doing:
+
+	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
+
+and handle empty REQ_FLUSH requests in its prep_fn/request_fn.  Note that
+REQ_FLUSH requests with a payload are automatically turned into a sequence
+of an empty REQ_FLUSH request followed by the actual write by the block
+layer.  For devices that also support the FUA bit the block layer needs
+to be told to pass through the REQ_FUA bit using:
+
+	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
+
+and the driver must handle write requests that have the REQ_FUA bit set
+in prep_fn/request_fn.  If the FUA bit is not natively supported the block
+layer turns it into an empty REQ_FLUSH request after the actual write.