M7350v1_en_gpl

This commit is contained in:
T
2024-09-09 08:52:07 +00:00
commit f9cc65cfda
65988 changed files with 26357421 additions and 0 deletions

View File

@@ -0,0 +1,18 @@
00-INDEX
- This file
biodoc.txt
- Notes on the Generic Block Layer Rewrite in Linux 2.5
capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
- The members of struct request (in include/linux/blkdev.h)
stat.txt
- Block layer statistics in /sys/block/<dev>/stat
switching-sched.txt
- Switching I/O schedulers at runtime
writeback_cache_control.txt
- Control of volatile write back caches

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,15 @@
Generic Block Device Capability
===============================================================================
This file documents the sysfs file block/<disk>/capability
capability is a hex word indicating which capabilities a specific disk
supports. For more information on bits not listed here, see
include/linux/genhd.h
Capability Value
-------------------------------------------------------------------------------
GENHD_FL_MEDIA_CHANGE_NOTIFY 4
When this bit is set, the disk supports Asynchronous Notification
of media change events. These events will be broadcast to user
space via kernel uevent.

View File

@@ -0,0 +1,116 @@
CFQ ioscheduler tunables
========================
slice_idle
----------
This specifies how long CFQ should idle for next request on certain cfq queues
(for sequential workloads) and service trees (for random workloads) before
queue is expired and CFQ selects next queue to dispatch from.
By default slice_idle is a non-zero value. That means by default we idle on
queues/service trees. This can be very helpful on highly seeky media like
single spindle SATA/SAS disks where we can cut down on overall number of
seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues/service tree
level and one should see an overall improved throughput on faster storage
devices like multiple SATA/SAS disks in hardware RAID configuration. The down
side is that isolation provided from WRITES also goes down and notion of
IO priority becomes weaker.
So depending on storage and workload, it might be useful to set slice_idle=0.
In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
keeping slice_idle enabled should be useful. For any configurations where
there are multiple spindles behind single LUN (Host based hardware RAID
controller or for storage arrays), setting slice_idle=0 might end up in better
throughput and acceptable latencies.
CFQ IOPS Mode for group scheduling
===================================
Basic CFQ design is to provide priority based time slices. Higher priority
process gets bigger time slice and lower priority process gets smaller time
slice. Measuring time becomes harder if storage is fast and supports NCQ and
it would be better to dispatch multiple requests from multiple cfq queues in
request queue at a time. In such scenario, it is not possible to measure time
consumed by single queue accurately.
What is possible though is to measure number of requests dispatched from a
single queue and also allow dispatch from multiple cfq queue at the same time.
This effectively becomes the fairness in terms of IOPS (IO operations per
second).
If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
to IOPS mode and starts providing fairness in terms of number of requests
dispatched. Note that this mode switching takes effect only for group
scheduling. For non-cgroup users nothing should change.
CFQ IO scheduler Idling Theory
===============================
Idling on a queue is primarily about waiting for the next request to come
on same queue after completion of a request. In this process CFQ will not
dispatch requests from other cfq queues even if requests are pending there.
The rationale behind idling is that it can cut down on number of seeks
on rotational media. For example, if a process is doing dependent
sequential reads (next read will come on only after completion of previous
one), then not dispatching request from other queue should help as we
did not move the disk head and kept on dispatching sequential IO from
one queue.
CFQ has following service trees and various queues are put on these trees.
sync-idle sync-noidle async
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
On this tree we idle on each queue individually.
All synchronous non-sequential queues go on sync-noidle tree. Also any
request which are marked with REQ_NOIDLE go on this service tree. On this
tree we do not idle on individual queues instead idle on the whole group
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
we will idle only once last queue has dispatched the IO and there is
no more IO on this service tree.
All async writes go on async service tree. There is no idling on async
queues.
CFQ has some optimizations for SSDs and if it detects a non-rotational
media which can support higher queue depth (multiple requests at in
flight at a time), then it cuts down on idling of individual queues and
all the queues move to sync-noidle tree and only tree idle remains. This
tree idling provides isolation with buffered write queues on async tree.
FAQ
===
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
queues. Otherwise in presence of many sequential readers, other
synchronous IO might not get fair share of disk.
For example, if there are 10 sequential readers doing IO and they get
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
roughly after 1 second. If after completion of REQ_NOIDLE request we
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
request comes in, again it will be scheduled after 1second. Repeat it
and notice how a workload can lose its disk share and suffer due to
multiple sequential readers.
fsync can generate dependent IO where bunch of data is written in the
context of fsync, and later some journaling data is written. Journaling
data comes in only after fsync has finished its IO (atleast for ext4
that seemed to be the case). Now if one decides not to idle on fsync
thread due to REQ_NOIDLE, then next journaling write will not get
scheduled for another second. A process doing small fsync, will suffer
badly in presence of multiple sequential readers.
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
provides isolation from multiple sequential readers and at the same
time we do not idle on individual threads.
Q2. When to specify REQ_NOIDLE
A2. I would think whenever one is doing synchronous write and not expecting
more writes to be dispatched from same context soon, should be able
to specify REQ_NOIDLE on writes and that probably should work well for
most of the cases.

View File

@@ -0,0 +1,327 @@
----------------------------------------------------------------------
1. INTRODUCTION
Modern filesystems feature checksumming of data and metadata to
protect against data corruption. However, the detection of the
corruption is done at read time which could potentially be months
after the data was written. At that point the original data that the
application tried to write is most likely lost.
The solution is to ensure that the disk is actually storing what the
application meant it to. Recent additions to both the SCSI family
protocols (SBC Data Integrity Field, SCC protection proposal) as well
as SATA/T13 (External Path Protection) try to remedy this by adding
support for appending integrity metadata to an I/O. The integrity
metadata (or protection information in SCSI terminology) includes a
checksum for each sector as well as an incrementing counter that
ensures the individual sectors are written in the right order. And
for some protection schemes also that the I/O is written to the right
place on disk.
Current storage controllers and devices implement various protective
measures, for instance checksumming and scrubbing. But these
technologies are working in their own isolated domains or at best
between adjacent nodes in the I/O path. The interesting thing about
DIF and the other integrity extensions is that the protection format
is well defined and every node in the I/O path can verify the
integrity of the I/O and reject it if corruption is detected. This
allows not only corruption prevention but also isolation of the point
of failure.
----------------------------------------------------------------------
2. THE DATA INTEGRITY EXTENSIONS
As written, the protocol extensions only protect the path between
controller and storage device. However, many controllers actually
allow the operating system to interact with the integrity metadata
(IMD). We have been working with several FC/SAS HBA vendors to enable
the protection information to be transferred to and from their
controllers.
The SCSI Data Integrity Field works by appending 8 bytes of protection
information to each sector. The data + integrity metadata is stored
in 520 byte sectors on disk. Data + IMD are interleaved when
transferred between the controller and target. The T13 proposal is
similar.
Because it is highly inconvenient for operating systems to deal with
520 (and 4104) byte sectors, we approached several HBA vendors and
encouraged them to allow separation of the data and integrity metadata
scatter-gather lists.
The controller will interleave the buffers on write and split them on
read. This means that Linux can DMA the data buffers to and from
host memory without changes to the page cache.
Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
is somewhat heavy to compute in software. Benchmarks found that
calculating this checksum had a significant impact on system
performance for a number of workloads. Some controllers allow a
lighter-weight checksum to be used when interfacing with the operating
system. Emulex, for instance, supports the TCP/IP checksum instead.
The IP checksum received from the OS is converted to the 16-bit CRC
when writing and vice versa. This allows the integrity metadata to be
generated by Linux or the application at very low cost (comparable to
software RAID5).
The IP checksum is weaker than the CRC in terms of detecting bit
errors. However, the strength is really in the separation of the data
buffers and the integrity metadata. These two distinct buffers must
match up for an I/O to complete.
The separation of the data and integrity metadata buffers as well as
the choice in checksums is referred to as the Data Integrity
Extensions. As these extensions are outside the scope of the protocol
bodies (T10, T13), Oracle and its partners are trying to standardize
them within the Storage Networking Industry Association.
----------------------------------------------------------------------
3. KERNEL CHANGES
The data integrity framework in Linux enables protection information
to be pinned to I/Os and sent to/received from controllers that
support it.
The advantage to the integrity extensions in SCSI and SATA is that
they enable us to protect the entire path from application to storage
device. However, at the same time this is also the biggest
disadvantage. It means that the protection information must be in a
format that can be understood by the disk.
Generally Linux/POSIX applications are agnostic to the intricacies of
the storage devices they are accessing. The virtual filesystem switch
and the block layer make things like hardware sector size and
transport protocols completely transparent to the application.
However, this level of detail is required when preparing the
protection information to send to a disk. Consequently, the very
concept of an end-to-end protection scheme is a layering violation.
It is completely unreasonable for an application to be aware whether
it is accessing a SCSI or SATA disk.
The data integrity support implemented in Linux attempts to hide this
from the application. As far as the application (and to some extent
the kernel) is concerned, the integrity metadata is opaque information
that's attached to the I/O.
The current implementation allows the block layer to automatically
generate the protection information for any I/O. Eventually the
intent is to move the integrity metadata calculation to userspace for
user data. Metadata and other I/O that originates within the kernel
will still use the automatic generation interface.
Some storage devices allow each hardware sector to be tagged with a
16-bit value. The owner of this tag space is the owner of the block
device. I.e. the filesystem in most cases. The filesystem can use
this extra space to tag sectors as they see fit. Because the tag
space is limited, the block interface allows tagging bigger chunks by
way of interleaving. This way, 8*16 bits of information can be
attached to a typical 4KB filesystem block.
This also means that applications such as fsck and mkfs will need
access to manipulate the tags from user space. A passthrough
interface for this is being worked on.
----------------------------------------------------------------------
4. BLOCK LAYER IMPLEMENTATION DETAILS
4.1 BIO
The data integrity patches add a new field to struct bio when
CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
to a struct bip which contains the bio integrity payload. Essentially
a bip is a trimmed down struct bio which holds a bio_vec containing
the integrity metadata and the required housekeeping information (bvec
pool, vector count, etc.)
A kernel subsystem can enable data integrity protection on a bio by
calling bio_integrity_alloc(bio). This will allocate and attach the
bip to the bio.
Individual pages containing integrity metadata can subsequently be
attached using bio_integrity_add_page().
bio_free() will automatically free the bip.
4.2 BLOCK DEVICE
Because the format of the protection data is tied to the physical
disk, each block device has been extended with a block integrity
profile (struct blk_integrity). This optional profile is registered
with the block layer using blk_integrity_register().
The profile contains callback functions for generating and verifying
the protection data, as well as getting and setting application tags.
The profile also contains a few constants to aid in completing,
merging and splitting the integrity metadata.
Layered block devices will need to pick a profile that's appropriate
for all subdevices. blk_integrity_compare() can help with that. DM
and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
will require extra work due to the application tag.
----------------------------------------------------------------------
5.0 BLOCK LAYER INTEGRITY API
5.1 NORMAL FILESYSTEM
The normal filesystem is unaware that the underlying block device
is capable of sending/receiving integrity metadata. The IMD will
be automatically generated by the block layer at submit_bio() time
in case of a WRITE. A READ request will cause the I/O integrity
to be verified upon completion.
IMD generation and verification can be toggled using the
/sys/block/<bdev>/integrity/write_generate
and
/sys/block/<bdev>/integrity/read_verify
flags.
5.2 INTEGRITY-AWARE FILESYSTEM
A filesystem that is integrity-aware can prepare I/Os with IMD
attached. It can also use the application tag space if this is
supported by the block device.
int bdev_integrity_enabled(block_device, int rw);
bdev_integrity_enabled() will return 1 if the block device
supports integrity metadata transfer for the data direction
specified in 'rw'.
bdev_integrity_enabled() honors the write_generate and
read_verify flags in sysfs and will respond accordingly.
int bio_integrity_prep(bio);
To generate IMD for WRITE and to set up buffers for READ, the
filesystem must call bio_integrity_prep(bio).
Prior to calling this function, the bio data direction and start
sector must be set, and the bio should have all data pages
added. It is up to the caller to ensure that the bio does not
change while I/O is in progress.
bio_integrity_prep() should only be called if
bio_integrity_enabled() returned 1.
int bio_integrity_tag_size(bio);
If the filesystem wants to use the application tag space it will
first have to find out how much storage space is available.
Because tag space is generally limited (usually 2 bytes per
sector regardless of sector size), the integrity framework
supports interleaving the information between the sectors in an
I/O.
Filesystems can call bio_integrity_tag_size(bio) to find out how
many bytes of storage are available for that particular bio.
Another option is bdev_get_tag_size(block_device) which will
return the number of available bytes per hardware sector.
int bio_integrity_set_tag(bio, void *tag_buf, len);
After a successful return from bio_integrity_prep(),
bio_integrity_set_tag() can be used to attach an opaque tag
buffer to a bio. Obviously this only makes sense if the I/O is
a WRITE.
int bio_integrity_get_tag(bio, void *tag_buf, len);
Similarly, at READ I/O completion time the filesystem can
retrieve the tag buffer using bio_integrity_get_tag().
5.3 PASSING EXISTING INTEGRITY METADATA
Filesystems that either generate their own integrity metadata or
are capable of transferring IMD from user space can use the
following calls:
struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
Allocates the bio integrity payload and hangs it off of the bio.
nr_pages indicate how many pages of protection data need to be
stored in the integrity bio_vec list (similar to bio_alloc()).
The integrity payload will be freed at bio_free() time.
int bio_integrity_add_page(bio, page, len, offset);
Attaches a page containing integrity metadata to an existing
bio. The bio must have an existing bip,
i.e. bio_integrity_alloc() must have been called. For a WRITE,
the integrity metadata in the pages must be in a format
understood by the target device with the notable exception that
the sector numbers will be remapped as the request traverses the
I/O stack. This implies that the pages added using this call
will be modified during I/O! The first reference tag in the
integrity metadata must have a value of bip->bip_sector.
Pages can be added using bio_integrity_add_page() as long as
there is room in the bip bio_vec array (nr_pages).
Upon completion of a READ operation, the attached pages will
contain the integrity metadata received from the storage device.
It is up to the receiver to process them and verify data
integrity upon completion.
5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
METADATA
To enable integrity exchange on a block device the gendisk must be
registered as capable:
int blk_integrity_register(gendisk, blk_integrity);
The blk_integrity struct is a template and should contain the
following:
static struct blk_integrity my_profile = {
.name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
.generate_fn = my_generate_fn,
.verify_fn = my_verify_fn,
.get_tag_fn = my_get_tag_fn,
.set_tag_fn = my_set_tag_fn,
.tuple_size = sizeof(struct my_tuple_size),
.tag_size = <tag bytes per hw sector>,
};
'name' is a text string which will be visible in sysfs. This is
part of the userland API so chose it carefully and never change
it. The format is standards body-type-variant.
E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
'generate_fn' generates appropriate integrity metadata (for WRITE).
'verify_fn' verifies that the data buffer matches the integrity
metadata.
'tuple_size' must be set to match the size of the integrity
metadata per sector. I.e. 8 for DIF and EPP.
'tag_size' must be set to identify how many bytes of tag space
are available per hardware sector. For DIF this is either 2 or
0 depending on the value of the Control Mode Page ATO bit.
See 6.2 for a description of get_tag_fn and set_tag_fn.
----------------------------------------------------------------------
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>

View File

@@ -0,0 +1,75 @@
Deadline IO scheduler tunables
==============================
This little file attempts to document how the deadline io scheduler works.
In particular, it will clarify the meaning of the exposed tunables that may be
of interest to power users.
Selecting IO schedulers
-----------------------
Refer to Documentation/block/switching-sched.txt for information on
selecting an io scheduler on a per-device basis.
********************************************************************************
read_expire (in ms)
-----------
The goal of the deadline io scheduler is to attempt to guarantee a start
service time for a request. As we focus mainly on read latencies, this is
tunable. When a read request first enters the io scheduler, it is assigned
a deadline that is the current time + the read_expire value in units of
milliseconds.
write_expire (in ms)
-----------
Similar to read_expire mentioned above, but for writes.
fifo_batch (number of requests)
----------
Requests are grouped into ``batches'' of a particular data direction (read or
write) which are serviced in increasing sector order. To limit extra seeking,
deadline expiries are only checked between batches. fifo_batch controls the
maximum number of requests per batch.
This parameter tunes the balance between per-request latency and aggregate
throughput. When low latency is the primary concern, smaller is better (where
a value of 1 yields first-come first-served behaviour). Increasing fifo_batch
generally improves throughput, at the cost of latency variation.
writes_starved (number of dispatches)
--------------
When we have to move requests from the io scheduler queue to the block
device dispatch queue, we always give a preference to reads. However, we
don't want to starve writes indefinitely either. So writes_starved controls
how many times we give preference to reads over writes. When that has been
done writes_starved number of times, we dispatch some writes based on the
same criteria as reads.
front_merges (bool)
------------
Sometimes it happens that a request enters the io scheduler that is contiguous
with a request that is already on the queue. Either it fits in the back of that
request, or it fits at the front. That is called either a back merge candidate
or a front merge candidate. Due to the way files are typically laid out,
back merges are much more common than front merges. For some work loads, you
may even know that it is a waste of time to spend any time attempting to
front merge requests. Setting front_merges to 0 disables this functionality.
Front merges may still occur due to the cached last_merge hint, but since
that comes at basically 0 cost we leave that on. We simply disable the
rbtree front sector lookup when the io scheduler merge function is called.
Nov 11 2002, Jens Axboe <jens.axboe@oracle.com>

View File

@@ -0,0 +1,183 @@
Block io priorities
===================
Intro
-----
With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
priorities are supported for reads on files. This enables users to io nice
processes or process groups, similar to what has been possible with cpu
scheduling for ages. This document mainly details the current possibilities
with cfq; other io schedulers do not support io priorities thus far.
Scheduling classes
------------------
CFQ implements three generic scheduling classes that determine how io is
served for a process.
IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
higher priority than any other in the system, processes from this class are
given first access to the disk every time. Thus it needs to be used with some
care, one io RT process can starve the entire system. Within the RT class,
there are 8 levels of class data that determine exactly how much time this
process needs the disk for on each service. In the future this might change
to be more directly mappable to performance, by passing in a wanted data
rate instead.
IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
for any process that hasn't set a specific io priority. The class data
determines how much io bandwidth the process will get, it's directly mappable
to the cpu nice levels just more coarsely implemented. 0 is the highest
BE prio level, 7 is the lowest. The mapping between cpu nice level and io
nice level is determined as: io_nice = (cpu_nice + 20) / 5.
IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
level only get io time when no one else needs the disk. The idle class has no
class data, since it doesn't really apply here.
Tools
-----
See below for a sample ionice tool. Usage:
# ionice -c<class> -n<level> -p<pid>
If pid isn't given, the current process is assumed. IO priority settings
are inherited on fork, so you can use ionice to start the process at a given
level:
# ionice -c2 -n0 /bin/ls
will run ls at the best-effort scheduling class at the highest priority.
For a running process, you can give the pid instead:
# ionice -c1 -n2 -p100
will change pid 100 to run at the realtime scheduling class, at priority 2.
---> snip ionice.c tool <---
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <getopt.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <asm/unistd.h>
extern int sys_ioprio_set(int, int, int);
extern int sys_ioprio_get(int, int);
#if defined(__i386__)
#define __NR_ioprio_set 289
#define __NR_ioprio_get 290
#elif defined(__ppc__)
#define __NR_ioprio_set 273
#define __NR_ioprio_get 274
#elif defined(__x86_64__)
#define __NR_ioprio_set 251
#define __NR_ioprio_get 252
#elif defined(__ia64__)
#define __NR_ioprio_set 1274
#define __NR_ioprio_get 1275
#else
#error "Unsupported arch"
#endif
static inline int ioprio_set(int which, int who, int ioprio)
{
return syscall(__NR_ioprio_set, which, who, ioprio);
}
static inline int ioprio_get(int which, int who)
{
return syscall(__NR_ioprio_get, which, who);
}
enum {
IOPRIO_CLASS_NONE,
IOPRIO_CLASS_RT,
IOPRIO_CLASS_BE,
IOPRIO_CLASS_IDLE,
};
enum {
IOPRIO_WHO_PROCESS = 1,
IOPRIO_WHO_PGRP,
IOPRIO_WHO_USER,
};
#define IOPRIO_CLASS_SHIFT 13
const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
int main(int argc, char *argv[])
{
int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
int c, pid = 0;
while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
switch (c) {
case 'n':
ioprio = strtol(optarg, NULL, 10);
set = 1;
break;
case 'c':
ioprio_class = strtol(optarg, NULL, 10);
set = 1;
break;
case 'p':
pid = strtol(optarg, NULL, 10);
break;
}
}
switch (ioprio_class) {
case IOPRIO_CLASS_NONE:
ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_RT:
case IOPRIO_CLASS_BE:
break;
case IOPRIO_CLASS_IDLE:
ioprio = 7;
break;
default:
printf("bad prio class %d\n", ioprio_class);
return 1;
}
if (!set) {
if (!pid && argv[optind])
pid = strtol(argv[optind], NULL, 10);
ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
printf("pid=%d, %d\n", pid, ioprio);
if (ioprio == -1)
perror("ioprio_get");
else {
ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
ioprio = ioprio & 0xff;
printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
}
} else {
if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
perror("ioprio_set");
return 1;
}
if (argv[optind])
execvp(argv[optind], &argv[optind]);
}
return 0;
}
---> snip ionice.c tool <---
March 11 2005, Jens Axboe <jens.axboe@oracle.com>

View File

@@ -0,0 +1,67 @@
Queue sysfs files
=================
This text file will detail the queue files that are located in the sysfs tree
for each block device. Note that stacked devices typically do not export
any settings, since their queue merely functions are a remapping target.
These files are the ones found in the /sys/block/xxx/queue/ directory.
Files denoted with a RO postfix are readonly and the RW postfix means
read-write.
hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.
max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.
nomerges (RW)
-------------
This enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).
nr_requests (RW)
----------------
This controls how many requests may be allocated in the block layer for
read or write requests. Note that the total allocated number may be twice
this amount, since it applies only to reads or writes (not the accumulated
sum).
read_ahead_kb (RW)
------------------
Maximum number of kilobytes to read-ahead for filesystems on this block
device.
rq_affinity (RW)
----------------
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic).
scheduler (RW)
--------------
When read, this file will display the current and available IO schedulers
for this block device. The currently active IO scheduler will be enclosed
in [] brackets. Writing an IO scheduler name to this file will switch
control of this block device to that new IO scheduler. Note that writing
an IO scheduler name to this file will attempt to load that IO scheduler
module, if it isn't already present in the system.
Jens Axboe <jens.axboe@oracle.com>, February 2009

View File

@@ -0,0 +1,88 @@
struct request documentation
Jens Axboe <jens.axboe@oracle.com> 27/05/02
1.0
Index
2.0 Struct request members classification
2.1 struct request members explanation
3.0
2.0
Short explanation of request members
Classification flags:
D driver member
B block layer member
I I/O scheduler member
Unless an entry contains a D classification, a device driver must not access
this member. Some members may contain D classifications, but should only be
access through certain macros or functions (eg ->flags).
<linux/blkdev.h>
2.1
Member Flag Comment
------ ---- -------
struct list_head queuelist BI Organization on various internal
queues
void *elevator_private I I/O scheduler private data
unsigned char cmd[16] D Driver can use this for setting up
a cdb before execution, see
blk_queue_prep_rq
unsigned long flags DBI Contains info about data direction,
request type, etc.
int rq_status D Request status bits
kdev_t rq_dev DBI Target device
int errors DB Error counts
sector_t sector DBI Target location
unsigned long hard_nr_sectors B Used to keep sector sane
unsigned long nr_sectors DBI Total number of sectors in request
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
unsigned short nr_phys_segments DB Number of physical scatter gather
segments in a request
unsigned short nr_hw_segments DB Number of hardware scatter gather
segments in a request
unsigned int current_nr_sectors DB Number of sectors in first segment
of request
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
int tag DB TCQ tag, if assigned
void *special D Free to be used by driver
char *buffer D Map of first segment, also see
section on bouncing SECTION
struct completion *waiting D Can be used by driver to get signalled
on request completion
struct bio *bio DBI First bio in request
struct bio *biotail DBI Last bio in request
struct request_queue *q DB Request queue this request belongs to
struct request_list *rl B Request list this request came from

View File

@@ -0,0 +1,134 @@
Introduction
============
The ROW scheduling algorithm will be used in mobile devices as default
block layer IO scheduling algorithm. ROW stands for "READ Over WRITE"
which is the main requests dispatch policy of this algorithm.
The ROW IO scheduler was developed with the mobile devices needs in
mind. In mobile devices we favor user experience upon everything else,
thus we want to give READ IO requests as much priority as possible.
The main idea of the ROW scheduling policy is just that:
- If there are READ requests in pipe - dispatch them, while write
starvation is considered.
Software description
====================
The elevator defines a registering mechanism for different IO scheduler
to implement. This makes implementing a new algorithm quite straight
forward and requires almost no changes to block/elevator framework. A
new IO scheduler just has to implement a set of callback functions
defined by the elevator.
These callbacks cover all the required IO operations such as
adding/removing request to/from the scheduler, merging two requests,
dispatching a request etc.
Design
======
The requests are kept in queues according to their priority. The
dispatching of requests is done in a Round Robin manner with a
different slice for each queue. The dispatch quantum for a specific
queue is set according to the queues priority. READ queues are
given bigger dispatch quantum than the WRITE queues, within a dispatch
cycle.
At the moment there are 6 types of queues the requests are
distributed to:
- High priority READ queue
- High priority Synchronous WRITE queue
- Regular priority READ queue
- Regular priority Synchronous WRITE queue
- Regular priority WRITE queue
- Low priority READ queue
The marking of request as high/low priority will be done by the
application adding the request and not the scheduler. See TODO section.
If the request is not marked in any way (high/low) the scheduler
assigns it to one of the regular priority queues:
read/write/sync write.
If in a certain dispatch cycle one of the queues was empty and didn't
use its quantum that queue will be marked as "un-served". If we're in
a middle of a dispatch cycle dispatching from queue Y and a request
arrives for queue X that was un-served in the previous cycle, if X's
priority is higher than Y's, queue X will be preempted in the favor of
queue Y.
For READ request queues ROW IO scheduler allows idling within a
dispatch quantum in order to give the application a chance to insert
more requests. Idling means adding some extra time for serving a
certain queue even if the queue is empty. The idling is enabled if
the ROW IO scheduler identifies the application is inserting requests
in a high frequency.
Not all queues can idle. ROW scheduler exposes an enablement struct
for idling.
For idling on READ queues, the ROW IO scheduler uses timer mechanism.
When the timer expires we schedule a delayed work that will signal the
device driver to fetch another request for dispatch.
ROW scheduler will support additional services for block devices that
supports Urgent Requests. That is, the scheduler may inform the
device driver upon urgent requests using a newly defined callback.
In addition it will support rescheduling of requests that were
interrupted. For example if the device driver issues a long write
request and a sudden urgent request is received by the scheduler.
The scheduler will inform the device driver about the urgent request,
so the device driver can stop the current write request and serve the
urgent request. In such a case the device driver may also insert back
to the scheduler the remainder of the interrupted write request, such
that the scheduler may continue sending urgent requests without the
need to interrupt the ongoing write again and again. The write
remainder will be sent later on according to the scheduler policy.
SMP/multi-core
==============
At the moment the code is accessed from 2 contexts:
- Application context (from block/elevator layer): adding the requests.
- device driver thread: dispatching the requests and notifying on
completion.
One lock is used to synchronize between the two. This lock is provided
by the block device driver along with the dispatch queue.
Config options
==============
1. hp_read_quantum: dispatch quantum for the high priority READ queue
(default is 100 requests)
2. rp_read_quantum: dispatch quantum for the regular priority READ
queue (default is 100 requests)
3. hp_swrite_quantum: dispatch quantum for the high priority
Synchronous WRITE queue (default is 2 requests)
4. rp_swrite_quantum: dispatch quantum for the regular priority
Synchronous WRITE queue (default is 1 requests)
5. rp_write_quantum: dispatch quantum for the regular priority WRITE
queue (default is 1 requests)
6. lp_read_quantum: dispatch quantum for the low priority READ queue
(default is 1 requests)
7. lp_swrite_quantum: dispatch quantum for the low priority Synchronous
WRITE queue (default is 1 requests)
8. read_idle: how long to idle on read queue in Msec (in case idling
is enabled on that queue). (default is 5 Msec)
9. read_idle_freq: frequency of inserting READ requests that will
trigger idling. This is the time in Msec between inserting two READ
requests. (default is 8 Msec)
Note: Dispatch quantum is number of requests that will be dispatched
from a certain queue in a dispatch cycle.
To do
=====
The ROW algorithm takes the scheduling policy one step further, making
it a bit more "user-needs oriented", by allowing the application to
hint on the urgency of its requests. For example: even among the READ
requests several requests may be more urgent for completion than other.
The former will go to the High priority READ queue, that is given the
bigger dispatch quantum than any other queue.
Still need to design the way applications will "hint" on the urgency of
their requests. May be done by ioctl(). We need to look into concrete
use-cases in order to determine the best solution for this.
This will be implemented as a second phase.
Design and implement additional services for block devices that
supports High Priority Requests.

View File

@@ -0,0 +1,82 @@
Block layer statistics in /sys/block/<dev>/stat
===============================================
This file documents the contents of the /sys/block/<dev>/stat file.
The stat file provides several statistics about the state of block
device <dev>.
Q. Why are there multiple statistics in a single file? Doesn't sysfs
normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
represent a consistent snapshot of the state of the device. If the
statistics were exported as multiple files containing one statistic
each, it would be impossible to guarantee that a set of readings
represent a single point in time.
The stat file consists of a single line of text containing 11 decimal
values separated by whitespace. The fields are summarized in the
following table, and described in more detail below.
Name units description
---- ----- -----------
read I/Os requests number of read I/Os processed
read merges requests number of read I/Os merged with in-queue I/O
read sectors sectors number of sectors read
read ticks milliseconds total wait time for read requests
write I/Os requests number of write I/Os processed
write merges requests number of write I/Os merged with in-queue I/O
write sectors sectors number of sectors written
write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
read I/Os, write I/Os
=====================
These values increment when an I/O request completes.
read merges, write merges
=========================
These values increment when an I/O request is merged with an
already-queued I/O request.
read sectors, write sectors
===========================
These values count the number of sectors read from or written to this
block device. The "sectors" in question are the standard UNIX 512-byte
sectors, not any device- or filesystem-specific block size. The
counters are incremented when the I/O completes.
read ticks, write ticks
=======================
These values count the number of milliseconds that I/O requests have
waited on this block device. If there are multiple I/O requests waiting,
these values will increase at a rate greater than 1000/second; for
example, if 60 read requests wait for an average of 30 ms, the read_ticks
field will increase by 60*30 = 1800.
in_flight
=========
This value counts the number of I/O requests that have been issued to
the device driver but have not yet completed. It does not include I/O
requests that are in the queue but not yet issued to the device driver.
io_ticks
========
This value counts the number of milliseconds during which the device has
had I/O requests queued.
time_in_queue
=============
This value counts the number of milliseconds that I/O requests have waited
on this block device. If there are multiple I/O requests waiting, this
value will increase as the product of the number of milliseconds times the
number of requests waiting (see "read ticks" above for an example).

View File

@@ -0,0 +1,37 @@
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in:
/sys/block/<device>/queue/iosched
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
you can do so by typing:
# mount none /sys -t sysfs
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq

View File

@@ -0,0 +1,39 @@
Test IO scheduler
==================
The test scheduler allows testing a block device by dispatching
specific requests according to the test case and declare PASS/FAIL
according to the requests completion error code.
The test IO scheduler implements the no-op scheduler operations, and uses
them in order to dispatch the non-test requests when no test is running.
This will allow to keep a normal FS operation in parallel to the test
capability.
The test IO scheduler keeps two different queues, one for real-world requests
(inserted by the FS) and the other for the test requests.
The test IO scheduler chooses the queue for dispatch requests according to the
test state (IDLE/RUNNING).
the test IO scheduler is compiled by default as a dynamic module and enabled
only if CONFIG_DEBUG_FS is defined.
Each block device test utility that would like to use the test-iosched test
services, should register as a blk_dev_test_type and supply an init and exit
callbacks. Those callback are called upon selection (or removal) of the
test-iosched as the active scheduler. From that point the block device test
can start a test and supply its own callbacks for preparing, running, result
checking and cleanup of the test.
Each test is exposed via debugfs and can be triggered by writing to
the debugfs file. In order to add a new test one should expose a new debugfs
file for the new test.
Selecting IO schedulers
-----------------------
Refer to Documentation/block/switching-sched.txt for information on
selecting an io scheduler on a per-device basis.
May 10 2012, maya Erez <merez@codeaurora.org>

View File

@@ -0,0 +1,86 @@
Explicit volatile write back cache control
=====================================
Introduction
------------
Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.
The Linux block layer provides two simple mechanisms that let filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.
Explicit cache flushes
----------------------
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O. It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.
Forced Unit Access
-----------------
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
Implementation details for filesystems
--------------------------------------
Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.
Implementation details for make_request_fn based block drivers
--------------------------------------------------------------
These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits need to be propagated to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.
Implementation details for request_fn based block drivers
--------------------------------------------------------------
For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For devices with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of an empty REQ_FLUSH request followed by the actual write by the block
layer. For devices that also support the FUA bit the block layer needs
to be told to pass through the REQ_FUA bit using:
blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
and the driver must handle write requests that have the REQ_FUA bit set
in prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.