M7350v1_en_gpl

This commit is contained in:
T
2024-09-09 08:52:07 +00:00
commit f9cc65cfda
65988 changed files with 26357421 additions and 0 deletions
@@ -0,0 +1,22 @@
00-INDEX
- this file (nfs-related documentation).
Exporting
- explanation of how to make filesystems exportable.
fault_injection.txt
- information for using fault injection on the server
knfsd-stats.txt
- statistics which the NFS server makes available to user space.
nfs.txt
- nfs client, and DNS resolution for fs_locations.
nfs41-server.txt
- info on the Linux server implementation of NFSv4 minor version 1.
nfs-rdma.txt
- how to install and setup the Linux NFS/RDMA client and server software
nfsroot.txt
- short guide on setting up a diskless box with NFS root filesystem.
pnfs.txt
- short explanation of some of the internals of the pnfs client code
rpc-cache.txt
- introduction to the caching mechanisms in the sunrpc layer.
idmapper.txt
- information for configuring request-keys to be used by idmapper
@@ -0,0 +1,154 @@
Making Filesystems Exportable
=============================
Overview
--------
All filesystem operations require a dentry (or two) as a starting
point. Local applications have a reference-counted hold on suitable
dentries via open file descriptors or cwd/root. However remote
applications that access a filesystem via a remote filesystem protocol
such as NFS may not be able to hold such a reference, and so need a
different way to refer to a particular dentry. As the alternative
form of reference needs to be stable across renames, truncates, and
server-reboot (among other things, though these tend to be the most
problematic), there is no simple answer like 'filename'.
The mechanism discussed here allows each filesystem implementation to
specify how to generate an opaque (outside of the filesystem) byte
string for any dentry, and how to find an appropriate dentry for any
given opaque byte string.
This byte string will be called a "filehandle fragment" as it
corresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragments
and dentries will be termed "exportable".
Dcache Issues
-------------
The dcache normally contains a proper prefix of any given filesystem
tree. This means that if any filesystem object is in the dcache, then
all of the ancestors of that filesystem object are also in the dcache.
As normal access is by filename this prefix is created naturally and
maintained easily (by each object maintaining a reference count on
its parent).
However when objects are included into the dcache by interpreting a
filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as
it is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
prefix.
b/ A per-superblock list "s_anon" of dentries which are the roots of
subtrees that are not in the proper prefix. These dentries, as
well as the proper prefix, need to be released at unmount time. As
these dentries will not be hashed, they are linked together on the
d_hash list_head.
c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
d_alloc_anon(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry
can ever be attached.
d_splice_alias(inode, dentry) will make sure that there is a
dentry with the same name and parent as the given dentry, and
which refers to the given inode.
If the inode is a directory and already has a dentry, then that
dentry is d_moved over the given dentry.
If the passed dentry gets attached, care is taken that this is
mutually exclusive to a d_alloc_anon operation.
If the passed dentry is used, NULL is returned, else the used
dentry is returned. This corresponds to the calling pattern of
->lookup.
Filesystem Issues
-----------------
For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name.
If inode is NULL, d_splice_alias(inode, dentry) is eqivalent to
d_add(dentry, inode), NULL
Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
Typically the ->lookup routine will simply end with a:
return d_splice_alias(inode, dentry);
}
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
encode_fh (optional)
Takes a dentry and creates a filehandle fragment which can later be used
to find or create a dentry for the same object. The default
implementation creates a filehandle fragment that encodes a 32bit inode
and generation number for the inode encoded, and if necessary the
same information for the parent.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
create a dentry for it (possibly with d_alloc_anon).
fh_to_parent (optional but strongly recommended)
Given a filehandle fragment, this should find the parent of the
implied object and create a dentry for it (possibly with d_alloc_anon).
May fail if the filehandle fragment is too small.
get_parent (optional but strongly recommended)
When given a dentry for a directory, this should return a dentry for
the parent. Quite possibly the parent dentry will have been allocated
by d_alloc_anon. The default get_parent function just returns an error
so any filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".." entries
in the dcache which are too messy to work with.
get_name (optional)
When given a parent dentry and a child dentry, this should find a name
in the directory identified by the parent dentry, which leads to the
object identified by the child dentry. If no get_name function is
supplied, a default implementation is provided which uses vfs_readdir
to find potential names, and matches inode numbers to find the correct
match.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
The decode_fh routine should not depend on the stated size that is
passed to it. This size may be larger than the original filehandle
generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.
@@ -0,0 +1,69 @@
Fault Injection
===============
Fault injection is a method for forcing errors that may not normally occur, or
may be difficult to reproduce. Forcing these errors in a controlled environment
can help the developer find and fix bugs before their code is shipped in a
production system. Injecting an error on the Linux NFS server will allow us to
observe how the client reacts and if it manages to recover its state correctly.
NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this
feature.
Using Fault Injection
=====================
On the client, mount the fault injection server through NFS v4.0+ and do some
work over NFS (open files, take locks, ...).
On the server, mount the debugfs filesystem to <debug_dir> and ls
<debug_dir>/nfsd. This will show a list of files that will be used for
injecting faults on the NFS server. As root, write a number n to the file
corresponding to the action you want the server to take. The server will then
process the first n items it finds. So if you want to forget 5 locks, echo '5'
to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget
all corresponding items. A log message will be created containing the number
of items forgotten (check dmesg).
Go back to work on the client and check if the client recovered from the error
correctly.
Available Faults
================
forget_clients:
The NFS server keeps a list of clients that have placed a mount call. If
this list is cleared, the server will have no knowledge of who the client
is, forcing the client to reauthenticate with the server.
forget_openowners:
The NFS server keeps a list of what files are currently opened and who
they were opened by. Clearing this list will force the client to reopen
its files.
forget_locks:
The NFS server keeps a list of what files are currently locked in the VFS.
Clearing this list will force the client to reclaim its locks (files are
unlocked through the VFS as they are cleared from this list).
forget_delegations:
A delegation is used to assure the client that a file, or part of a file,
has not changed since the delegation was awarded. Clearing this list will
force the client to reaquire its delegation before accessing the file
again.
recall_delegations:
Delegations can be recalled by the server when another client attempts to
access a file. This test will notify the client that its delegation has
been revoked, forcing the client to reaquire the delegation before using
the file again.
tools/nfs/inject_faults.sh script
=================================
This script has been created to ease the fault injection process. This script
will detect the mounted debugfs directory and write to the files located there
based on the arguments passed by the user. For example, running
`inject_faults.sh forget_locks 1` as root will instruct the server to forget
one lock. Running `inject_faults forget_locks` will instruct the server to
forgetall locks.
@@ -0,0 +1,75 @@
=========
ID Mapper
=========
Id mapper is used by NFS to translate user and group ids into names, and to
translate user and group names into ids. Part of this translation involves
performing an upcall to userspace to request the information. There are two
ways NFS could obtain this information: placing a call to /sbin/request-key
or by placing a call to the rpc.idmap daemon.
NFS will attempt to call /sbin/request-key first. If this succeeds, the
result will be cached using the generic request-key cache. This call should
only fail if /etc/request-key.conf is not configured for the id_resolver key
type, see the "Configuring" section below if you wish to use the request-key
method.
If the call to /sbin/request-key fails (if /etc/request-key.conf is not
configured with the id_resolver key type), then the idmapper will ask the
legacy rpc.idmap daemon for the id mapping. This result will be stored
in a custom NFS idmap cache.
===========
Configuring
===========
The file /etc/request-key.conf will need to be modified so /sbin/request-key can
direct the upcall. The following line should be added:
#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
#====== ======= =============== =============== ===============================
create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
The last parameter, 600, defines how many seconds into the future the key will
expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
is not specified, nfs.idmap will default to 600 seconds.
id mapper uses for key descriptions:
uid: Find the UID for the given user
gid: Find the GID for the given group
user: Find the user name for the given UID
group: Find the group name for the given GID
You can handle any of these individually, rather than using the generic upcall
program. If you would like to use your own program for a uid lookup then you
would edit your request-key.conf so it look similar to this:
#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
#====== ======= =============== =============== ===============================
create id_resolver uid:* * /some/other/program %k %d 600
create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
Notice that the new line was added above the line for the generic program.
request-key will find the first matching line and corresponding program. In
this case, /some/other/program will handle all uid lookups and
/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
See <file:Documentation/security/keys-request-key.txt> for more information
about the request-key function.
=========
nfs.idmap
=========
nfs.idmap is designed to be called by request-key, and should not be run "by
hand". This program takes two arguments, a serialized key and a key
description. The serialized key is first converted into a key_serial_t, and
then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
determines the correct function to call by looking at the first part of the
description string. For example, a uid lookup description will appear as
"uid:user@domain".
nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.
@@ -0,0 +1,159 @@
Kernel NFS Server Statistics
============================
This document describes the format and semantics of the statistics
which the kernel NFS server makes available to userspace. These
statistics are available in several text form pseudo files, each of
which is described separately below.
In most cases you don't need to know these formats, as the nfsstat(8)
program from the nfs-utils distribution provides a helpful command-line
interface for extracting and printing them.
All the files described here are formatted as a sequence of text lines,
separated by newline '\n' characters. Lines beginning with a hash
'#' character are comments intended for humans and should be ignored
by parsing routines. All other lines contain a sequence of fields
separated by whitespace.
/proc/fs/nfsd/pool_stats
------------------------
This file is available in kernels from 2.6.30 onwards, if the
/proc/fs/nfsd filesystem is mounted (it almost always should be).
The first line is a comment which describes the fields present in
all the other lines. The other lines present the following data as
a sequence of unsigned decimal numeric fields. One line is shown
for each NFS thread pool.
All counters are 64 bits wide and wrap naturally. There is no way
to zero these counters, instead applications should do their own
rate conversion.
pool
The id number of the NFS thread pool to which this line applies.
This number does not change.
Thread pool ids are a contiguous set of small integers starting
at zero. The maximum value depends on the thread pool mode, but
currently cannot be larger than the number of CPUs in the system.
Note that in the default case there will be a single thread pool
which contains all the nfsd threads and all the CPUs in the system,
and thus this file will have a single line with a pool id of "0".
packets-arrived
Counts how many NFS packets have arrived. More precisely, this
is the number of times that the network stack has notified the
sunrpc server layer that new data may be available on a transport
(e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
Depending on the NFS workload patterns and various network stack
effects (such as Large Receive Offload) which can combine packets
on the wire, this may be either more or less than the number
of NFS calls received (which statistic is available elsewhere).
However this is a more accurate and less workload-dependent measure
of how much CPU load is being placed on the sunrpc server layer
due to NFS network traffic.
sockets-enqueued
Counts how many times an NFS transport is enqueued to wait for
an nfsd thread to service it, i.e. no nfsd thread was considered
available.
The circumstance this statistic tracks indicates that there was NFS
network-facing work to be done but it couldn't be done immediately,
thus introducing a small delay in servicing NFS calls. The ideal
rate of change for this counter is zero; significantly non-zero
values may indicate a performance limitation.
This can happen either because there are too few nfsd threads in the
thread pool for the NFS workload (the workload is thread-limited),
or because the NFS workload needs more CPU time than is available in
the thread pool (the workload is CPU-limited). In the former case,
configuring more nfsd threads will probably improve the performance
of the NFS workload. In the latter case, the sunrpc server layer is
already choosing not to wake idle nfsd threads because there are too
many nfsd threads which want to run but cannot, so configuring more
nfsd threads will make no difference whatsoever. The overloads-avoided
statistic (see below) can be used to distinguish these cases.
threads-woken
Counts how many times an idle nfsd thread is woken to try to
receive some data from an NFS transport.
This statistic tracks the circumstance where incoming
network-facing NFS work is being handled quickly, which is a good
thing. The ideal rate of change for this counter will be close
to but less than the rate of change of the packets-arrived counter.
overloads-avoided
Counts how many times the sunrpc server layer chose not to wake an
nfsd thread, despite the presence of idle nfsd threads, because
too many nfsd threads had been recently woken but could not get
enough CPU time to actually run.
This statistic counts a circumstance where the sunrpc layer
heuristically avoids overloading the CPU scheduler with too many
runnable nfsd threads. The ideal rate of change for this counter
is zero. Significant non-zero values indicate that the workload
is CPU limited. Usually this is associated with heavy CPU usage
on all the CPUs in the nfsd thread pool.
If a sustained large overloads-avoided rate is detected on a pool,
the top(1) utility should be used to check for the following
pattern of CPU usage on all the CPUs associated with the given
nfsd thread pool.
- %us ~= 0 (as you're *NOT* running applications on your NFS server)
- %wa ~= 0
- %id ~= 0
- %sy + %hi + %si ~= 100
If this pattern is seen, configuring more nfsd threads will *not*
improve the performance of the workload. If this patten is not
seen, then something more subtle is wrong.
threads-timedout
Counts how many times an nfsd thread triggered an idle timeout,
i.e. was not woken to handle any incoming network packets for
some time.
This statistic counts a circumstance where there are more nfsd
threads configured than can be used by the NFS workload. This is
a clue that the number of nfsd threads can be reduced without
affecting performance. Unfortunately, it's only a clue and not
a strong indication, for a couple of reasons:
- Currently the rate at which the counter is incremented is quite
slow; the idle timeout is 60 minutes. Unless the NFS workload
remains constant for hours at a time, this counter is unlikely
to be providing information that is still useful.
- It is usually a wise policy to provide some slack,
i.e. configure a few more nfsds than are currently needed,
to allow for future spikes in load.
Note that incoming packets on NFS transports will be dealt with in
one of three ways. An nfsd thread can be woken (threads-woken counts
this case), or the transport can be enqueued for later attention
(sockets-enqueued counts this case), or the packet can be temporarily
deferred because the transport is currently being used by an nfsd
thread. This last case is not very interesting and is not explicitly
counted, but can be inferred from the other counters thus:
packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
More
----
Descriptions of the other statistics file should go here.
Greg Banks <gnb@sgi.com>
26 Mar 2009
@@ -0,0 +1,271 @@
################################################################################
# #
# NFS/RDMA README #
# #
################################################################################
Author: NetApp and Open Grid Computing
Date: May 29, 2008
Table of Contents
~~~~~~~~~~~~~~~~~
- Overview
- Getting Help
- Installation
- Check RDMA and NFS Setup
- NFS/RDMA Setup
Overview
~~~~~~~~
This document describes how to install and setup the Linux NFS/RDMA client
and server software.
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
was first included in the following release, Linux 2.6.25.
In our testing, we have obtained excellent performance results (full 10Gbit
wire bandwidth at minimal client CPU) under many workloads. The code passes
the full Connectathon test suite and operates over both Infiniband and iWARP
RDMA adapters.
Getting Help
~~~~~~~~~~~~
If you get stuck, you can ask questions on the
nfs-rdma-devel@lists.sourceforge.net
mailing list.
Installation
~~~~~~~~~~~~
These instructions are a step by step guide to building a machine for
use with NFS/RDMA.
- Install an RDMA device
Any device supported by the drivers in drivers/infiniband/hw is acceptable.
Testing has been performed using several Mellanox-based IB cards, the
Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
- Install a Linux distribution and tools
The first kernel release to contain both the NFS/RDMA client and server was
Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
Linux kernel release should be installed.
The procedures described in this document have been tested with
distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
- Install nfs-utils-1.1.2 or greater on the client
An NFS/RDMA mount point can be obtained by using the mount.nfs command in
nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
version with support for NFS/RDMA mounts, but for various reasons we
recommend using nfs-utils-1.1.2 or greater). To see which version of
mount.nfs you are using, type:
$ /sbin/mount.nfs -V
If the version is less than 1.1.2 or the command does not exist,
you should install the latest version of nfs-utils.
Download the latest package from:
http://www.kernel.org/pub/linux/utils/nfs
Uncompress the package and follow the installation instructions.
If you will not need the idmapper and gssd executables (you do not need
these to create an NFS/RDMA enabled mount command), the installation
process can be simplified by disabling these features when running
configure:
$ ./configure --disable-gss --disable-nfsv4
To build nfs-utils you will need the tcp_wrappers package installed. For
more information on this see the package's README and INSTALL files.
After building the nfs-utils package, there will be a mount.nfs binary in
the utils/mount directory. This binary can be used to initiate NFS v2, v3,
or v4 mounts. To initiate a v4 mount, the binary must be called
mount.nfs4. The standard technique is to create a symlink called
mount.nfs4 to mount.nfs.
This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
$ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
In this location, mount.nfs will be invoked automatically for NFS mounts
by the system mount command.
NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
on the NFS client machine. You do not need this specific version of
nfs-utils on the server. Furthermore, only the mount.nfs command from
nfs-utils-1.1.2 is needed on the client.
- Install a Linux kernel with NFS/RDMA
The NFS/RDMA client and server are both included in the mainline Linux
kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
kernel can be found at:
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
Download the sources and place them in an appropriate location.
- Configure the RDMA stack
Make sure your kernel configuration has RDMA support enabled. Under
Device Drivers -> InfiniBand support, update the kernel configuration
to enable InfiniBand support [NOTE: the option name is misleading. Enabling
InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
iWARP adapter support (amso, cxgb3, etc.).
If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
- Configure the NFS client and server
Your kernel configuration must also have NFS file system support and/or
NFS server support enabled. These and other NFS related configuration
options can be found under File Systems -> Network File Systems.
- Build, install, reboot
The NFS/RDMA code will be enabled automatically if NFS and RDMA
are turned on. The NFS/RDMA client and server are configured via the hidden
SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
value of SUNRPC_XPRT_RDMA will be:
- N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
and server will not be built
- M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
in this case the NFS/RDMA client and server will be built as modules
- Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
and server will be built into the kernel
Therefore, if you have followed the steps above and turned no NFS and RDMA,
the NFS/RDMA client and server will be built.
Build a new kernel, install it, boot it.
Check RDMA and NFS Setup
~~~~~~~~~~~~~~~~~~~~~~~~
Before configuring the NFS/RDMA software, it is a good idea to test
your new kernel to ensure that the kernel is working correctly.
In particular, it is a good idea to verify that the RDMA stack
is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
is working properly.
- Check RDMA Setup
If you built the RDMA components as modules, load them at
this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
card:
$ modprobe ib_mthca
$ modprobe ib_ipoib
If you are using InfiniBand, make sure there is a Subnet Manager (SM)
running on the network. If your IB switch has an embedded SM, you can
use it. Otherwise, you will need to run an SM, such as OpenSM, on one
of your end nodes.
If an SM is running on your network, you should see the following:
$ cat /sys/class/infiniband/driverX/ports/1/state
4: ACTIVE
where driverX is mthca0, ipath5, ehca3, etc.
To further test the InfiniBand software stack, use IPoIB (this
assumes you have two IB hosts named host1 and host2):
host1$ ifconfig ib0 a.b.c.x
host2$ ifconfig ib0 a.b.c.y
host1$ ping a.b.c.y
host2$ ping a.b.c.x
For other device types, follow the appropriate procedures.
- Check NFS Setup
For the NFS components enabled above (client and/or server),
test their functionality over standard Ethernet using TCP/IP or UDP/IP.
NFS/RDMA Setup
~~~~~~~~~~~~~~
We recommend that you use two machines, one to act as the client and
one to act as the server.
One time configuration:
- On the server system, configure the /etc/exports file and
start the NFS/RDMA server.
Exports entries with the following formats have been tested:
/vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
/vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
The IP address(es) is(are) the client's IPoIB address for an InfiniBand
HCA or the cleint's iWARP address(es) for an RNIC.
NOTE: The "insecure" option must be used because the NFS/RDMA client does
not use a reserved port.
Each time a machine boots:
- Load and configure the RDMA drivers
For InfiniBand using a Mellanox adapter:
$ modprobe ib_mthca
$ modprobe ib_ipoib
$ ifconfig ib0 a.b.c.d
NOTE: use unique addresses for the client and server
- Start the NFS server
If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
kernel config), load the RDMA transport module:
$ modprobe svcrdma
Regardless of how the server was built (module or built-in), start the
server:
$ /etc/init.d/nfs start
or
$ service nfs start
Instruct the server to listen on the RDMA transport:
$ echo rdma 20049 > /proc/fs/nfsd/portlist
- On the client system
If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
kernel config), load the RDMA client module:
$ modprobe xprtrdma.ko
Regardless of how the client was built (module or built-in), use this
command to mount the NFS/RDMA server:
$ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
To verify that the mount is using RDMA, run "cat /proc/mounts" and check
the "proto" field for the given mount.
Congratulations! You're using NFS/RDMA!
@@ -0,0 +1,98 @@
The NFS client
==============
The NFS version 2 protocol was first documented in RFC1094 (March 1989).
Since then two more major releases of NFS have been published, with NFSv3
being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
2003).
The Linux NFS client currently supports all the above published versions,
and work is in progress on adding support for minor version 1 of the NFSv4
protocol.
The purpose of this document is to provide information on some of the
upcall interfaces that are used in order to provide the NFS client with
some of the information that it requires in order to fully comply with
the NFS spec.
The DNS resolver
================
NFSv4 allows for one server to refer the NFS client to data that has been
migrated onto another server by means of the special "fs_locations"
attribute. See
http://tools.ietf.org/html/rfc3530#section-6
and
http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
The fs_locations information can take the form of either an ip address and
a path, or a DNS hostname and a path. The latter requires the NFS client to
do a DNS lookup in order to mount the new volume, and hence the need for an
upcall to allow userland to provide this service.
Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
(1) The process checks the dns_resolve cache to see if it contains a
valid entry. If so, it returns that entry and exits.
(2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
(may be changed using the 'nfs.cache_getent' kernel boot parameter)
is run, with two arguments:
- the cache name, "dns_resolve"
- the hostname to resolve
(3) After looking up the corresponding ip address, the helper script
writes the result into the rpc_pipefs pseudo-file
'/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
in the following (text) format:
"<ip address> <hostname> <ttl>\n"
Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
(ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
<hostname> is identical to the second argument of the helper
script, and <ttl> is the 'time to live' of this cache entry (in
units of seconds).
Note: If <ip address> is invalid, say the string "0", then a negative
entry is created, which will cause the kernel to treat the hostname
as having no valid DNS translation.
A basic sample /sbin/nfs_cache_getent
=====================================
#!/bin/bash
#
ttl=600
#
cut=/usr/bin/cut
getent=/usr/bin/getent
rpc_pipefs=/var/lib/nfs/rpc_pipefs
#
die()
{
echo "Usage: $0 cache_name entry_name"
exit 1
}
[ $# -lt 2 ] && die
cachename="$1"
cache_path=${rpc_pipefs}/cache/${cachename}/channel
case "${cachename}" in
dns_resolve)
name="$2"
result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
[ -z "${result}" ] && result="0"
;;
*)
die
;;
esac
echo "${result} ${name} ${ttl}" >${cache_path}
@@ -0,0 +1,208 @@
NFSv4.1 Server Implementation
Server support for minorversion 1 can be controlled using the
/proc/fs/nfsd/versions control file. The string output returned
by reading this file will contain either "+4.1" or "-4.1"
correspondingly.
Currently, server support for minorversion 1 is disabled by default.
It can be enabled at run time by writing the string "+4.1" to
the /proc/fs/nfsd/versions control file. Note that to write this
control file, the nfsd service must be taken down. Use your user-mode
nfs-utils to set this up; see rpc.nfsd(8)
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
"-4", respectively. Therefore, code meant to work on both new and old
kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on RFC 5661.
From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
Other NFSv4.1 features, Parallel NFS operations in particular,
are still under development out of tree.
See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
for more information.
The current implementation is intended for developers only: while it
does support ordinary file operations on clients we have tested against
(including the linux client), it is incomplete in ways which may limit
features unexpectedly, cause known bugs in rare cases, or cause
interoperability problems with future clients. Known issues:
- gss support is questionable: currently mounts with kerberos
from a linux client are possible, but we aren't really
conformant with the spec (for example, we don't use kerberos
on the backchannel correctly).
- Incomplete backchannel support: incomplete backchannel gss
support and no support for BACKCHANNEL_CTL mean that
callbacks (hence delegations and layouts) may not be
available and clients confused by the incomplete
implementation may fail.
- We do not support SSV, which provides security for shared
client-server state (thus preventing unauthorized tampering
with locks and opens, for example). It is mandatory for
servers to support this, though no clients use it yet.
- Mandatory operations which we do not support, such as
DESTROY_CLIENTID, are not currently used by clients, but will be
(and the spec recommends their uses in common cases), and
clients should not be expected to know how to recover from the
case where they are not supported. This will eventually cause
interoperability failures.
In addition, some limitations are inherited from the current NFSv4
implementation:
- Incomplete delegation enforcement: if a file is renamed or
unlinked by a local process, a client holding a delegation may
continue to indefinitely allow opens of the file under the old
name.
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
in minor version 1. The first column indicates the operations that
are not supported yet by the linux server implementation.
The OPTIONAL features identified and their abbreviations are as follows:
pNFS Parallel NFS
FDELG File Delegations
DDELG Directory Delegations
The following abbreviations indicate the linux server implementation status.
I Implemented NFSv4.1 operations.
NS Not Supported.
NS* unimplemented optional feature.
P pNFS features implemented out of tree.
PNS pNFS features that are not supported yet (out of tree).
Operations
+----------------------+------------+--------------+----------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+----------------------+------------+--------------+----------------+
| ACCESS | REQ | | Section 18.1 |
NS | BACKCHANNEL_CTL | REQ | | Section 18.33 |
I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
| CLOSE | REQ | | Section 18.2 |
| COMMIT | REQ | | Section 18.3 |
| CREATE | REQ | | Section 18.4 |
I | CREATE_SESSION | REQ | | Section 18.36 |
NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
| DELEGRETURN | OPT | FDELG, | Section 18.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | DESTROY_CLIENTID | REQ | | Section 18.50 |
I | DESTROY_SESSION | REQ | | Section 18.37 |
I | EXCHANGE_ID | REQ | | Section 18.35 |
I | FREE_STATEID | REQ | | Section 18.38 |
| GETATTR | REQ | | Section 18.7 |
P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
| GETFH | REQ | | Section 18.8 |
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
| LINK | OPT | | Section 18.9 |
| LOCK | REQ | | Section 18.10 |
| LOCKT | REQ | | Section 18.11 |
| LOCKU | REQ | | Section 18.12 |
| LOOKUP | REQ | | Section 18.13 |
| LOOKUPP | REQ | | Section 18.14 |
| NVERIFY | REQ | | Section 18.15 |
| OPEN | REQ | | Section 18.16 |
NS*| OPENATTR | OPT | | Section 18.17 |
| OPEN_CONFIRM | MNI | | N/A |
| OPEN_DOWNGRADE | REQ | | Section 18.18 |
| PUTFH | REQ | | Section 18.19 |
| PUTPUBFH | REQ | | Section 18.20 |
| PUTROOTFH | REQ | | Section 18.21 |
| READ | REQ | | Section 18.22 |
| READDIR | REQ | | Section 18.23 |
| READLINK | OPT | | Section 18.24 |
| RECLAIM_COMPLETE | REQ | | Section 18.51 |
| RELEASE_LOCKOWNER | MNI | | N/A |
| REMOVE | REQ | | Section 18.25 |
| RENAME | REQ | | Section 18.26 |
| RENEW | MNI | | N/A |
| RESTOREFH | REQ | | Section 18.27 |
| SAVEFH | REQ | | Section 18.28 |
| SECINFO | REQ | | Section 18.29 |
I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
| | | layout (REQ) | Section 13.12 |
I | SEQUENCE | REQ | | Section 18.46 |
| SETATTR | REQ | | Section 18.30 |
| SETCLIENTID | MNI | | N/A |
| SETCLIENTID_CONFIRM | MNI | | N/A |
NS | SET_SSV | REQ | | Section 18.47 |
I | TEST_STATEID | REQ | | Section 18.48 |
| VERIFY | REQ | | Section 18.31 |
NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
| WRITE | REQ | | Section 18.32 |
Callback Operations
+-------------------------+-----------+-------------+---------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+-------------------------+-----------+-------------+---------------+
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
| CB_RECALL | OPT | FDELG, | Section 20.2 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
| | | (REQ) | |
I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
| | | DDELG, pNFS | |
| | | (REQ) | |
+-------------------------+-----------+-------------+---------------+
Implementation notes:
DELEGPURGE:
* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
persist across client reboots). Thus we need not implement this for
now.
EXCHANGE_ID:
* only SP4_NONE state protection supported
* implementation ids are ignored
CREATE_SESSION:
* backchannel attributes are ignored
* backchannel security parameters are ignored
SEQUENCE:
* no support for dynamic slot table renegotiation (optional)
Nonstandard compound limitations:
* No support for a sessions fore channel RPC compound that requires both a
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
fail to live up to the promise we made in CREATE_SESSION fore channel
negotiation.
* No more than one IO operation (read, write, readdir) allowed per
compound.
See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
@@ -0,0 +1,294 @@
Mounting the root filesystem via NFS (nfsroot)
===============================================
Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
Updated 2006 by Horms <horms@verge.net.au>
In order to use a diskless system, such as an X-terminal or printer server
for example, it is necessary for the root filesystem to be present on a
non-disk device. This may be an initramfs (see Documentation/filesystems/
ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
filesystem mounted via NFS. The following text describes on how to use NFS
for the root filesystem. For the rest of this text 'client' means the
diskless system, and 'server' means the NFS server.
1.) Enabling nfsroot capabilities
-----------------------------
In order to use nfsroot, NFS client support needs to be selected as
built-in during configuration. Once this has been selected, the nfsroot
option will become available, which should also be selected.
In the networking options, kernel level autoconfiguration can be selected,
along with the types of autoconfiguration to support. Selecting all of
DHCP, BOOTP and RARP is safe.
2.) Kernel command line
-------------------
When the kernel has been loaded by a boot loader (see below) it needs to be
told what root fs device to use. And in the case of nfsroot, where to find
both the server and the name of the directory on the server to mount as root.
This can be established using the following kernel command line parameters:
root=/dev/nfs
This is necessary to enable the pseudo-NFS-device. Note that it's not a
real device but just a synonym to tell the kernel to use NFS instead of
a real device.
nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
If the `nfsroot' parameter is NOT given on the command line,
the default "/tftpboot/%s" will be used.
<server-ip> Specifies the IP address of the NFS server.
The default address is determined by the `ip' parameter
(see below). This parameter allows the use of different
servers for IP autoconfiguration and NFS.
<root-dir> Name of the directory on the server to mount as root.
If there is a "%s" token in the string, it will be
replaced by the ASCII-representation of the client's
IP address.
<nfs-options> Standard NFS options. All options are separated by commas.
The following defaults are used:
port = as given by server portmap daemon
rsize = 4096
wsize = 4096
timeo = 7
retrans = 3
acregmin = 3
acregmax = 60
acdirmin = 30
acdirmax = 60
flags = hard, nointr, noposix, cto, ac
ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
This parameter tells the kernel how to configure IP addresses of devices
and also how to set up the IP routing table. It was originally called
`nfsaddrs', but now the boot-time IP configuration works independently of
NFS, so it was renamed to `ip' and the old name remained as an alias for
compatibility reasons.
If this parameter is missing from the kernel command line, all fields are
assumed to be empty, and the defaults mentioned below apply. In general
this means that the kernel tries to configure everything using
autoconfiguration.
The <autoconf> parameter can appear alone as the value to the `ip'
parameter (without all the ':' characters before). If the value is
"ip=off" or "ip=none", no autoconfiguration will take place, otherwise
autoconfiguration will take place. The most common way to use this
is "ip=dhcp".
<client-ip> IP address of the client.
Default: Determined using autoconfiguration.
<server-ip> IP address of the NFS server. If RARP is used to determine
the client address and this parameter is NOT empty only
replies from the specified server are accepted.
Only required for NFS root. That is autoconfiguration
will not be triggered if it is missing and NFS root is not
in operation.
Default: Determined using autoconfiguration.
The address of the autoconfiguration server is used.
<gw-ip> IP address of a gateway if the server is on a different subnet.
Default: Determined using autoconfiguration.
<netmask> Netmask for local network interface. If unspecified
the netmask is derived from the client IP address assuming
classful addressing.
Default: Determined using autoconfiguration.
<hostname> Name of the client. May be supplied by autoconfiguration,
but its absence will not trigger autoconfiguration.
If specified and DHCP is used, the user provided hostname will
be carried in the DHCP request to hopefully update DNS record.
Default: Client IP address is used in ASCII notation.
<device> Name of network device to use.
Default: If the host only has one device, it is used.
Otherwise the device is determined using
autoconfiguration. This is done by sending
autoconfiguration requests out of all devices,
and using the device that received the first reply.
<autoconf> Method to use for autoconfiguration. In the case of options
which specify multiple autoconfiguration protocols,
requests are sent using all protocols, and the first one
to reply is used.
Only autoconfiguration protocols that have been compiled
into the kernel will be used, regardless of the value of
this option.
off or none: don't use autoconfiguration
(do static IP assignment instead)
on or any: use any protocol available in the kernel
(default)
dhcp: use DHCP
bootp: use BOOTP
rarp: use RARP
both: use both BOOTP and RARP but not DHCP
(old option kept for backwards compatibility)
Default: any
nfsrootdebug
This parameter enables debugging messages to appear in the kernel
log at boot time so that administrators can verify that the correct
NFS mount options, server address, and root path are passed to the
NFS client.
rdinit=<executable file>
To specify which file contains the program that starts system
initialization, administrators can use this command line parameter.
The default value of this parameter is "/init". If the specified
file exists and the kernel can execute it, root filesystem related
kernel command line parameters, including `nfsroot=', are ignored.
A description of the process of mounting the root file system can be
found in:
Documentation/early-userspace/README
3.) Boot Loader
----------
To get the kernel into memory different approaches can be used.
They depend on various facilities being available:
3.1) Booting from a floppy using syslinux
When building kernels, an easy way to create a boot floppy that uses
syslinux is to use the zdisk or bzdisk make targets which use zimage
and bzimage images respectively. Both targets accept the
FDARGS parameter which can be used to set the kernel command line.
e.g.
make bzdisk FDARGS="root=/dev/nfs"
Note that the user running this command will need to have
access to the floppy drive device, /dev/fd0
For more information on syslinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
N.B: Previously it was possible to write a kernel directly to
a floppy using dd, configure the boot device using rdev, and
boot using the resulting floppy. Linux no longer supports this
method of booting.
3.2) Booting from a cdrom using isolinux
When building kernels, an easy way to create a bootable cdrom that
uses isolinux is to use the isoimage target which uses a bzimage
image. Like zdisk and bzdisk, this target accepts the FDARGS
parameter which can be used to set the kernel command line.
e.g.
make isoimage FDARGS="root=/dev/nfs"
The resulting iso image will be arch/<ARCH>/boot/image.iso
This can be written to a cdrom using a variety of tools including
cdrecord.
e.g.
cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
3.2) Using LILO
When using LILO all the necessary command line parameters may be
specified using the 'append=' directive in the LILO configuration
file.
However, to use the 'root=' directive you also need to create
a dummy root device, which may be removed after LILO is run.
mknod /dev/boot255 c 0 255
For information on configuring LILO, please refer to its documentation.
3.3) Using GRUB
When using GRUB, kernel parameter are simply appended after the kernel
specification: kernel <kernel> <parameters>
3.4) Using loadlin
loadlin may be used to boot Linux from a DOS command prompt without
requiring a local hard disk to mount as root. This has not been
thoroughly tested by the authors of this document, but in general
it should be possible configure the kernel command line similarly
to the configuration of LILO.
Please refer to the loadlin documentation for further information.
3.5) Using a boot ROM
This is probably the most elegant way of booting a diskless client.
With a boot ROM the kernel is loaded using the TFTP protocol. The
authors of this document are not aware of any no commercial boot
ROMs that support booting Linux over the network. However, there
are two free implementations of a boot ROM, netboot-nfs and
etherboot, both of which are available on sunsite.unc.edu, and both
of which contain everything you need to boot a diskless Linux client.
3.6) Using pxelinux
Pxelinux may be used to boot linux using the PXE boot loader
which is present on many modern network cards.
When using pxelinux, the kernel image is specified using
"kernel <relative-path-below /tftpboot>". The nfsroot parameters
are passed to the kernel by adding them to the "append" line.
It is common to use serial console in conjunction with pxeliunx,
see Documentation/serial-console.txt for more information.
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
4.) Credits
-------
The nfsroot code in the kernel and the RARP support have been written
by Gero Kuhlmann <gero@gkminix.han.de>.
The rest of the IP layer autoconfiguration code has been written
by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
In order to write the initial version of nfsroot I would like to thank
Jens-Uwe Mager <jum@anubis.han.de> for his help.
@@ -0,0 +1,109 @@
Reference counting in pnfs:
==========================
The are several inter-related caches. We have layouts which can
reference multiple devices, each of which can reference multiple data servers.
Each data server can be referenced by multiple devices. Each device
can be referenced by multiple layouts. To keep all of this straight,
we need to reference count.
struct pnfs_layout_hdr
----------------------
The on-the-wire command LAYOUTGET corresponds to struct
pnfs_layout_segment, usually referred to by the variable name lseg.
Each nfs_inode may hold a pointer to a cache of of these layout
segments in nfsi->layout, of type struct pnfs_layout_hdr.
We reference the header for the inode pointing to it, across each
outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
LAYOUTCOMMIT), and for each lseg held within.
Each header is also (when non-empty) put on a list associated with
struct nfs_client (cl_layouts). Being put on this list does not bump
the reference count, as the layout is kept around by the lseg that
keeps it in the list.
deviceid_cache
--------------
lsegs reference device ids, which are resolved per nfs_client and
layout driver type. The device ids are held in a RCU cache (struct
nfs4_deviceid_cache). The cache itself is referenced across each
mount. The entries (struct nfs4_deviceid) themselves are held across
the lifetime of each lseg referencing them.
RCU is used because the deviceid is basically a write once, read many
data structure. The hlist size of 32 buckets needs better
justification, but seems reasonable given that we can have multiple
deviceid's per filesystem, and multiple filesystems per nfs_client.
The hash code is copied from the nfsd code base. A discussion of
hashing and variations of this algorithm can be found at:
http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
data server cache
-----------------
file driver devices refer to data servers, which are kept in a module
level cache. Its reference is held over the lifetime of the deviceid
pointing to it.
lseg
----
lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
bit which holds it in the pnfs_layout_hdr's list. When the final lseg
is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
bit is set, preventing any new lsegs from being added.
layout drivers
--------------
PNFS utilizes what is called layout drivers. The STD defines 3 basic
layout types: "files" "objects" and "blocks". For each of these types
there is a layout-driver with a common function-vectors table which
are called by the nfs-client pnfs-core to implement the different layout
types.
Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c
Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory
Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory
objects-layout setup
--------------------
As part of the full STD implementation the objlayoutdriver.ko needs, at times,
to automatically login to yet undiscovered iscsi/osd devices. For this the
driver makes up-calles to a user-mode script called *osd_login*
The path_name of the script to use is by default:
/sbin/osd_login.
This name can be overridden by the Kernel module parameter:
objlayoutdriver.osd_login_prog
If Kernel does not find the osd_login_prog path it will zero it out
and will not attempt farther logins. An admin can then write new value
to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it.
The /sbin/osd_login is part of the nfs-utils package, and should usually
be installed on distributions that support this Kernel version.
The API to the login script is as follows:
Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID>
Options:
-u target uri e.g. iscsi://<ip>:<port>
(allways exists)
(More protocols can be defined in the future.
The client does not interpret this string it is
passed unchanged as recieved from the Server)
-o osdname of the requested target OSD
(Might be empty)
(A string which denotes the OSD name, there is a
limit of 64 chars on this string)
-s systemid of the requested target OSD
(Might be empty)
(This string, if not empty is always an hex
representation of the 20 bytes osd_system_id)
blocks-layout setup
-------------------
TODO: Document the setup needs of the blocks layout driver
@@ -0,0 +1,202 @@
This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.
CACHES
======
The caching replaces the old exports table and allows for
a wide variety of values to be caches.
There are a number of caches that are similar in structure though
quite possibly very different in content and use. There is a corpus
of common code for managing these caches.
Examples of caches that are likely to be needed are:
- mapping from IP address to client name
- mapping from client name and filesystem to export options
- mapping from UID to list of GIDs, to work around NFS's limitation
of 16 gids.
- mappings between local UID/GID and remote UID/GID for sites that
do not have uniform uid assignment
- mapping from network identify to public key for crypto authentication.
The common code handles such things as:
- general cache lookup with correct locking
- supporting 'NEGATIVE' as well as positive entries
- allowing an EXPIRED time on cache items, and removing
items after they expire, and are no longer in-use.
- making requests to user-space to fill in cache entries
- allowing user-space to directly set entries in the cache
- delaying RPC requests that depend on as-yet incomplete
cache entries, and replaying those requests when the cache entry
is complete.
- clean out old entries as they expire.
Creating a Cache
----------------
1/ A cache needs a datum to store. This is in the form of a
structure definition that must contain a
struct cache_head
as an element, usually the first.
It will also contain a key and some content.
Each cache element is reference counted and contains
expiry and update times for use in cache management.
2/ A cache needs a "cache_detail" structure that
describes the cache. This stores the hash table, some
parameters for cache management, and some operations detailing how
to work with particular cache items.
The operations requires are:
struct cache_head *alloc(void)
This simply allocates appropriate memory and returns
a pointer to the cache_detail embedded within the
structure
void cache_put(struct kref *)
This is called when the last reference to an item is
dropped. The pointer passed is to the 'ref' field
in the cache_head. cache_put should release any
references create by 'cache_init' and, if CACHE_VALID
is set, any references created by cache_update.
It should then release the memory allocated by
'alloc'.
int match(struct cache_head *orig, struct cache_head *new)
test if the keys in the two structures match. Return
1 if they do, 0 if they don't.
void init(struct cache_head *orig, struct cache_head *new)
Set the 'key' fields in 'new' from 'orig'. This may
include taking references to shared objects.
void update(struct cache_head *orig, struct cache_head *new)
Set the 'content' fileds in 'new' from 'orig'.
int cache_show(struct seq_file *m, struct cache_detail *cd,
struct cache_head *h)
Optional. Used to provide a /proc file that lists the
contents of a cache. This should show one item,
usually on just one line.
int cache_request(struct cache_detail *cd, struct cache_head *h,
char **bpp, int *blen)
Format a request to be send to user-space for an item
to be instantiated. *bpp is a buffer of size *blen.
bpp should be moved forward over the encoded message,
and *blen should be reduced to show how much free
space remains. Return 0 on success or <0 if not
enough room or other problem.
int cache_parse(struct cache_detail *cd, char *buf, int len)
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
cache with sunrpc_cache_lookup, and update the item
with sunrpc_cache_update.
3/ A cache needs to be registered using cache_register(). This
includes it on a list of caches that will be regularly
cleaned to discard old data.
Using a cache
-------------
To find a value in a cache, call sunrpc_cache_lookup passing a pointer
to the cache_head in a sample item with the 'key' fields filled in.
This will be passed to ->match to identify the target entry. If no
entry is found, a new entry will be create, added to the cache, and
marked as not containing valid data.
The item returned is typically passed to cache_check which will check
if the data is valid, and may initiate an up-call to get fresh data.
cache_check will return -ENOENT in the entry is negative or if an up
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;
cache_check can be passed a "struct cache_req *". This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req). This is
done when the found cache item is not uptodate, but the is reason to
believe that userspace might provide information soon. When the cache
item does become valid, the deferred copy of the request will be
revisited (->revisit). It is expected that this method will
reschedule the request for processing.
The value returned by sunrpc_cache_lookup can also be passed to
sunrpc_cache_update to set the content for the item. A second item is
passed which should hold the content. If the item found by _lookup
has valid data, then it is discarded and a new item is created. This
saves any user of an item from worrying about content changing while
it is being inspected. If the item found by _lookup does not contain
valid data, then the content is copied across and CACHE_VALID is set.
Populating a cache
------------------
Each cache has a name, and when the cache is registered, a directory
with that name is created in /proc/net/rpc
This directory contains a file called 'channel' which is a channel
for communicating between kernel and user for populating the cache.
This directory may later contain other files of interacting
with the cache.
The 'channel' works a bit like a datagram socket. Each 'write' is
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
- a key
- an expiry time
- a content.
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.
Reading from a channel is a bit more interesting. When a cache
lookup fails, or when it succeeds but finds an entry that may soon
expire, a request is lodged for that cache item to be updated by
user-space. These requests appear in the channel file.
Successive reads will return successive requests.
If there are no more requests to return, read will return EOF, but a
select or poll for read will block waiting for another request to be
added.
Thus a user-space helper is likely to:
open the channel.
select for readable
read a request
write a response
loop.
If it dies and needs to be restarted, any requests that have not been
answered will still appear in the file and will be read by the new
instance of the helper.
Each cache should define a "cache_parse" method which takes a message
written from user-space and processes it. It should return an error
(which propagates back to the write syscall) or 0.
Each cache should also define a "cache_request" method which
takes a cache item and encodes a request into the buffer
provided.
Note: If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail. This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.
request/response format
-----------------------
While each cache is free to use its own format for requests
and responses over channel, the following is recommended as
appropriate and support routines are available to help:
Each request or response record should be printable ASCII
with precisely one newline character which should be at the end.
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted. two mechanisms are available:
1/ If a field begins '\x' then it must contain an even number of
hex digits, and pairs of these digits provide the bytes in the
field.
2/ otherwise a \ in the field must be followed by 3 octal digits
which give the code for a byte. Other characters are treated
as them selves. At the very least, space, newline, nul, and
'\' must be quoted in this way.