M7350v1_en_gpl

This commit is contained in:
T
2024-09-09 08:52:07 +00:00
commit f9cc65cfda
65988 changed files with 26357421 additions and 0 deletions

View File

@@ -0,0 +1,44 @@
00-INDEX
- This file
apm-acpi.txt
- basic info about the APM and ACPI support.
basic-pm-debugging.txt
- Debugging suspend and resume
devices.txt
- How drivers interact with system-wide power management
drivers-testing.txt
- Testing suspend and resume support in device drivers
freezing-of-tasks.txt
- How processes and controlled during suspend
interface.txt
- Power management user interface in /sys/power
notifiers.txt
- Registering suspend notifiers in device drivers
opp.txt
- Operating Performance Point library
pci.txt
- How the PCI Subsystem Does Power Management
pm_qos_interface.txt
- info on Linux PM Quality of Service interface
power_supply_class.txt
- Tells userspace about battery, UPS, AC or DC power supply properties
s2ram.txt
- How to get suspend to ram working (and debug it when it isn't)
states.txt
- System power management states
suspend-and-cpuhotplug.txt
- Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug
swsusp-and-swap-files.txt
- Using swap files with software suspend (to disk)
swsusp-dmcrypt.txt
- How to use dm-crypt and software suspend (to disk) together
swsusp.txt
- Goals, implementation, and usage of software suspend (ACPI S3)
tricks.txt
- How to trick software suspend (to disk) into working when it isn't
userland-swsusp.txt
- Experimental implementation of software suspend in userspace
video_extension.txt
- ACPI video extensions
video.txt
- Video issues during resume from suspend

View File

@@ -0,0 +1,32 @@
APM or ACPI?
------------
If you have a relatively recent x86 mobile, desktop, or server system,
odds are it supports either Advanced Power Management (APM) or
Advanced Configuration and Power Interface (ACPI). ACPI is the newer
of the two technologies and puts power management in the hands of the
operating system, allowing for more intelligent power management than
is possible with BIOS controlled APM.
The best way to determine which, if either, your system supports is to
build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is
enabled by default). If a working ACPI implementation is found, the
ACPI driver will override and disable APM, otherwise the APM driver
will be used.
No, sorry, you cannot have both ACPI and APM enabled and running at
once. Some people with broken ACPI or broken APM implementations
would like to use both to get a full set of working features, but you
simply cannot mix and match the two. Only one power management
interface can be in control of the machine at once. Think about it..
User-space Daemons
------------------
Both APM and ACPI rely on user-space daemons, apmd and acpid
respectively, to be completely functional. Obtain both of these
daemons from your Linux distribution or from the Internet (see below)
and be sure that they are started sometime in the system boot process.
Go ahead and start both. If ACPI or APM is not available on your
system the associated daemon will exit gracefully.
apmd: http://ftp.debian.org/pool/main/a/apmd/
acpid: http://acpid.sf.net/

View File

@@ -0,0 +1,227 @@
Debugging hibernation and suspend
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
1. Testing hibernation (aka suspend to disk or STD)
To check if hibernation works, you can try to hibernate in the "reboot" mode:
# echo reboot > /sys/power/disk
# echo disk > /sys/power/state
and the system should create a hibernation image, reboot, resume and get back to
the command prompt where you have started the transition. If that happens,
hibernation is most likely to work correctly. Still, you need to repeat the
test at least a couple of times in a row for confidence. [This is necessary,
because some problems only show up on a second attempt at suspending and
resuming the system.] Moreover, hibernating in the "reboot" and "shutdown"
modes causes the PM core to skip some platform-related callbacks which on ACPI
systems might be necessary to make hibernation work. Thus, if your machine fails
to hibernate or resume in the "reboot" mode, you should try the "platform" mode:
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
which is the default and recommended mode of hibernation.
Unfortunately, the "platform" mode of hibernation does not work on some systems
with broken BIOSes. In such cases the "shutdown" mode of hibernation might
work:
# echo shutdown > /sys/power/disk
# echo disk > /sys/power/state
(it is similar to the "reboot" mode, but it requires you to press the power
button to make the system resume).
If neither "platform" nor "shutdown" hibernation mode works, you will need to
identify what goes wrong.
a) Test modes of hibernation
To find out why hibernation fails on your system, you can use a special testing
facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then,
there is the file /sys/power/pm_test that can be used to make the hibernation
core run in a test mode. There are 5 test modes available:
freezer
- test the freezing of processes
devices
- test the freezing of processes and suspending of devices
platform
- test the freezing of processes, suspending of devices and platform
global control methods(*)
processors
- test the freezing of processes, suspending of devices, platform
global control methods(*) and the disabling of nonboot CPUs
core
- test the freezing of processes, suspending of devices, platform global
control methods(*), the disabling of nonboot CPUs and suspending of
platform/system devices
(*) the platform global control methods are only available on ACPI systems
and are only tested if the hibernation mode is set to "platform"
To use one of them it is necessary to write the corresponding string to
/sys/power/pm_test (eg. "devices" to test the freezing of processes and
suspending devices) and issue the standard hibernation commands. For example,
to use the "devices" test mode along with the "platform" mode of hibernation,
you should do the following:
# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
Then, the kernel will try to freeze processes, suspend devices, wait 5 seconds,
resume devices and thaw processes. If "platform" is written to
/sys/power/pm_test , then after suspending devices the kernel will additionally
invoke the global control methods (eg. ACPI global control methods) used to
prepare the platform firmware for hibernation. Next, it will wait 5 seconds and
invoke the platform (eg. ACPI) global methods used to cancel hibernation etc.
Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal
hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test
contains a space-separated list of all available tests (including "none" that
represents the normal functionality) in which the current test level is
indicated by square brackets.
Generally, as you can see, each test level is more "invasive" than the previous
one and the "core" level tests the hardware and drivers as deeply as possible
without creating a hibernation image. Obviously, if the "devices" test fails,
the "platform" test will fail as well and so on. Thus, as a rule of thumb, you
should try the test modes starting from "freezer", through "devices", "platform"
and "processors" up to "core" (repeat the test on each level a couple of times
to make sure that any random factors are avoided).
If the "freezer" test fails, there is a task that cannot be frozen (in that case
it usually is possible to identify the offending task by analysing the output of
dmesg obtained after the failing test). Failure at this level usually means
that there is a problem with the tasks freezer subsystem that should be
reported.
If the "devices" test fails, most likely there is a driver that cannot suspend
or resume its device (in the latter case the system may hang or become unstable
after the test, so please take that into consideration). To find this driver,
you can carry out a binary search according to the rules:
- if the test fails, unload a half of the drivers currently loaded and repeat
(that would probably involve rebooting the system, so always note what drivers
have been loaded before the test),
- if the test succeeds, load a half of the drivers you have unloaded most
recently and repeat.
Once you have found the failing driver (there can be more than just one of
them), you have to unload it every time before hibernation. In that case please
make sure to report the problem with the driver.
It is also possible that the "devices" test will still fail after you have
unloaded all modules. In that case, you may want to look in your kernel
configuration for the drivers that can be compiled as modules (and test again
with these drivers compiled as modules). You may also try to use some special
kernel command line options such as "noapic", "noacpi" or even "acpi=off".
If the "platform" test fails, there is a problem with the handling of the
platform (eg. ACPI) firmware on your system. In that case the "platform" mode
of hibernation is not likely to work. You can try the "shutdown" mode, but that
is rather a poor man's workaround.
If the "processors" test fails, the disabling/enabling of nonboot CPUs does not
work (of course, this only may be an issue on SMP systems) and the problem
should be reported. In that case you can also try to switch the nonboot CPUs
off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and
see if that works.
If the "core" test fails, which means that suspending of the system/platform
devices has failed (these devices are suspended on one CPU with interrupts off),
the problem is most probably hardware-related and serious, so it should be
reported.
A failure of any of the "platform", "processors" or "core" tests may cause your
system to hang or become unstable, so please beware. Such a failure usually
indicates a serious problem that very well may be related to the hardware, but
please report it anyway.
b) Testing minimal configuration
If all of the hibernation test modes work, you can boot the system with the
"init=/bin/bash" command line parameter and attempt to hibernate in the
"reboot", "shutdown" and "platform" modes. If that does not work, there
probably is a problem with a driver statically compiled into the kernel and you
can try to compile more drivers as modules, so that they can be tested
individually. Otherwise, there is a problem with a modular driver and you can
find it by loading a half of the modules you normally use and binary searching
in accordance with the algorithm:
- if there are n modules loaded and the attempt to suspend and resume fails,
unload n/2 of the modules and try again (that would probably involve rebooting
the system),
- if there are n modules loaded and the attempt to suspend and resume succeeds,
load n/2 modules more and try again.
Again, if you find the offending module(s), it(they) must be unloaded every time
before hibernation, and please report the problem with it(them).
c) Advanced debugging
In case that hibernation does not work on your system even in the minimal
configuration and compiling more drivers as modules is not practical or some
modules cannot be unloaded, you can use one of the more advanced debugging
techniques to find the problem. First, if there is a serial port in your box,
you can boot the kernel with the 'no_console_suspend' parameter and try to log
kernel messages using the serial console. This may provide you with some
information about the reasons of the suspend (resume) failure. Alternatively,
it may be possible to use a FireWire port for debugging with firescope
(ftp://ftp.firstfloor.org/pub/ak/firescope/). On x86 it is also possible to
use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
2. Testing suspend to RAM (STR)
To verify that the STR works, it is generally more convenient to use the s2ram
tool available from http://suspend.sf.net and documented at
http://en.opensuse.org/SDB:Suspend_to_RAM.
Namely, after writing "freezer", "devices", "platform", "processors", or "core"
into /sys/power/pm_test (available if the kernel is compiled with
CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding
to given string. The STR test modes are defined in the same way as for
hibernation, so please refer to Section 1 for more information about them. In
particular, the "core" test allows you to test everything except for the actual
invocation of the platform firmware in order to put the system into the sleep
state.
Among other things, the testing with the help of /sys/power/pm_test may allow
you to identify drivers that fail to suspend or resume their devices. They
should be unloaded every time before an STR transition.
Next, you can follow the instructions at http://en.opensuse.org/s2ram to test
the system, but if it does not work "out of the box", you may need to boot it
with "init=/bin/bash" and test s2ram in the minimal configuration. In that
case, you may be able to search for failing drivers by following the procedure
analogous to the one described in section 1. If you find some failing drivers,
you will have to unload them every time before an STR transition (ie. before
you run s2ram), and please report the problems with them.
There is a debugfs entry which shows the suspend to RAM statistics. Here is an
example of its output.
# mount -t debugfs none /sys/kernel/debug
# cat /sys/kernel/debug/suspend_stats
success: 20
fail: 5
failed_freeze: 0
failed_prepare: 0
failed_suspend: 5
failed_suspend_noirq: 0
failed_resume: 0
failed_resume_noirq: 0
failures:
last_failed_dev: alarm
adc
last_failed_errno: -16
-16
last_failed_step: suspend
suspend
Field success means the success number of suspend to RAM, and field fail means
the failure number. Others are the failure number of different steps of suspend
to RAM. suspend_stats just lists the last 2 failed devices, error number and
failed step of suspend.

View File

@@ -0,0 +1,163 @@
Charger Manager
(C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL
Charger Manager provides in-kernel battery charger management that
requires temperature monitoring during suspend-to-RAM state
and where each battery may have multiple chargers attached and the userland
wants to look at the aggregated information of the multiple chargers.
Charger Manager is a platform_driver with power-supply-class entries.
An instance of Charger Manager (a platform-device created with Charger-Manager)
represents an independent battery with chargers. If there are multiple
batteries with their own chargers acting independently in a system,
the system may need multiple instances of Charger Manager.
1. Introduction
===============
Charger Manager supports the following:
* Support for multiple chargers (e.g., a device with USB, AC, and solar panels)
A system may have multiple chargers (or power sources) and some of
they may be activated at the same time. Each charger may have its
own power-supply-class and each power-supply-class can provide
different information about the battery status. This framework
aggregates charger-related information from multiple sources and
shows combined information as a single power-supply-class.
* Support for in suspend-to-RAM polling (with suspend_again callback)
While the battery is being charged and the system is in suspend-to-RAM,
we may need to monitor the battery health by looking at the ambient or
battery temperature. We can accomplish this by waking up the system
periodically. However, such a method wakes up devices unncessary for
monitoring the battery health and tasks, and user processes that are
supposed to be kept suspended. That, in turn, incurs unnecessary power
consumption and slow down charging process. Or even, such peak power
consumption can stop chargers in the middle of charging
(external power input < device power consumption), which not
only affects the charging time, but the lifespan of the battery.
Charger Manager provides a function "cm_suspend_again" that can be
used as suspend_again callback of platform_suspend_ops. If the platform
requires tasks other than cm_suspend_again, it may implement its own
suspend_again callback that calls cm_suspend_again in the middle.
Normally, the platform will need to resume and suspend some devices
that are used by Charger Manager.
2. Global Charger-Manager Data related with suspend_again
========================================================
In order to setup Charger Manager with suspend-again feature
(in-suspend monitoring), the user should provide charger_global_desc
with setup_charger_manager(struct charger_global_desc *).
This charger_global_desc data for in-suspend monitoring is global
as the name suggests. Thus, the user needs to provide only once even
if there are multiple batteries. If there are multiple batteries, the
multiple instances of Charger Manager share the same charger_global_desc
and it will manage in-suspend monitoring for all instances of Charger Manager.
The user needs to provide all the two entries properly in order to activate
in-suspend monitoring:
struct charger_global_desc {
char *rtc_name;
: The name of rtc (e.g., "rtc0") used to wakeup the system from
suspend for Charger Manager. The alarm interrupt (AIE) of the rtc
should be able to wake up the system from suspend. Charger Manager
saves and restores the alarm value and use the previously-defined
alarm if it is going to go off earlier than Charger Manager so that
Charger Manager does not interfere with previously-defined alarms.
bool (*rtc_only_wakeup)(void);
: This callback should let CM know whether
the wakeup-from-suspend is caused only by the alarm of "rtc" in the
same struct. If there is any other wakeup source triggered the
wakeup, it should return false. If the "rtc" is the only wakeup
reason, it should return true.
};
3. How to setup suspend_again
=============================
Charger Manager provides a function "extern bool cm_suspend_again(void)".
When cm_suspend_again is called, it monitors every battery. The suspend_ops
callback of the system's platform_suspend_ops can call cm_suspend_again
function to know whether Charger Manager wants to suspend again or not.
If there are no other devices or tasks that want to use suspend_again
feature, the platform_suspend_ops may directly refer to cm_suspend_again
for its suspend_again callback.
The cm_suspend_again() returns true (meaning "I want to suspend again")
if the system was woken up by Charger Manager and the polling
(in-suspend monitoring) results in "normal".
4. Charger-Manager Data (struct charger_desc)
=============================================
For each battery charged independently from other batteries (if a series of
batteries are charged by a single charger, they are counted as one independent
battery), an instance of Charger Manager is attached to it.
struct charger_desc {
char *psy_name;
: The power-supply-class name of the battery. Default is
"battery" if psy_name is NULL. Users can access the psy entries
at "/sys/class/power_supply/[psy_name]/".
enum polling_modes polling_mode;
: CM_POLL_DISABLE: do not poll this battery.
CM_POLL_ALWAYS: always poll this battery.
CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if
an external power source is attached.
CM_POLL_CHARGING_ONLY: poll this battery if and only if the
battery is being charged.
unsigned int fullbatt_uV;
: If specified with a non-zero value, Charger Manager assumes
that the battery is full (capacity = 100) if the battery is not being
charged and the battery voltage is equal to or greater than
fullbatt_uV.
unsigned int polling_interval_ms;
: Required polling interval in ms. Charger Manager will poll
this battery every polling_interval_ms or more frequently.
enum data_source battery_present;
CM_FUEL_GAUGE: get battery presence information from fuel gauge.
CM_CHARGER_STAT: get battery presence from chargers.
char **psy_charger_stat;
: An array ending with NULL that has power-supply-class names of
chargers. Each power-supply-class should provide "PRESENT" (if
battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an
external power source is attached or not), and "STATUS" (shows whether
the battery is {"FULL" or not FULL} or {"FULL", "Charging",
"Discharging", "NotCharging"}).
int num_charger_regulators;
struct regulator_bulk_data *charger_regulators;
: Regulators representing the chargers in the form for
regulator framework's bulk functions.
char *psy_fuel_gauge;
: Power-supply-class name of the fuel gauge.
int (*temperature_out_of_range)(int *mC);
bool measure_battery_temp;
: This callback returns 0 if the temperature is safe for charging,
a positive number if it is too hot to charge, and a negative number
if it is too cold to charge. With the variable mC, the callback returns
the temperature in 1/1000 of centigrade.
The source of temperature can be battery or ambient one according to
the value of measure_battery_temp.
};
5. Other Considerations
=======================
At the charger/battery-related events such as battery-pulled-out,
charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped,
and others critical to chargers, the system should be configured to wake up.
At least the following should wake up the system from a suspend:
a) charger-on/off b) external-power-in/out c) battery-in/out (while charging)
It is usually accomplished by configuring the PMIC as a wakeup source.

View File

@@ -0,0 +1,669 @@
Device Power Management
Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu>
Most of the code in Linux is device drivers, so most of the Linux power
management (PM) code is also driver-specific. Most drivers will do very
little; others, especially for platforms with small batteries (like cell
phones), will do a lot.
This writeup gives an overview of how drivers interact with system-wide
power management goals, emphasizing the models and interfaces that are
shared by everything that hooks up to the driver model core. Read it as
background for the domain-specific work you'd do with any specific driver.
Two Models for Device Power Management
======================================
Drivers will use one or both of these models to put devices into low-power
states:
System Sleep model:
Drivers can enter low-power states as part of entering system-wide
low-power states like "suspend" (also known as "suspend-to-RAM"), or
(mostly for systems with disks) "hibernation" (also known as
"suspend-to-disk").
This is something that device, bus, and class drivers collaborate on
by implementing various role-specific suspend and resume methods to
cleanly power down hardware and software subsystems, then reactivate
them without loss of data.
Some drivers can manage hardware wakeup events, which make the system
leave the low-power state. This feature may be enabled or disabled
using the relevant /sys/devices/.../power/wakeup file (for Ethernet
drivers the ioctl interface used by ethtool may also be used for this
purpose); enabling it may cost some power usage, but let the whole
system enter low-power states more often.
Runtime Power Management model:
Devices may also be put into low-power states while the system is
running, independently of other power management activity in principle.
However, devices are not generally independent of each other (for
example, a parent device cannot be suspended unless all of its child
devices have been suspended). Moreover, depending on the bus type the
device is on, it may be necessary to carry out some bus-specific
operations on the device for this purpose. Devices put into low power
states at run time may require special handling during system-wide power
transitions (suspend or hibernation).
For these reasons not only the device driver itself, but also the
appropriate subsystem (bus type, device type or device class) driver and
the PM core are involved in runtime power management. As in the system
sleep power management case, they need to collaborate by implementing
various role-specific suspend and resume methods, so that the hardware
is cleanly powered down and reactivated without data or service loss.
There's not a lot to be said about those low-power states except that they are
very system-specific, and often device-specific. Also, that if enough devices
have been put into low-power states (at runtime), the effect may be very similar
to entering some system-wide low-power state (system sleep) ... and that
synergies exist, so that several drivers using runtime PM might put the system
into a state where even deeper power saving options are available.
Most suspended devices will have quiesced all I/O: no more DMA or IRQs (except
for wakeup events), no more data read or written, and requests from upstream
drivers are no longer accepted. A given bus or platform may have different
requirements though.
Examples of hardware wakeup events include an alarm from a real time clock,
network wake-on-LAN packets, keyboard or mouse activity, and media insertion
or removal (for PCMCIA, MMC/SD, USB, and so on).
Interfaces for Entering System Sleep States
===========================================
There are programming interfaces provided for subsystems (bus type, device type,
device class) and device drivers to allow them to participate in the power
management of devices they are concerned with. These interfaces cover both
system sleep and runtime power management.
Device Power Management Operations
----------------------------------
Device power management operations, at the subsystem level as well as at the
device driver level, are implemented by defining and populating objects of type
struct dev_pm_ops:
struct dev_pm_ops {
int (*prepare)(struct device *dev);
void (*complete)(struct device *dev);
int (*suspend)(struct device *dev);
int (*resume)(struct device *dev);
int (*freeze)(struct device *dev);
int (*thaw)(struct device *dev);
int (*poweroff)(struct device *dev);
int (*restore)(struct device *dev);
int (*suspend_late)(struct device *dev);
int (*resume_early)(struct device *dev);
int (*freeze_late)(struct device *dev);
int (*thaw_early)(struct device *dev);
int (*poweroff_late)(struct device *dev);
int (*restore_early)(struct device *dev);
int (*suspend_noirq)(struct device *dev);
int (*resume_noirq)(struct device *dev);
int (*freeze_noirq)(struct device *dev);
int (*thaw_noirq)(struct device *dev);
int (*poweroff_noirq)(struct device *dev);
int (*restore_noirq)(struct device *dev);
int (*runtime_suspend)(struct device *dev);
int (*runtime_resume)(struct device *dev);
int (*runtime_idle)(struct device *dev);
};
This structure is defined in include/linux/pm.h and the methods included in it
are also described in that file. Their roles will be explained in what follows.
For now, it should be sufficient to remember that the last three methods are
specific to runtime power management while the remaining ones are used during
system-wide power transitions.
There also is a deprecated "old" or "legacy" interface for power management
operations available at least for some subsystems. This approach does not use
struct dev_pm_ops objects and it is suitable only for implementing system sleep
power management methods. Therefore it is not described in this document, so
please refer directly to the source code for more information about it.
Subsystem-Level Methods
-----------------------
The core methods to suspend and resume devices reside in struct dev_pm_ops
pointed to by the ops member of struct dev_pm_domain, or by the pm member of
struct bus_type, struct device_type and struct class. They are mostly of
interest to the people writing infrastructure for platforms and buses, like PCI
or USB, or device type and device class drivers. They also are relevant to the
writers of device drivers whose subsystems (PM domains, device types, device
classes and bus types) don't provide all power management methods.
Bus drivers implement these methods as appropriate for the hardware and the
drivers using it; PCI works differently from USB, and so on. Not many people
write subsystem-level drivers; most driver code is a "device driver" that builds
on top of bus-specific framework code.
For more information on these driver calls, see the description later;
they are called in phases for every device, respecting the parent-child
sequencing in the driver model tree.
/sys/devices/.../power/wakeup files
-----------------------------------
All device objects in the driver model contain fields that control the handling
of system wakeup events (hardware signals that can force the system out of a
sleep state). These fields are initialized by bus or device driver code using
device_set_wakeup_capable() and device_set_wakeup_enable(), defined in
include/linux/pm_wakeup.h.
The "power.can_wakeup" flag just records whether the device (and its driver) can
physically support wakeup events. The device_set_wakeup_capable() routine
affects this flag. The "power.wakeup" field is a pointer to an object of type
struct wakeup_source used for controlling whether or not the device should use
its system wakeup mechanism and for notifying the PM core of system wakeup
events signaled by the device. This object is only present for wakeup-capable
devices (i.e. devices whose "can_wakeup" flags are set) and is created (or
removed) by device_set_wakeup_capable().
Whether or not a device is capable of issuing wakeup events is a hardware
matter, and the kernel is responsible for keeping track of it. By contrast,
whether or not a wakeup-capable device should issue wakeup events is a policy
decision, and it is managed by user space through a sysfs attribute: the
"power/wakeup" file. User space can write the strings "enabled" or "disabled"
to it to indicate whether or not, respectively, the device is supposed to signal
system wakeup. This file is only present if the "power.wakeup" object exists
for the given device and is created (or removed) along with that object, by
device_set_wakeup_capable(). Reads from the file will return the corresponding
string.
The "power/wakeup" file is supposed to contain the "disabled" string initially
for the majority of devices; the major exceptions are power buttons, keyboards,
and Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with
ethtool. It should also default to "enabled" for devices that don't generate
wakeup requests on their own but merely forward wakeup requests from one bus to
another (like PCI Express ports).
The device_may_wakeup() routine returns true only if the "power.wakeup" object
exists and the corresponding "power/wakeup" file contains the string "enabled".
This information is used by subsystems, like the PCI bus type code, to see
whether or not to enable the devices' wakeup mechanisms. If device wakeup
mechanisms are enabled or disabled directly by drivers, they also should use
device_may_wakeup() to decide what to do during a system sleep transition.
Device drivers, however, are not supposed to call device_set_wakeup_enable()
directly in any case.
It ought to be noted that system wakeup is conceptually different from "remote
wakeup" used by runtime power management, although it may be supported by the
same physical mechanism. Remote wakeup is a feature allowing devices in
low-power states to trigger specific interrupts to signal conditions in which
they should be put into the full-power state. Those interrupts may or may not
be used to signal system wakeup events, depending on the hardware design. On
some systems it is impossible to trigger them from system sleep states. In any
case, remote wakeup should always be enabled for runtime power management for
all devices and drivers that support it.
/sys/devices/.../power/control files
------------------------------------
Each device in the driver model has a flag to control whether it is subject to
runtime power management. This flag, called runtime_auto, is initialized by the
bus type (or generally subsystem) code using pm_runtime_allow() or
pm_runtime_forbid(); the default is to allow runtime power management.
The setting can be adjusted by user space by writing either "on" or "auto" to
the device's power/control sysfs file. Writing "auto" calls pm_runtime_allow(),
setting the flag and allowing the device to be runtime power-managed by its
driver. Writing "on" calls pm_runtime_forbid(), clearing the flag, returning
the device to full power if it was in a low-power state, and preventing the
device from being runtime power-managed. User space can check the current value
of the runtime_auto flag by reading the file.
The device's runtime_auto flag has no effect on the handling of system-wide
power transitions. In particular, the device can (and in the majority of cases
should and will) be put into a low-power state during a system-wide transition
to a sleep state even though its runtime_auto flag is clear.
For more information about the runtime power management framework, refer to
Documentation/power/runtime_pm.txt.
Calling Drivers to Enter and Leave System Sleep States
======================================================
When the system goes into a sleep state, each device's driver is asked to
suspend the device by putting it into a state compatible with the target
system state. That's usually some version of "off", but the details are
system-specific. Also, wakeup-enabled devices will usually stay partly
functional in order to wake the system.
When the system leaves that low-power state, the device's driver is asked to
resume it by returning it to full power. The suspend and resume operations
always go together, and both are multi-phase operations.
For simple drivers, suspend might quiesce the device using class code
and then turn its hardware as "off" as possible during suspend_noirq. The
matching resume calls would then completely reinitialize the hardware
before reactivating its class I/O queues.
More power-aware drivers might prepare the devices for triggering system wakeup
events.
Call Sequence Guarantees
------------------------
To ensure that bridges and similar links needing to talk to a device are
available when the device is suspended or resumed, the device tree is
walked in a bottom-up order to suspend devices. A top-down order is
used to resume those devices.
The ordering of the device tree is defined by the order in which devices
get registered: a child can never be registered, probed or resumed before
its parent; and can't be removed or suspended after that parent.
The policy is that the device tree should match hardware bus topology.
(Or at least the control bus, for devices which use multiple busses.)
In particular, this means that a device registration may fail if the parent of
the device is suspending (i.e. has been chosen by the PM core as the next
device to suspend) or has already suspended, as well as after all of the other
devices have been suspended. Device drivers must be prepared to cope with such
situations.
System Power Management Phases
------------------------------
Suspending or resuming the system is done in several phases. Different phases
are used for standby or memory sleep states ("suspend-to-RAM") and the
hibernation state ("suspend-to-disk"). Each phase involves executing callbacks
for every device before the next phase begins. Not all busses or classes
support all these callbacks and not all drivers use all the callbacks. The
various phases always run after tasks have been frozen and before they are
unfrozen. Furthermore, the *_noirq phases run at a time when IRQ handlers have
been disabled (except for those marked with the IRQF_NO_SUSPEND flag).
All phases use PM domain, bus, type, class or driver callbacks (that is, methods
defined in dev->pm_domain->ops, dev->bus->pm, dev->type->pm, dev->class->pm or
dev->driver->pm). These callbacks are regarded by the PM core as mutually
exclusive. Moreover, PM domain callbacks always take precedence over all of the
other callbacks and, for example, type callbacks take precedence over bus, class
and driver callbacks. To be precise, the following rules are used to determine
which callback to execute in the given phase:
1. If dev->pm_domain is present, the PM core will choose the callback
included in dev->pm_domain->ops for execution
2. Otherwise, if both dev->type and dev->type->pm are present, the callback
included in dev->type->pm will be chosen for execution.
3. Otherwise, if both dev->class and dev->class->pm are present, the
callback included in dev->class->pm will be chosen for execution.
4. Otherwise, if both dev->bus and dev->bus->pm are present, the callback
included in dev->bus->pm will be chosen for execution.
This allows PM domains and device types to override callbacks provided by bus
types or device classes if necessary.
The PM domain, type, class and bus callbacks may in turn invoke device- or
driver-specific methods stored in dev->driver->pm, but they don't have to do
that.
If the subsystem callback chosen for execution is not present, the PM core will
execute the corresponding method from dev->driver->pm instead if there is one.
Entering System Suspend
-----------------------
When the system goes into the standby or memory sleep state, the phases are:
prepare, suspend, suspend_late, suspend_noirq.
1. The prepare phase is meant to prevent races by preventing new devices
from being registered; the PM core would never know that all the
children of a device had been suspended if new children could be
registered at will. (By contrast, devices may be unregistered at any
time.) Unlike the other suspend-related phases, during the prepare
phase the device tree is traversed top-down.
After the prepare callback method returns, no new children may be
registered below the device. The method may also prepare the device or
driver in some way for the upcoming system power transition, but it
should not put the device into a low-power state.
2. The suspend methods should quiesce the device to stop it from performing
I/O. They also may save the device registers and put it into the
appropriate low-power state, depending on the bus type the device is on,
and they may enable wakeup events.
3 For a number of devices it is convenient to split suspend into the
"quiesce device" and "save device state" phases, in which cases
suspend_late is meant to do the latter. It is always executed after
runtime power management has been disabled for all devices.
4. The suspend_noirq phase occurs after IRQ handlers have been disabled,
which means that the driver's interrupt handler will not be called while
the callback method is running. The methods should save the values of
the device's registers that weren't saved previously and finally put the
device into the appropriate low-power state.
The majority of subsystems and device drivers need not implement this
callback. However, bus types allowing devices to share interrupt
vectors, like PCI, generally need it; otherwise a driver might encounter
an error during the suspend phase by fielding a shared interrupt
generated by some other device after its own device had been set to low
power.
At the end of these phases, drivers should have stopped all I/O transactions
(DMA, IRQs), saved enough state that they can re-initialize or restore previous
state (as needed by the hardware), and placed the device into a low-power state.
On many platforms they will gate off one or more clock sources; sometimes they
will also switch off power supplies or reduce voltages. (Drivers supporting
runtime PM may already have performed some or all of these steps.)
If device_may_wakeup(dev) returns true, the device should be prepared for
generating hardware wakeup signals to trigger a system wakeup event when the
system is in the sleep state. For example, enable_irq_wake() might identify
GPIO signals hooked up to a switch or other external hardware, and
pci_enable_wake() does something similar for the PCI PME signal.
If any of these callbacks returns an error, the system won't enter the desired
low-power state. Instead the PM core will unwind its actions by resuming all
the devices that were suspended.
Leaving System Suspend
----------------------
When resuming from standby or memory sleep, the phases are:
resume_noirq, resume_early, resume, complete.
1. The resume_noirq callback methods should perform any actions needed
before the driver's interrupt handlers are invoked. This generally
means undoing the actions of the suspend_noirq phase. If the bus type
permits devices to share interrupt vectors, like PCI, the method should
bring the device and its driver into a state in which the driver can
recognize if the device is the source of incoming interrupts, if any,
and handle them correctly.
For example, the PCI bus type's ->pm.resume_noirq() puts the device into
the full-power state (D0 in the PCI terminology) and restores the
standard configuration registers of the device. Then it calls the
device driver's ->pm.resume_noirq() method to perform device-specific
actions.
2. The resume_early methods should prepare devices for the execution of
the resume methods. This generally involves undoing the actions of the
preceding suspend_late phase.
3 The resume methods should bring the the device back to its operating
state, so that it can perform normal I/O. This generally involves
undoing the actions of the suspend phase.
4. The complete phase should undo the actions of the prepare phase. Note,
however, that new children may be registered below the device as soon as
the resume callbacks occur; it's not necessary to wait until the
complete phase.
At the end of these phases, drivers should be as functional as they were before
suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
gated on. Even if the device was in a low-power state before the system sleep
because of runtime power management, afterwards it should be back in its
full-power state. There are multiple reasons why it's best to do this; they are
discussed in more detail in Documentation/power/runtime_pm.txt.
However, the details here may again be platform-specific. For example,
some systems support multiple "run" states, and the mode in effect at
the end of resume might not be the one which preceded suspension.
That means availability of certain clocks or power supplies changed,
which could easily affect how a driver works.
Drivers need to be able to handle hardware which has been reset since the
suspend methods were called, for example by complete reinitialization.
This may be the hardest part, and the one most protected by NDA'd documents
and chip errata. It's simplest if the hardware state hasn't changed since
the suspend was carried out, but that can't be guaranteed (in fact, it usually
is not the case).
Drivers must also be prepared to notice that the device has been removed
while the system was powered down, whenever that's physically possible.
PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
where common Linux platforms will see such removal. Details of how drivers
will notice and handle such removals are currently bus-specific, and often
involve a separate thread.
These callbacks may return an error value, but the PM core will ignore such
errors since there's nothing it can do about them other than printing them in
the system log.
Entering Hibernation
--------------------
Hibernating the system is more complicated than putting it into the standby or
memory sleep state, because it involves creating and saving a system image.
Therefore there are more phases for hibernation, with a different set of
callbacks. These phases always run after tasks have been frozen and memory has
been freed.
The general procedure for hibernation is to quiesce all devices (freeze), create
an image of the system memory while everything is stable, reactivate all
devices (thaw), write the image to permanent storage, and finally shut down the
system (poweroff). The phases used to accomplish this are:
prepare, freeze, freeze_late, freeze_noirq, thaw_noirq, thaw_early,
thaw, complete, prepare, poweroff, poweroff_late, poweroff_noirq
1. The prepare phase is discussed in the "Entering System Suspend" section
above.
2. The freeze methods should quiesce the device so that it doesn't generate
IRQs or DMA, and they may need to save the values of device registers.
However the device does not have to be put in a low-power state, and to
save time it's best not to do so. Also, the device should not be
prepared to generate wakeup events.
3. The freeze_late phase is analogous to the suspend_late phase described
above, except that the device should not be put in a low-power state and
should not be allowed to generate wakeup events by it.
4. The freeze_noirq phase is analogous to the suspend_noirq phase discussed
above, except again that the device should not be put in a low-power
state and should not be allowed to generate wakeup events.
At this point the system image is created. All devices should be inactive and
the contents of memory should remain undisturbed while this happens, so that the
image forms an atomic snapshot of the system state.
5. The thaw_noirq phase is analogous to the resume_noirq phase discussed
above. The main difference is that its methods can assume the device is
in the same state as at the end of the freeze_noirq phase.
6. The thaw_early phase is analogous to the resume_early phase described
above. Its methods should undo the actions of the preceding
freeze_late, if necessary.
7. The thaw phase is analogous to the resume phase discussed above. Its
methods should bring the device back to an operating state, so that it
can be used for saving the image if necessary.
8. The complete phase is discussed in the "Leaving System Suspend" section
above.
At this point the system image is saved, and the devices then need to be
prepared for the upcoming system shutdown. This is much like suspending them
before putting the system into the standby or memory sleep state, and the phases
are similar.
9. The prepare phase is discussed above.
10. The poweroff phase is analogous to the suspend phase.
11. The poweroff_late phase is analogous to the suspend_late phase.
12. The poweroff_noirq phase is analogous to the suspend_noirq phase.
The poweroff, poweroff_late and poweroff_noirq callbacks should do essentially
the same things as the suspend, suspend_late and suspend_noirq callbacks,
respectively. The only notable difference is that they need not store the
device register values, because the registers should already have been stored
during the freeze, freeze_late or freeze_noirq phases.
Leaving Hibernation
-------------------
Resuming from hibernation is, again, more complicated than resuming from a sleep
state in which the contents of main memory are preserved, because it requires
a system image to be loaded into memory and the pre-hibernation memory contents
to be restored before control can be passed back to the image kernel.
Although in principle, the image might be loaded into memory and the
pre-hibernation memory contents restored by the boot loader, in practice this
can't be done because boot loaders aren't smart enough and there is no
established protocol for passing the necessary information. So instead, the
boot loader loads a fresh instance of the kernel, called the boot kernel, into
memory and passes control to it in the usual way. Then the boot kernel reads
the system image, restores the pre-hibernation memory contents, and passes
control to the image kernel. Thus two different kernels are involved in
resuming from hibernation. In fact, the boot kernel may be completely different
from the image kernel: a different configuration and even a different version.
This has important consequences for device drivers and their subsystems.
To be able to load the system image into memory, the boot kernel needs to
include at least a subset of device drivers allowing it to access the storage
medium containing the image, although it doesn't need to include all of the
drivers present in the image kernel. After the image has been loaded, the
devices managed by the boot kernel need to be prepared for passing control back
to the image kernel. This is very similar to the initial steps involved in
creating a system image, and it is accomplished in the same way, using prepare,
freeze, and freeze_noirq phases. However the devices affected by these phases
are only those having drivers in the boot kernel; other devices will still be in
whatever state the boot loader left them.
Should the restoration of the pre-hibernation memory contents fail, the boot
kernel would go through the "thawing" procedure described above, using the
thaw_noirq, thaw, and complete phases, and then continue running normally. This
happens only rarely. Most often the pre-hibernation memory contents are
restored successfully and control is passed to the image kernel, which then
becomes responsible for bringing the system back to the working state.
To achieve this, the image kernel must restore the devices' pre-hibernation
functionality. The operation is much like waking up from the memory sleep
state, although it involves different phases:
restore_noirq, restore_early, restore, complete
1. The restore_noirq phase is analogous to the resume_noirq phase.
2. The restore_early phase is analogous to the resume_early phase.
3. The restore phase is analogous to the resume phase.
4. The complete phase is discussed above.
The main difference from resume[_early|_noirq] is that restore[_early|_noirq]
must assume the device has been accessed and reconfigured by the boot loader or
the boot kernel. Consequently the state of the device may be different from the
state remembered from the freeze, freeze_late and freeze_noirq phases. The
device may even need to be reset and completely re-initialized. In many cases
this difference doesn't matter, so the resume[_early|_noirq] and
restore[_early|_norq] method pointers can be set to the same routines.
Nevertheless, different callback pointers are used in case there is a situation
where it actually does matter.
Device Power Management Domains
-------------------------------
Sometimes devices share reference clocks or other power resources. In those
cases it generally is not possible to put devices into low-power states
individually. Instead, a set of devices sharing a power resource can be put
into a low-power state together at the same time by turning off the shared
power resource. Of course, they also need to be put into the full-power state
together, by turning the shared power resource on. A set of devices with this
property is often referred to as a power domain.
Support for power domains is provided through the pm_domain field of struct
device. This field is a pointer to an object of type struct dev_pm_domain,
defined in include/linux/pm.h, providing a set of power management callbacks
analogous to the subsystem-level and device driver callbacks that are executed
for the given device during all power transitions, instead of the respective
subsystem-level callbacks. Specifically, if a device's pm_domain pointer is
not NULL, the ->suspend() callback from the object pointed to by it will be
executed instead of its subsystem's (e.g. bus type's) ->suspend() callback and
anlogously for all of the remaining callbacks. In other words, power management
domain callbacks, if defined for the given device, always take precedence over
the callbacks provided by the device's subsystem (e.g. bus type).
The support for device power management domains is only relevant to platforms
needing to use the same device driver power management callbacks in many
different power domain configurations and wanting to avoid incorporating the
support for power domains into subsystem-level callbacks, for example by
modifying the platform bus type. Other platforms need not implement it or take
it into account in any way.
Device Low Power (suspend) States
---------------------------------
Device low-power states aren't standard. One device might only handle
"on" and "off, while another might support a dozen different versions of
"on" (how many engines are active?), plus a state that gets back to "on"
faster than from a full "off".
Some busses define rules about what different suspend states mean. PCI
gives one example: after the suspend sequence completes, a non-legacy
PCI device may not perform DMA or issue IRQs, and any wakeup events it
issues would be issued through the PME# bus signal. Plus, there are
several PCI-standard device states, some of which are optional.
In contrast, integrated system-on-chip processors often use IRQs as the
wakeup event sources (so drivers would call enable_irq_wake) and might
be able to treat DMA completion as a wakeup event (sometimes DMA can stay
active too, it'd only be the CPU and some peripherals that sleep).
Some details here may be platform-specific. Systems may have devices that
can be fully active in certain sleep states, such as an LCD display that's
refreshed using DMA while most of the system is sleeping lightly ... and
its frame buffer might even be updated by a DSP or other non-Linux CPU while
the Linux control processor stays idle.
Moreover, the specific actions taken may depend on the target system state.
One target system state might allow a given device to be very operational;
another might require a hard shut down with re-initialization on resume.
And two different target systems might use the same device in different
ways; the aforementioned LCD might be active in one product's "standby",
but a different product using the same SOC might work differently.
Power Management Notifiers
--------------------------
There are some operations that cannot be carried out by the power management
callbacks discussed above, because the callbacks occur too late or too early.
To handle these cases, subsystems and device drivers may register power
management notifiers that are called before tasks are frozen and after they have
been thawed. Generally speaking, the PM notifiers are suitable for performing
actions that either require user space to be available, or at least won't
interfere with user space.
For details refer to Documentation/power/notifiers.txt.
Runtime Power Management
========================
Many devices are able to dynamically power down while the system is still
running. This feature is useful for devices that are not being used, and
can offer significant power savings on a running system. These devices
often support a range of runtime power states, which might use names such
as "off", "sleep", "idle", "active", and so on. Those states will in some
cases (like PCI) be partially constrained by the bus the device uses, and will
usually include hardware states that are also used in system sleep states.
A system-wide power transition can be started while some devices are in low
power states due to runtime power management. The system sleep PM callbacks
should recognize such situations and react to them appropriately, but the
necessary actions are subsystem-specific.
In some cases the decision may be made at the subsystem level while in other
cases the device driver may be left to decide. In some cases it may be
desirable to leave a suspended device in that state during a system-wide power
transition, but in other cases the device must be put back into the full-power
state temporarily, for example so that its system wakeup capability can be
disabled. This all depends on the hardware and the design of the subsystem and
device driver in question.
During system-wide resume from a sleep state it's easiest to put devices into
the full-power state, as explained in Documentation/power/runtime_pm.txt. Refer
to that document for more information regarding this particular issue as well as
for information on the device runtime power management framework in general.

View File

@@ -0,0 +1,46 @@
Testing suspend and resume support in device drivers
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
1. Preparing the test system
Unfortunately, to effectively test the support for the system-wide suspend and
resume transitions in a driver, it is necessary to suspend and resume a fully
functional system with this driver loaded. Moreover, that should be done
several times, preferably several times in a row, and separately for hibernation
(aka suspend to disk or STD) and suspend to RAM (STR), because each of these
cases involves slightly different operations and different interactions with
the machine's BIOS.
Of course, for this purpose the test system has to be known to suspend and
resume without the driver being tested. Thus, if possible, you should first
resolve all suspend/resume-related problems in the test system before you start
testing the new driver. Please see Documentation/power/basic-pm-debugging.txt
for more information about the debugging of suspend/resume functionality.
2. Testing the driver
Once you have resolved the suspend/resume-related problems with your test system
without the new driver, you are ready to test it:
a) Build the driver as a module, load it and try the test modes of hibernation
(see: Documentation/power/basic-pm-debugging.txt, 1).
b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and
"platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1).
c) Compile the driver directly into the kernel and try the test modes of
hibernation.
d) Attempt to hibernate with the driver compiled directly into the kernel
in the "reboot", "shutdown" and "platform" modes.
e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt,
2). [As far as the STR tests are concerned, it should not matter whether or
not the driver is built as a module.]
f) Attempt to suspend to RAM using the s2ram tool with the driver loaded
(see: Documentation/power/basic-pm-debugging.txt, 2).
Each of the above tests should be repeated several times and the STD tests
should be mixed with the STR tests. If any of them fails, the driver cannot be
regarded as suspend/resume-safe.

View File

@@ -0,0 +1,225 @@
Freezing of tasks
(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
I. What is the freezing of tasks?
The freezing of tasks is a mechanism by which user space processes and some
kernel threads are controlled during hibernation or system-wide suspend (on some
architectures).
II. How does it work?
There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN
and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have
PF_NOFREEZE unset (all user space processes and some kernel threads) are
regarded as 'freezable' and treated in a special way before the system enters a
suspend state as well as before a hibernation image is created (in what follows
we only consider hibernation, but the description also applies to suspend).
Namely, as the first step of the hibernation procedure the function
freeze_processes() (defined in kernel/power/process.c) is called. A system-wide
variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate
whether the system is to undergo a freezing operation. And freeze_processes()
sets this variable. After this, it executes try_to_freeze_tasks() that sends a
fake signal to all user space processes, and wakes up all the kernel threads.
All freezable tasks must react to that by calling try_to_freeze(), which
results in a call to __refrigerator() (defined in kernel/freezer.c), which sets
the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes
it loop until PF_FROZEN is cleared for it. Then, we say that the task is
'frozen' and therefore the set of functions handling this mechanism is referred
to as 'the freezer' (these functions are defined in kernel/power/process.c,
kernel/freezer.c & include/linux/freezer.h). User space processes are generally
frozen before kernel threads.
__refrigerator() must not be called directly. Instead, use the
try_to_freeze() function (defined in include/linux/freezer.h), that checks
if the task is to be frozen and makes the task enter __refrigerator().
For user space processes try_to_freeze() is called automatically from the
signal-handling code, but the freezable kernel threads need to call it
explicitly in suitable places or use the wait_event_freezable() or
wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
that combine interruptible sleep with checking if the task is to be frozen and
calling try_to_freeze(). The main loop of a freezable kernel thread may look
like the following one:
set_freezable();
do {
hub_events();
wait_event_freezable(khubd_wait,
!list_empty(&hub_event_list) ||
kthread_should_stop());
} while (!kthread_should_stop() || !list_empty(&hub_event_list));
(from drivers/usb/core/hub.c::hub_thread()).
If a freezable kernel thread fails to call try_to_freeze() after the freezer has
initiated a freezing operation, the freezing of tasks will fail and the entire
hibernation operation will be cancelled. For this reason, freezable kernel
threads must call try_to_freeze() somewhere or use one of the
wait_event_freezable() and wait_event_freezable_timeout() macros.
After the system memory state has been restored from a hibernation image and
devices have been reinitialized, the function thaw_processes() is called in
order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that
have been frozen leave __refrigerator() and continue running.
Rationale behind the functions dealing with freezing and thawing of tasks:
-------------------------------------------------------------------------
freeze_processes():
- freezes only userspace tasks
freeze_kernel_threads():
- freezes all tasks (including kernel threads) because we can't freeze
kernel threads without freezing userspace tasks
thaw_kernel_threads():
- thaws only kernel threads; this is particularly useful if we need to do
anything special in between thawing of kernel threads and thawing of
userspace tasks, or if we want to postpone the thawing of userspace tasks
thaw_processes():
- thaws all tasks (including kernel threads) because we can't thaw userspace
tasks without thawing kernel threads
III. Which kernel threads are freezable?
Kernel threads are not freezable by default. However, a kernel thread may clear
PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE
directly is not allowed). From this point it is regarded as freezable
and must call try_to_freeze() in a suitable place.
IV. Why do we do that?
Generally speaking, there is a couple of reasons to use the freezing of tasks:
1. The principal reason is to prevent filesystems from being damaged after
hibernation. At the moment we have no simple means of checkpointing
filesystems, so if there are any modifications made to filesystem data and/or
metadata on disks, we cannot bring them back to the state from before the
modifications. At the same time each hibernation image contains some
filesystem-related information that must be consistent with the state of the
on-disk data and metadata after the system memory state has been restored from
the image (otherwise the filesystems will be damaged in a nasty way, usually
making them almost impossible to repair). We therefore freeze tasks that might
cause the on-disk filesystems' data and metadata to be modified after the
hibernation image has been created and before the system is finally powered off.
The majority of these are user space processes, but if any of the kernel threads
may cause something like this to happen, they have to be freezable.
2. Next, to create the hibernation image we need to free a sufficient amount of
memory (approximately 50% of available RAM) and we need to do that before
devices are deactivated, because we generally need them for swapping out. Then,
after the memory for the image has been freed, we don't want tasks to allocate
additional memory and we prevent them from doing that by freezing them earlier.
[Of course, this also means that device drivers should not allocate substantial
amounts of memory from their .suspend() callbacks before hibernation, but this
is a separate issue.]
3. The third reason is to prevent user space processes and some kernel threads
from interfering with the suspending and resuming of devices. A user space
process running on a second CPU while we are suspending devices may, for
example, be troublesome and without the freezing of tasks we would need some
safeguards against race conditions that might occur in such a case.
Although Linus Torvalds doesn't like the freezing of tasks, he said this in one
of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608):
"RJW:> Why we freeze tasks at all or why we freeze kernel threads?
Linus: In many ways, 'at all'.
I _do_ realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
ways to do so.
So in practice, the 'at all' may become a 'why freeze kernel threads?' and
freezing user threads I don't find really objectionable."
Still, there are kernel threads that may want to be freezable. For example, if
a kernel thread that belongs to a device driver accesses the device directly, it
in principle needs to know when the device is suspended, so that it doesn't try
to access it at that time. However, if the kernel thread is freezable, it will
be frozen before the driver's .suspend() callback is executed and it will be
thawed after the driver's .resume() callback has run, so it won't be accessing
the device while it's suspended.
4. Another reason for freezing tasks is to prevent user space processes from
realizing that hibernation (or suspend) operation takes place. Ideally, user
space processes should not notice that such a system-wide operation has occurred
and should continue running without any problems after the restore (or resume
from suspend). Unfortunately, in the most general case this is quite difficult
to achieve without the freezing of tasks. Consider, for example, a process
that depends on all CPUs being online while it's running. Since we need to
disable nonboot CPUs during the hibernation, if this process is not frozen, it
may notice that the number of CPUs has changed and may start to work incorrectly
because of that.
V. Are there any problems related to the freezing of tasks?
Yes, there are.
First of all, the freezing of kernel threads may be tricky if they depend one
on another. For example, if kernel thread A waits for a completion (in the
TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B
and B is frozen in the meantime, then A will be blocked until B is thawed, which
may be undesirable. That's why kernel threads are not freezable by default.
Second, there are the following two problems related to the freezing of user
space processes:
1. Putting processes into an uninterruptible sleep distorts the load average.
2. Now that we have FUSE, plus the framework for doing device drivers in
userspace, it gets even more complicated because some userspace processes are
now doing the sorts of things that kernel threads do
(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html).
The problem 1. seems to be fixable, although it hasn't been fixed so far. The
other one is more serious, but it seems that we can work around it by using
hibernation (and suspend) notifiers (in that case, though, we won't be able to
avoid the realization by the user space processes that the hibernation is taking
place).
There are also problems that the freezing of tasks tends to expose, although
they are not directly related to it. For example, if request_firmware() is
called from a device driver's .resume() routine, it will timeout and eventually
fail, because the user land process that should respond to the request is frozen
at this point. So, seemingly, the failure is due to the freezing of tasks.
Suppose, however, that the firmware file is located on a filesystem accessible
only through another device that hasn't been resumed yet. In that case,
request_firmware() will fail regardless of whether or not the freezing of tasks
is used. Consequently, the problem is not really related to the freezing of
tasks, since it generally exists anyway.
A driver must have all firmwares it may need in RAM before suspend() is called.
If keeping them is not practical, for example due to their size, they must be
requested early enough using the suspend notifier API described in notifiers.txt.
VI. Are there any precautions to be taken to prevent freezing failures?
Yes, there are.
First of all, grabbing the 'pm_mutex' lock to mutually exclude a piece of code
from system-wide sleep such as suspend/hibernation is not encouraged.
If possible, that piece of code must instead hook onto the suspend/hibernation
notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code
(kernel/cpu.c) for an example.
However, if that is not feasible, and grabbing 'pm_mutex' is deemed necessary,
it is strongly discouraged to directly call mutex_[un]lock(&pm_mutex) since
that could lead to freezing failures, because if the suspend/hibernate code
successfully acquired the 'pm_mutex' lock, and hence that other entity failed
to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE
state. As a consequence, the freezer would not be able to freeze that task,
leading to freezing failure.
However, the [un]lock_system_sleep() APIs are safe to use in this scenario,
since they ask the freezer to skip freezing this task, since it is anyway
"frozen enough" as it is blocked on 'pm_mutex', which will be released
only after the entire suspend/hibernation sequence is complete.
So, to summarize, use [un]lock_system_sleep() instead of directly using
mutex_[un]lock(&pm_mutex). That would prevent freezing failures.

View File

@@ -0,0 +1,75 @@
Power Management Interface
The power management subsystem provides a unified sysfs interface to
userspace, regardless of what architecture or platform one is
running. The interface exists in /sys/power/ directory (assuming sysfs
is mounted at /sys).
/sys/power/state controls system power state. Reading from this file
returns what states are supported, which is hard-coded to 'standby'
(Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
(Suspend-to-Disk).
Writing to this file one of those strings causes the system to
transition into that state. Please see the file
Documentation/power/states.txt for a description of each of those
states.
/sys/power/disk controls the operating mode of the suspend-to-disk
mechanism. Suspend-to-disk can be handled in several ways. We have a
few options for putting the system to sleep - using the platform driver
(e.g. ACPI or other suspend_ops), powering off the system or rebooting the
system (for testing).
Additionally, /sys/power/disk can be used to turn on one of the two testing
modes of the suspend-to-disk mechanism: 'testproc' or 'test'. If the
suspend-to-disk mechanism is in the 'testproc' mode, writing 'disk' to
/sys/power/state will cause the kernel to disable nonboot CPUs and freeze
tasks, wait for 5 seconds, unfreeze tasks and enable nonboot CPUs. If it is
in the 'test' mode, writing 'disk' to /sys/power/state will cause the kernel
to disable nonboot CPUs and freeze tasks, shrink memory, suspend devices, wait
for 5 seconds, resume devices, unfreeze tasks and enable nonboot CPUs. Then,
we are able to look in the log messages and work out, for example, which code
is being slow and which device drivers are misbehaving.
Reading from this file will display all supported modes and the currently
selected one in brackets, for example
[shutdown] reboot test testproc
Writing to this file will accept one of
'platform' (only if the platform supports it)
'shutdown'
'reboot'
'testproc'
'test'
/sys/power/image_size controls the size of the image created by
the suspend-to-disk mechanism. It can be written a string
representing a non-negative integer that will be used as an upper
limit of the image size, in bytes. The suspend-to-disk mechanism will
do its best to ensure the image size will not exceed that number. However,
if this turns out to be impossible, it will try to suspend anyway using the
smallest image possible. In particular, if "0" is written to this file, the
suspend image will be as small as possible.
Reading from this file will display the current image size limit, which
is set to 2/5 of available RAM by default.
/sys/power/pm_trace controls the code which saves the last PM event point in
the RTC across reboots, so that you can debug a machine that just hangs
during suspend (or more commonly, during resume). Namely, the RTC is only
used to save the last PM event point if this file contains '1'. Initially it
contains '0' which may be changed to '1' by writing a string representing a
nonzero integer into it.
To use this debugging feature you should attempt to suspend the machine, then
reboot it and run
dmesg -s 1000000 | grep 'hash matches'
CAUTION: Using it will cause your machine's real-time (CMOS) clock to be
set to a random invalid time after a resume.

View File

@@ -0,0 +1,53 @@
Suspend notifiers
(C) 2007-2011 Rafael J. Wysocki <rjw@sisk.pl>, GPL
There are some operations that subsystems or drivers may want to carry out
before hibernation/suspend or after restore/resume, but they require the system
to be fully functional, so the drivers' and subsystems' .suspend() and .resume()
or even .prepare() and .complete() callbacks are not suitable for this purpose.
For example, device drivers may want to upload firmware to their devices after
resume/restore, but they cannot do it by calling request_firmware() from their
.resume() or .complete() routines (user land processes are frozen at these
points). The solution may be to load the firmware into memory before processes
are frozen and upload it from there in the .resume() routine.
A suspend/hibernation notifier may be used for this purpose.
The subsystems or drivers having such needs can register suspend notifiers that
will be called upon the following events by the PM core:
PM_HIBERNATION_PREPARE The system is going to hibernate or suspend, tasks will
be frozen immediately.
PM_POST_HIBERNATION The system memory state has been restored from a
hibernation image or an error occurred during
hibernation. Device drivers' restore callbacks have
been executed and tasks have been thawed.
PM_RESTORE_PREPARE The system is going to restore a hibernation image.
If all goes well, the restored kernel will issue a
PM_POST_HIBERNATION notification.
PM_POST_RESTORE An error occurred during restore from hibernation.
Device drivers' restore callbacks have been executed
and tasks have been thawed.
PM_SUSPEND_PREPARE The system is preparing for suspend.
PM_POST_SUSPEND The system has just resumed or an error occurred during
suspend. Device drivers' resume callbacks have been
executed and tasks have been thawed.
It is generally assumed that whatever the notifiers do for
PM_HIBERNATION_PREPARE, should be undone for PM_POST_HIBERNATION. Analogously,
operations performed for PM_SUSPEND_PREPARE should be reversed for
PM_POST_SUSPEND. Additionally, all of the notifiers are called for
PM_POST_HIBERNATION if one of them fails for PM_HIBERNATION_PREPARE, and
all of the notifiers are called for PM_POST_SUSPEND if one of them fails for
PM_SUSPEND_PREPARE.
The hibernation and suspend notifiers are called with pm_mutex held. They are
defined in the usual way, but their last argument is meaningless (it is always
NULL). To register and/or unregister a suspend notifier use the functions
register_pm_notifier() and unregister_pm_notifier(), respectively, defined in
include/linux/suspend.h . If you don't need to unregister the notifier, you can
also use the pm_notifier() macro defined in include/linux/suspend.h .

View File

@@ -0,0 +1,380 @@
*=============*
* OPP Library *
*=============*
(C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated
Contents
--------
1. Introduction
2. Initial OPP List Registration
3. OPP Search Functions
4. OPP Availability Control Functions
5. OPP Data Retrieval Functions
6. Cpufreq Table Generation
7. Data Structures
1. Introduction
===============
Complex SoCs of today consists of a multiple sub-modules working in conjunction.
In an operational system executing varied use cases, not all modules in the SoC
need to function at their highest performing frequency all the time. To
facilitate this, sub-modules in a SoC are grouped into domains, allowing some
domains to run at lower voltage and frequency while other domains are loaded
more. The set of discrete tuples consisting of frequency and voltage pairs that
the device will support per domain are called Operating Performance Points or
OPPs.
OPP library provides a set of helper functions to organize and query the OPP
information. The library is located in drivers/base/power/opp.c and the header
is located in include/linux/opp.h. OPP library can be enabled by enabling
CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on
CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to
optionally boot at a certain OPP without needing cpufreq.
Typical usage of the OPP library is as follows:
(users) -> registers a set of default OPPs -> (library)
SoC framework -> modifies on required cases certain OPPs -> OPP layer
-> queries to search/retrieve information ->
Architectures that provide a SoC framework for OPP should select ARCH_HAS_OPP
to make the OPP layer available.
OPP layer expects each domain to be represented by a unique device pointer. SoC
framework registers a set of initial OPPs per device with the OPP layer. This
list is expected to be an optimally small number typically around 5 per device.
This initial list contains a set of OPPs that the framework expects to be safely
enabled by default in the system.
Note on OPP Availability:
------------------------
As the system proceeds to operate, SoC framework may choose to make certain
OPPs available or not available on each device based on various external
factors. Example usage: Thermal management or other exceptional situations where
SoC framework might choose to disable a higher frequency OPP to safely continue
operations until that OPP could be re-enabled if possible.
OPP library facilitates this concept in it's implementation. The following
operational functions operate only on available opps:
opp_find_freq_{ceil, floor}, opp_get_voltage, opp_get_freq, opp_get_opp_count
and opp_init_cpufreq_table
opp_find_freq_exact is meant to be used to find the opp pointer which can then
be used for opp_enable/disable functions to make an opp available as required.
WARNING: Users of OPP library should refresh their availability count using
get_opp_count if opp_enable/disable functions are invoked for a device, the
exact mechanism to trigger these or the notification mechanism to other
dependent subsystems such as cpufreq are left to the discretion of the SoC
specific framework which uses the OPP library. Similar care needs to be taken
care to refresh the cpufreq table in cases of these operations.
WARNING on OPP List locking mechanism:
-------------------------------------------------
OPP library uses RCU for exclusivity. RCU allows the query functions to operate
in multiple contexts and this synchronization mechanism is optimal for a read
intensive operations on data structure as the OPP library caters to.
To ensure that the data retrieved are sane, the users such as SoC framework
should ensure that the section of code operating on OPP queries are locked
using RCU read locks. The opp_find_freq_{exact,ceil,floor},
opp_get_{voltage, freq, opp_count} fall into this category.
opp_{add,enable,disable} are updaters which use mutex and implement it's own
RCU locking mechanisms. opp_init_cpufreq_table acts as an updater and uses
mutex to implment RCU updater strategy. These functions should *NOT* be called
under RCU locks and other contexts that prevent blocking functions in RCU or
mutex operations from working.
2. Initial OPP List Registration
================================
The SoC implementation calls opp_add function iteratively to add OPPs per
device. It is expected that the SoC framework will register the OPP entries
optimally- typical numbers range to be less than 5. The list generated by
registering the OPPs is maintained by OPP library throughout the device
operation. The SoC framework can subsequently control the availability of the
OPPs dynamically using the opp_enable / disable functions.
opp_add - Add a new OPP for a specific domain represented by the device pointer.
The OPP is defined using the frequency and voltage. Once added, the OPP
is assumed to be available and control of it's availability can be done
with the opp_enable/disable functions. OPP library internally stores
and manages this information in the opp struct. This function may be
used by SoC framework to define a optimal list as per the demands of
SoC usage environment.
WARNING: Do not use this function in interrupt context.
Example:
soc_pm_init()
{
/* Do things */
r = opp_add(mpu_dev, 1000000, 900000);
if (!r) {
pr_err("%s: unable to register mpu opp(%d)\n", r);
goto no_cpufreq;
}
/* Do cpufreq things */
no_cpufreq:
/* Do remaining things */
}
3. OPP Search Functions
=======================
High level framework such as cpufreq operates on frequencies. To map the
frequency back to the corresponding OPP, OPP library provides handy functions
to search the OPP list that OPP library internally manages. These search
functions return the matching pointer representing the opp if a match is
found, else returns error. These errors are expected to be handled by standard
error checks such as IS_ERR() and appropriate actions taken by the caller.
opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
availability. This function is especially useful to enable an OPP which
is not available by default.
Example: In a case when SoC framework detects a situation where a
higher frequency could be made available, it can use this function to
find the OPP prior to call the opp_enable to actually make it available.
rcu_read_lock();
opp = opp_find_freq_exact(dev, 1000000000, false);
rcu_read_unlock();
/* dont operate on the pointer.. just do a sanity check.. */
if (IS_ERR(opp)) {
pr_err("frequency not disabled!\n");
/* trigger appropriate actions.. */
} else {
opp_enable(dev,1000000000);
}
NOTE: This is the only search function that operates on OPPs which are
not available.
opp_find_freq_floor - Search for an available OPP which is *at most* the
provided frequency. This function is useful while searching for a lesser
match OR operating on OPP information in the order of decreasing
frequency.
Example: To find the highest opp for a device:
freq = ULONG_MAX;
rcu_read_lock();
opp_find_freq_floor(dev, &freq);
rcu_read_unlock();
opp_find_freq_ceil - Search for an available OPP which is *at least* the
provided frequency. This function is useful while searching for a
higher match OR operating on OPP information in the order of increasing
frequency.
Example 1: To find the lowest opp for a device:
freq = 0;
rcu_read_lock();
opp_find_freq_ceil(dev, &freq);
rcu_read_unlock();
Example 2: A simplified implementation of a SoC cpufreq_driver->target:
soc_cpufreq_target(..)
{
/* Do stuff like policy checks etc. */
/* Find the best frequency match for the req */
rcu_read_lock();
opp = opp_find_freq_ceil(dev, &freq);
rcu_read_unlock();
if (!IS_ERR(opp))
soc_switch_to_freq_voltage(freq);
else
/* do something when we can't satisfy the req */
/* do other stuff */
}
4. OPP Availability Control Functions
=====================================
A default OPP list registered with the OPP library may not cater to all possible
situation. The OPP library provides a set of functions to modify the
availability of a OPP within the OPP list. This allows SoC frameworks to have
fine grained dynamic control of which sets of OPPs are operationally available.
These functions are intended to *temporarily* remove an OPP in conditions such
as thermal considerations (e.g. don't use OPPx until the temperature drops).
WARNING: Do not use these functions in interrupt context.
opp_enable - Make a OPP available for operation.
Example: Lets say that 1GHz OPP is to be made available only if the
SoC temperature is lower than a certain threshold. The SoC framework
implementation might choose to do something as follows:
if (cur_temp < temp_low_thresh) {
/* Enable 1GHz if it was disabled */
rcu_read_lock();
opp = opp_find_freq_exact(dev, 1000000000, false);
rcu_read_unlock();
/* just error check */
if (!IS_ERR(opp))
ret = opp_enable(dev, 1000000000);
else
goto try_something_else;
}
opp_disable - Make an OPP to be not available for operation
Example: Lets say that 1GHz OPP is to be disabled if the temperature
exceeds a threshold value. The SoC framework implementation might
choose to do something as follows:
if (cur_temp > temp_high_thresh) {
/* Disable 1GHz if it was enabled */
rcu_read_lock();
opp = opp_find_freq_exact(dev, 1000000000, true);
rcu_read_unlock();
/* just error check */
if (!IS_ERR(opp))
ret = opp_disable(dev, 1000000000);
else
goto try_something_else;
}
5. OPP Data Retrieval Functions
===============================
Since OPP library abstracts away the OPP information, a set of functions to pull
information from the OPP structure is necessary. Once an OPP pointer is
retrieved using the search functions, the following functions can be used by SoC
framework to retrieve the information represented inside the OPP layer.
opp_get_voltage - Retrieve the voltage represented by the opp pointer.
Example: At a cpufreq transition to a different frequency, SoC
framework requires to set the voltage represented by the OPP using
the regulator framework to the Power Management chip providing the
voltage.
soc_switch_to_freq_voltage(freq)
{
/* do things */
rcu_read_lock();
opp = opp_find_freq_ceil(dev, &freq);
v = opp_get_voltage(opp);
rcu_read_unlock();
if (v)
regulator_set_voltage(.., v);
/* do other things */
}
opp_get_freq - Retrieve the freq represented by the opp pointer.
Example: Lets say the SoC framework uses a couple of helper functions
we could pass opp pointers instead of doing additional parameters to
handle quiet a bit of data parameters.
soc_cpufreq_target(..)
{
/* do things.. */
max_freq = ULONG_MAX;
rcu_read_lock();
max_opp = opp_find_freq_floor(dev,&max_freq);
requested_opp = opp_find_freq_ceil(dev,&freq);
if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
r = soc_test_validity(max_opp, requested_opp);
rcu_read_unlock();
/* do other things */
}
soc_test_validity(..)
{
if(opp_get_voltage(max_opp) < opp_get_voltage(requested_opp))
return -EINVAL;
if(opp_get_freq(max_opp) < opp_get_freq(requested_opp))
return -EINVAL;
/* do things.. */
}
opp_get_opp_count - Retrieve the number of available opps for a device
Example: Lets say a co-processor in the SoC needs to know the available
frequencies in a table, the main processor can notify as following:
soc_notify_coproc_available_frequencies()
{
/* Do things */
rcu_read_lock();
num_available = opp_get_opp_count(dev);
speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
/* populate the table in increasing order */
freq = 0;
while (!IS_ERR(opp = opp_find_freq_ceil(dev, &freq))) {
speeds[i] = freq;
freq++;
i++;
}
rcu_read_unlock();
soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available);
/* Do other things */
}
6. Cpufreq Table Generation
===========================
opp_init_cpufreq_table - cpufreq framework typically is initialized with
cpufreq_frequency_table_cpuinfo which is provided with the list of
frequencies that are available for operation. This function provides
a ready to use conversion routine to translate the OPP layer's internal
information about the available frequencies into a format readily
providable to cpufreq.
WARNING: Do not use this function in interrupt context.
Example:
soc_pm_init()
{
/* Do things */
r = opp_init_cpufreq_table(dev, &freq_table);
if (!r)
cpufreq_frequency_table_cpuinfo(policy, freq_table);
/* Do other things */
}
NOTE: This function is available only if CONFIG_CPU_FREQ is enabled in
addition to CONFIG_PM as power management feature is required to
dynamically scale voltage and frequency in a system.
opp_free_cpufreq_table - Free up the table allocated by opp_init_cpufreq_table
7. Data Structures
==================
Typically an SoC contains multiple voltage domains which are variable. Each
domain is represented by a device pointer. The relationship to OPP can be
represented as follows:
SoC
|- device 1
| |- opp 1 (availability, freq, voltage)
| |- opp 2 ..
... ...
| `- opp n ..
|- device 2
...
`- device m
OPP library maintains a internal list that the SoC framework populates and
accessed by various functions as described above. However, the structures
representing the actual OPPs and domains are internal to the OPP library itself
to allow for suitable abstraction reusable across systems.
struct opp - The internal data structure of OPP library which is used to
represent an OPP. In addition to the freq, voltage, availability
information, it also contains internal book keeping information required
for the OPP library to operate on. Pointer to this structure is
provided back to the users such as SoC framework to be used as a
identifier for OPP in the interactions with OPP layer.
WARNING: The struct opp pointer should not be parsed or modified by the
users. The defaults of for an instance is populated by opp_add, but the
availability of the OPP can be modified by opp_enable/disable functions.
struct device - This is used to identify a domain to the OPP layer. The
nature of the device and it's implementation is left to the user of
OPP library such as the SoC framework.
Overall, in a simplistic view, the data structure operations is represented as
following:
Initialization / modification:
+-----+ /- opp_enable
opp_add --> | opp | <-------
| +-----+ \- opp_disable
\-------> domain_info(device)
Search functions:
/-- opp_find_freq_ceil ---\ +-----+
domain_info<---- opp_find_freq_exact -----> | opp |
\-- opp_find_freq_floor ---/ +-----+
Retrieval functions:
+-----+ /- opp_get_voltage
| opp | <---
+-----+ \- opp_get_freq
domain_info <- opp_get_opp_count

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,148 @@
PM Quality Of Service Interface.
This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.
Two different PM QoS frameworks are available:
1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput.
2. the per-device PM QoS framework provides the API to manage the per-device latency
constraints.
Each parameters have defined units:
* latency: usec
* timeout: usec
* throughput: kbs (kilo bit / sec)
1. PM QoS framework
The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
and pm_qos.h. This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.
For each parameter a list of performance requests is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the request list or elements of the list. Typically the
aggregated target value is simply the max or min of the request values held
in the parameter list elements.
Note: the aggregated target value is implemented as an atomic variable so that
reading the aggregated value does not require any locking mechanism.
From kernel mode the use of this interface is simple:
void pm_qos_add_request(handle, param_class, target_value):
Will insert an element into the list for that identified PM QoS class with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of pm_qos need to save the returned handle for future use in other
pm_qos API functions.
void pm_qos_update_request(handle, new_target_value):
Will update the list element pointed to by the handle with the new target value
and recompute the new aggregated target, calling the notification tree if the
target is changed.
void pm_qos_remove_request(handle):
Will remove the element. After removal it will update the aggregate target and
call the notification tree if the target was changed as a result of removing
the request.
int pm_qos_request(param_class):
Returns the aggregated value for a given PM QoS class.
int pm_qos_request_active(handle):
Returns if the request is still active, i.e. it has not been removed from a
PM QoS class constraints list.
int pm_qos_add_notifier(param_class, notifier):
Adds a notification callback function to the PM QoS class. The callback is
called when the aggregated value for the PM QoS class is changed.
int pm_qos_remove_notifier(int param_class, notifier):
Removes the notification callback function for the PM QoS class.
From user mode:
Only processes can register a pm_qos request. To provide for automatic
cleanup of a process, the interface requires the process to register its
parameter requests in the following way:
To register the default pm_qos target for the specific parameter, the process
must open one of /dev/[cpu_dma_latency, network_latency, network_throughput]
As long as the device node is held open that process has a registered
request on the parameter.
To change the requested target value the process needs to write an s32 value to
the open device node. Alternatively the user mode program could write a hex
string for the value using 10 char long format e.g. "0x12345678". This
translates to a pm_qos_update_request call.
To remove the user mode request for a target value simply close the device
node.
2. PM QoS per-device latency framework
For each device a list of performance requests is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the request list or elements of the list. Typically the
aggregated target value is simply the max or min of the request values held
in the parameter list elements.
Note: the aggregated target value is implemented as an atomic variable so that
reading the aggregated value does not require any locking mechanism.
From kernel mode the use of this interface is the following:
int dev_pm_qos_add_request(device, handle, value):
Will insert an element into the list for that identified device with the
target value. Upon change to this list the new target is recomputed and any
registered notifiers are called only if the target value is now different.
Clients of dev_pm_qos need to save the handle for future use in other
dev_pm_qos API functions.
int dev_pm_qos_update_request(handle, new_value):
Will update the list element pointed to by the handle with the new target value
and recompute the new aggregated target, calling the notification trees if the
target is changed.
int dev_pm_qos_remove_request(handle):
Will remove the element. After removal it will update the aggregate target and
call the notification trees if the target was changed as a result of removing
the request.
s32 dev_pm_qos_read_value(device):
Returns the aggregated value for a given device's constraints list.
Notification mechanisms:
The per-device PM QoS framework has 2 different and distinct notification trees:
a per-device notification tree and a global notification tree.
int dev_pm_qos_add_notifier(device, notifier):
Adds a notification callback function for the device.
The callback is called when the aggregated value of the device constraints list
is changed.
int dev_pm_qos_remove_notifier(device, notifier):
Removes the notification callback function for the device.
int dev_pm_qos_add_global_notifier(notifier):
Adds a notification callback function in the global notification tree of the
framework.
The callback is called when the aggregated value for any device is changed.
int dev_pm_qos_remove_global_notifier(notifier):
Removes the notification callback function from the global notification tree
of the framework.
From user mode:
No API for user space access to the per-device latency constraints is provided
yet - still under discussion.

View File

@@ -0,0 +1,180 @@
Linux power supply class
========================
Synopsis
~~~~~~~~
Power supply class used to represent battery, UPS, AC or DC power supply
properties to user-space.
It defines core set of attributes, which should be applicable to (almost)
every power supply out there. Attributes are available via sysfs and uevent
interfaces.
Each attribute has well defined meaning, up to unit of measure used. While
the attributes provided are believed to be universally applicable to any
power supply, specific monitoring hardware may not be able to provide them
all, so any of them may be skipped.
Power supply class is extensible, and allows to define drivers own attributes.
The core attribute set is subject to the standard Linux evolution (i.e.
if it will be found that some attribute is applicable to many power supply
types or their drivers, it can be added to the core set).
It also integrates with LED framework, for the purpose of providing
typically expected feedback of battery charging/fully charged status and
AC/USB power supply online status. (Note that specific details of the
indication (including whether to use it at all) are fully controllable by
user and/or specific machine defaults, per design principles of LED
framework).
Attributes/properties
~~~~~~~~~~~~~~~~~~~~~
Power supply class has predefined set of attributes, this eliminates code
duplication across drivers. Power supply class insist on reusing its
predefined attributes *and* their units.
So, userspace gets predictable set of attributes and their units for any
kind of power supply, and can process/present them to a user in consistent
manner. Results for different power supplies and machines are also directly
comparable.
See drivers/power/ds2760_battery.c and drivers/power/pda_power.c for the
example how to declare and handle attributes.
Units
~~~~~
Quoting include/linux/power_supply.h:
All voltages, currents, charges, energies, time and temperatures in µV,
µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise
stated. It's driver's job to convert its raw values to units in which
this class operates.
Attributes/properties detailed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~
~ ~
~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~
~ of battery, this class distinguish these terms. Don't mix them! ~
~ ~
~ CHARGE_* attributes represents capacity in µAh only. ~
~ ENERGY_* attributes represents capacity in µWh only. ~
~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~
~ ~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Postfixes:
_AVG - *hardware* averaged value, use it if your hardware is really able to
report averaged values.
_NOW - momentary/instantaneous values.
STATUS - this attribute represents operating status (charging, full,
discharging (i.e. powering a load), etc.). This corresponds to
BATTERY_STATUS_* values, as defined in battery.h.
CHARGE_TYPE - batteries can typically charge at different rates.
This defines trickle and fast charges. For batteries that
are already charged or discharging, 'n/a' can be displayed (or
'unknown', if the status is not known).
HEALTH - represents health of the battery, values corresponds to
POWER_SUPPLY_HEALTH_*, defined in battery.h.
VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and
minimal power supply voltages. Maximal/minimal means values of voltages
when battery considered "full"/"empty" at normal conditions. Yes, there is
no direct relation between voltage and battery capacity, but some dumb
batteries use voltage for very approximated calculation of capacity.
Battery driver also can use this attribute just to inform userspace
about maximal and minimal voltage thresholds of a given battery.
VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that
these ones should be used if hardware could only guess (measure and
retain) the thresholds of a given power supply.
CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when
battery considered full/empty.
ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy.
CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value
of charge when battery became full/empty". It also could mean "value of
charge when battery considered full/empty at given conditions (temperature,
age)". I.e. these attributes represents real thresholds, not design values.
CHARGE_COUNTER - the current charge counter (in µAh). This could easily
be negative; there is no empty or full value. It is only useful for
relative, time-based measurements.
ENERGY_FULL, ENERGY_EMPTY - same as above but for energy.
CAPACITY - capacity in percents.
CAPACITY_LEVEL - capacity level. This corresponds to
POWER_SUPPLY_CAPACITY_LEVEL_*.
TEMP - temperature of the power supply.
TEMP_AMBIENT - ambient temperature.
TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e.
while battery powers a load)
TIME_TO_FULL - seconds left for battery to be considered full (i.e.
while battery is charging)
Battery <-> external power supply interaction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often power supplies are acting as supplies and supplicants at the same
time. Batteries are good example. So, batteries usually care if they're
externally powered or not.
For that case, power supply class implements notification mechanism for
batteries.
External power supply (AC) lists supplicants (batteries) names in
"supplied_to" struct member, and each power_supply_changed() call
issued by external power supply will notify supplicants via
external_power_changed callback.
QA
~~
Q: Where is POWER_SUPPLY_PROP_XYZ attribute?
A: If you cannot find attribute suitable for your driver needs, feel free
to add it and send patch along with your driver.
The attributes available currently are the ones currently provided by the
drivers written.
Good candidates to add in future: model/part#, cycle_time, manufacturer,
etc.
Q: I have some very specific attribute (e.g. battery color), should I add
this attribute to standard ones?
A: Most likely, no. Such attribute can be placed in the driver itself, if
it is useful. Of course, if the attribute in question applicable to
large set of batteries, provided by many drivers, and/or comes from
some general battery specification/standard, it may be a candidate to
be added to the core attribute set.
Q: Suppose, my battery monitoring chip/firmware does not provides capacity
in percents, but provides charge_{now,full,empty}. Should I calculate
percentage capacity manually, inside the driver, and register CAPACITY
attribute? The same question about time_to_empty/time_to_full.
A: Most likely, no. This class is designed to export properties which are
directly measurable by the specific hardware available.
Inferring not available properties using some heuristics or mathematical
model is not subject of work for a battery driver. Such functionality
should be factored out, and in fact, apm_power, the driver to serve
legacy APM API on top of power supply class, uses a simple heuristic of
approximating remaining battery capacity based on its charge, current,
voltage and so on. But full-fledged battery model is likely not subject
for kernel at all, as it would require floating point calculation to deal
with things like differential equations and Kalman filters. This is
better be handled by batteryd/libbattery, yet to be written.

View File

@@ -0,0 +1,182 @@
Regulator Consumer Driver Interface
===================================
This text describes the regulator interface for consumer device drivers.
Please see overview.txt for a description of the terms used in this text.
1. Consumer Regulator Access (static & dynamic drivers)
=======================================================
A consumer driver can get access to its supply regulator by calling :-
regulator = regulator_get(dev, "Vcc");
The consumer passes in its struct device pointer and power supply ID. The core
then finds the correct regulator by consulting a machine specific lookup table.
If the lookup is successful then this call will return a pointer to the struct
regulator that supplies this consumer.
To release the regulator the consumer driver should call :-
regulator_put(regulator);
Consumers can be supplied by more than one regulator e.g. codec consumer with
analog and digital supplies :-
digital = regulator_get(dev, "Vcc"); /* digital core */
analog = regulator_get(dev, "Avdd"); /* analog */
The regulator access functions regulator_get() and regulator_put() will
usually be called in your device drivers probe() and remove() respectively.
2. Regulator Output Enable & Disable (static & dynamic drivers)
====================================================================
A consumer can enable its power supply by calling:-
int regulator_enable(regulator);
NOTE: The supply may already be enabled before regulator_enabled() is called.
This may happen if the consumer shares the regulator or the regulator has been
previously enabled by bootloader or kernel board initialization code.
A consumer can determine if a regulator is enabled by calling :-
int regulator_is_enabled(regulator);
This will return > zero when the regulator is enabled.
A consumer can disable its supply when no longer needed by calling :-
int regulator_disable(regulator);
NOTE: This may not disable the supply if it's shared with other consumers. The
regulator will only be disabled when the enabled reference count is zero.
Finally, a regulator can be forcefully disabled in the case of an emergency :-
int regulator_force_disable(regulator);
NOTE: this will immediately and forcefully shutdown the regulator output. All
consumers will be powered off.
3. Regulator Voltage Control & Status (dynamic drivers)
======================================================
Some consumer drivers need to be able to dynamically change their supply
voltage to match system operating points. e.g. CPUfreq drivers can scale
voltage along with frequency to save power, SD drivers may need to select the
correct card voltage, etc.
Consumers can control their supply voltage by calling :-
int regulator_set_voltage(regulator, min_uV, max_uV);
Where min_uV and max_uV are the minimum and maximum acceptable voltages in
microvolts.
NOTE: this can be called when the regulator is enabled or disabled. If called
when enabled, then the voltage changes instantly, otherwise the voltage
configuration changes and the voltage is physically set when the regulator is
next enabled.
The regulators configured voltage output can be found by calling :-
int regulator_get_voltage(regulator);
NOTE: get_voltage() will return the configured output voltage whether the
regulator is enabled or disabled and should NOT be used to determine regulator
output state. However this can be used in conjunction with is_enabled() to
determine the regulator physical output voltage.
4. Regulator Current Limit Control & Status (dynamic drivers)
===========================================================
Some consumer drivers need to be able to dynamically change their supply
current limit to match system operating points. e.g. LCD backlight driver can
change the current limit to vary the backlight brightness, USB drivers may want
to set the limit to 500mA when supplying power.
Consumers can control their supply current limit by calling :-
int regulator_set_current_limit(regulator, min_uA, max_uA);
Where min_uA and max_uA are the minimum and maximum acceptable current limit in
microamps.
NOTE: this can be called when the regulator is enabled or disabled. If called
when enabled, then the current limit changes instantly, otherwise the current
limit configuration changes and the current limit is physically set when the
regulator is next enabled.
A regulators current limit can be found by calling :-
int regulator_get_current_limit(regulator);
NOTE: get_current_limit() will return the current limit whether the regulator
is enabled or disabled and should not be used to determine regulator current
load.
5. Regulator Operating Mode Control & Status (dynamic drivers)
=============================================================
Some consumers can further save system power by changing the operating mode of
their supply regulator to be more efficient when the consumers operating state
changes. e.g. consumer driver is idle and subsequently draws less current
Regulator operating mode can be changed indirectly or directly.
Indirect operating mode control.
--------------------------------
Consumer drivers can request a change in their supply regulator operating mode
by calling :-
int regulator_set_optimum_mode(struct regulator *regulator, int load_uA);
This will cause the core to recalculate the total load on the regulator (based
on all its consumers) and change operating mode (if necessary and permitted)
to best match the current operating load.
The load_uA value can be determined from the consumers datasheet. e.g.most
datasheets have tables showing the max current consumed in certain situations.
Most consumers will use indirect operating mode control since they have no
knowledge of the regulator or whether the regulator is shared with other
consumers.
Direct operating mode control.
------------------------------
Bespoke or tightly coupled drivers may want to directly control regulator
operating mode depending on their operating point. This can be achieved by
calling :-
int regulator_set_mode(struct regulator *regulator, unsigned int mode);
unsigned int regulator_get_mode(struct regulator *regulator);
Direct mode will only be used by consumers that *know* about the regulator and
are not sharing the regulator with other consumers.
6. Regulator Events
===================
Regulators can notify consumers of external events. Events could be received by
consumers under regulator stress or failure conditions.
Consumers can register interest in regulator events by calling :-
int regulator_register_notifier(struct regulator *regulator,
struct notifier_block *nb);
Consumers can uregister interest by calling :-
int regulator_unregister_notifier(struct regulator *regulator,
struct notifier_block *nb);
Regulators use the kernel notifier framework to send event to their interested
consumers.

View File

@@ -0,0 +1,33 @@
Regulator API design notes
==========================
This document provides a brief, partially structured, overview of some
of the design considerations which impact the regulator API design.
Safety
------
- Errors in regulator configuration can have very serious consequences
for the system, potentially including lasting hardware damage.
- It is not possible to automatically determine the power confugration
of the system - software-equivalent variants of the same chip may
have different power requirments, and not all components with power
requirements are visible to software.
=> The API should make no changes to the hardware state unless it has
specific knowledge that these changes are safe to do perform on
this particular system.
Consumer use cases
------------------
- The overwhelming majority of devices in a system will have no
requirement to do any runtime configuration of their power beyond
being able to turn it on or off.
- Many of the power supplies in the system will be shared between many
different consumers.
=> The consumer API should be structured so that these use cases are
very easy to handle and so that consumers will work with shared
supplies without any additional effort.

View File

@@ -0,0 +1,100 @@
Regulator Machine Driver Interface
===================================
The regulator machine driver interface is intended for board/machine specific
initialisation code to configure the regulator subsystem.
Consider the following machine :-
Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V]
|
+-> [Consumer B @ 3.3V]
The drivers for consumers A & B must be mapped to the correct regulator in
order to control their power supply. This mapping can be achieved in machine
initialisation code by creating a struct regulator_consumer_supply for
each regulator.
struct regulator_consumer_supply {
const char *dev_name; /* consumer dev_name() */
const char *supply; /* consumer supply - e.g. "vcc" */
};
e.g. for the machine above
static struct regulator_consumer_supply regulator1_consumers[] = {
{
.dev_name = "dev_name(consumer B)",
.supply = "Vcc",
},};
static struct regulator_consumer_supply regulator2_consumers[] = {
{
.dev = "dev_name(consumer A"),
.supply = "Vcc",
},};
This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2
to the 'Vcc' supply for Consumer A.
Constraints can now be registered by defining a struct regulator_init_data
for each regulator power domain. This structure also maps the consumers
to their supply regulator :-
static struct regulator_init_data regulator1_data = {
.constraints = {
.name = "Regulator-1",
.min_uV = 3300000,
.max_uV = 3300000,
.valid_modes_mask = REGULATOR_MODE_NORMAL,
},
.num_consumer_supplies = ARRAY_SIZE(regulator1_consumers),
.consumer_supplies = regulator1_consumers,
};
The name field should be set to something that is usefully descriptive
for the board for configuration of supplies for other regulators and
for use in logging and other diagnostic output. Normally the name
used for the supply rail in the schematic is a good choice. If no
name is provided then the subsystem will choose one.
Regulator-1 supplies power to Regulator-2. This relationship must be registered
with the core so that Regulator-1 is also enabled when Consumer A enables its
supply (Regulator-2). The supply regulator is set by the supply_regulator
field below and co:-
static struct regulator_init_data regulator2_data = {
.supply_regulator = "Regulator-1",
.constraints = {
.min_uV = 1800000,
.max_uV = 2000000,
.valid_ops_mask = REGULATOR_CHANGE_VOLTAGE,
.valid_modes_mask = REGULATOR_MODE_NORMAL,
},
.num_consumer_supplies = ARRAY_SIZE(regulator2_consumers),
.consumer_supplies = regulator2_consumers,
};
Finally the regulator devices must be registered in the usual manner.
static struct platform_device regulator_devices[] = {
{
.name = "regulator",
.id = DCDC_1,
.dev = {
.platform_data = &regulator1_data,
},
},
{
.name = "regulator",
.id = DCDC_2,
.dev = {
.platform_data = &regulator2_data,
},
},
};
/* register regulator 1 device */
platform_device_register(&regulator_devices[0]);
/* register regulator 2 device */
platform_device_register(&regulator_devices[1]);

View File

@@ -0,0 +1,171 @@
Linux voltage and current regulator framework
=============================================
About
=====
This framework is designed to provide a standard kernel interface to control
voltage and current regulators.
The intention is to allow systems to dynamically control regulator power output
in order to save power and prolong battery life. This applies to both voltage
regulators (where voltage output is controllable) and current sinks (where
current limit is controllable).
(C) 2008 Wolfson Microelectronics PLC.
Author: Liam Girdwood <lrg@slimlogic.co.uk>
Nomenclature
============
Some terms used in this document:-
o Regulator - Electronic device that supplies power to other devices.
Most regulators can enable and disable their output whilst
some can control their output voltage and or current.
Input Voltage -> Regulator -> Output Voltage
o PMIC - Power Management IC. An IC that contains numerous regulators
and often contains other subsystems.
o Consumer - Electronic device that is supplied power by a regulator.
Consumers can be classified into two types:-
Static: consumer does not change its supply voltage or
current limit. It only needs to enable or disable it's
power supply. Its supply voltage is set by the hardware,
bootloader, firmware or kernel board initialisation code.
Dynamic: consumer needs to change it's supply voltage or
current limit to meet operation demands.
o Power Domain - Electronic circuit that is supplied its input power by the
output power of a regulator, switch or by another power
domain.
The supply regulator may be behind a switch(s). i.e.
Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A]
| |
| +-> [Consumer B], [Consumer C]
|
+-> [Consumer D], [Consumer E]
That is one regulator and three power domains:
Domain 1: Switch-1, Consumers D & E.
Domain 2: Switch-2, Consumers B & C.
Domain 3: Consumer A.
and this represents a "supplies" relationship:
Domain-1 --> Domain-2 --> Domain-3.
A power domain may have regulators that are supplied power
by other regulators. i.e.
Regulator-1 -+-> Regulator-2 -+-> [Consumer A]
|
+-> [Consumer B]
This gives us two regulators and two power domains:
Domain 1: Regulator-2, Consumer B.
Domain 2: Consumer A.
and a "supplies" relationship:
Domain-1 --> Domain-2
o Constraints - Constraints are used to define power levels for performance
and hardware protection. Constraints exist at three levels:
Regulator Level: This is defined by the regulator hardware
operating parameters and is specified in the regulator
datasheet. i.e.
- voltage output is in the range 800mV -> 3500mV.
- regulator current output limit is 20mA @ 5V but is
10mA @ 10V.
Power Domain Level: This is defined in software by kernel
level board initialisation code. It is used to constrain a
power domain to a particular power range. i.e.
- Domain-1 voltage is 3300mV
- Domain-2 voltage is 1400mV -> 1600mV
- Domain-3 current limit is 0mA -> 20mA.
Consumer Level: This is defined by consumer drivers
dynamically setting voltage or current limit levels.
e.g. a consumer backlight driver asks for a current increase
from 5mA to 10mA to increase LCD illumination. This passes
to through the levels as follows :-
Consumer: need to increase LCD brightness. Lookup and
request next current mA value in brightness table (the
consumer driver could be used on several different
personalities based upon the same reference device).
Power Domain: is the new current limit within the domain
operating limits for this domain and system state (e.g.
battery power, USB power)
Regulator Domains: is the new current limit within the
regulator operating parameters for input/output voltage.
If the regulator request passes all the constraint tests
then the new regulator value is applied.
Design
======
The framework is designed and targeted at SoC based devices but may also be
relevant to non SoC devices and is split into the following four interfaces:-
1. Consumer driver interface.
This uses a similar API to the kernel clock interface in that consumer
drivers can get and put a regulator (like they can with clocks atm) and
get/set voltage, current limit, mode, enable and disable. This should
allow consumers complete control over their supply voltage and current
limit. This also compiles out if not in use so drivers can be reused in
systems with no regulator based power control.
See Documentation/power/regulator/consumer.txt
2. Regulator driver interface.
This allows regulator drivers to register their regulators and provide
operations to the core. It also has a notifier call chain for propagating
regulator events to clients.
See Documentation/power/regulator/regulator.txt
3. Machine interface.
This interface is for machine specific code and allows the creation of
voltage/current domains (with constraints) for each regulator. It can
provide regulator constraints that will prevent device damage through
overvoltage or over current caused by buggy client drivers. It also
allows the creation of a regulator tree whereby some regulators are
supplied by others (similar to a clock tree).
See Documentation/power/regulator/machine.txt
4. Userspace ABI.
The framework also exports a lot of useful voltage/current/opmode data to
userspace via sysfs. This could be used to help monitor device power
consumption and status.
See Documentation/ABI/testing/sysfs-class-regulator

View File

@@ -0,0 +1,31 @@
Regulator Driver Interface
==========================
The regulator driver interface is relatively simple and designed to allow
regulator drivers to register their services with the core framework.
Registration
============
Drivers can register a regulator by calling :-
struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc,
struct device *dev, struct regulator_init_data *init_data,
void *driver_data, struct device_node *of_node);
This will register the regulators capabilities and operations to the regulator
core.
Regulators can be unregistered by calling :-
void regulator_unregister(struct regulator_dev *rdev);
Regulator Events
================
Regulators can send events (e.g. over temp, under voltage, etc) to consumer
drivers by calling :-
int regulator_notifier_call_chain(struct regulator_dev *rdev,
unsigned long event, void *data);

View File

@@ -0,0 +1,887 @@
Runtime Power Management Framework for I/O Devices
(C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
(C) 2010 Alan Stern <stern@rowland.harvard.edu>
1. Introduction
Support for runtime power management (runtime PM) of I/O devices is provided
at the power management core (PM core) level by means of:
* The power management workqueue pm_wq in which bus types and device drivers can
put their PM-related work items. It is strongly recommended that pm_wq be
used for queuing all work items related to runtime PM, because this allows
them to be synchronized with system-wide power transitions (suspend to RAM,
hibernation and resume from system sleep states). pm_wq is declared in
include/linux/pm_runtime.h and defined in kernel/power/main.c.
* A number of runtime PM fields in the 'power' member of 'struct device' (which
is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
be used for synchronizing runtime PM operations with one another.
* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in
include/linux/pm.h).
* A set of helper functions defined in drivers/base/power/runtime.c that can be
used for carrying out runtime PM operations in such a way that the
synchronization between them is taken care of by the PM core. Bus types and
device drivers are encouraged to use these functions.
The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM
fields of 'struct dev_pm_info' and the core helper functions provided for
runtime PM are described below.
2. Device Runtime PM Callbacks
There are three device runtime PM callbacks defined in 'struct dev_pm_ops':
struct dev_pm_ops {
...
int (*runtime_suspend)(struct device *dev);
int (*runtime_resume)(struct device *dev);
int (*runtime_idle)(struct device *dev);
...
};
The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks
are executed by the PM core for the device's subsystem that may be either of
the following:
1. PM domain of the device, if the device's PM domain object, dev->pm_domain,
is present.
2. Device type of the device, if both dev->type and dev->type->pm are present.
3. Device class of the device, if both dev->class and dev->class->pm are
present.
4. Bus type of the device, if both dev->bus and dev->bus->pm are present.
If the subsystem chosen by applying the above rules doesn't provide the relevant
callback, the PM core will invoke the corresponding driver callback stored in
dev->driver->pm directly (if present).
The PM core always checks which callback to use in the order given above, so the
priority order of callbacks from high to low is: PM domain, device type, class
and bus type. Moreover, the high-priority one will always take precedence over
a low-priority one. The PM domain, bus type, device type and class callbacks
are referred to as subsystem-level callbacks in what follows.
By default, the callbacks are always invoked in process context with interrupts
enabled. However, the pm_runtime_irq_safe() helper function can be used to tell
the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume()
and ->runtime_idle() callbacks for the given device in atomic context with
interrupts disabled. This implies that the callback routines in question must
not block or sleep, but it also means that the synchronous helper functions
listed at the end of Section 4 may be used for that device within an interrupt
handler or generally in an atomic context.
The subsystem-level suspend callback, if present, is _entirely_ _responsible_
for handling the suspend of the device as appropriate, which may, but need not
include executing the device driver's own ->runtime_suspend() callback (from the
PM core's point of view it is not necessary to implement a ->runtime_suspend()
callback in a device driver as long as the subsystem-level suspend callback
knows what to do to handle the device).
* Once the subsystem-level suspend callback (or the driver suspend callback,
if invoked directly) has completed successfully for the given device, the PM
core regards the device as suspended, which need not mean that it has been
put into a low power state. It is supposed to mean, however, that the
device will not process data and will not communicate with the CPU(s) and
RAM until the appropriate resume callback is executed for it. The runtime
PM status of a device after successful execution of the suspend callback is
'suspended'.
* If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM
status remains 'active', which means that the device _must_ be fully
operational afterwards.
* If the suspend callback returns an error code different from -EBUSY and
-EAGAIN, the PM core regards this as a fatal error and will refuse to run
the helper functions described in Section 4 for the device until its status
is directly set to either'active', or 'suspended' (the PM core provides
special helper functions for this purpose).
In particular, if the driver requires remote wakeup capability (i.e. hardware
mechanism allowing the device to request a change of its power state, such as
PCI PME) for proper functioning and device_run_wake() returns 'false' for the
device, then ->runtime_suspend() should return -EBUSY. On the other hand, if
device_run_wake() returns 'true' for the device and the device is put into a
low-power state during the execution of the suspend callback, it is expected
that remote wakeup will be enabled for the device. Generally, remote wakeup
should be enabled for all input devices put into low-power states at run time.
The subsystem-level resume callback, if present, is _entirely_ _responsible_ for
handling the resume of the device as appropriate, which may, but need not
include executing the device driver's own ->runtime_resume() callback (from the
PM core's point of view it is not necessary to implement a ->runtime_resume()
callback in a device driver as long as the subsystem-level resume callback knows
what to do to handle the device).
* Once the subsystem-level resume callback (or the driver resume callback, if
invoked directly) has completed successfully, the PM core regards the device
as fully operational, which means that the device _must_ be able to complete
I/O operations as needed. The runtime PM status of the device is then
'active'.
* If the resume callback returns an error code, the PM core regards this as a
fatal error and will refuse to run the helper functions described in Section
4 for the device, until its status is directly set to either 'active', or
'suspended' (by means of special helper functions provided by the PM core
for this purpose).
The idle callback (a subsystem-level one, if present, or the driver one) is
executed by the PM core whenever the device appears to be idle, which is
indicated to the PM core by two counters, the device's usage counter and the
counter of 'active' children of the device.
* If any of these counters is decreased using a helper function provided by
the PM core and it turns out to be equal to zero, the other counter is
checked. If that counter also is equal to zero, the PM core executes the
idle callback with the device as its argument.
The action performed by the idle callback is totally dependent on the subsystem
(or driver) in question, but the expected and recommended action is to check
if the device can be suspended (i.e. if all of the conditions necessary for
suspending the device are satisfied) and to queue up a suspend request for the
device in that case. The value returned by this callback is ignored by the PM
core.
The helper functions provided by the PM core, described in Section 4, guarantee
that the following constraints are met with respect to runtime PM callbacks for
one device:
(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute
->runtime_suspend() in parallel with ->runtime_resume() or with another
instance of ->runtime_suspend() for the same device) with the exception that
->runtime_suspend() or ->runtime_resume() can be executed in parallel with
->runtime_idle() (although ->runtime_idle() will not be started while any
of the other callbacks is being executed for the same device).
(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active'
devices (i.e. the PM core will only execute ->runtime_idle() or
->runtime_suspend() for the devices the runtime PM status of which is
'active').
(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device
the usage counter of which is equal to zero _and_ either the counter of
'active' children of which is equal to zero, or the 'power.ignore_children'
flag of which is set.
(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the
PM core will only execute ->runtime_resume() for the devices the runtime
PM status of which is 'suspended').
Additionally, the helper functions provided by the PM core obey the following
rules:
* If ->runtime_suspend() is about to be executed or there's a pending request
to execute it, ->runtime_idle() will not be executed for the same device.
* A request to execute or to schedule the execution of ->runtime_suspend()
will cancel any pending requests to execute ->runtime_idle() for the same
device.
* If ->runtime_resume() is about to be executed or there's a pending request
to execute it, the other callbacks will not be executed for the same device.
* A request to execute ->runtime_resume() will cancel any pending or
scheduled requests to execute the other callbacks for the same device,
except for scheduled autosuspends.
3. Runtime PM Device Fields
The following device runtime PM fields are present in 'struct dev_pm_info', as
defined in include/linux/pm.h:
struct timer_list suspend_timer;
- timer used for scheduling (delayed) suspend and autosuspend requests
unsigned long timer_expires;
- timer expiration time, in jiffies (if this is different from zero, the
timer is running and will expire at that time, otherwise the timer is not
running)
struct work_struct work;
- work structure used for queuing up requests (i.e. work items in pm_wq)
wait_queue_head_t wait_queue;
- wait queue used if any of the helper functions needs to wait for another
one to complete
spinlock_t lock;
- lock used for synchronisation
atomic_t usage_count;
- the usage counter of the device
atomic_t child_count;
- the count of 'active' children of the device
unsigned int ignore_children;
- if set, the value of child_count is ignored (but still updated)
unsigned int disable_depth;
- used for disabling the helper funcions (they work normally if this is
equal to zero); the initial value of it is 1 (i.e. runtime PM is
initially disabled for all devices)
unsigned int runtime_error;
- if set, there was a fatal error (one of the callbacks returned error code
as described in Section 2), so the helper funtions will not work until
this flag is cleared; this is the error code returned by the failing
callback
unsigned int idle_notification;
- if set, ->runtime_idle() is being executed
unsigned int request_pending;
- if set, there's a pending request (i.e. a work item queued up into pm_wq)
enum rpm_request request;
- type of request that's pending (valid if request_pending is set)
unsigned int deferred_resume;
- set if ->runtime_resume() is about to be run while ->runtime_suspend() is
being executed for that device and it is not practical to wait for the
suspend to complete; means "start a resume as soon as you've suspended"
unsigned int run_wake;
- set if the device is capable of generating runtime wake-up events
enum rpm_status runtime_status;
- the runtime PM status of the device; this field's initial value is
RPM_SUSPENDED, which means that each device is initially regarded by the
PM core as 'suspended', regardless of its real hardware status
unsigned int runtime_auto;
- if set, indicates that the user space has allowed the device driver to
power manage the device at run time via the /sys/devices/.../power/control
interface; it may only be modified with the help of the pm_runtime_allow()
and pm_runtime_forbid() helper functions
unsigned int no_callbacks;
- indicates that the device does not use the runtime PM callbacks (see
Section 8); it may be modified only by the pm_runtime_no_callbacks()
helper function
unsigned int irq_safe;
- indicates that the ->runtime_suspend() and ->runtime_resume() callbacks
will be invoked with the spinlock held and interrupts disabled
unsigned int use_autosuspend;
- indicates that the device's driver supports delayed autosuspend (see
Section 9); it may be modified only by the
pm_runtime{_dont}_use_autosuspend() helper functions
unsigned int timer_autosuspends;
- indicates that the PM core should attempt to carry out an autosuspend
when the timer expires rather than a normal suspend
int autosuspend_delay;
- the delay time (in milliseconds) to be used for autosuspend
unsigned long last_busy;
- the time (in jiffies) when the pm_runtime_mark_last_busy() helper
function was last called for this device; used in calculating inactivity
periods for autosuspend
All of the above fields are members of the 'power' member of 'struct device'.
4. Runtime PM Device Helper Functions
The following runtime PM helper functions are defined in
drivers/base/power/runtime.c and include/linux/pm_runtime.h:
void pm_runtime_init(struct device *dev);
- initialize the device runtime PM fields in 'struct dev_pm_info'
void pm_runtime_remove(struct device *dev);
- make sure that the runtime PM of the device will be disabled after
removing the device from device hierarchy
int pm_runtime_idle(struct device *dev);
- execute the subsystem-level idle callback for the device; returns 0 on
success or error code on failure, where -EINPROGRESS means that
->runtime_idle() is already being executed
int pm_runtime_suspend(struct device *dev);
- execute the subsystem-level suspend callback for the device; returns 0 on
success, 1 if the device's runtime PM status was already 'suspended', or
error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt
to suspend the device again in future and -EACCES means that
'power.disable_depth' is different from 0
int pm_runtime_autosuspend(struct device *dev);
- same as pm_runtime_suspend() except that the autosuspend delay is taken
into account; if pm_runtime_autosuspend_expiration() says the delay has
not yet expired then an autosuspend is scheduled for the appropriate time
and 0 is returned
int pm_runtime_resume(struct device *dev);
- execute the subsystem-level resume callback for the device; returns 0 on
success, 1 if the device's runtime PM status was already 'active' or
error code on failure, where -EAGAIN means it may be safe to attempt to
resume the device again in future, but 'power.runtime_error' should be
checked additionally, and -EACCES means that 'power.disable_depth' is
different from 0
int pm_request_idle(struct device *dev);
- submit a request to execute the subsystem-level idle callback for the
device (the request is represented by a work item in pm_wq); returns 0 on
success or error code if the request has not been queued up
int pm_request_autosuspend(struct device *dev);
- schedule the execution of the subsystem-level suspend callback for the
device when the autosuspend delay has expired; if the delay has already
expired then the work item is queued up immediately
int pm_schedule_suspend(struct device *dev, unsigned int delay);
- schedule the execution of the subsystem-level suspend callback for the
device in future, where 'delay' is the time to wait before queuing up a
suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work
item is queued up immediately); returns 0 on success, 1 if the device's PM
runtime status was already 'suspended', or error code if the request
hasn't been scheduled (or queued up if 'delay' is 0); if the execution of
->runtime_suspend() is already scheduled and not yet expired, the new
value of 'delay' will be used as the time to wait
int pm_request_resume(struct device *dev);
- submit a request to execute the subsystem-level resume callback for the
device (the request is represented by a work item in pm_wq); returns 0 on
success, 1 if the device's runtime PM status was already 'active', or
error code if the request hasn't been queued up
void pm_runtime_get_noresume(struct device *dev);
- increment the device's usage counter
int pm_runtime_get(struct device *dev);
- increment the device's usage counter, run pm_request_resume(dev) and
return its result
int pm_runtime_get_sync(struct device *dev);
- increment the device's usage counter, run pm_runtime_resume(dev) and
return its result
void pm_runtime_put_noidle(struct device *dev);
- decrement the device's usage counter
int pm_runtime_put(struct device *dev);
- decrement the device's usage counter; if the result is 0 then run
pm_request_idle(dev) and return its result
int pm_runtime_put_autosuspend(struct device *dev);
- decrement the device's usage counter; if the result is 0 then run
pm_request_autosuspend(dev) and return its result
int pm_runtime_put_sync(struct device *dev);
- decrement the device's usage counter; if the result is 0 then run
pm_runtime_idle(dev) and return its result
int pm_runtime_put_sync_suspend(struct device *dev);
- decrement the device's usage counter; if the result is 0 then run
pm_runtime_suspend(dev) and return its result
int pm_runtime_put_sync_autosuspend(struct device *dev);
- decrement the device's usage counter; if the result is 0 then run
pm_runtime_autosuspend(dev) and return its result
void pm_runtime_enable(struct device *dev);
- decrement the device's 'power.disable_depth' field; if that field is equal
to zero, the runtime PM helper functions can execute subsystem-level
callbacks described in Section 2 for the device
int pm_runtime_disable(struct device *dev);
- increment the device's 'power.disable_depth' field (if the value of that
field was previously zero, this prevents subsystem-level runtime PM
callbacks from being run for the device), make sure that all of the pending
runtime PM operations on the device are either completed or canceled;
returns 1 if there was a resume request pending and it was necessary to
execute the subsystem-level resume callback for the device to satisfy that
request, otherwise 0 is returned
int pm_runtime_barrier(struct device *dev);
- check if there's a resume request pending for the device and resume it
(synchronously) in that case, cancel any other pending runtime PM requests
regarding it and wait for all runtime PM operations on it in progress to
complete; returns 1 if there was a resume request pending and it was
necessary to execute the subsystem-level resume callback for the device to
satisfy that request, otherwise 0 is returned
void pm_suspend_ignore_children(struct device *dev, bool enable);
- set/unset the power.ignore_children flag of the device
int pm_runtime_set_active(struct device *dev);
- clear the device's 'power.runtime_error' flag, set the device's runtime
PM status to 'active' and update its parent's counter of 'active'
children as appropriate (it is only valid to use this function if
'power.runtime_error' is set or 'power.disable_depth' is greater than
zero); it will fail and return error code if the device has a parent
which is not active and the 'power.ignore_children' flag of which is unset
void pm_runtime_set_suspended(struct device *dev);
- clear the device's 'power.runtime_error' flag, set the device's runtime
PM status to 'suspended' and update its parent's counter of 'active'
children as appropriate (it is only valid to use this function if
'power.runtime_error' is set or 'power.disable_depth' is greater than
zero)
bool pm_runtime_suspended(struct device *dev);
- return true if the device's runtime PM status is 'suspended' and its
'power.disable_depth' field is equal to zero, or false otherwise
bool pm_runtime_status_suspended(struct device *dev);
- return true if the device's runtime PM status is 'suspended'
void pm_runtime_allow(struct device *dev);
- set the power.runtime_auto flag for the device and decrease its usage
counter (used by the /sys/devices/.../power/control interface to
effectively allow the device to be power managed at run time)
void pm_runtime_forbid(struct device *dev);
- unset the power.runtime_auto flag for the device and increase its usage
counter (used by the /sys/devices/.../power/control interface to
effectively prevent the device from being power managed at run time)
void pm_runtime_no_callbacks(struct device *dev);
- set the power.no_callbacks flag for the device and remove the runtime
PM attributes from /sys/devices/.../power (or prevent them from being
added when the device is registered)
void pm_runtime_irq_safe(struct device *dev);
- set the power.irq_safe flag for the device, causing the runtime-PM
callbacks to be invoked with interrupts off
void pm_runtime_mark_last_busy(struct device *dev);
- set the power.last_busy field to the current time
void pm_runtime_use_autosuspend(struct device *dev);
- set the power.use_autosuspend flag, enabling autosuspend delays
void pm_runtime_dont_use_autosuspend(struct device *dev);
- clear the power.use_autosuspend flag, disabling autosuspend delays
void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
- set the power.autosuspend_delay value to 'delay' (expressed in
milliseconds); if 'delay' is negative then runtime suspends are
prevented
unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
- calculate the time when the current autosuspend delay period will expire,
based on power.last_busy and power.autosuspend_delay; if the delay time
is 1000 ms or larger then the expiration time is rounded up to the
nearest second; returns 0 if the delay period has already expired or
power.use_autosuspend isn't set, otherwise returns the expiration time
in jiffies
It is safe to execute the following helper functions from interrupt context:
pm_request_idle()
pm_request_autosuspend()
pm_schedule_suspend()
pm_request_resume()
pm_runtime_get_noresume()
pm_runtime_get()
pm_runtime_put_noidle()
pm_runtime_put()
pm_runtime_put_autosuspend()
pm_runtime_enable()
pm_suspend_ignore_children()
pm_runtime_set_active()
pm_runtime_set_suspended()
pm_runtime_suspended()
pm_runtime_mark_last_busy()
pm_runtime_autosuspend_expiration()
If pm_runtime_irq_safe() has been called for a device then the following helper
functions may also be used in interrupt context:
pm_runtime_idle()
pm_runtime_suspend()
pm_runtime_autosuspend()
pm_runtime_resume()
pm_runtime_get_sync()
pm_runtime_put_sync()
pm_runtime_put_sync_suspend()
pm_runtime_put_sync_autosuspend()
5. Runtime PM Initialization, Device Probing and Removal
Initially, the runtime PM is disabled for all devices, which means that the
majority of the runtime PM helper funtions described in Section 4 will return
-EAGAIN until pm_runtime_enable() is called for the device.
In addition to that, the initial runtime PM status of all devices is
'suspended', but it need not reflect the actual physical state of the device.
Thus, if the device is initially active (i.e. it is able to process I/O), its
runtime PM status must be changed to 'active', with the help of
pm_runtime_set_active(), before pm_runtime_enable() is called for the device.
However, if the device has a parent and the parent's runtime PM is enabled,
calling pm_runtime_set_active() for the device will affect the parent, unless
the parent's 'power.ignore_children' flag is set. Namely, in that case the
parent won't be able to suspend at run time, using the PM core's helper
functions, as long as the child's status is 'active', even if the child's
runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for
the child yet or pm_runtime_disable() has been called for it). For this reason,
once pm_runtime_set_active() has been called for the device, pm_runtime_enable()
should be called for it too as soon as reasonably possible or its runtime PM
status should be changed back to 'suspended' with the help of
pm_runtime_set_suspended().
If the default initial runtime PM status of the device (i.e. 'suspended')
reflects the actual state of the device, its bus type's or its driver's
->probe() callback will likely need to wake it up using one of the PM core's
helper functions described in Section 4. In that case, pm_runtime_resume()
should be used. Of course, for this purpose the device's runtime PM has to be
enabled earlier by calling pm_runtime_enable().
If the device bus type's or driver's ->probe() callback runs
pm_runtime_suspend() or pm_runtime_idle() or their asynchronous counterparts,
they will fail returning -EAGAIN, because the device's usage counter is
incremented by the driver core before executing ->probe(). Still, it may be
desirable to suspend the device as soon as ->probe() has finished, so the driver
core uses pm_runtime_put_sync() to invoke the subsystem-level idle callback for
the device at that time.
Moreover, the driver core prevents runtime PM callbacks from racing with the bus
notifier callback in __device_release_driver(), which is necessary, because the
notifier is used by some subsystems to carry out operations affecting the
runtime PM functionality. It does so by calling pm_runtime_get_sync() before
driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This
resumes the device if it's in the suspended state and prevents it from
being suspended again while those routines are being executed.
To allow bus types and drivers to put devices into the suspended state by
calling pm_runtime_suspend() from their ->remove() routines, the driver core
executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER
notifications in __device_release_driver(). This requires bus types and
drivers to make their ->remove() callbacks avoid races with runtime PM directly,
but also it allows of more flexibility in the handling of devices during the
removal of their drivers.
The user space can effectively disallow the driver of the device to power manage
it at run time by changing the value of its /sys/devices/.../power/control
attribute to "on", which causes pm_runtime_forbid() to be called. In principle,
this mechanism may also be used by the driver to effectively turn off the
runtime power management of the device until the user space turns it on.
Namely, during the initialization the driver can make sure that the runtime PM
status of the device is 'active' and call pm_runtime_forbid(). It should be
noted, however, that if the user space has already intentionally changed the
value of /sys/devices/.../power/control to "auto" to allow the driver to power
manage the device at run time, the driver may confuse it by using
pm_runtime_forbid() this way.
6. Runtime PM and System Sleep
Runtime PM and system sleep (i.e., system suspend and hibernation, also known
as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of
ways. If a device is active when a system sleep starts, everything is
straightforward. But what should happen if the device is already suspended?
The device may have different wake-up settings for runtime PM and system sleep.
For example, remote wake-up may be enabled for runtime suspend but disallowed
for system sleep (device_may_wakeup(dev) returns 'false'). When this happens,
the subsystem-level system suspend callback is responsible for changing the
device's wake-up setting (it may leave that to the device driver's system
suspend routine). It may be necessary to resume the device and suspend it again
in order to do so. The same is true if the driver uses different power levels
or other settings for runtime suspend and system sleep.
During system resume, the simplest approach is to bring all devices back to full
power, even if they had been suspended before the system suspend began. There
are several reasons for this, including:
* The device might need to switch power levels, wake-up settings, etc.
* Remote wake-up events might have been lost by the firmware.
* The device's children may need the device to be at full power in order
to resume themselves.
* The driver's idea of the device state may not agree with the device's
physical state. This can happen during resume from hibernation.
* The device might need to be reset.
* Even though the device was suspended, if its usage counter was > 0 then most
likely it would need a runtime resume in the near future anyway.
If the device had been suspended before the system suspend began and it's
brought back to full power during resume, then its runtime PM status will have
to be updated to reflect the actual post-system sleep status. The way to do
this is:
pm_runtime_disable(dev);
pm_runtime_set_active(dev);
pm_runtime_enable(dev);
The PM core always increments the runtime usage counter before calling the
->suspend() callback and decrements it after calling the ->resume() callback.
Hence disabling runtime PM temporarily like this will not cause any runtime
suspend attempts to be permanently lost. If the usage count goes to zero
following the return of the ->resume() callback, the ->runtime_idle() callback
will be invoked as usual.
On some systems, however, system sleep is not entered through a global firmware
or hardware operation. Instead, all hardware components are put into low-power
states directly by the kernel in a coordinated way. Then, the system sleep
state effectively follows from the states the hardware components end up in
and the system is woken up from that state by a hardware interrupt or a similar
mechanism entirely under the kernel's control. As a result, the kernel never
gives control away and the states of all devices during resume are precisely
known to it. If that is the case and none of the situations listed above takes
place (in particular, if the system is not waking up from hibernation), it may
be more efficient to leave the devices that had been suspended before the system
suspend began in the suspended state.
The PM core does its best to reduce the probability of race conditions between
the runtime PM and system suspend/resume (and hibernation) callbacks by carrying
out the following operations:
* During system suspend it calls pm_runtime_get_noresume() and
pm_runtime_barrier() for every device right before executing the
subsystem-level .suspend() callback for it. In addition to that it calls
pm_runtime_disable() for every device right after executing the
subsystem-level .suspend() callback for it.
* During system resume it calls pm_runtime_enable() and pm_runtime_put_sync()
for every device right before and right after executing the subsystem-level
.resume() callback for it, respectively.
7. Generic subsystem callbacks
Subsystems may wish to conserve code space by using the set of generic power
management callbacks provided by the PM core, defined in
driver/base/power/generic_ops.c:
int pm_generic_runtime_idle(struct device *dev);
- invoke the ->runtime_idle() callback provided by the driver of this
device, if defined, and call pm_runtime_suspend() for this device if the
return value is 0 or the callback is not defined
int pm_generic_runtime_suspend(struct device *dev);
- invoke the ->runtime_suspend() callback provided by the driver of this
device and return its result, or return -EINVAL if not defined
int pm_generic_runtime_resume(struct device *dev);
- invoke the ->runtime_resume() callback provided by the driver of this
device and return its result, or return -EINVAL if not defined
int pm_generic_suspend(struct device *dev);
- if the device has not been suspended at run time, invoke the ->suspend()
callback provided by its driver and return its result, or return 0 if not
defined
int pm_generic_suspend_noirq(struct device *dev);
- if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq()
callback provided by the device's driver and return its result, or return
0 if not defined
int pm_generic_resume(struct device *dev);
- invoke the ->resume() callback provided by the driver of this device and,
if successful, change the device's runtime PM status to 'active'
int pm_generic_resume_noirq(struct device *dev);
- invoke the ->resume_noirq() callback provided by the driver of this device
int pm_generic_freeze(struct device *dev);
- if the device has not been suspended at run time, invoke the ->freeze()
callback provided by its driver and return its result, or return 0 if not
defined
int pm_generic_freeze_noirq(struct device *dev);
- if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq()
callback provided by the device's driver and return its result, or return
0 if not defined
int pm_generic_thaw(struct device *dev);
- if the device has not been suspended at run time, invoke the ->thaw()
callback provided by its driver and return its result, or return 0 if not
defined
int pm_generic_thaw_noirq(struct device *dev);
- if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq()
callback provided by the device's driver and return its result, or return
0 if not defined
int pm_generic_poweroff(struct device *dev);
- if the device has not been suspended at run time, invoke the ->poweroff()
callback provided by its driver and return its result, or return 0 if not
defined
int pm_generic_poweroff_noirq(struct device *dev);
- if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq()
callback provided by the device's driver and return its result, or return
0 if not defined
int pm_generic_restore(struct device *dev);
- invoke the ->restore() callback provided by the driver of this device and,
if successful, change the device's runtime PM status to 'active'
int pm_generic_restore_noirq(struct device *dev);
- invoke the ->restore_noirq() callback provided by the device's driver
These functions can be assigned to the ->runtime_idle(), ->runtime_suspend(),
->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(),
->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(),
->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() callback
pointers in the subsystem-level dev_pm_ops structures.
If a subsystem wishes to use all of them at the same time, it can simply assign
the GENERIC_SUBSYS_PM_OPS macro, defined in include/linux/pm.h, to its
dev_pm_ops structure pointer.
Device drivers that wish to use the same function as a system suspend, freeze,
poweroff and runtime suspend callback, and similarly for system resume, thaw,
restore, and runtime resume, can achieve this with the help of the
UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its
last argument to NULL).
8. "No-Callback" Devices
Some "devices" are only logical sub-devices of their parent and cannot be
power-managed on their own. (The prototype example is a USB interface. Entire
USB devices can go into low-power mode or send wake-up requests, but neither is
possible for individual interfaces.) The drivers for these devices have no
need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend()
and ->runtime_resume() would always return 0 without doing anything else and
->runtime_idle() would always call pm_runtime_suspend().
Subsystems can tell the PM core about these devices by calling
pm_runtime_no_callbacks(). This should be done after the device structure is
initialized and before it is registered (although after device registration is
also okay). The routine will set the device's power.no_callbacks flag and
prevent the non-debugging runtime PM sysfs attributes from being created.
When power.no_callbacks is set, the PM core will not invoke the
->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks.
Instead it will assume that suspends and resumes always succeed and that idle
devices should be suspended.
As a consequence, the PM core will never directly inform the device's subsystem
or driver about runtime power changes. Instead, the driver for the device's
parent must take responsibility for telling the device's driver when the
parent's power state changes.
9. Autosuspend, or automatically-delayed suspends
Changing a device's power state isn't free; it requires both time and energy.
A device should be put in a low-power state only when there's some reason to
think it will remain in that state for a substantial time. A common heuristic
says that a device which hasn't been used for a while is liable to remain
unused; following this advice, drivers should not allow devices to be suspended
at runtime until they have been inactive for some minimum period. Even when
the heuristic ends up being non-optimal, it will still prevent devices from
"bouncing" too rapidly between low-power and full-power states.
The term "autosuspend" is an historical remnant. It doesn't mean that the
device is automatically suspended (the subsystem or driver still has to call
the appropriate PM routines); rather it means that runtime suspends will
automatically be delayed until the desired period of inactivity has elapsed.
Inactivity is determined based on the power.last_busy field. Drivers should
call pm_runtime_mark_last_busy() to update this field after carrying out I/O,
typically just before calling pm_runtime_put_autosuspend(). The desired length
of the inactivity period is a matter of policy. Subsystems can set this length
initially by calling pm_runtime_set_autosuspend_delay(), but after device
registration the length should be controlled by user space, using the
/sys/devices/.../power/autosuspend_delay_ms attribute.
In order to use autosuspend, subsystems or drivers must call
pm_runtime_use_autosuspend() (preferably before registering the device), and
thereafter they should use the various *_autosuspend() helper functions instead
of the non-autosuspend counterparts:
Instead of: pm_runtime_suspend use: pm_runtime_autosuspend;
Instead of: pm_schedule_suspend use: pm_request_autosuspend;
Instead of: pm_runtime_put use: pm_runtime_put_autosuspend;
Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend.
Drivers may also continue to use the non-autosuspend helper functions; they
will behave normally, not taking the autosuspend delay into account.
Similarly, if the power.use_autosuspend field isn't set then the autosuspend
helper functions will behave just like the non-autosuspend counterparts.
Under some circumstances a driver or subsystem may want to prevent a device
from autosuspending immediately, even though the usage counter is zero and the
autosuspend delay time has expired. If the ->runtime_suspend() callback
returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is
in the future (as it normally would be if the callback invoked
pm_runtime_mark_last_busy()), the PM core will automatically reschedule the
autosuspend. The ->runtime_suspend() callback can't do this rescheduling
itself because no suspend requests of any kind are accepted while the device is
suspending (i.e., while the callback is running).
The implementation is well suited for asynchronous use in interrupt contexts.
However such use inevitably involves races, because the PM core can't
synchronize ->runtime_suspend() callbacks with the arrival of I/O requests.
This synchronization must be handled by the driver, using its private lock.
Here is a schematic pseudo-code example:
foo_read_or_write(struct foo_priv *foo, void *data)
{
lock(&foo->private_lock);
add_request_to_io_queue(foo, data);
if (foo->num_pending_requests++ == 0)
pm_runtime_get(&foo->dev);
if (!foo->is_suspended)
foo_process_next_request(foo);
unlock(&foo->private_lock);
}
foo_io_completion(struct foo_priv *foo, void *req)
{
lock(&foo->private_lock);
if (--foo->num_pending_requests == 0) {
pm_runtime_mark_last_busy(&foo->dev);
pm_runtime_put_autosuspend(&foo->dev);
} else {
foo_process_next_request(foo);
}
unlock(&foo->private_lock);
/* Send req result back to the user ... */
}
int foo_runtime_suspend(struct device *dev)
{
struct foo_priv foo = container_of(dev, ...);
int ret = 0;
lock(&foo->private_lock);
if (foo->num_pending_requests > 0) {
ret = -EBUSY;
} else {
/* ... suspend the device ... */
foo->is_suspended = 1;
}
unlock(&foo->private_lock);
return ret;
}
int foo_runtime_resume(struct device *dev)
{
struct foo_priv foo = container_of(dev, ...);
lock(&foo->private_lock);
/* ... resume the device ... */
foo->is_suspended = 0;
pm_runtime_mark_last_busy(&foo->dev);
if (foo->num_pending_requests > 0)
foo_process_requests(foo);
unlock(&foo->private_lock);
return 0;
}
The important point is that after foo_io_completion() asks for an autosuspend,
the foo_runtime_suspend() callback may race with foo_read_or_write().
Therefore foo_runtime_suspend() has to check whether there are any pending I/O
requests (while holding the private lock) before allowing the suspend to
proceed.
In addition, the power.autosuspend_delay field can be changed by user space at
any time. If a driver cares about this, it can call
pm_runtime_autosuspend_expiration() from within the ->runtime_suspend()
callback while holding its private lock. If the function returns a nonzero
value then the delay has not yet expired and the callback should return
-EAGAIN.

View File

@@ -0,0 +1,81 @@
How to get s2ram working
~~~~~~~~~~~~~~~~~~~~~~~~
2006 Linus Torvalds
2006 Pavel Machek
1) Check suspend.sf.net, program s2ram there has long whitelist of
"known ok" machines, along with tricks to use on each one.
2) If that does not help, try reading tricks.txt and
video.txt. Perhaps problem is as simple as broken module, and
simple module unload can fix it.
3) You can use Linus' TRACE_RESUME infrastructure, described below.
Using TRACE_RESUME
~~~~~~~~~~~~~~~~~~
I've been working at making the machines I have able to STR, and almost
always it's a driver that is buggy. Thank God for the suspend/resume
debugging - the thing that Chuck tried to disable. That's often the _only_
way to debug these things, and it's actually pretty powerful (but
time-consuming - having to insert TRACE_RESUME() markers into the device
driver that doesn't resume and recompile and reboot).
Anyway, the way to debug this for people who are interested (have a
machine that doesn't boot) is:
- enable PM_DEBUG, and PM_TRACE
- use a script like this:
#!/bin/sh
sync
echo 1 > /sys/power/pm_trace
echo mem > /sys/power/state
to suspend
- if it doesn't come back up (which is usually the problem), reboot by
holding the power button down, and look at the dmesg output for things
like
Magic number: 4:156:725
hash matches drivers/base/power/resume.c:28
hash matches device 0000:01:00.0
which means that the last trace event was just before trying to resume
device 0000:01:00.0. Then figure out what driver is controlling that
device (lspci and /sys/devices/pci* is your friend), and see if you can
fix it, disable it, or trace into its resume function.
If no device matches the hash (or any matches appear to be false positives),
the culprit may be a device from a loadable kernel module that is not loaded
until after the hash is checked. You can check the hash against the current
devices again after more modules are loaded using sysfs:
cat /sys/power/pm_trace_dev_match
For example, the above happens to be the VGA device on my EVO, which I
used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out
that "radeonfb" simply cannot resume that device - it tries to set the
PLL's, and it just _hangs_. Using the regular VGA console and letting X
resume it instead works fine.
NOTE
====
pm_trace uses the system's Real Time Clock (RTC) to save the magic number.
Reason for this is that the RTC is the only reliably available piece of
hardware during resume operations where a value can be set that will
survive a reboot.
Consequence is that after a resume (even if it is successful) your system
clock will have a value corresponding to the magic number instead of the
correct date/time! It is therefore advisable to use a program like ntp-date
or rdate to reset the correct date/time from an external time source when
using this trace option.
As the clock keeps ticking it is also essential that the reboot is done
quickly after the resume failure. The trace option does not use the seconds
or the low order bits of the minutes of the RTC, but a too long delay will
corrupt the magic value.

View File

@@ -0,0 +1,80 @@
System Power Management States
The kernel supports three power management states generically, though
each is dependent on platform support code to implement the low-level
details for each state. This file describes each state, what they are
commonly called, what ACPI state they map to, and what string to write
to /sys/power/state to enter that state
State: Standby / Power-On Suspend
ACPI State: S1
String: "standby"
This state offers minimal, though real, power savings, while providing
a very low-latency transition back to a working system. No operating
state is lost (the CPU retains power), so the system easily starts up
again where it left off.
We try to put devices in a low-power state equivalent to D1, which
also offers low power savings, but low resume latency. Not all devices
support D1, and those that don't are left on.
A transition from Standby to the On state should take about 1-2
seconds.
State: Suspend-to-RAM
ACPI State: S3
String: "mem"
This state offers significant power savings as everything in the
system is put into a low-power state, except for memory, which is
placed in self-refresh mode to retain its contents.
System and device state is saved and kept in memory. All devices are
suspended and put into D3. In many cases, all peripheral buses lose
power when entering STR, so devices must be able to handle the
transition back to the On state.
For at least ACPI, STR requires some minimal boot-strapping code to
resume the system from STR. This may be true on other platforms.
A transition from Suspend-to-RAM to the On state should take about
3-5 seconds.
State: Suspend-to-disk
ACPI State: S4
String: "disk"
This state offers the greatest power savings, and can be used even in
the absence of low-level platform support for power management. This
state operates similarly to Suspend-to-RAM, but includes a final step
of writing memory contents to disk. On resume, this is read and memory
is restored to its pre-suspend state.
STD can be handled by the firmware or the kernel. If it is handled by
the firmware, it usually requires a dedicated partition that must be
setup via another operating system for it to use. Despite the
inconvenience, this method requires minimal work by the kernel, since
the firmware will also handle restoring memory contents on resume.
For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used
to write memory contents to free swap space. swsusp has some restrictive
requirements, but should work in most cases. Some, albeit outdated,
documentation can be found in Documentation/power/swsusp.txt.
Alternatively, userspace can do most of the actual suspend to disk work,
see userland-swsusp.txt.
Once memory state is written to disk, the system may either enter a
low-power state (like ACPI S4), or it may simply power down. Powering
down offers greater savings, and allows this mechanism to work on any
system. However, entering a real low-power state allows the user to
trigger wake up events (e.g. pressing a key or opening a laptop lid).
A transition from Suspend-to-Disk to the On state should take about 30
seconds, though it's typically a bit more with the current
implementation.

View File

@@ -0,0 +1,275 @@
Interaction of Suspend code (S3) with the CPU hotplug infrastructure
(C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
infrastructure uses it internally? And where do they share common code?
Well, a picture is worth a thousand words... So ASCII art follows :-)
[This depicts the current design in the kernel, and focusses only on the
interactions involving the freezer and CPU hotplug and also tries to explain
the locking involved. It outlines the notifications involved as well.
But please note that here, only the call paths are illustrated, with the aim
of describing where they take different paths and where they share code.
What happens when regular CPU hotplug and Suspend-to-RAM race with each other
is not depicted here.]
On a high level, the suspend-resume cycle goes like this:
|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
|tasks | | cpus | | | | cpus | |tasks|
More details follow:
Suspend call path
-----------------
Write 'mem' to
/sys/power/state
syfs file
|
v
Acquire pm_mutex lock
|
v
Send PM_SUSPEND_PREPARE
notifications
|
v
Freeze tasks
|
|
v
disable_nonboot_cpus()
/* start */
|
v
Acquire cpu_add_remove_lock
|
v
Iterate over CURRENTLY
online CPUs
|
|
| ----------
v | L
======> _cpu_down() |
| [This takes cpuhotplug.lock |
Common | before taking down the CPU |
code | and releases it when done] | O
| While it is at it, notifications |
| are sent when notable events occur, |
======> by running all registered callbacks. |
| | O
| |
| |
v |
Note down these cpus in | P
frozen_cpus mask ----------
|
v
Disable regular cpu hotplug
by setting cpu_hotplug_disabled=1
|
v
Release cpu_add_remove_lock
|
v
/* disable_nonboot_cpus() complete */
|
v
Do suspend
Resuming back is likewise, with the counterparts being (in the order of
execution during resume):
* enable_nonboot_cpus() which involves:
| Acquire cpu_add_remove_lock
| Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug
| Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
| Release cpu_add_remove_lock
v
* thaw tasks
* send PM_POST_SUSPEND notifications
* Release pm_mutex lock.
It is to be noted here that the pm_mutex lock is acquired at the very
beginning, when we are just starting out to suspend, and then released only
after the entire cycle is complete (i.e., suspend + resume).
Regular CPU hotplug call path
-----------------------------
Write 0 (or 1) to
/sys/devices/system/cpu/cpu*/online
sysfs file
|
|
v
cpu_down()
|
v
Acquire cpu_add_remove_lock
|
v
If cpu_hotplug_disabled is 1
return gracefully
|
|
v
======> _cpu_down()
| [This takes cpuhotplug.lock
Common | before taking down the CPU
code | and releases it when done]
| While it is at it, notifications
| are sent when notable events occur,
======> by running all registered callbacks.
|
|
v
Release cpu_add_remove_lock
[That's it!, for
regular CPU hotplug]
So, as can be seen from the two diagrams (the parts marked as "Common code"),
regular CPU hotplug and the suspend code path converge at the _cpu_down() and
_cpu_up() functions. They differ in the arguments passed to these functions,
in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
argument. But during suspend, since the tasks are already frozen by the time
the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
with the 'tasks_frozen' argument set to 1.
[See below for some known issues regarding this.]
Important files and functions/entry points:
------------------------------------------
kernel/power/process.c : freeze_processes(), thaw_processes()
kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
II. What are the issues involved in CPU hotplug?
-------------------------------------------
There are some interesting situations involving CPU hotplug and microcode
update on the CPUs, as discussed below:
[Please bear in mind that the kernel requests the microcode images from
userspace, using the request_firmware() function defined in
drivers/base/firmware_class.c]
a. When all the CPUs are identical:
This is the most common situation and it is quite straightforward: we want
to apply the same microcode revision to each of the CPUs.
To give an example of x86, the collect_cpu_info() function defined in
arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
and thereby in applying the correct microcode revision to it.
But note that the kernel does not maintain a common microcode image for the
all CPUs, in order to handle case 'b' described below.
b. When some of the CPUs are different than the rest:
In this case since we probably need to apply different microcode revisions
to different CPUs, the kernel maintains a copy of the correct microcode
image for each CPU (after appropriate CPU type/model discovery using
functions such as collect_cpu_info()).
c. When a CPU is physically hot-unplugged and a new (and possibly different
type of) CPU is hot-plugged into the system:
In the current design of the kernel, whenever a CPU is taken offline during
a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
(which is sent by the CPU hotplug code), the microcode update driver's
callback for that event reacts by freeing the kernel's copy of the
microcode image for that CPU.
Hence, when a new CPU is brought online, since the kernel finds that it
doesn't have the microcode image, it does the CPU type/model discovery
afresh and then requests the userspace for the appropriate microcode image
for that CPU, which is subsequently applied.
For example, in x86, the mc_cpu_callback() function (which is the microcode
update driver's callback registered for CPU hotplug events) calls
microcode_update_cpu() which would call microcode_init_cpu() in this case,
instead of microcode_resume_cpu() when it finds that the kernel doesn't
have a valid microcode image. This ensures that the CPU type/model
discovery is performed and the right microcode is applied to the CPU after
getting it from userspace.
d. Handling microcode update during suspend/hibernate:
Strictly speaking, during a CPU hotplug operation which does not involve
physically removing or inserting CPUs, the CPUs are not actually powered
off during a CPU offline. They are just put to the lowest C-states possible.
Hence, in such a case, it is not really necessary to re-apply microcode
when the CPUs are brought back online, since they wouldn't have lost the
image during the CPU offline operation.
This is the usual scenario encountered during a resume after a suspend.
However, in the case of hibernation, since all the CPUs are completely
powered off, during restore it becomes necessary to apply the microcode
images to all the CPUs.
[Note that we don't expect someone to physically pull out nodes and insert
nodes with a different type of CPUs in-between a suspend-resume or a
hibernate/restore cycle.]
In the current design of the kernel however, during a CPU offline operation
as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification),
the existing copy of microcode image in the kernel is not freed up.
And during the CPU online operations (during resume/restore), since the
kernel finds that it already has copies of the microcode images for all the
CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
type/model and the need for validating whether the microcode revisions are
right for the CPUs or not (due to the above assumption that physical CPU
hotplug will not be done in-between suspend/resume or hibernate/restore
cycles).
III. Are there any known problems when regular CPU hotplug and suspend race
with each other?
Yes, they are listed below:
1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
the _cpu_down() and _cpu_up() functions is *always* 0.
This might not reflect the true current state of the system, since the
tasks could have been frozen by an out-of-band event such as a suspend
operation in progress. Hence, it will lead to wrong notifications being
sent during the cpu online/offline events (eg, CPU_ONLINE notification
instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of
inappropriate code by the callbacks registered for such CPU hotplug events.
2. If a regular CPU hotplug stress test happens to race with the freezer due
to a suspend operation in progress at the same time, then we could hit the
situation described below:
* A regular cpu online operation continues its journey from userspace
into the kernel, since the freezing has not yet begun.
* Then freezer gets to work and freezes userspace.
* If cpu online has not yet completed the microcode update stuff by now,
it will now start waiting on the frozen userspace in the
TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
* Now the freezer continues and tries to freeze the remaining tasks. But
due to this wait mentioned above, the freezer won't be able to freeze
the cpu online hotplug task and hence freezing of tasks fails.
As a result of this task freezing failure, the suspend operation gets
aborted.

View File

@@ -0,0 +1,60 @@
Using swap files with software suspend (swsusp)
(C) 2006 Rafael J. Wysocki <rjw@sisk.pl>
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:
(1) swap files need not be contiguous,
(2) the header of a swap file is not in the first block of the partition that
holds it. From the swsusp's point of view (1) is not a problem, because it is
already taken care of by the swap-handling code, but (2) has to be taken into
consideration.
In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver. Unfortunately, however, it requires the
filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during resume from disk. For this reason to
identify a swap file swsusp uses the name of the partition that holds the file
and the offset from the beginning of the partition at which the swap file's
header is located. For convenience, this offset is expressed in <PAGE_SIZE>
units.
In order to use a swap file with swsusp, you need to:
1) Create the swap file and make it active, eg.
# dd if=/dev/zero of=<swap_file_path> bs=1024 count=<swap_file_size_in_k>
# mkswap <swap_file_path>
# swapon <swap_file_path>
2) Use an application that will bmap the swap file with the help of the
FIBMAP ioctl and determine the location of the file's swap header, as the
offset, in <PAGE_SIZE> units, from the beginning of the partition which
holds the swap file.
3) Add the following parameters to the kernel command line:
resume=<swap_file_partition> resume_offset=<swap_file_offset>
where <swap_file_partition> is the partition on which the swap file is located
and <swap_file_offset> is the offset of the swap header determined by the
application in 2) (of course, this step may be carried out automatically
by the same application that determines the swap file's header offset using the
FIBMAP ioctl)
OR
Use a userland suspend application that will set the partition and offset
with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in
Documentation/power/userland-swsusp.txt (this is the only method to suspend
to a swap file allowing the resume to be initiated from an initrd or initramfs
image).
Now, swsusp will use the swap file in the same way in which it would use a swap
partition. In particular, the swap file has to be active (ie. be present in
/proc/swaps) so that it can be used for suspending.
Note that if the swap file used for suspending is deleted and recreated,
the location of its header need not be the same as before. Thus every time
this happens the value of the "resume_offset=" kernel command line parameter
has to be updated.

View File

@@ -0,0 +1,138 @@
Author: Andreas Steinmetz <ast@domdv.de>
How to use dm-crypt and swsusp together:
========================================
Some prerequisites:
You know how dm-crypt works. If not, visit the following web page:
http://www.saout.de/misc/dm-crypt/
You have read Documentation/power/swsusp.txt and understand it.
You did read Documentation/initrd.txt and know how an initrd works.
You know how to create or how to modify an initrd.
Now your system is properly set up, your disk is encrypted except for
the swap device(s) and the boot partition which may contain a mini
system for crypto setup and/or rescue purposes. You may even have
an initrd that does your current crypto setup already.
At this point you want to encrypt your swap, too. Still you want to
be able to suspend using swsusp. This, however, means that you
have to be able to either enter a passphrase or that you read
the key(s) from an external device like a pcmcia flash disk
or an usb stick prior to resume. So you need an initrd, that sets
up dm-crypt and then asks swsusp to resume from the encrypted
swap device.
The most important thing is that you set up dm-crypt in such
a way that the swap device you suspend to/resume from has
always the same major/minor within the initrd as well as
within your running system. The easiest way to achieve this is
to always set up this swap device first with dmsetup, so that
it will always look like the following:
brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0
Now set up your kernel to use /dev/mapper/swap0 as the default
resume partition, so your kernel .config contains:
CONFIG_PM_STD_PARTITION="/dev/mapper/swap0"
Prepare your boot loader to use the initrd you will create or
modify. For lilo the simplest setup looks like the following
lines:
image=/boot/vmlinuz
initrd=/boot/initrd.gz
label=linux
append="root=/dev/ram0 init=/linuxrc rw"
Finally you need to create or modify your initrd. Lets assume
you create an initrd that reads the required dm-crypt setup
from a pcmcia flash disk card. The card is formatted with an ext2
fs which resides on /dev/hde1 when the card is inserted. The
card contains at least the encrypted swap setup in a file
named "swapkey". /etc/fstab of your initrd contains something
like the following:
/dev/hda1 /mnt ext3 ro 0 0
none /proc proc defaults,noatime,nodiratime 0 0
none /sys sysfs defaults,noatime,nodiratime 0 0
/dev/hda1 contains an unencrypted mini system that sets up all
of your crypto devices, again by reading the setup from the
pcmcia flash disk. What follows now is a /linuxrc for your
initrd that allows you to resume from encrypted swap and that
continues boot with your mini system on /dev/hda1 if resume
does not happen:
#!/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
mount /proc
mount /sys
mapped=0
noresume=`grep -c noresume /proc/cmdline`
if [ "$*" != "" ]
then
noresume=1
fi
dmesg -n 1
/sbin/cardmgr -q
for i in 1 2 3 4 5 6 7 8 9 0
do
if [ -f /proc/ide/hde/media ]
then
usleep 500000
mount -t ext2 -o ro /dev/hde1 /mnt
if [ -f /mnt/swapkey ]
then
dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1
fi
umount /mnt
break
fi
usleep 500000
done
killproc /sbin/cardmgr
dmesg -n 6
if [ $mapped = 1 ]
then
if [ $noresume != 0 ]
then
mkswap /dev/mapper/swap0 > /dev/null 2>&1
fi
echo 254:0 > /sys/power/resume
dmsetup remove swap0
fi
umount /sys
mount /mnt
umount /proc
cd /mnt
pivot_root . mnt
mount /proc
umount -l /mnt
umount /proc
exec chroot . /sbin/init $* < dev/console > dev/console 2>&1
Please don't mind the weird loop above, busybox's msh doesn't know
the let statement. Now, what is happening in the script?
First we have to decide if we want to try to resume, or not.
We will not resume if booting with "noresume" or any parameters
for init like "single" or "emergency" as boot parameters.
Then we need to set up dmcrypt with the setup data from the
pcmcia flash disk. If this succeeds we need to reset the swap
device if we don't want to resume. The line "echo 254:0 > /sys/power/resume"
then attempts to resume from the first device mapper device.
Note that it is important to set the device in /sys/power/resume,
regardless if resuming or not, otherwise later suspend will fail.
If resume starts, script execution terminates here.
Otherwise we just remove the encrypted swap device and leave it to the
mini system on /dev/hda1 to set the whole crypto up (it is up to
you to modify this to your taste).
What then follows is the well known process to change the root
file system and continue booting from there. I prefer to unmount
the initrd prior to continue booting but it is up to you to modify
this.

View File

@@ -0,0 +1,408 @@
Some warnings, first.
* BIG FAT WARNING *********************************************************
*
* If you touch anything on disk between suspend and resume...
* ...kiss your data goodbye.
*
* If you do resume from initrd after your filesystems are mounted...
* ...bye bye root partition.
* [this is actually same case as above]
*
* If you have unsupported (*) devices using DMA, you may have some
* problems. If your disk driver does not support suspend... (IDE does),
* it may cause some problems, too. If you change kernel command line
* between suspend and resume, it may do something wrong. If you change
* your hardware while system is suspended... well, it was not good idea;
* but it will probably only crash.
*
* (*) suspend/resume support is needed to make it safe.
*
* If you have any filesystems on USB devices mounted before software suspend,
* they won't be accessible after resume and you may lose data, as though
* you have unplugged the USB devices with mounted filesystems on them;
* see the FAQ below for details. (This is not true for more traditional
* power states like "standby", which normally don't turn USB off.)
You need to append resume=/dev/your_swap_partition to kernel command
line. Then you suspend by
echo shutdown > /sys/power/disk; echo disk > /sys/power/state
. If you feel ACPI works pretty well on your system, you might try
echo platform > /sys/power/disk; echo disk > /sys/power/state
. If you have SATA disks, you'll need recent kernels with SATA suspend
support. For suspend and resume to work, make sure your disk drivers
are built into kernel -- not modules. [There's way to make
suspend/resume with modular disk drivers, see FAQ, but you probably
should not do that.]
If you want to limit the suspend image size to N bytes, do
echo N > /sys/power/image_size
before suspend (it is limited to 500 MB by default).
Article about goals and implementation of Software Suspend for Linux
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Author: G‚ábor Kuti
Last revised: 2003-10-20 by Pavel Machek
Idea and goals to achieve
Nowadays it is common in several laptops that they have a suspend button. It
saves the state of the machine to a filesystem or to a partition and switches
to standby mode. Later resuming the machine the saved state is loaded back to
ram and the machine can continue its work. It has two real benefits. First we
save ourselves the time machine goes down and later boots up, energy costs
are real high when running from batteries. The other gain is that we don't have to
interrupt our programs so processes that are calculating something for a long
time shouldn't need to be written interruptible.
swsusp saves the state of the machine into active swaps and then reboots or
powerdowns. You must explicitly specify the swap partition to resume from with
``resume='' kernel option. If signature is found it loads and restores saved
state. If the option ``noresume'' is specified as a boot parameter, it skips
the resuming. If the option ``hibernate=nocompress'' is specified as a boot
parameter, it saves hibernation image without compression.
In the meantime while the system is suspended you should not add/remove any
of the hardware, write to the filesystems, etc.
Sleep states summary
====================
There are three different interfaces you can use, /proc/acpi should
work like this:
In a really perfect world:
echo 1 > /proc/acpi/sleep # for standby
echo 2 > /proc/acpi/sleep # for suspend to ram
echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative
echo 4 > /proc/acpi/sleep # for suspend to disk
echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system
and perhaps
echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios
Frequently Asked Questions
==========================
Q: well, suspending a server is IMHO a really stupid thing,
but... (Diego Zuccato):
A: You bought new UPS for your server. How do you install it without
bringing machine down? Suspend to disk, rearrange power cables,
resume.
You have your server on UPS. Power died, and UPS is indicating 30
seconds to failure. What do you do? Suspend to disk.
Q: Maybe I'm missing something, but why don't the regular I/O paths work?
A: We do use the regular I/O paths. However we cannot restore the data
to its original location as we load it. That would create an
inconsistent kernel state which would certainly result in an oops.
Instead, we load the image into unused memory and then atomically copy
it back to it original location. This implies, of course, a maximum
image size of half the amount of memory.
There are two solutions to this:
* require half of memory to be free during suspend. That way you can
read "new" data onto free spots, then cli and copy
* assume we had special "polling" ide driver that only uses memory
between 0-640KB. That way, I'd have to make sure that 0-640KB is free
during suspending, but otherwise it would work...
suspend2 shares this fundamental limitation, but does not include user
data and disk caches into "used memory" by saving them in
advance. That means that the limitation goes away in practice.
Q: Does linux support ACPI S4?
A: Yes. That's what echo platform > /sys/power/disk does.
Q: What is 'suspend2'?
A: suspend2 is 'Software Suspend 2', a forked implementation of
suspend-to-disk which is available as separate patches for 2.4 and 2.6
kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB
highmem and preemption. It also has a extensible architecture that
allows for arbitrary transformations on the image (compression,
encryption) and arbitrary backends for writing the image (eg to swap
or an NFS share[Work In Progress]). Questions regarding suspend2
should be sent to the mailing list available through the suspend2
website, and not to the Linux Kernel Mailing List. We are working
toward merging suspend2 into the mainline kernel.
Q: What is the freezing of tasks and why are we using it?
A: The freezing of tasks is a mechanism by which user space processes and some
kernel threads are controlled during hibernation or system-wide suspend (on some
architectures). See freezing-of-tasks.txt for details.
Q: What is the difference between "platform" and "shutdown"?
A:
shutdown: save state in linux, then tell bios to powerdown
platform: save state in linux, then tell bios to powerdown and blink
"suspended led"
"platform" is actually right thing to do where supported, but
"shutdown" is most reliable (except on ACPI systems).
Q: I do not understand why you have such strong objections to idea of
selective suspend.
A: Do selective suspend during runtime power management, that's okay. But
it's useless for suspend-to-disk. (And I do not see how you could use
it for suspend-to-ram, I hope you do not want that).
Lets see, so you suggest to
* SUSPEND all but swap device and parents
* Snapshot
* Write image to disk
* SUSPEND swap device and parents
* Powerdown
Oh no, that does not work, if swap device or its parents uses DMA,
you've corrupted data. You'd have to do
* SUSPEND all but swap device and parents
* FREEZE swap device and parents
* Snapshot
* UNFREEZE swap device and parents
* Write
* SUSPEND swap device and parents
Which means that you still need that FREEZE state, and you get more
complicated code. (And I have not yet introduce details like system
devices).
Q: There don't seem to be any generally useful behavioral
distinctions between SUSPEND and FREEZE.
A: Doing SUSPEND when you are asked to do FREEZE is always correct,
but it may be unnecessarily slow. If you want your driver to stay simple,
slowness may not matter to you. It can always be fixed later.
For devices like disk it does matter, you do not want to spindown for
FREEZE.
Q: After resuming, system is paging heavily, leading to very bad interactivity.
A: Try running
cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
after resume. swapoff -a; swapon -a may also be useful.
Q: What happens to devices during swsusp? They seem to be resumed
during system suspend?
A: That's correct. We need to resume them if we want to write image to
disk. Whole sequence goes like
Suspend part
~~~~~~~~~~~~
running system, user asks for suspend-to-disk
user processes are stopped
suspend(PMSG_FREEZE): devices are frozen so that they don't interfere
with state snapshot
state snapshot: copy of whole used memory is taken with interrupts disabled
resume(): devices are woken up so that we can write image to swap
write image to swap
suspend(PMSG_SUSPEND): suspend devices so that we can power off
turn the power off
Resume part
~~~~~~~~~~~
(is actually pretty similar)
running system, user asks for suspend-to-disk
user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows)
read image from disk
suspend(PMSG_FREEZE): devices are frozen so that they don't interfere
with image restoration
image restoration: rewrite memory with image
resume(): devices are woken up so that system can continue
thaw all user processes
Q: What is this 'Encrypt suspend image' for?
A: First of all: it is not a replacement for dm-crypt encrypted swap.
It cannot protect your computer while it is suspended. Instead it does
protect from leaking sensitive data after resume from suspend.
Think of the following: you suspend while an application is running
that keeps sensitive data in memory. The application itself prevents
the data from being swapped out. Suspend, however, must write these
data to swap to be able to resume later on. Without suspend encryption
your sensitive data are then stored in plaintext on disk. This means
that after resume your sensitive data are accessible to all
applications having direct access to the swap device which was used
for suspend. If you don't need swap after resume these data can remain
on disk virtually forever. Thus it can happen that your system gets
broken in weeks later and sensitive data which you thought were
encrypted and protected are retrieved and stolen from the swap device.
To prevent this situation you should use 'Encrypt suspend image'.
During suspend a temporary key is created and this key is used to
encrypt the data written to disk. When, during resume, the data was
read back into memory the temporary key is destroyed which simply
means that all data written to disk during suspend are then
inaccessible so they can't be stolen later on. The only thing that
you must then take care of is that you call 'mkswap' for the swap
partition used for suspend as early as possible during regular
boot. This asserts that any temporary key from an oopsed suspend or
from a failed or aborted resume is erased from the swap device.
As a rule of thumb use encrypted swap to protect your data while your
system is shut down or suspended. Additionally use the encrypted
suspend image to prevent sensitive data from being stolen after
resume.
Q: Can I suspend to a swap file?
A: Generally, yes, you can. However, it requires you to use the "resume=" and
"resume_offset=" kernel command line parameters, so the resume from a swap file
cannot be initiated from an initrd or initramfs image. See
swsusp-and-swap-files.txt for details.
Q: Is there a maximum system RAM size that is supported by swsusp?
A: It should work okay with highmem.
Q: Does swsusp (to disk) use only one swap partition or can it use
multiple swap partitions (aggregate them into one logical space)?
A: Only one swap partition, sorry.
Q: If my application(s) causes lots of memory & swap space to be used
(over half of the total system RAM), is it correct that it is likely
to be useless to try to suspend to disk while that app is running?
A: No, it should work okay, as long as your app does not mlock()
it. Just prepare big enough swap partition.
Q: What information is useful for debugging suspend-to-disk problems?
A: Well, last messages on the screen are always useful. If something
is broken, it is usually some kernel driver, therefore trying with as
little as possible modules loaded helps a lot. I also prefer people to
suspend from console, preferably without X running. Booting with
init=/bin/bash, then swapon and starting suspend sequence manually
usually does the trick. Then it is good idea to try with latest
vanilla kernel.
Q: How can distributions ship a swsusp-supporting kernel with modular
disk drivers (especially SATA)?
A: Well, it can be done, load the drivers, then do echo into
/sys/power/disk/resume file from initrd. Be sure not to mount
anything, not even read-only mount, or you are going to lose your
data.
Q: How do I make suspend more verbose?
A: If you want to see any non-error kernel messages on the virtual
terminal the kernel switches to during suspend, you have to set the
kernel console loglevel to at least 4 (KERN_WARNING), for example by
doing
# save the old loglevel
read LOGLEVEL DUMMY < /proc/sys/kernel/printk
# set the loglevel so we see the progress bar.
# if the level is higher than needed, we leave it alone.
if [ $LOGLEVEL -lt 5 ]; then
echo 5 > /proc/sys/kernel/printk
fi
IMG_SZ=0
read IMG_SZ < /sys/power/image_size
echo -n disk > /sys/power/state
RET=$?
#
# the logic here is:
# if image_size > 0 (without kernel support, IMG_SZ will be zero),
# then try again with image_size set to zero.
if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size
echo 0 > /sys/power/image_size
echo -n disk > /sys/power/state
RET=$?
fi
# restore previous loglevel
echo $LOGLEVEL > /proc/sys/kernel/printk
exit $RET
Q: Is this true that if I have a mounted filesystem on a USB device and
I suspend to disk, I can lose data unless the filesystem has been mounted
with "sync"?
A: That's right ... if you disconnect that device, you may lose data.
In fact, even with "-o sync" you can lose data if your programs have
information in buffers they haven't written out to a disk you disconnect,
or if you disconnect before the device finished saving data you wrote.
Software suspend normally powers down USB controllers, which is equivalent
to disconnecting all USB devices attached to your system.
Your system might well support low-power modes for its USB controllers
while the system is asleep, maintaining the connection, using true sleep
modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the
/sys/power/state file; write "standby" or "mem".) We've not seen any
hardware that can use these modes through software suspend, although in
theory some systems might support "platform" modes that won't break the
USB connections.
Remember that it's always a bad idea to unplug a disk drive containing a
mounted filesystem. That's true even when your system is asleep! The
safest thing is to unmount all filesystems on removable media (such USB,
Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays)
before suspending; then remount them after resuming.
There is a work-around for this problem. For more information, see
Documentation/usb/persist.txt.
Q: Can I suspend-to-disk using a swap partition under LVM?
A: No. You can suspend successfully, but you'll not be able to
resume. uswsusp should be able to work with LVM. See suspend.sf.net.
Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were
compiled with the similar configuration files. Anyway I found that
suspend to disk (and resume) is much slower on 2.6.16 compared to
2.6.15. Any idea for why that might happen or how can I speed it up?
A: This is because the size of the suspend image is now greater than
for 2.6.15 (by saving more data we can get more responsive system
after resume).
There's the /sys/power/image_size knob that controls the size of the
image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as
root), the 2.6.15 behavior should be restored. If it is still too
slow, take a look at suspend.sf.net -- userland suspend is faster and
supports LZF compression to speed it up further.

View File

@@ -0,0 +1,27 @@
swsusp/S3 tricks
~~~~~~~~~~~~~~~~
Pavel Machek <pavel@ucw.cz>
If you want to trick swsusp/S3 into working, you might want to try:
* go with minimal config, turn off drivers like USB, AGP you don't
really need
* turn off APIC and preempt
* use ext2. At least it has working fsck. [If something seems to go
wrong, force fsck when you have a chance]
* turn off modules
* use vga text console, shut down X. [If you really want X, you might
want to try vesafb later]
* try running as few processes as possible, preferably go to single
user mode.
* due to video issues, swsusp should be easier to get working than
S3. Try that first.
When you make it work, try to find out what exactly was it that broke
suspend, and preferably fix that.

View File

@@ -0,0 +1,170 @@
Documentation for userland software suspend interface
(C) 2006 Rafael J. Wysocki <rjw@sisk.pl>
First, the warnings at the beginning of swsusp.txt still apply.
Second, you should read the FAQ in swsusp.txt _now_ if you have not
done it already.
Now, to use the userland interface for software suspend you need special
utilities that will read/write the system memory snapshot from/to the
kernel. Such utilities are available, for example, from
<http://suspend.sourceforge.net>. You may want to have a look at them if you
are going to develop your own suspend/resume utilities.
The interface consists of a character device providing the open(),
release(), read(), and write() operations as well as several ioctl()
commands defined in include/linux/suspend_ioctls.h . The major and minor
numbers of the device are, respectively, 10 and 231, and they can
be read from /sys/class/misc/snapshot/dev.
The device can be open either for reading or for writing. If open for
reading, it is considered to be in the suspend mode. Otherwise it is
assumed to be in the resume mode. The device cannot be open for simultaneous
reading and writing. It is also impossible to have the device open more than
once at a time.
Even opening the device has side effects. Data structures are
allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are
called.
The ioctl() commands recognized by the device are:
SNAPSHOT_FREEZE - freeze user space processes (the current process is
not frozen); this is required for SNAPSHOT_CREATE_IMAGE
and SNAPSHOT_ATOMIC_RESTORE to succeed
SNAPSHOT_UNFREEZE - thaw user space processes frozen by SNAPSHOT_FREEZE
SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the
last argument of ioctl() should be a pointer to an int variable,
the value of which will indicate whether the call returned after
creating the snapshot (1) or after restoring the system memory state
from it (0) (after resume the system finds itself finishing the
SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot
has been created the read() operation can be used to transfer
it out of the kernel
SNAPSHOT_ATOMIC_RESTORE - restore the system memory state from the
uploaded snapshot image; before calling it you should transfer
the system memory snapshot back to the kernel using the write()
operation; this call will not succeed if the snapshot
image is not available to the kernel
SNAPSHOT_FREE - free memory allocated for the snapshot image
SNAPSHOT_PREF_IMAGE_SIZE - set the preferred maximum size of the image
(the kernel will do its best to ensure the image size will not exceed
this number, but if it turns out to be impossible, the kernel will
create the smallest image possible)
SNAPSHOT_GET_IMAGE_SIZE - return the actual size of the hibernation image
SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the
last argument should be a pointer to an unsigned int variable that will
contain the result if the call is successful).
SNAPSHOT_ALLOC_SWAP_PAGE - allocate a swap page from the resume partition
(the last argument should be a pointer to a loff_t variable that
will contain the swap page offset if the call is successful)
SNAPSHOT_FREE_SWAP_PAGES - free all swap pages allocated by
SNAPSHOT_ALLOC_SWAP_PAGE
SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE>
units) from the beginning of the partition at which the swap header is
located (the last ioctl() argument should point to a struct
resume_swap_area, as defined in kernel/power/suspend_ioctls.h,
containing the resume device specification and the offset); for swap
partitions the offset is always 0, but it is different from zero for
swap files (see Documentation/power/swsusp-and-swap-files.txt for
details).
SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support,
depending on the argument value (enable, if the argument is nonzero)
SNAPSHOT_POWER_OFF - make the kernel transition the system to the hibernation
state (eg. ACPI S4) using the platform (eg. ACPI) driver
SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to
immediately enter the suspend-to-RAM state, so this call must always
be preceded by the SNAPSHOT_FREEZE call and it is also necessary
to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call
is needed to implement the suspend-to-both mechanism in which the
suspend image is first created, as though the system had been suspended
to disk, and then the system is suspended to RAM (this makes it possible
to resume the system from RAM if there's enough battery power or restore
its state on the basis of the saved suspend image otherwise)
The device's read() operation can be used to transfer the snapshot image from
the kernel. It has the following limitations:
- you cannot read() more than one virtual memory page at a time
- read()s across page boundaries are impossible (ie. if ypu read() 1/2 of
a page in the previous call, you will only be able to read()
_at_ _most_ 1/2 of the page in the next call)
The device's write() operation is used for uploading the system memory snapshot
into the kernel. It has the same limitations as the read() operation.
The release() operation frees all memory allocated for the snapshot image
and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any).
Thus it is not necessary to use either SNAPSHOT_FREE or
SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also
unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are
still frozen when the device is being closed).
Currently it is assumed that the userland utilities reading/writing the
snapshot image from/to the kernel will use a swap partition, called the resume
partition, or a swap file as storage space (if a swap file is used, the resume
partition is the partition that holds this file). However, this is not really
required, as they can use, for example, a special (blank) suspend partition or
a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and
mounted afterwards.
These utilities MUST NOT make any assumptions regarding the ordering of
data within the snapshot image. The contents of the image are entirely owned
by the kernel and its structure may be changed in future kernel releases.
The snapshot image MUST be written to the kernel unaltered (ie. all of the image
data, metadata and header MUST be written in _exactly_ the same amount, form
and order in which they have been read). Otherwise, the behavior of the
resumed system may be totally unpredictable.
While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the
structure of the snapshot image is consistent with the information stored
in the image header. If any inconsistencies are detected,
SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof
mechanism and the userland utilities using the interface SHOULD use additional
means, such as checksums, to ensure the integrity of the snapshot image.
The suspending and resuming utilities MUST lock themselves in memory,
preferably using mlockall(), before calling SNAPSHOT_FREEZE.
The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE
in the memory location pointed to by the last argument of ioctl() and proceed
in accordance with it:
1. If the value is 1 (ie. the system memory snapshot has just been
created and the system is ready for saving it):
(a) The suspending utility MUST NOT close the snapshot device
_unless_ the whole suspend procedure is to be cancelled, in
which case, if the snapshot image has already been saved, the
suspending utility SHOULD destroy it, preferably by zapping
its header. If the suspend is not to be cancelled, the
system MUST be powered off or rebooted after the snapshot
image has been saved.
(b) The suspending utility SHOULD NOT attempt to perform any
file system operations (including reads) on the file systems
that were mounted before SNAPSHOT_CREATE_IMAGE has been
called. However, it MAY mount a file system that was not
mounted at that time and perform some operations on it (eg.
use it for saving the image).
2. If the value is 0 (ie. the system state has just been restored from
the snapshot image), the suspending utility MUST close the snapshot
device. Afterwards it will be treated as a regular userland process,
so it need not exit.
The resuming utility SHOULD NOT attempt to mount any file systems that could
be mounted before suspend and SHOULD NOT attempt to perform any operations
involving such file systems.
For details, please refer to the source code.

View File

@@ -0,0 +1,185 @@
Video issues with S3 resume
~~~~~~~~~~~~~~~~~~~~~~~~~~~
2003-2006, Pavel Machek
During S3 resume, hardware needs to be reinitialized. For most
devices, this is easy, and kernel driver knows how to do
it. Unfortunately there's one exception: video card. Those are usually
initialized by BIOS, and kernel does not have enough information to
boot video card. (Kernel usually does not even contain video card
driver -- vesafb and vgacon are widely used).
This is not problem for swsusp, because during swsusp resume, BIOS is
run normally so video card is normally initialized. It should not be
problem for S1 standby, because hardware should retain its state over
that.
We either have to run video BIOS during early resume, or interpret it
using vbetool later, or maybe nothing is necessary on particular
system because video state is preserved. Unfortunately different
methods work on different systems, and no known method suits all of
them.
Userland application called s2ram has been developed; it contains long
whitelist of systems, and automatically selects working method for a
given system. It can be downloaded from CVS at
www.sf.net/projects/suspend . If you get a system that is not in the
whitelist, please try to find a working solution, and submit whitelist
entry so that work does not need to be repeated.
Currently, VBE_SAVE method (6 below) works on most
systems. Unfortunately, vbetool only runs after userland is resumed,
so it makes debugging of early resume problems
hard/impossible. Methods that do not rely on userland are preferable.
Details
~~~~~~~
There are a few types of systems where video works after S3 resume:
(1) systems where video state is preserved over S3.
(2) systems where it is possible to call the video BIOS during S3
resume. Unfortunately, it is not correct to call the video BIOS at
that point, but it happens to work on some machines. Use
acpi_sleep=s3_bios.
(3) systems that initialize video card into vga text mode and where
the BIOS works well enough to be able to set video mode. Use
acpi_sleep=s3_mode on these.
(4) on some systems s3_bios kicks video into text mode, and
acpi_sleep=s3_bios,s3_mode is needed.
(5) radeon systems, where X can soft-boot your video card. You'll need
a new enough X, and a plain text console (no vesafb or radeonfb). See
http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information.
Alternatively, you should use vbetool (6) instead.
(6) other radeon systems, where vbetool is enough to bring system back
to life. It needs text console to be working. Do vbetool vbestate
save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool
vbestate restore < /tmp/delme; setfont <whatever>, and your video
should work.
(7) on some systems, it is possible to boot most of kernel, and then
POSTing bios works. Ole Rohne has patch to do just that at
http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2.
(8) on some systems, you can use the video_post utility and or
do echo 3 > /sys/power/state && /usr/sbin/video_post - which will
initialize the display in console mode. If you are in X, you can switch
to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get
the display working in graphical mode again.
Now, if you pass acpi_sleep=something, and it does not work with your
bios, you'll get a hard crash during resume. Be careful. Also it is
safest to do your experiments with plain old VGA console. The vesafb
and radeonfb (etc) drivers have a tendency to crash the machine during
resume.
You may have a system where none of above works. At that point you
either invent another ugly hack that works, or write proper driver for
your video card (good luck getting docs :-(). Maybe suspending from X
(proper X, knowing your hardware, not XF68_FBcon) might have better
chance of working.
Table of known working notebooks:
Model hack (or "how to do it")
------------------------------------------------------------------------------
Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI
Acer TM 230 s3_bios (2)
Acer TM 242FX vbetool (6)
Acer TM C110 video_post (8)
Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8)
Acer TM 4052LCi s3_bios (2)
Acer TM 636Lci s3_bios,s3_mode (4)
Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back
Acer TM 660 ??? (*)
Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6)
Acer TM 803 vga=normal, X patches, see webpage (5) or vbetool (6)
Acer TM 803LCi vga=normal, vbetool (6)
Arima W730a vbetool needed (6)
Asus L2400D s3_mode (3)(***) (S1 also works OK)
Asus L3350M (SiS 740) (6)
Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK)
Asus M6887Ne vga=normal, s3_bios (2), use radeon driver instead of fglrx in x.org
Athlon64 desktop prototype s3_bios (2)
Compal CL-50 ??? (*)
Compaq Armada E500 - P3-700 none (1) (S1 also works OK)
Compaq Evo N620c vga=normal, s3_bios (2)
Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1
Dell D600, ATI RV250 vga=normal and X, or try vbestate (6)
Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested)
Dell Inspiron 4000 ??? (*)
Dell Inspiron 500m ??? (*)
Dell Inspiron 510m ???
Dell Inspiron 5150 vbetool needed (6)
Dell Inspiron 600m ??? (*)
Dell Inspiron 8200 ??? (*)
Dell Inspiron 8500 ??? (*)
Dell Inspiron 8600 ??? (*)
eMachines athlon64 machines vbetool needed (6) (someone please get me model #s)
HP NC6000 s3_bios, may not use radeonfb (2); or vbetool (6)
HP NX7000 ??? (*)
HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X
HP Omnibook XE3 athlon version none (1)
HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV
HP Omnibook XE3L-GF vbetool (6)
HP Omnibook 5150 none (1), (S1 also works OK)
IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work.
IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(]
IBM TP R32 / Type 2658-MMG none (1)
IBM TP R40 2722B3G ??? (*)
IBM TP R50p / Type 1832-22U s3_bios (2)
IBM TP R51 none (1)
IBM TP T30 236681A ??? (*)
IBM TP T40 / Type 2373-MU4 none (1)
IBM TP T40p none (1)
IBM TP R40p s3_bios (2)
IBM TP T41p s3_bios (2), switch to X after resume
IBM TP T42 s3_bios (2)
IBM ThinkPad T42p (2373-GTG) s3_bios (2)
IBM TP X20 ??? (*)
IBM TP X30 s3_bios, s3_mode (4)
IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight.
IBM TP X32 none (1), but backlight is on and video is trashed after long suspend. s3_bios,s3_mode (4) works too. Perhaps that gets better results?
IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4)
IBM TP 600e none(1), but a switch to console and back to X is needed
Medion MD4220 ??? (*)
Samsung P35 vbetool needed (6)
Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off
Sony Vaio PCG-C1VRX/K s3_bios (2)
Sony Vaio PCG-F403 ??? (*)
Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver
Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use radeontool (http://fdd.com/software/radeon/) to turn off backlight.
Sony Vaio PCG-N505SN ??? (*)
Sony Vaio vgn-s260 X or boot-radeon can init it (5)
Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X.
Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4)
Toshiba Libretto L5 none (1)
Toshiba Libretto 100CT/110CT vbetool (6)
Toshiba Portege 3020CT s3_mode (3)
Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK)
Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK)
Toshiba Satellite 4090XCDT ??? (*)
Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****)
Toshiba M30 (2) xor X with nvidia driver using internal AGP
Uniwill 244IIO ??? (*)
Known working desktop systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mainboard Graphics card hack (or "how to do it")
------------------------------------------------------------------------------
Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4)
(*) from https://wiki.ubuntu.com/HoaryPMResults, not sure
which options to use. If you know, please tell me.
(***) To be tested with a newer kernel.
(****) Not with SMP kernel, UP only.

View File

@@ -0,0 +1,37 @@
ACPI video extensions
~~~~~~~~~~~~~~~~~~~~~
This driver implement the ACPI Extensions For Display Adapters for
integrated graphics devices on motherboard, as specified in ACPI 2.0
Specification, Appendix B, allowing to perform some basic control like
defining the video POST device, retrieving EDID information or to
setup a video output, etc. Note that this is an ref. implementation
only. It may or may not work for your integrated video device.
Interfaces exposed to userland through /proc/acpi/video:
VGA/info : display the supported video bus device capability like Video ROM, CRT/LCD/TV.
VGA/ROM : Used to get a copy of the display devices' ROM data (up to 4k).
VGA/POST_info : Used to determine what options are implemented.
VGA/POST : Used to get/set POST device.
VGA/DOS : Used to get/set ownership of output switching:
Please refer ACPI spec B.4.1 _DOS
VGA/CRT : CRT output
VGA/LCD : LCD output
VGA/TVO : TV output
VGA/*/brightness : Used to get/set brightness of output device
Notify event through /proc/acpi/event:
#define ACPI_VIDEO_NOTIFY_SWITCH 0x80
#define ACPI_VIDEO_NOTIFY_PROBE 0x81
#define ACPI_VIDEO_NOTIFY_CYCLE 0x82
#define ACPI_VIDEO_NOTIFY_NEXT_OUTPUT 0x83
#define ACPI_VIDEO_NOTIFY_PREV_OUTPUT 0x84
#define ACPI_VIDEO_NOTIFY_CYCLE_BRIGHTNESS 0x82
#define ACPI_VIDEO_NOTIFY_INC_BRIGHTNESS 0x83
#define ACPI_VIDEO_NOTIFY_DEC_BRIGHTNESS 0x84
#define ACPI_VIDEO_NOTIFY_ZERO_BRIGHTNESS 0x85
#define ACPI_VIDEO_NOTIFY_DISPLAY_OFF 0x86