174 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			174 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
$Id: README,v 1.2 2001/06/21 23:07:06 dwmw2 Exp $
 | 
						|
$Log: README,v $
 | 
						|
Revision 1.2  2001/06/21 23:07:06  dwmw2
 | 
						|
Initial import to MTD CVS
 | 
						|
 | 
						|
Revision 1.1  2001/06/11 19:34:40  vipin
 | 
						|
Added README file to dir.
 | 
						|
 | 
						|
 | 
						|
This is the README file for the "checkfs" power fail test program.
 | 
						|
By: Vipin Malik
 | 
						|
 | 
						|
NOTE: This program requires an external "power cycling box"
 | 
						|
connected to one of the com ports of the system under test.
 | 
						|
This power cycling box should wait for a random amount of time
 | 
						|
after it receives a "ok to power me down" message over the
 | 
						|
serial port, and then yank power to the system under test.
 | 
						|
(The box that I rigged up tested with waits anywhere from
 | 
						|
0 to ~40 seconds).
 | 
						|
 | 
						|
 | 
						|
It should then restore power after a few seconds and wait for the
 | 
						|
message again.
 | 
						|
 | 
						|
 | 
						|
ABOUT:
 | 
						|
 | 
						|
This program's primary purpose it to test the reliiability
 | 
						|
of various file systems under Linux.
 | 
						|
 | 
						|
SETUP:
 | 
						|
 | 
						|
You need to setup the file system you want to test and run the
 | 
						|
"makefiles" program ONCE. This creates a set of files that are
 | 
						|
required by the "checkfs" program.
 | 
						|
 | 
						|
Also copy the "checkfs" executable program to the same dir.
 | 
						|
 | 
						|
Then you need to make sure that the program "checkfs" is called
 | 
						|
automatically on startup. You can customise the operation of
 | 
						|
the "checkfs" program by passing it various cmd line arguments.
 | 
						|
run "checkfs -?" for more details.
 | 
						|
 | 
						|
****NOTE*******
 | 
						|
Make sure that you call the checkfs program only after you have
 | 
						|
mounted the file system you want to test (this is obvious), but
 | 
						|
also after you have run any "scan" utilities to check for and
 | 
						|
fix any file systems errors. The e2fsck is one utility for the
 | 
						|
ext2 file system. For an automated setup you of course need to
 | 
						|
provide these scan programs to run in standalone mode (-f -y
 | 
						|
flags for e2fsck for example).
 | 
						|
 | 
						|
File systems like JFFS and JFFS2 do not have any such external
 | 
						|
utilities and you may call "checkfs" right after you have mounted
 | 
						|
the respective file system under test.
 | 
						|
 | 
						|
There are two ways you can mount the file system under test:
 | 
						|
 | 
						|
1. Mount your root fs on a "standard" fs like ext2 and then
 | 
						|
mount the file system under test (which may be ext2 on another
 | 
						|
partition or device) and then run "checkfs" on this mounted
 | 
						|
partition OR
 | 
						|
 | 
						|
2. Make your fs AND device that you have put this fs as your
 | 
						|
root fs and run "checkfs" on the root device (i.e. "/").
 | 
						|
You can of course still run checkfs under a separate dir
 | 
						|
under your "/" root dir.
 | 
						|
 | 
						|
I have found the second method to be a particularly stringent
 | 
						|
arrangement (and thus preferred when you are trying to break
 | 
						|
something).
 | 
						|
 | 
						|
Using this arrangement I was able to find that JFFS clobbered
 | 
						|
some "sister" files on the root fs even though "checkfs" would
 | 
						|
run fine through all its own check files.
 | 
						|
 | 
						|
(I found this out when one of the clobbered sister file happened
 | 
						|
to be /bin/bash. The system refused to run rc.local thus
 | 
						|
preventing my "checkfs" program from being launched :)
 | 
						|
 | 
						|
"checkfs":
 | 
						|
 | 
						|
The "formatting" reliability of the fs as well as the file data integrity
 | 
						|
of files on the fs can be checked using this program.
 | 
						|
 | 
						|
"formatiing" reliability can only be checked via an indirect method.
 | 
						|
If there is severe formatting reliability issues with the file system,
 | 
						|
it will most likely cause other system failures that will prevent this
 | 
						|
program from running successfully on a power up. This will prevent
 | 
						|
a "ok to power me down" message from going out to the power cycling
 | 
						|
black box and prevent power being turned off again.
 | 
						|
 | 
						|
File data reliability is checked more directly. A fixed number of
 | 
						|
files are created in the current dir (using the program "makefiles").
 | 
						|
 | 
						|
Each file has a random number of bytes in it (set by using the
 | 
						|
-s cmd line flag). The number of "ints" in the file is stored as the
 | 
						|
first "int" in it (note: 0 length files are not allowed). Each file
 | 
						|
is then filled with random data and a 16 bit CRC appended at the end.
 | 
						|
 | 
						|
When "checkfs" is run, it runs through all files (with predetermined
 | 
						|
file names)- one at a time- and checks for the number of "int's"
 | 
						|
in it as well as the ending CRC.
 | 
						|
 | 
						|
The program exits if the numbers of files that are corrupt are greater
 | 
						|
that a user specified parameter (set by using the -e cmd line flag).
 | 
						|
 | 
						|
If the number of corrupt files is less than this parameter, the corrupt
 | 
						|
files are repaired and operation resumes as explained below.
 | 
						|
 | 
						|
The idea behind allowing a user specified amount of corrupt files is as
 | 
						|
follows:
 | 
						|
 | 
						|
If you are testing for "formatting" reliability of a fs, and for
 | 
						|
the data reliability of "other" files present of the fs, use -e 1.
 | 
						|
"other" files are defined as sister files on the fs, not being written to
 | 
						|
by the "checkfs" test program.
 | 
						|
 | 
						|
As mentioned, in this case you would set -e 1, or allow at most 1 file
 | 
						|
to be corrupt each time after a power fail. This would be the file
 | 
						|
that was probably being written to when power failed (and CRC was not
 | 
						|
updated to reflect the  new data being written). You would check file
 | 
						|
systems like ext2 etc. with such a configuration.
 | 
						|
(As you have no hope that these file systems provide for either your
 | 
						|
new data or old data to be present in the file if power failed during
 | 
						|
the write. This is called "roll back and recover".)
 | 
						|
 | 
						|
With JFFS2 I tested for such "roll back and recover" file data reliability
 | 
						|
by setting -e 0 and making sure that all writes to the file being
 | 
						|
updated are done in a *single* write().
 | 
						|
 | 
						|
This is how I found that JFFS2 (yet) does NOT support this functionality.
 | 
						|
(There was a great debate if this was a bug or a feature that was lacking
 | 
						|
or even an issue at all. See the mtd archives for more details).
 | 
						|
 | 
						|
In other words, JFFS2 will partially update a file on FLASH even before
 | 
						|
the write() command has completed, thus leaving part old data part new
 | 
						|
data in your file if power failed in the middle of a write().
 | 
						|
 | 
						|
This is bad functionality if you are updating a binary structure or a
 | 
						|
CRC protected file (as in our case).
 | 
						|
 | 
						|
 | 
						|
If All Files Check Out OK:
 | 
						|
 | 
						|
On the startup scan, if there are less errors than specified by the "-e flag"
 | 
						|
a "ok to power me down message" is sent via the specified com port.
 | 
						|
 | 
						|
The actual format of this message will depend on the format expected
 | 
						|
by the power cycling box that will receive this message. One may customise
 | 
						|
the actual message that goes out in the "do_pwr_dn)" routine in "comm.c".
 | 
						|
 | 
						|
This file is called with an open file descriptor to the comm port that
 | 
						|
this message needs to go out over and the count of the current power
 | 
						|
cycle (in case your power cycling box can display/log this count).
 | 
						|
 | 
						|
After this message has been sent out, the checkfs program goes into
 | 
						|
a while(1) loop of writing new data (with CRC), one at a time, into
 | 
						|
all the "check files" in the dir.
 | 
						|
 | 
						|
Its life comes to a sudden end when power is asynchronously pulled from
 | 
						|
under its feet (by your external power cycling box).
 | 
						|
 | 
						|
It comes back to life when power is restored and the system boots and
 | 
						|
checkfs is called from the rc.local script file.
 | 
						|
 | 
						|
The cycle then repeats till a problem is detected, at which point
 | 
						|
the "ok to power me down" message is not sent and the cycle stops
 | 
						|
waiting for the user to examine the system.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 |