247 lines
		
	
	
		
			8.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			247 lines
		
	
	
		
			8.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Table of contents
 | |
| =================
 | |
| 
 | |
| Last updated: 20 December 2005
 | |
| 
 | |
| Contents
 | |
| ========
 | |
| 
 | |
| - Introduction
 | |
| - Devices not appearing
 | |
| - Finding patch that caused a bug
 | |
| -- Finding using git-bisect
 | |
| -- Finding it the old way
 | |
| - Fixing the bug
 | |
| 
 | |
| Introduction
 | |
| ============
 | |
| 
 | |
| Always try the latest kernel from kernel.org and build from source. If you are
 | |
| not confident in doing that please report the bug to your distribution vendor
 | |
| instead of to a kernel developer.
 | |
| 
 | |
| Finding bugs is not always easy. Have a go though. If you can't find it don't
 | |
| give up. Report as much as you have found to the relevant maintainer. See
 | |
| MAINTAINERS for who that is for the subsystem you have worked on.
 | |
| 
 | |
| Before you submit a bug report read REPORTING-BUGS.
 | |
| 
 | |
| Devices not appearing
 | |
| =====================
 | |
| 
 | |
| Often this is caused by udev. Check that first before blaming it on the
 | |
| kernel.
 | |
| 
 | |
| Finding patch that caused a bug
 | |
| ===============================
 | |
| 
 | |
| 
 | |
| 
 | |
| Finding using git-bisect
 | |
| ------------------------
 | |
| 
 | |
| Using the provided tools with git makes finding bugs easy provided the bug is
 | |
| reproducible.
 | |
| 
 | |
| Steps to do it:
 | |
| - start using git for the kernel source
 | |
| - read the man page for git-bisect
 | |
| - have fun
 | |
| 
 | |
| Finding it the old way
 | |
| ----------------------
 | |
| 
 | |
| [Sat Mar  2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
 | |
| 
 | |
| This is how to track down a bug if you know nothing about kernel hacking.
 | |
| It's a brute force approach but it works pretty well.
 | |
| 
 | |
| You need:
 | |
| 
 | |
|         . A reproducible bug - it has to happen predictably (sorry)
 | |
|         . All the kernel tar files from a revision that worked to the
 | |
|           revision that doesn't
 | |
| 
 | |
| You will then do:
 | |
| 
 | |
|         . Rebuild a revision that you believe works, install, and verify that.
 | |
|         . Do a binary search over the kernels to figure out which one
 | |
|           introduced the bug.  I.e., suppose 1.3.28 didn't have the bug, but
 | |
|           you know that 1.3.69 does.  Pick a kernel in the middle and build
 | |
|           that, like 1.3.50.  Build & test; if it works, pick the mid point
 | |
|           between .50 and .69, else the mid point between .28 and .50.
 | |
|         . You'll narrow it down to the kernel that introduced the bug.  You
 | |
|           can probably do better than this but it gets tricky.
 | |
| 
 | |
|         . Narrow it down to a subdirectory
 | |
| 
 | |
|           - Copy kernel that works into "test".  Let's say that 3.62 works,
 | |
|             but 3.63 doesn't.  So you diff -r those two kernels and come
 | |
|             up with a list of directories that changed.  For each of those
 | |
|             directories:
 | |
| 
 | |
|                 Copy the non-working directory next to the working directory
 | |
|                 as "dir.63".
 | |
|                 One directory at time, try moving the working directory to
 | |
|                 "dir.62" and mv dir.63 dir"time, try
 | |
| 
 | |
|                         mv dir dir.62
 | |
|                         mv dir.63 dir
 | |
|                         find dir -name '*.[oa]' -print | xargs rm -f
 | |
| 
 | |
|                 And then rebuild and retest.  Assuming that all related
 | |
|                 changes were contained in the sub directory, this should
 | |
|                 isolate the change to a directory.
 | |
| 
 | |
|                 Problems: changes in header files may have occurred; I've
 | |
|                 found in my case that they were self explanatory - you may
 | |
|                 or may not want to give up when that happens.
 | |
| 
 | |
|         . Narrow it down to a file
 | |
| 
 | |
|           - You can apply the same technique to each file in the directory,
 | |
|             hoping that the changes in that file are self contained.
 | |
| 
 | |
|         . Narrow it down to a routine
 | |
| 
 | |
|           - You can take the old file and the new file and manually create
 | |
|             a merged file that has
 | |
| 
 | |
|                 #ifdef VER62
 | |
|                 routine()
 | |
|                 {
 | |
|                         ...
 | |
|                 }
 | |
|                 #else
 | |
|                 routine()
 | |
|                 {
 | |
|                         ...
 | |
|                 }
 | |
|                 #endif
 | |
| 
 | |
|             And then walk through that file, one routine at a time and
 | |
|             prefix it with
 | |
| 
 | |
|                 #define VER62
 | |
|                 /* both routines here */
 | |
|                 #undef VER62
 | |
| 
 | |
|             Then recompile, retest, move the ifdefs until you find the one
 | |
|             that makes the difference.
 | |
| 
 | |
| Finally, you take all the info that you have, kernel revisions, bug
 | |
| description, the extent to which you have narrowed it down, and pass
 | |
| that off to whomever you believe is the maintainer of that section.
 | |
| A post to linux.dev.kernel isn't such a bad idea if you've done some
 | |
| work to narrow it down.
 | |
| 
 | |
| If you get it down to a routine, you'll probably get a fix in 24 hours.
 | |
| 
 | |
| My apologies to Linus and the other kernel hackers for describing this
 | |
| brute force approach, it's hardly what a kernel hacker would do.  However,
 | |
| it does work and it lets non-hackers help fix bugs.  And it is cool
 | |
| because Linux snapshots will let you do this - something that you can't
 | |
| do with vendor supplied releases.
 | |
| 
 | |
| Fixing the bug
 | |
| ==============
 | |
| 
 | |
| Nobody is going to tell you how to fix bugs. Seriously. You need to work it
 | |
| out. But below are some hints on how to use the tools.
 | |
| 
 | |
| To debug a kernel, use objdump and look for the hex offset from the crash
 | |
| output to find the valid line of code/assembler. Without debug symbols, you
 | |
| will see the assembler code for the routine shown, but if your kernel has
 | |
| debug symbols the C code will also be available. (Debug symbols can be enabled
 | |
| in the kernel hacking menu of the menu configuration.) For example:
 | |
| 
 | |
|     objdump -r -S -l --disassemble net/dccp/ipv4.o
 | |
| 
 | |
| NB.: you need to be at the top level of the kernel tree for this to pick up
 | |
| your C files.
 | |
| 
 | |
| If you don't have access to the code you can also debug on some crash dumps
 | |
| e.g. crash dump output as shown by Dave Miller.
 | |
| 
 | |
| >    EIP is at ip_queue_xmit+0x14/0x4c0
 | |
| >     ...
 | |
| >    Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
 | |
| >    00 00 55 57  56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
 | |
| >    <8b> 83 3c 01 00 00 89 44  24 14 8b 45 28 85 c0 89 44 24 18 0f 85
 | |
| >
 | |
| >    Put the bytes into a "foo.s" file like this:
 | |
| >
 | |
| >           .text
 | |
| >           .globl foo
 | |
| >    foo:
 | |
| >           .byte  .... /* bytes from Code: part of OOPS dump */
 | |
| >
 | |
| >    Compile it with "gcc -c -o foo.o foo.s" then look at the output of
 | |
| >    "objdump --disassemble foo.o".
 | |
| >
 | |
| >    Output:
 | |
| >
 | |
| >    ip_queue_xmit:
 | |
| >        push       %ebp
 | |
| >        push       %edi
 | |
| >        push       %esi
 | |
| >        push       %ebx
 | |
| >        sub        $0xbc, %esp
 | |
| >        mov        0xd0(%esp), %ebp        ! %ebp = arg0 (skb)
 | |
| >        mov        0x8(%ebp), %ebx         ! %ebx = skb->sk
 | |
| >        mov        0x13c(%ebx), %eax       ! %eax = inet_sk(sk)->opt
 | |
| 
 | |
| In addition, you can use GDB to figure out the exact file and line
 | |
| number of the OOPS from the vmlinux file. If you have
 | |
| CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the
 | |
| OOPS:
 | |
| 
 | |
|  EIP:    0060:[<c021e50e>]    Not tainted VLI
 | |
| 
 | |
| And use GDB to translate that to human-readable form:
 | |
| 
 | |
|   gdb vmlinux
 | |
|   (gdb) l *0xc021e50e
 | |
| 
 | |
| If you don't have CONFIG_DEBUG_INFO enabled, you use the function
 | |
| offset from the OOPS:
 | |
| 
 | |
|  EIP is at vt_ioctl+0xda8/0x1482
 | |
| 
 | |
| And recompile the kernel with CONFIG_DEBUG_INFO enabled:
 | |
| 
 | |
|   make vmlinux
 | |
|   gdb vmlinux
 | |
|   (gdb) p vt_ioctl
 | |
|   (gdb) l *(0x<address of vt_ioctl> + 0xda8)
 | |
| or, as one command
 | |
|   (gdb) l *(vt_ioctl + 0xda8)
 | |
| 
 | |
| If you have a call trace, such as :-
 | |
| >Call Trace:
 | |
| > [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
 | |
| > [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
 | |
| > [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
 | |
| > ...
 | |
| this shows the problem in the :jbd: module. You can load that module in gdb
 | |
| and list the relevant code.
 | |
|   gdb fs/jbd/jbd.ko
 | |
|   (gdb) p log_wait_commit
 | |
|   (gdb) l *(0x<address> + 0xa3)
 | |
| or
 | |
|   (gdb) l *(log_wait_commit + 0xa3)
 | |
| 
 | |
| 
 | |
| Another very useful option of the Kernel Hacking section in menuconfig is
 | |
| Debug memory allocations. This will help you see whether data has been
 | |
| initialised and not set before use etc. To see the values that get assigned
 | |
| with this look at mm/slab.c and search for POISON_INUSE. When using this an
 | |
| Oops will often show the poisoned data instead of zero which is the default.
 | |
| 
 | |
| Once you have worked out a fix please submit it upstream. After all open
 | |
| source is about sharing what you do and don't you want to be recognised for
 | |
| your genius?
 | |
| 
 | |
| Please do read Documentation/SubmittingPatches though to help your code get
 | |
| accepted.
 | 
