从oops中查找错误代码行-堆栈错误信息.doc

资源描述

《从oops中查找错误代码行-堆栈错误信息.doc》由会员分享，可在线阅读，更多相关《从oops中查找错误代码行-堆栈错误信息.doc（11页珍藏版）》请在三一文库上搜索。

1、从oops信息查找出错代码行分类： Linux Releated 2011-04-18 16:15 725人阅读评论(0) 收藏举报（1）从oops crash的地方开始查起，首先找到指针访问错误的代码行a)重新编译内核时，选上kernel hacking-compile the kernel with debug info -kernel debugging使得内核包含调试信息，b)然后从Oops信息中找到“PC is at free_block+0x8c/0x168”#Unable to handle kernel paging request at virtual address

2、 000c0604 /非法指针地址pgd = 40004000000c0604 *pgd=00000000Internal error: Oops: 817 #1Modules linked in:CPU: 0 Not tainted (2.6.27.18 #221)PC is at free_block+0x78/0x168 /当前指令地址LR is at release_console_sem+0x19c/0x1b8 /函数返回地址#从system_map中查到free_block地址0x40097ac0，+0x78得到0x40097B38c)在内核根目录运行arm-wrs-linux-g

3、nueabi-armv6jel_vfp-uclibc_small-gdb vmlinux就可以得到出错行rootkqyang-hikvision linux-2.6.27_svn_quyong# arm-wrs-linux-gnueabi-armv6jel_vfp-uclibc_small-gdb vmlinuxGNU gdb (Wind River Linux Sourcery G+ 4.3-85) 6.8.50.20080821-cvsCopyright (C) 2008 Free Software Foundation, Inc.License GPLv3+: GNU GPL versi

4、on 3 or later This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type show copyingand show warranty for details.This GDB was configured as -host=i686-pc-linux-gnu -target=arm-wrs-linux-gnueabi.For bug reporting instructions, please

5、 see:.(gdb) l *0x40097B380x40097b38 is in free_block (include/linux/list.h:93).88 * the prev/next entries already!89 */90 #include 91 static inline void _list_del(struct list_head * prev, struct list_head * next)92 93 next-prev = prev;94 prev-next = next;95 9697 /*(gdb)原文地址：linux内核的oops信息作者：XINU Oop

6、s可看成是内核级（特权级）的Segmentation Fault。一般应用程序（用户级）如进行了内存的非法访问(地址不合法、无权限访问、)或执行了非法指令，则会得到Segfault信号，一般对应的行为是coredump，应用程序也可以自行获取Segfault信号进行处理，而内核出错则是打印出Oops信息。内核打印Oops信息的执行流程： 1、do_page_fault()（arch/i386/mm/fault.c），如果内核出现非法访问，则该函数会打印出EIP、PDE等信息，如下： Unable to handle kernel paging request at virtual addre

7、ss f899b670 printing eip: c01de48c *pde = 00737067 接下来调用die(Oops, regs, error_code);函数，此时如果系统还活着(至少要满足两个条件：1. 在进程上下文 2. 没有设置panic_on_oops)，则会kill掉当前进程，以致死机。 2、die()（arch/i386/kernel/traps.c），该函数最开始会打印出： Oops: 0002 #1 其中，0002代表错误码，#1代表Oops发生次数。 error_code: * bit0 0 means no page found, 1 means protec

8、tion fault * bit1 0 means read, 1 means write * bit2 0 means kernel, 1 means user-mode * bit3 0 means data,1 means instruction 接下来会调用 show_registers(regs) 函数，输出寄存器、当前进程、堆栈、指令代码等信息，以供判断。 Linux内核在发生kernel panic时会打印出Oops信息，把当前的寄存器状态、堆栈信息、完整的Call trace都打印出来，以帮助我们定位错误。下在是一个例子，该例子展示了空指针引用错误。 01 #include 0

9、2 #include 03 04 static int _init hello_init(void) 05 06 int *p = 0; 07 08 *p = 1; 09 return 0; 10 11 12 static void _exit hello_exit(void) 13 14 return; 15 16 17 module_init(hello_init); 18 module_exit(hello_exit); 19 20 MODULE_LICENSE(GPL); 从上面的代码中，我们可以很容易看到出错的代码在08行，当我们把它编译成一个*.ko模块，并使用insmod将其添加

10、到内核时，Oops信息如期而至，如下： 100.243737 BUG: unable to handle kernel NULL pointer dereference at (null) 100.244985 IP: hello_init+0x5/0x11 hello 100.262266 *pde = 00000000 100.288395 Oops: 0002 #1 SMP 100.305468 last sysfs file: /sys/devices/virtual/sound/timer/uevent 100.325955 Modules linked in: hello(+) v

11、mblock vsock vmmemctl vmhgfs acpiphp snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm_osssnd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev psmouse serio_rawfbcon tileblit font bitblit softcursor snd parport_pc sound

12、core snd_page_alloc vmci i2c_piix4 vga16fb vgastate intel_agp agpgart shpchp lp parportfloppy pcnet32 mii mptspi mptscsih mptbase scsi_transport_spi vmxnet 100.472178 100.494931 Pid: 1586, comm: insmod Not tainted (2.6.32-21-generic #32-Ubuntu) VMware Virtual Platform 100.540018 EIP: 0060: EFLAGS: 0

13、0010246 CPU: 0 100.562844 EIP is at hello_init+0x5/0x11 hello 100.584351 EAX: 00000000 EBX: fffffffc ECX: f82cf040 EDX: 00000001 100.609358 ESI: f82cf040 EDI: 00000000 EBP: f1b9ff5c ESP: f1b9ff5c 100.631467 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 100.657664 Process insmod (pid: 1586, ti=f1b9e00

14、0 task=f137b340 task.ti=f1b9e000) 100.706083 Stack: 100.731783 f1b9ff88 c0101131 f82cf040 c076d240 fffffffc f82cf040 0072cff4 f82d2000 100.759324 fffffffc f82cf040 0072cff4 f1b9ffac c0182340 f19638f8 f137b340 f19638c0 100.811396 00000004 09cc9018 09cc9018 00020000 f1b9e000 c01033ec 09cc9018 00015324

15、 100.891922 Call Trace: 100.916257 ? do_one_initcall+0x31/0x190 100.943670 ? hello_init+0x0/0x11 hello 100.970905 ? sys_init_module+0xb0/0x210 100.995542 ? syscall_call+0x7/0xb 101.024087 Code: 05 00 00 00 00 01 00 00 00 5d c3 00 00 00 00 00 00 00 00 00 00 101.079592 EIP: hello_init+0x5/0x11 hello S

16、S:ESP 0068:f1b9ff5c 101.134682 CR2: 0000000000000000 101.158929 - end trace e294b69a66d752cb - Oops描述了Bug类型，并指出Bug的位置，即“IP: hello_init+0x5/0x11 hello”。此时，我们需要用objdump工具来帮忙分析问题，该命令可以帮助反汇编，执行命令如下：objdump -S hello.o下面是反汇编后的内容，是C语言与汇编混合代码，如下：01 hello.o: file format elf32-i386020304 Disassembly of sectio

17、n .init.text:0506 00000000 :07 #include 08 #include 0910 static int _init hello_init(void)11 12 0: 55 push % ebp13 int *p = 0;14 15 *p = 1;16 17 return 0;18 19 1: 31 c0 xor % eax,% eax20 #include 21 #include 2223 static int _init hello_init(void)24 25 3: 89 e5 mov %esp,% ebp26 int *p = 0;2728 *p = 1

18、;29 5: c7 05 00 00 00 00 01 movl $0x1,0x030 c: 00 00 0031 32 return 0;33 34 f: 5d pop % ebp35 10: c3 ret 3637 Disassembly of section .exit.text:3839 00000000 :4041 static void _exit hello_exit(void)42 43 0: 55 push % ebp44 1: 89 e5 mov %esp,% ebp45 3: e8 fc ff ff ff call 4 46 return;47 48 8: 5d pop

19、% ebp49 9: c3 ret (注意：上面的% ebp等中间出现空格，其中的空格应去掉，因为sina作了处理，故采用空格跳过) 对照Oops的提示，我们可以很清楚的看到，出错的位置hello_init+0x5的汇编代码是： 29 5:c7 05 00 00 00 00 01 movl $0x1,0x0 这句代码的作用是把数值1存入0这个地址，这个操作当然是非法的，同时也可以看到对应的源码为： 28 *p = 1; 哈哈，在Oops信息的帮助下，我们很快就可以找到问题所在。该例子没有造成死机，可以使用dmesg命令查看到完整的错误信息，但很多时候是会造成死机，并且会存在多屏显示提示信息，

20、那么我们可以使用内核转储工具kdump把发生Oops时的内存和CPU寄存器的内容dump到一个文件里，之后我们再用gdb来分析问题。参考网址： http:/ http:/ http:/ Spears的“Oops I Did It Again”那首歌的歌词中，也是一种轻描淡写，有时含有抱歉的意思。http:/ scheduling while atomic: insmod/826/0x00000002Call Trace:ef12f700 c00081e0 show_stack+0x3c/0x194 (unreliable)ef12f730 c0019b2c _schedule_bug+0x6

21、4/0x78ef12f750 c0350f50 schedule+0x324/0x34cef12f7a0 c03515c0 schedule_timeout+0x68/0xe4ef12f7e0 c027938c fsl_elbc_run_command+0x138/0x1c0ef12f820 c0275820 nand_do_read_ops+0x130/0x3dcef12f880 c0275ebc nand_read+0xac/0xe0ef12f8b0 c0262d98 part_read+0x5c/0xe4ef12f8c0 c017bcac jffs2_flash_read+0x68/0x

22、254ef12f8f0 c0170550 jffs2_read_dnode+0x60/0x304ef12f940 c017088c jffs2_read_inode_range+0x98/0x180ef12f970 c016e610 jffs2_do_readpage_nolock+0x94/0x1acef12f990 c016ee04 jffs2_write_begin+0x2b0/0x330ef12fa10 c005144c generic_file_buffered_write+0x11c/0x8d0ef12fab0 c0051e48 _generic_file_aio_write_no

23、lock+0x248/0x500ef12fb20 c0052168 generic_file_aio_write+0x68/0x10cef12fb50 c007ca80 do_sync_write+0xc4/0x138ef12fc10 f107c0dc oops_log+0xdc/0x1e8 oopslogef12fe70 f3087058 oops_log_init+0x58/0xa0 oopslogef12fe80 c00477bc sys_init_module+0x130/0x17dcef12ff40 c00104b0 ret_from_syscall+0x0/0x38- Except

24、ion: c01 at 0xff29658 LR = 0x100313002.2Oops程序在内核态时，进入一种异常情况，比如引用非法指针导致的数据异常，数组越界导致的取指异常，此时异常处理机制能够捕获此异常，并将系统关键信息打印到串口上，正常情况下Oops消息会被记录到系统日志中去。Oops发生时，进程处在内核态，很可能正在访问系统关键资源，并且获取了一些锁，当进程由于Oops异常退出时，无法释放已经获取的资源，导致其他需要获取此资源的进程挂起，对系统的正常运行造成影响。通常这种情况，系统处在不稳定的状态，很可能崩溃。2.3Panic当Oops发生在中断上下文中或者在进程0和1中，系统将彻底

25、挂起，因为中断服务程序异常后，将无法恢复，这种情况即称为内核panic。另外当系统设置了panic标志时，无论Oops发生在中断上下文还是进程上下文，都将导致内核Panic。由于在中断复位程序中panic后，系统将不再进行调度，Syslogd将不会再运行，因此这种情况下，Oops的消息仅仅打印到串口上，不会被记录在系统日志中。Kernelpanic调试举例：242.788019 bluesleep_outgoing_data: tx was sleeping 244.012224 *host_wake is 1 245.234647 Disable_key_during_touch=0 245

26、.237802 huqiao_button-code=139,state =1 245.414640 Disable_key_during_touch=0 245.417542 huqiao_button-code=139,state =0 245.821424 *host_wake is 0 245.823708 bluesleep_hostwake_isr: Iwaking up. 245.823713 245.830155 bluesleep_hostwake_task: bluesleep_hostwake_task is called 245.838356 Unable to han

27、dle kernel NULL pointer dereference at virtualaddress 00000008 245.845678 pgd = c0004000 245.848188 00000008 *pgd=00000000 245.851751 Internal error: Oops: 5 #1 PREEMPT SMP ARM 245.857122 Modules linked in: 245.860080 CPU: 0 Tainted: G W (3.4.0-perf-svn874 #1) 245.866444 PC is at sco_connect_cfm+0x3

28、80/0x4e8 245.871106 LR is at 0xd880 245.873800 pc : lr : psr: 40000013 245.873805 sp : dbe55e78 ip : 00000000 fp : d7d95c00 245.885246 r10: d8643998 r9 : d8e5b80d r8 : d8643830 245.890529 r7 : dbe54000 r6 : d9e5b600 r5 : cae27c80r4 : d8643800 245.896968 r3 : 00000008 r2 : 00000000 r1 : d7d96016r0 :

29、00000000 245.903552 Flags: nZcv IRQs on FIQs on Mode SVC_32ISA ARM Segment kernel 245.910772 Control: 10c5787d Table: 5a47406a DAC: 00000015 245.916576 245.916579 PC: 0xc0744640: 245.920751 4640 e3310000 1afffffa f57ff04f e320f004 e5973004e2433001 e5873004 e5973000 245.928910 4660 ea000042 e59f0190

30、e300332a e19030b3 e31300040a000004 e2800fc6 e59f1198如上图，当出现kernel panic的时候，会出现上面所示的堆栈信息。我们可以看到 245.866444 PC is atsco_connect_cfm+0x380/0x4e8，就会知道在sco_connect_cfm函数这边出现问题的。一般来说从LR(链接寄存器)这，我们可以知道上面的哪个函数是被hci_proto_connect_cfm所调用的。当看到Unable to handle kernel NULL pointerdereference at virtual address 0

31、0000008时，就知道这个函数应用了一个非法地址，在linux中，将最高的1G字节（从虚拟地址0xC0000000到0xFFFFFFFF），供内核使用，称为“内核空间”。而将较低的3G字节（从虚拟地址0x00000000到0xBFFFFFFF），供各个进程使用，称为“用户空间），现在内核非法使用了用户空间的地址故存在问题。关于kernel panic一般很难复现，于是我计划在内核中自己用代码去模拟这个现象。static inlinevoid hci_proto_connect_cfm(struct hci_conn *conn, _u8 status)register struct hci

32、_proto *hp;hp = hci_protoHCI_PROTO_L2CAP;if (hp & hp-connect_cfm)hp-connect_cfm(conn, status);hp = hci_protoHCI_PROTO_SCO;if (hp & hp-connect_cfm)hp-connect_cfm(conn, status);if (conn-connect_cfm_cb)conn-connect_cfm_cb(conn, status);当我把函数改变为static inlinevoid hci_proto_connect_cfm(struct hci_conn *co

33、nn, _u8 status)register struct hci_proto *hp;hp = hci_protoHCI_PROTO_L2CAP;if (hp & hp-connect_cfm)hp-connect_cfm(conn, status);conn = = NULL - 21; / Simulation this phenomenon,hp = hci_protoHCI_PROTO_SCO;if (hp & hp-connect_cfm)hp-connect_cfm(conn, status);if (conn-connect_cfm_cb)conn-connect_cfm_c

34、b(conn, status); 这个现象就会完全的复现。其实根据hci_conn结构体定义，我们就会知道hcon-type的地址为00000008，于是我们就会明白，在最初的代码中，在调用sco_connect_cfm的时候，传入的变量conn的地址被改变为NULL - 21;但是在前面跑hp-connect_cfm(conn, status)却没有什么问题，conn的地址传进hp-connect_cfm(conn, status)，也没有什么改变。于是我就开始郁闷了。为什么突然地址变为一个非法的地址？后来在网上查了下，才发现可能是硬件的问题，使得某一个地址发生了临时的错误而导致的。于是找到了原因，这个bug也就分析结束了。

展开阅读全文