kunpeng technical troubleshooting guide

67
Kunpeng Technical Troubleshooting Guide Issue 02 Date 2021-09-01 HUAWEI TECHNOLOGIES CO., LTD.

Upload: others

Post on 01-May-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kunpeng Technical Troubleshooting Guide

Kunpeng Technical TroubleshootingGuide

Issue 02

Date 2021-09-01

HUAWEI TECHNOLOGIES CO., LTD.

Page 2: Kunpeng Technical Troubleshooting Guide

Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved.

No part of this document may be reproduced or transmitted in any form or by any means without priorwritten consent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei andthe customer. All or part of the products, services and features described in this document may not bewithin the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,information, and recommendations in this document are provided "AS IS" without warranties, guaranteesor representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. i

Page 3: Kunpeng Technical Troubleshooting Guide

Contents

1 Overview....................................................................................................................................1

2 Kernel Problems.......................................................................................................................32.1 Analysis Methods.................................................................................................................................................................... 32.1.1 Kdump Mechanism............................................................................................................................................................. 32.1.2 Configuring the Kdump Service...................................................................................................................................... 42.1.3 Configuring Kernel Parameters....................................................................................................................................... 52.1.4 Enabling KASAN................................................................................................................................................................... 62.2 Analysis Tool............................................................................................................................................................................. 72.2.1 Crash.........................................................................................................................................................................................72.3 System Crash.............................................................................................................................................................................72.3.1 Fault Locating....................................................................................................................................................................... 82.3.2 Case 1.......................................................................................................................................................................................92.3.3 Case 2.................................................................................................................................................................................... 122.4 System Suspension............................................................................................................................................................... 132.4.1 Fault Locating..................................................................................................................................................................... 132.4.2 Cases...................................................................................................................................................................................... 14

3 Application Problems........................................................................................................... 223.1 Analysis Methods.................................................................................................................................................................. 223.2 C/C++ Application Problems............................................................................................................................................. 233.2.1 Introduction to C/C++...................................................................................................................................................... 233.2.2 Analysis Tool........................................................................................................................................................................243.2.2.1 GDB..................................................................................................................................................................................... 243.2.3 Troubleshooting................................................................................................................................................................. 253.2.3.1 C/C++ Process Ends Unexpectedly........................................................................................................................... 253.2.3.2 C/C++ Process Suspends.............................................................................................................................................. 323.3 Fortran Application Problems........................................................................................................................................... 353.3.1 Introduction to Fortran.................................................................................................................................................... 353.3.2 Analysis Tool........................................................................................................................................................................353.3.2.1 GDB..................................................................................................................................................................................... 353.3.3 Troubleshooting................................................................................................................................................................. 353.3.3.1 Fortran Data Deviation Occurs..................................................................................................................................353.3.3.2 Fortran Process Ends Unexpectedly......................................................................................................................... 39

Kunpeng Technical Troubleshooting Guide Contents

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. ii

Page 4: Kunpeng Technical Troubleshooting Guide

3.4 Java Application Problems................................................................................................................................................. 413.4.1 Introduction to Java.......................................................................................................................................................... 413.4.2 Java Fault Locating Tools................................................................................................................................................ 413.4.2.1 jps........................................................................................................................................................................................ 413.4.2.2 jinfo..................................................................................................................................................................................... 423.4.2.3 jstack.................................................................................................................................................................................. 433.4.2.4 jmap.................................................................................................................................................................................... 433.4.2.5 jstat..................................................................................................................................................................................... 443.4.3 Troubleshooting Java Problems.................................................................................................................................... 453.4.3.1 Heap Memory Leak....................................................................................................................................................... 453.4.3.2 Java Process Hangs........................................................................................................................................................483.4.3.3 Java Process Freezes......................................................................................................................................................503.5 Go Application Problems.................................................................................................................................................... 533.5.1 Introduction to Go.............................................................................................................................................................533.5.2 Go Fault Locating Tools...................................................................................................................................................533.5.2.1 Delve................................................................................................................................................................................... 533.5.3 Troubleshooting Go Problems....................................................................................................................................... 543.5.3.1 Go Program Stops Abnormally..................................................................................................................................54

4 Auxiliary Analysis of the Kunpeng DevKit...................................................................... 594.1 Overview.................................................................................................................................................................................. 594.2 Cases..........................................................................................................................................................................................594.2.1 Creating a Task................................................................................................................................................................... 594.2.2 Analyzing the Result.........................................................................................................................................................60

A ARM64 General-Purpose Register.....................................................................................62

B Change History...................................................................................................................... 63

Kunpeng Technical Troubleshooting Guide Contents

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. iii

Page 5: Kunpeng Technical Troubleshooting Guide

1 Overview

About This DocumentProblems may occur on each module throughout the lifecycle of software, whichmay affect the running of the module or even the software. Every softwareengineer should learn how to analyze and locate software problems. However, thishas become an increasingly difficult job as the capacity or functionalities ofsoftware increases.

To help developers and users to better use the Kunpeng platform, this documentdescribes the methods and cases of troubleshooting software problems based onthe experience of Kunpeng software development and maintenance.

Software problems include kernel problems and application problems. The analysismethods and fault locating tools for application problems vary according to thedevelopment language. Therefore, this document involves two main categories:kernel problems and application problems. Application problems are describedbased on different development languages. In each part, we'll be introducingsome common fault locating tools, analysis methods, and cases of specificsoftware problems. In addition, the diagnosis and debugging functions included inthe Kunpeng DevKit can help developers analyze problems such as memory leaksand out of memory (OOM) and quickly resolve the problems.

Software Problem AnalysisFigure 1-1 shows the general guideline for analyzing software problems. Thisdocument summarizes specific troubleshooting methods for each problem whileobserving this general guideline.

Kunpeng Technical Troubleshooting Guide 1 Overview

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 1

Page 6: Kunpeng Technical Troubleshooting Guide

Figure 1-1 Analysis guideline for software problems

Common Analysis MethodsThe following are common methods for analyzing software problems:

● Query software logs and information recorded in the /proc directory.● Collect information by adding printing information to the code.● Use a debugger such as GDB or kdb for step-by-step debugging.● Use dump mechanisms such as LKCD or kdump to dump the information

generated when software crashes for subsequent locating.

Kunpeng Technical Troubleshooting Guide 1 Overview

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 2

Page 7: Kunpeng Technical Troubleshooting Guide

2 Kernel Problems

2.1 Analysis Methods

2.2 Analysis Tool

2.3 System Crash

2.4 System Suspension

2.1 Analysis Methods

2.1.1 Kdump MechanismLinux kernel problems are difficult to locate and rectify because they may causethe entire system to break down and cannot be reproduced. Also, due to theparticularity of kernel debugging, common debugging methods such as printing,query, and step-by-step debugging are not suitable for locating kernel problems.Kernel crash dump mechanisms effectively collect onsite information when aproblem occurs for subsequent analysis and are widely used to locate memoryproblems. Kernel crash dump mechanisms include LKCD, Diskdump, Netdump,Mkdump, and kdump. As a mainstream kernel crash dump mechanism, kdump isrecommended for locating kernel problems in this document.

Kdump is accepted by the ARM kernel mainline in Linux 2.6.38. It is implementedthrough two kernels. In addition to the system kernel, kdump starts anothercapture kernel to capture system kernel information for dump. Figure 2-1 showsthe process.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 3

Page 8: Kunpeng Technical Troubleshooting Guide

Figure 2-1 Kdump mechanism

1. The system kernel starts normally, including hardware self-check andbootloader loading, and reserves memory space for the capture kernel.

2. Load the capture kernel to the reserved memory space.

3. The system crashes, triggering panic, and the capture kernel starts up. (Thecapture kernel uses kexec to skip hardware self-check and quickly start up.)

4. The capture kernel collects the system kernel information by using the /proc/vmcore memory image file.

5. Compress the system kernel information to generate a dump file and writethe file to a drive.

2.1.2 Configuring the Kdump ServiceIf a kernel problem occurs, you cannot use debugging tools such as GDB todirectly debug the failed program. Instead, you need to generate a vmcore file andthen debug the vmcore file to locate problems.

Kdump dumps memory running parameters when the system crashes, deadlocks,or crashes. A vmcore file is generated only when kdump is configured. Thefollowing uses CentOS as an example to describe how to configure kdump.

Step 1 Install kdump and its dependencies.yum install kdump-tools crash kexec-tools makedumpfile -y

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 4

Page 9: Kunpeng Technical Troubleshooting Guide

Step 2 Configure kdump.

Edit the /etc/default/grub.d/kdump-tools.cfg file.

vim /etc/default/grub.d/kdump-tools.cfg

Modify the crashkernel parameter in the GRUB_CMDLINE_LINUX_DEFAULTvariable (for example, set crashkernel to 1G-:512M), save the modification, andexit the file.

Step 3 Update grub parameters.update-grub

Step 4 Confirm kdump is enabled.

1. Restart the server and check the kdump configuration.kdump-config show

2. Run the following commands to verify kdump. The Linux kernel breaks downafter the command execution, and kdump files are generated in the defaultdirectory /var/crash. If a vmcore file is generated, kdump has been enabled.echo 1 > /proc/sys/kernel/sysrqecho c > /proc/sysrq-trigger

----End

2.1.3 Configuring Kernel ParametersIf some deadlocks occur in the kernel, the system does not restart but stopsresponding. In this case, a kernel panic does not occur. You need to createconditions to make the kernel panic so that a vmcore file can be generated.

You can force a kernel panic as follows:

● Trigger a kernel panic when a process is suspended.echo 1 > /proc/sys/kernel/hung_task_panic

● Trigger a kernel panic n seconds after a hung task occurs (60s is used as anexample in this command).echo 60 > /proc/sys/kernel/hung_task_timeout_secs

● Trigger a kernel panic when a soft lockup is generated.echo 1 > /proc/sys/kernel/softlockup_panic

● Trigger a kernel panic when an out of memory (OOM) error occurs.echo 1 > /proc/sys/vm/panic_on_oom

● Trigger a kernel panic when an alarm is generated in the kernel.echo 1 > /proc/sys/kernel/panic_on_warn

The configurations modified by running the preceding commands take effecttemporarily and become invalid after a system restart. To enable automaticconfiguration upon restart, do as follows:

Step 1 Edit the etc/sysctl.conf file.vim etc/sysctl.conf

Enter the following information:

kernel.hung_task_panic=1kernel.hung_task_timeout_secs=60kernel.softlockup_panic=1vm.panic_on_oom=1kernel.panic_on_warn=1

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 5

Page 10: Kunpeng Technical Troubleshooting Guide

Step 2 Make the configuration take effect.sysctl -p

----End

2.1.4 Enabling KASANKernel Address Sanitize (KASAN) is a dynamic memory error detector. Use thistool to detect memory overwrites and the use of released memory. EnablingKASAN helps quickly locate the problem if a memory problem occurs. However,after KASAN is enabled, problems that are difficult to reproduce may occur.

The kernel integrates KASAN. Enable the KASAN option and recompile the kernelso that related functions can be used. This section uses CentOS 7.6 as an exampleto illustrate the detailed process.

Step 1 Download the RPM package kernel-alt-xxx.src.rpm.

Download address: http://vault.centos.org/7.6.1810/os/Source/SPackages/

Step 2 Install the dependencies.yum install –y xmlto asciidoc newt-devel pciutils-develsudo yum install rpm-build redhat-rpm-config asciidoc hmaccalc perl-ExtUtils-Embed pesign xmltosudo yum install audit-libs-devel binutils-devel elfutils-devel elfutils-libelf-develsudo yum install ncurses-devel newt-devel numactl-devel pciutils-devel python-devel zlib-devel

Step 3 Ensure that the mockbuild user and the group to which the mockbuild userbelongs are valid.groupadd mockbuilduseradd mockbuild -g mockbuild

Step 4 Install the RPM package.rpm –ivh kernel-alt-xxx.src.rpm

After the installation is complete, the RPM build project is automatically deployedin /root/rpmbuild/SPECS and /root/rpmbuild/SOURCES.

Step 5 Build rpmbuild.cd /root/rpmbuild/SPECSrpmbuild -bp --target=$(uname -m) kernel-alt.spec

Step 6 Compile the .config file.cd /root/rpmbuild/BUILD/kernel-alt-xxx/linux-xxx.aarch64make menuconfigKernel hacking ---> Memory Debugging ---> KASan: runtime memory debuggeer

NO TE

In the first command, replace the path following cd with the actual one.

Step 7 Compile the kernel.cd /root/rpmbuild/BUILD/kernel-alt-xxx/linux-xxx.aarch64make -jmake modules_installmake install

NO TE

In the first command, replace the path following cd with the actual one.

Step 8 Restart the system.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 6

Page 11: Kunpeng Technical Troubleshooting Guide

reboot

----End

2.2 Analysis Tool

2.2.1 Crash

IntroductionCrash is a memory dump analysis tool developed and maintained by DaveAnderson. It is widely used to locate kernel problems.

Installing CrashIn CentOS, use Yum to install Crash-related components.

yum install crash*

Using Crash● Enable Crash debugging.

crash {dump file} {debugging version kernel}

● Crash debugging command: (crash) command *args

NO TE

(crash) is displayed when you enter the debugging mode. command indicates thedebugging command to be executed, and *args indicates the parameter required bysome debugging commands.

Common parameters are as follows:

Parameter Description

bt Prints the function call stack information.

log Prints the system message buffer. Example: log | tail -n30

ps Displays the process status. > indicates an activeprocess. Example: ps | grep RU

dis Disassembles a specified function or address. Example:dis -l [func | addr]

whatis Searches for data or type information. Example: whatis[struct | union | typedef | symbol]

sym Converts a virtual address to a symbol.

2.3 System Crash

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 7

Page 12: Kunpeng Technical Troubleshooting Guide

2.3.1 Fault LocatingIf a server frequently breaks down and cannot run properly, the system breaksdown. Figure 2-2 shows how to locate the fault.

Figure 2-2 Fault locating for a system crash

Step 1 Configure kdump so that the system can collect crash information for subsequentdebugging.

Step 2 Enable KASAN and add some memory debugging information to the vmcore file.

Step 3 Run the program. If the problem does not recur, disable KASAN and run theprogram again to reproduce the problem.

Step 4 If the problem recurs, restart the server and confirm that the vmcore file isgenerated.

Step 5 Use Crash to debug the vmcore file and locate the fault.

Step 6 Analyze the cause. Modify the code and perform a verification.

Step 7 If the verification is successful, the code is integrated and no further action isrequired.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 8

Page 13: Kunpeng Technical Troubleshooting Guide

Step 8 If the verification fails, add debugging information, recompile the code, and runthe code to reproduce the problem.

----End

2.3.2 Case 1

SymptomA server breaks down occasionally when some software is running on it.

Fault Locating

Step 1 Configure kdump and enable KASAN to reproduce the problem.

Step 2 Disable KASAN to reproduce the problem if the problem does not recur for a longtime.

Step 3 Confirm that a vmcore file is generated. Use Crash to debug the vmcore file. Asshown in the following figure, it is confirmed that a blocked task exists.

Step 4 View the stack information. The kubelet process fails to apply for a lock. As aresult, a deadlock occurs.

Step 5 Check the log in the preceding figure. The __netlink_dump_start function invokesthe mutex_lock function which holds the lock.

Step 6 View the __netlink_dump_start kernel source code.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 9

Page 14: Kunpeng Technical Troubleshooting Guide

The cb_mutex in the nlk structure is used as the input parameter of themutex_lock function.

Step 7 Check the offset of cb_mutex in the structure variable nlk.

The offset of cb_mutex in the structure variable nlk is 920.

Step 8 Disassemble the __netlink_dump_start function.

The assembly operation ldr x0, [x0, #920] is displayed. According to the precedinganalysis, the address is moved from the initial address of nlk to the membervariable cb_mutex. Therefore, the initial address of the nlk is stored in the x0register before this command is executed. However, the value of the x0 register isstill unknown. In the previous assembly operation mov x19, x0, you can see that

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 10

Page 15: Kunpeng Technical Troubleshooting Guide

the register content is copied to x19. According to the features of the x19 register,the value of the x19 register is saved when the subfunction mutex_lock is called.

Step 9 Disassemble the mutex_lock function. According to the assembly operation strx19, [sp,#16], save x19 to sp+16 in the mutex_lock function.

Step 10 Check the stack frame value of mutex_lock. The value is ffff00089a9efae0, asshown in the following figure. The address of the nlk structure variable isffff00089a9efae0 + 0x10 = ffff00089a9efaf0.

Step 11 Obtain the memory location of the nlk structure variable.crash> rd 0xffff00089a9efaf0ffff00089a9efaf0: ffffa05fc828f800

Step 12 Obtain the position of the lock when mutex_lock occurs.crash> struct netlink_sock.cb_mutex ffffa05fc828f800 –xcb_mutex = 0xffff000008e6e7b0 <rtnl_mutex>

Step 13 Check the mutex lock structure.crash> struct mutex 0xffff000008e6e7b0 –x

Ensure that the lock holder is 0xffffa03b9dd80000. (The last three bits are theflag bits instead of the actual lock holder address. Therefore, the last three bits areset to 0.)

Step 14 View the ID of the process to which the lock holder belongs.

Step 15 View which other processes are requesting the lock.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 11

Page 16: Kunpeng Technical Troubleshooting Guide

Step 16 Check the related process code to determine the cause of the deadlock. Modifythe code to avoid deadlock and recompile and run the program. No problemoccurs. No further action is required.

----End

2.3.3 Case 2

Symptom

A server breaks down occasionally when some software is running on it.

Fault Locating

Step 1 Configure kdump and enable KASAN to reproduce the problem.

Step 2 Confirm that a vmcore file is generated. Use Crash to debug the vmcore file.

The addresses 0xffff20000a3c05c0 and ffff20000a3c0580 conflict.

Step 3 View the symbol of 0xffff20000a3c05c0, which is xxx.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 12

Page 17: Kunpeng Technical Troubleshooting Guide

Step 4 Analyze the service logic and exclude the cause of the symbol conflict. Modify thecode and recompile and run the program. The problem does not recur. It isconfirmed that the code is incorporated. No further action is required.

----End

2.4 System Suspension

2.4.1 Fault LocatingIf a system cannot be operated but does not restart, the system is suspended.Figure 2-3 shows how to locate and rectify the fault.

Figure 2-3 Fault locating for system suspension

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 13

Page 18: Kunpeng Technical Troubleshooting Guide

Step 1 Configure kdump so that the system can collect crash information for subsequentdebugging.

Step 2 Configure kernel parameters so that the suspension problem can cause the systemto break down.

Step 3 Reproduce the problem and confirm that a vmcore file is generated.

Step 4 Use Crash to debug the vmcore file and locate the fault.

Step 5 Analyze the cause and modify the code. Then perform a verification.

Step 6 If the verification is successful, the code is integrated and no further action isrequired.

Step 7 If the verification fails, add debugging information, recompile the code, and runthe code to reproduce the problem.

----End

2.4.2 Cases

SymptomA server stops responding when some software is running on it.

Fault Locating

Step 1 Configure kdump and kernel parameters.

Step 2 Reproduce the problem and confirm that a vmcore file is generated. Use Crash fordebugging. As shown in the following figure, there is a null pointer.

Step 3 Check the error stack. It is possible that the inet_sock_destruct function isincorrect.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 14

Page 19: Kunpeng Technical Troubleshooting Guide

Step 4 Disassemble the inet_sock_destruct function.

Step 5 View the source code of the inet_sock_destruct function.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 15

Page 20: Kunpeng Technical Troubleshooting Guide

According to the assembly and source code analysis, the x2 register is a nullpointer address, which means that the __skb_dequeue is null. However,__skb_dequeue must be set when it is transferred in the kfree_skb function. Thisindicates that other threads are also operating the __skb_dequeue. After thecurrent thread determines that the __skb_dequeue is not empty, other threadsmake the __skb_dequeue empty. As a result, an error is reported in subsequentoperations of the current thread.

Step 6 Find out which other threads perform operations on the __skb_dequeue. You needto add debugging information to the kernel to save the stack information abouteach __skb_dequeue operation. As shown in the following figure, modify the codeto save the stack information when an operation is performed on the__skb_dequeue.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 16

Page 21: Kunpeng Technical Troubleshooting Guide

Step 7 Recompile the kernel, run the program, generate a new vmcore file, use Crash fordebugging, and locate the fault based on the preceding locating method.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 17

Page 22: Kunpeng Technical Troubleshooting Guide

Step 8 Analyze the call logic of the stack: tcp_done > inet_csk_destroy_sock >sk_stream_kill_queues > kfree_skb > skb_release_all

Step 9 Run the program again, generate another vmcore file, use Crash for debugging,and locate the fault based on the preceding method.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 18

Page 23: Kunpeng Technical Troubleshooting Guide

Step 10 Analyze the two vmcore files you generated previously. It is found that twothreads are operating the __skb_dequeue at the same time. The following figureshows the calling relationship.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 19

Page 24: Kunpeng Technical Troubleshooting Guide

When multiple cores process different logics, the __skb_dequeue operations maynot be synchronized. As a result, the queue is released repeatedly and the systemsuspends.

Step 11 To ensure operation consistency and avoid repeated release, add a memory barrierbefore the sk_free operation. Recompile and run the kernel. The problem is solved.void __sock_wfree(struct sk_buff *skb){struct sock *sk = skb->sk;if (refcount_sub_and_test(skb->truesize, &sk->sk_wmem_alloc)) {smp_rmb(); //Add a memory barrier for verification.__sk_free(sk);}}

Step 12 Check the kernel community. It is found that the community has added thememory barrier to the refcount_sub_and_test function in version 5.1. You canupdate the kernel version to fix this problem.

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 20

Page 25: Kunpeng Technical Troubleshooting Guide

----End

Kunpeng Technical Troubleshooting Guide 2 Kernel Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 21

Page 26: Kunpeng Technical Troubleshooting Guide

3 Application Problems

3.1 Analysis Methods

3.2 C/C++ Application Problems

3.3 Fortran Application Problems

3.4 Java Application Problems

3.5 Go Application Problems

3.1 Analysis Methods

Coredump MechanismCoredump is a memory information capture mechanism based on asynchronoussignals. It provides memory information mirroring in user mode for analysis. It is acommon debugging method. Figure 3-1 shows the process.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 22

Page 27: Kunpeng Technical Troubleshooting Guide

Figure 3-1 Coredump work process

1. Some severe errors (such as invalid memory address access and invalidhardware instructions) occur during application execution. As a result, the kernelgenerates crash signals.

2. The signal processing function do_coredump checks whether the coredumpcondition is met, for example, whether the dumpable attribute of the currentprocess is enabled.

3. If the conditions are met, create a core file and invoke the core_dump functionto collect the onsite information about the current process.

4. Save the information in the core file in ELF format.

3.2 C/C++ Application Problems

3.2.1 Introduction to C/C++C is a high-level programming language completed by D.M. Ritchie in early 1973.C is compact, flexible, convenient, and closer to assembly than other high-levellanguages, making it the only choice in fields such as embedded systems andoperating systems. C++ is a high-level programming language invented by BjarneStroustrup in 1983. It involves a series of improvements based on the C language.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 23

Page 28: Kunpeng Technical Troubleshooting Guide

For example, the compiler is stricter; inline functions are introduced, and itsupports object-oriented methods. Therefore, C++ is widely used in gaming,network software, and other fields.

3.2.2 Analysis Tool

3.2.2.1 GDB

IntroductionGDB is a powerful user-mode program debugging tool for UNIX and UNIX-likeenvironments released by GNU. It is a mainstream tool for C/C++ programdebugging.

Installing GDBIn CentOS, use Yum to install GDB.

yum install gdb //CentOS

Usage● Debug an application.

gdb program

NO TE

program indicates the executable file to be debugged.

● Debug the core file of an application.gdb program core

NO TE

core indicates the file generated after a core dump occurs due to illegal programexecution.

● Use GDB to debug a running program.gdb program $PID

NO TE

PID indicates the process ID of the program to be debugged.

● GDB debugging command(gdb) command *args

NO TE

(gdb) is displayed when accessing the GDB debugging page. command indicates thedebugging command to be executed (some commands are abbreviated). *argsindicates the parameters required by some debugging commands.

Common parameters are as follows:

Parameter Description

r Runs the program until a breakpoint. The program stopsrunning at the breakpoint and waits for the user to enterthe next command.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 24

Page 29: Kunpeng Technical Troubleshooting Guide

Parameter Description

c Resumes the program running until the next breakpoint.

n Traces the program in a single step. A function is directlycalled without stepping into the function body.

s Debugs the program. It steps into the function if a functionis called.

until Runs the program until it exits the loop.

finish Runs the program until the current function returns.

call Calls a function that is visible in the program and passesparameters.

l Views the source code. list n is used to view the 10 linesbefore and after n. list func is used to view the functionsource code.

b n Adds a breakpoint at line n.

b func Sets a breakpoint at the beginning of the func() function.

clear n Clears the breakpoint in line n.

3.2.3 Troubleshooting

3.2.3.1 C/C++ Process Ends Unexpectedly

Fault LocatingThe C/C++ process is not executed according to the design requirements and endsunexpectedly. Figure 3-2 shows how to locate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 25

Page 30: Kunpeng Technical Troubleshooting Guide

Figure 3-2 Fault locating for the abnormal end of a C/C++ process

Step 1 Confirm that the C/C++ program ends unexpectedly.

Step 2 Modify the compilation script or enable the debugging macro to recompile aprogram that can be debugged.

Step 3 Enable the core dump function so that a core dump file can be generated whenan exception occurs.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 26

Page 31: Kunpeng Technical Troubleshooting Guide

Step 4 Run the program. When the program stops abnormally, check whether a coredump file is generated in the specified path.

Step 5 Use GDB to debug the core dump file and locate the cause.

Step 6 Modify and recompile the code. Then perform a verification.● If the problem is solved, integrate the modification into the code.● If the problem persists, generate a core dump file again to locate the fault. If

the locating information is insufficient, add the locating information to thecode and compile and run the program again.

----End

CasesSymptom

Some software exits abnormally on a server.

Fault Locating

Step 1 Run the top command to check whether the related process is ended. If yes, it isan abnormal exit.

Step 2 Add the -g compilation option to makefile to recompile the code.

Step 3 Enable core dump and set a path for storing the generated core dump file.ulimit -c unlimitedecho "/home/core.%e.%p.%t" > /proc/sys/kernel/core_pattern

Step 4 Run the program. After the program stops abnormally, thecore.dsa_sign_multi.xxx.xxx file is generated in the specified path.

Step 5 Use GDB to debug the core dump file. The debugging window is displayed, asshown in the following figure.gdb dsa_sign_multi core.dsa_sign_multi.xxx.xxx

Step 6 Run the info threads command to view the thread information of the process.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 27

Page 32: Kunpeng Technical Troubleshooting Guide

Step 7 Run the thread command to switch the thread ID to be queried and run the btcommand to view the thread stack.thread $IDbt

Step 8 Trace the stack information. It is found that the problem is caused by the calling ofthe OpenSSL library. However, the OpenSSL used by the system cannot bedebugged. As a result, the stack information of OpenSSL cannot be viewed.

Step 9 Recompile a debuggable version of OpenSSL and link the program. Run theprogram again, generate a new core dump file, debug the file, and view relatedstack information.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 28

Page 33: Kunpeng Technical Troubleshooting Guide

Step 10 Check the parameters of the stack that reports the error.info localsf xx

Step 11 Check the source code implementation.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 29

Page 34: Kunpeng Technical Troubleshooting Guide

MD_Update() is the macro alias of EVP_DigestUpdate().

Step 12 Find the implementation of the EVP_DigestUpdate() function.

The value of the third parameter of the function must be an unsigned integer.

The third parameter is (MD_DIGEST_LENGTH/2 – k), and k is calculated based on(st_idx + MD_DIGEST_LENGTH/2). MD_DIGEST_LENGTH is a constant, and st_idxis a variable.

Step 13 Check the code implementation. It is found that st_idx is copied from the globalvariable state_index. Then, state_index is accumulated and converted to preventthe value from exceeding the value of STATE_SIZE.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 30

Page 35: Kunpeng Technical Troubleshooting Guide

Step 14 In a normal process, st_idx does not exceed STATE_SIZE. However, in a multi-thread environment, there is a low probability that state_index is used by otherthreads before being converted. As a result, st_idx exceeds STATE_SIZE, and theresult of (MD_DIGEST_LENGTH/2 – k) is negative. After the result is transferred inEVP_DigestUpdate(), it is calculated as an unsigned number. As a result, an out-of-bound access error occurs, generating a core dump.

Step 15 Lock the static variable state_index, recompile the code, and verify that theproblem does not occur. Then incorporate the modification into the code. Theproblem is solved.

----End

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 31

Page 36: Kunpeng Technical Troubleshooting Guide

3.2.3.2 C/C++ Process Suspends

Fault Locating

If the C/C++ process is not executed as expected and the process is still running inthe background, the process is suspended. This is usually caused by deadlock orinfinite loop of the process. 3.2.3.2 C/C++ Process Suspends shows how to locateand rectify the fault.

Figure 3-3 Fault locating for C/C++ process suspension

Step 1 Run the top command to check whether the related process is running. It is foundthat the process is suspended.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 32

Page 37: Kunpeng Technical Troubleshooting Guide

Step 2 Recompile a program version that can be debugged. Run the new version toreproduce the problem. Debug the new version using GDB.

Step 3 Analyze the stack information and service logic to find out the cause.

Step 4 Modify and recompile the code. Then perform a verification.● If the problem is solved, integrate the modification into the code.● If the problem persists, add the location information, and recompile and run

the program.

----End

Cases

Symptom

A server stops responding when some software is running on it.

Fault Locating

Step 1 Run the top command to check whether the process exists. If the process exists,the process is suspended.

Step 2 Recompile a program that can be debugged, run the new program, reproduce theproblem, and query the process ID.

Step 3 Enter the debugging mode of the program.gdb attach 2573

NO TE

2573 is the process ID of the program.

Step 4 Enable logging and export the stack information to a file.set logging file core_info.logset logging onthread apply all bt

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 33

Page 38: Kunpeng Technical Troubleshooting Guide

Step 5 Check the thread status.info threads

Most threads are in the __lll_lock_wait state. There are mainly two waiting lockvariables: resource1 and resource2.

Step 6 Run the print command to view the information about the two lock variables.__owner indicates the thread ID of the lock holder.

As shown in the preceding figure, resource1 is held by thread 3889, andresource2 is held by thread 3892.

Step 7 Find the logs of the threads whose IDs are 3889 and 3892 from the collected logfile.

According to the status information of the two threads, thread 3889 holds the lockresource1 but requests resource2; thread 3892 holds the lock resource2 butrequests resource1. As a result, a deadlock occurs, and all other threads thatdepend on the two locks are in the waiting state.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 34

Page 39: Kunpeng Technical Troubleshooting Guide

Step 8 According to the preceding stack information, threadOne and threadTwo callseach other, causing a deadlock. Check the service code logic, modify the code, andrecompile and run the code. After confirming that the problem does not recur,integrate the modification into the code. The problem is resolved.

----End

3.3 Fortran Application Problems

3.3.1 Introduction to FortranFortran is a high-level language proposed by John Backus in 1954 and formallyused in 1956. As the first high-level language officially promoted, Fortran is stillwidely used in the field of numerical computing and has become one of the mainlanguages in related fields.

3.3.2 Analysis Tool

3.3.2.1 GDBSimilar to applications developed using C/C++, applications developed usingFortran also support GDB debugging. For details about how to use GDB, see3.2.2.1 GDB.

3.3.3 Troubleshooting

3.3.3.1 Fortran Data Deviation Occurs

Fault LocatingThe program is running properly, but the output is different from the expectedresult. Figure 3-4 shows how to locate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 35

Page 40: Kunpeng Technical Troubleshooting Guide

Figure 3-4 Fault locating for Fortran data deviation

Step 1 Confirm that data deviation occurs based on the command output.

Step 2 Streamline the calculation process where data deviation occurs.

Step 3 Analyze the possible causes, add the print information, and analyze the printresult.

Step 4 Confirm the cause and modify the code. Then perform a verification.

Step 5 If the problem persists, check other possible causes.

----End

CasesSymptom

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 36

Page 41: Kunpeng Technical Troubleshooting Guide

Some data in the software running result is inaccurate.

Fault Locating

1. Confirm that the calculation result of the rad_clr7.3 table is inaccurate.2. Confirm that the error data range is [855,856;1190,1191]. Print key variables

in the range. See the following figure.

3. Compare the key variable values with the output.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 37

Page 42: Kunpeng Technical Troubleshooting Guide

4. Check the calculation process. It is found that the tau array contains a largeamount of abnormal data. It is possible that the data is dirty.

5. Check the calling position of the tau array. It is found that the array is notinitialized. Initialize the array.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 38

Page 43: Kunpeng Technical Troubleshooting Guide

6. Modify the code and verify that the data deviation problem is solved.

3.3.3.2 Fortran Process Ends Unexpectedly

Fault LocatingThe program is running properly, but the output is different from the expectedresult. Figure 3-5 shows how to locate and rectify the fault.

Figure 3-5 Fault locating for the abnormal end of a Fortran process

Step 1 Confirm that the Fortran process ends unexpectedly.

Step 2 Modify the compilation script or enable the debugging macro to recompile aprogram that can be debugged.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 39

Page 44: Kunpeng Technical Troubleshooting Guide

Step 3 Enable the core dump function so that a core dump file can be generated whenan exception occurs.

Step 4 Run the program. When the program stops abnormally, confirm that a core dumpfile is generated in the specified path.

Step 5 Use GDB to debug the core dump file and locate the cause.

Step 6 Modify the code. Compile the code again to verify the modification.

Step 7 If the problem is solved, integrate the modification into the code.

Step 8 If the problem persists, run the program again and generate a core dump file forfault locating.

NO TE

If there is no location information, add the location information to the code, and recompileand run the code.

----End

CasesSymptom

A Fortran program ends abnormally.

Fault Locating

1. Run the top command to confirm the program exits.2. Compile a program version that can be debugged, enable core dump, and

then run the new program again.3. Use GDB to debug the core dump file and locate the code where the

exception occurs.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 40

Page 45: Kunpeng Technical Troubleshooting Guide

4. The core dump file records the memory address qtt accessed by grb2_inq.Normally, grb2_inq reads the grb2 file memory based on input parameters. Ifthe operation is successful, the memory address qtt is returned. It is possiblethat the read of grb2_inq fails, which makes qtt is a wild pointer.

5. The GDB debugging information shows that the memory address of qtt in isnot allocated. As a result, the access fails and a core dump occurs. (Note: Inthe C language, NULL is returned when the memory application fails. For theFortran interface, a wild pointer is returned. Therefore, a core dump may ormay not occur the next time the program accesses the memory.)

3.4 Java Application Problems

3.4.1 Introduction to JavaJava is a high-level programming language launched by Sun Microsystems in May1995. Java can run on multiple platforms, such as Windows, Mac OS, and otherUNIX OSs. Due to its cross-platform and beginner-friendly features, Java is widelyused in the development of desktop application systems, embedded systems, aswell as the e-commerce, enterprise-level, interactive, multimedia, distributed, andweb application systems.

3.4.2 Java Fault Locating Tools

3.4.2.1 jps

Introduction

The jps command provided by Java displays the IDs and brief information of allJava processes.

Installation Mode

Use OpenJDK to install this tool. For Bisheng JDK, it is a built-in tool.

Usage● If a Java environment is configured, you can directly run the jps command.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 41

Page 46: Kunpeng Technical Troubleshooting Guide

● If no Java environment is not configured, run the ps -ef | grep java commandto find the Java process. Then, switch to the JDK bin directory, and runthe ./jps command.

Syntax:jps -l

Common parameters are as follows:

Parameter Description

-v Displays the JVM parameters set in the command line.

-l Outputs the complete package name, application mainclass name, and complete path name of the JAR package.

-q Displays only the Java process IDs.

3.4.2.2 jinfo

Introduction

The jinfo tool can be used to view the extended parameters of a running Javaapplication, including Java system attributes and JVM command line parameters.In addition, jinfo can be used to dynamically modify some parameters of arunning JVM.

Installation Mode

Use OpenJDK to install this tool. For Bisheng JDK, it is a built-in tool.

Usage● If a Java environment is configured, you can directly run the jinfo command.

● If no Java environment is not configured, run the ps -ef | grep java commandto find the Java process. Then, switch to the JDK bin directory and run the ./jinfo command.

Syntax:jinfo –flags $pid

Common parameters are as follows:

Parameter Description

-flags Displays all parameters, including Java Systemattributes and JVM command line parameters.

-sysprops Displays the command line.

-flag name=value Sets the corresponding parameter.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 42

Page 47: Kunpeng Technical Troubleshooting Guide

3.4.2.3 jstack

Introductionjstack is used to view the thread stack information of a Java process.

Installation ModeUse OpenJDK to install this tool. For Bisheng JDK, it is a built-in tool.

Usage● If a Java environment is configured, you can directly run the jstack command.● If no Java environment is not configured, run the ps -ef | grep java command

to find the Java process. Then, switch to the JDK bin directory and run the ./jstack command.Syntax:jstack -l $pid > jstack.log

NO TE

This command is used to save jstack logs to the local jstack.log file.

Common parameters are as follows:

Parameter Description

-F Forcibly prints the stack information when the jstack doesnot respond.

-l Adds the lock information in addition to the commonstack information.

-m Displays all stack information of the Java and native C/C++ frameworks in hybrid mode.

3.4.2.4 jmap

IntroductionYou can run the jmap command to view object statistics in the Java heap.

Installation ModeUse OpenJDK to install this tool. For Bisheng JDK, it is a built-in tool.

Usage● If a Java environment is configured, you can directly run the jmap command.● If no Java environment is not configured, run the ps -ef | grep java command

to find the Java process. Then, switch to the JDK bin directory and run the ./jmap command.Syntax:

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 43

Page 48: Kunpeng Technical Troubleshooting Guide

jmap -histo:live $pid

Common parameters are as follows:

Parameter Description

-heap Displays Java heap details.

-histo[:live] Displays information about live objects in the heap.

-dump Generates a heap dump snapshot.

-finalizerinfo Displays the information about the objects to beterminated.

-clstats Displays the class loader information.

3.4.2.5 jstat

Introduction

You can run the jstat command to view the usage of each part of the heapmemory and the number of loaded classes.

Installation Mode

Use OpenJDK to install this tool. For Bisheng JDK, it is a built-in tool.

Usage● If a Java environment is configured, you can directly run the jstat command.● If no Java environment is not configured, run the ps -ef | grep java command

to find the Java process. Then, switch to the JDK bin directory and run the ./jstat command.Syntax:jstat -gcutil $pid 1000

NO TE

1000 indicates that statistics are output every 1000 ms.

Common parameters are as follows:

Parameter Description

-gc Displays garbage collection statistics (value).

-gcutil Displays garbage collection statistics (percentage).

-class Displays class loading statistics.

-compiler Displays compilation statistics.

-gccapacity Displays the capacity and space usage of threegeneration objects in the VM memory.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 44

Page 49: Kunpeng Technical Troubleshooting Guide

Parameter Description

-gcnew Displays information about young-generation objects.

-gcnewcapacity Displays information about young-generation objects andtheir space usage.

-gcold Displays information about old-generation objects.

-gcoldcapacity Displays information about old-generation objects andtheir space usage.

-printcompilation

Displays execution information about the current VM.

3.4.3 Troubleshooting Java Problems

3.4.3.1 Heap Memory Leak

Fault LocatingThe Java program breaks down or responds slowly. The message"OutOfMemoryError: Java heap space" exists in the log file. Figure 3-6 shows howto locate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 45

Page 50: Kunpeng Technical Troubleshooting Guide

Figure 3-6 Fault locating for heap memory leak

Step 1 Confirm the out of memory (OOM) problem by checking the error logs, memoryusage, and GC activities.

Step 2 Increase the memory parameter value to check whether memory leak occurs. Ifthe problem does not recur after the memory parameter value is increased, theproblem is not caused by memory leak.

Step 3 Export the stack file using jmap and analyze the file based on the object call stack.

1. Check whether the space usage of large-memory objects is proper.2. Export stack data before and after service invoking. Compare the object

memory usage and analyze whether the memory usage increase is proper.

Step 4 If the code is incorrect, modify and recompile the code. Then, perform a test andverification.

Step 5 If it is not a memory leak, check whether it is a JVM fault.

----End

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 46

Page 51: Kunpeng Technical Troubleshooting Guide

Cases

Symptom:

A Java process runs slowly and the request delay is abnormally high. The"OutOfMemoryError: Java heap space" exception is recorded in logs.

Fault Locating

Step 1 Run the jstat command to check GC activities. The command output shows thatfull GC occurs frequently and the memory usage of the old generation (column O)is close to 100%.

Step 2 Increase the maximum heap memory and run the service again. After a period oftime, check the GC activity. The GC is still in full load.

Step 3 Use jmap to capture the memory. It is found that a large number of string objectsexist.

Step 4 Use jmap to export the heap memory file and use the jvisualvm tool to load thefile for further analysis. Analyze possible causes based on the code.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 47

Page 52: Kunpeng Technical Troubleshooting Guide

Step 5 According to the code logic analysis, the Map object in this case periodically addsdata but does not clear the data. Therefore, the fault can be located.

Step 6 Compile the code and install the patch in the environment after the code ismodified. The problem is solved.

----End

3.4.3.2 Java Process Hangs

Fault LocatingThe Java program does not meet the service execution requirements and is hung.Figure 3-7 shows how to locate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 48

Page 53: Kunpeng Technical Troubleshooting Guide

Figure 3-7 Fault locating of the Java process hanging problem

Step 1 Locate the fault based on the error information in the service logs.

Step 2 Compare and analyze the running of x86 and Kunpeng.

1. Check whether the hanging also occurs on x86 under the same conditions.2. Increase the memory parameter and check whether the problem recurs.

Check whether the problem is caused by the code logic based on the precedingtwo steps.

Step 3 Compare JVM differences, such as the default parameters. Check whether thereare differences and analyze the impact of the differences on the program.

Step 4 Adjust the differences to ensure that the Kunpeng platform is consistent with thex86 platform, and check whether the problem is resolved.

----End

CasesSymptom:

After some software is ported to the Kunpeng server, the available memory is usedup during the pressure test for three hours. The main service process on the

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 49

Page 54: Kunpeng Technical Troubleshooting Guide

Kunpeng server is suspended. However, no exception occurs during the pressuretest for eight hours on an x86 server.

Fault Locating

Step 1 Analyze the memory monitoring data. According to the data, the CPU usage ofthe Kunpeng is stable, and the memory usage approaches the threshold after aperiod of time.

Step 2 Compare the Kunpeng environment and the x86 environment. The comparisonshows that the two environments have the same standard hardware configuration.

The system runs stably on x86, and no memory leak occurs. The same code isused in the Kunpeng and x86 architectures. Therefore, it can be concluded that thememory insufficiency is not caused by memory leaks.

Step 3 Configure the maximum memory on the Kunpeng. After the test, no hangingoccurs. The possibility of memory leak is low.

Step 4 Compare JVM running parameters and find the differences. Pay attention tomemory-related parameters, including GC parameters and space allocationparameters.● Similarities: The startup parameters and GC configurations are the same.● Differences: The default parameter values of the thread stack are different.

The default parameter value of Kunpeng is 2048, and that of x86 is 1024.Based on the memory usage, 450 more threads can be started on Kunpeng,and an extra 450 MB memory is occupied. Based on the proportion, the actualused memory will exceed 95% of the maximum memory, and a hang causedby OOM may occur.

Step 5 Change the thread stack size to 1024. After a long-term stability test, the problemdoes not recur.

----End

3.4.3.3 Java Process Freezes

Fault LocatingThe server does not respond to requests and the Java process is not stopped. TheCPU usage remains unchanged, and the garbage collector stops working. Figure3-8 shows how to locate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 50

Page 55: Kunpeng Technical Troubleshooting Guide

Figure 3-8 Fault locating of the Java process freezing problem

Step 1 Run the top command to check whether the Java process exists and whether theCPU usage of related processes is stable when the service stop responding.

Step 2 Run the jstat command to check the number of young GC times. If the numberdoes not increase when there are service requests, ensure that the garbagecollector stops working.

Step 3 Run the jstack command to print the Java process stack information and verifythat most service processes are in the Block state and are frozen.

Step 4 View thread logs and analyze the cause.

Step 5 Modify the code. Run the code again to verify the modification.

Step 6 Incorporate the modification if the problem has been solved.

Step 7 Add the debugging information and locate the fault again if the fault persists.

----End

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 51

Page 56: Kunpeng Technical Troubleshooting Guide

Cases

Symptom:

When customer services are tested on both x86 and Kunpeng servers, theKunpeng node freezes irregularly and does not respond to the heartbeat query ofthe active node. As a result, the Kunpeng node is removed from the cluster.

Fault Locating

Step 1 Run the top command. The command output shows that the Java process existsand the CPU usage of related process is stable.

Step 2 Run the jstat command. The command output shows that the number of youngGC times does not increase. It is confirmed that the garbage collector has stoppedworking.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 52

Page 57: Kunpeng Technical Troubleshooting Guide

Step 3 Run the jstack command to print the stack information of the Java process. Thecommand output shows that most service processes are in the Block state.

Step 4 Run the pastack command to print the call stack information of the Prestoprocess after the node is disconnected. It is found that the related thread keepsrunning the SpinPause function, which is invoked by the GC mechanism. In normalcases, the GC ends normally. However, the Presto process on the node keepsrunning the SpinPause function. It is confirmed that the problem is caused by theJDK.

Step 5 Replace the JDK with Bisheng JDK. The freezing problem does not occur, and theprocess suspension problem is resolved.

----End

3.5 Go Application Problems

3.5.1 Introduction to GoGo is a high-level programming language developed by Robert Griesemer, RobPike, and Ken Thompson at the end of 2007 and brought to open source inNovember, 2009. It is widely used in server-side development, in-memorydatabases, and cloud platforms because of its simplicity, swiftness, security, andparallelism.

3.5.2 Go Fault Locating Tools

3.5.2.1 Delve

Introduction

Delve is used for Go program debugging thanks to its simplicity and completefeatures.

Installation Modego get -u github.com/go-delve/delve/cmd/dlv

Usage● Initialize the module to generate a go.mod file.

go mod init Module name

● Debug the application.dlv debug File name

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 53

Page 58: Kunpeng Technical Troubleshooting Guide

NO TE

This command is used to access the debugging mode.

● dlv debugging command(dlv) command *args

NO TE

dlv is displayed when accessing the dlv debugging page. command indicates thedebugging command to be executed (some commands are abbreviated). *argsindicates the parameters required by some debugging commands.

Common parameters are as follows:

Parameter Description

h Views the usage.

b Sets a breakpoint.

c Runs till the breakpoint or the program stops.

disass Performs disassembling.

n Goes to the next line.

r Restarts the process.

bt Displays the stack trace information.

s Executes the program in step-by-step mode.

p Calculates an expression.

ls Displays the source code.

q Exits the debugger.

3.5.3 Troubleshooting Go Problems

3.5.3.1 Go Program Stops Abnormally

Fault LocatingThe Go process is not executed according to the design requirements and endsabnormally. Error similar to "unexpected fault address and signal SIGSEGV:segmentation violation code=0x32XXXX" is reported. Figure 3-9 shows how tolocate and rectify the fault.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 54

Page 59: Kunpeng Technical Troubleshooting Guide

Figure 3-9 Fault locating of the abnormal Go program stop

Step 1 Confirm that the Go program ends abnormally.

Step 2 Run the go mod init {Project folder} command to generate the go.mod file

Step 3 Use the dlv debugging tool to debug the Go program and locate the cause.

Step 4 Modify the code. Compile the code again to verify the modification.

Step 5 Incorporate the code if the verification is successful. The problem is solved.

Step 6 Add location information to the code if the problem persists. Recompile the codeto locate the problem.

----End

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 55

Page 60: Kunpeng Technical Troubleshooting Guide

CasesSymptom:

When software is developed based on the Go language, the test ends abnormallyafter the software is ported to Kunpeng for testing.

Fault Locating

Step 1 Check the error log, as shown in the following figure.

The "unexpected fault address" error is reported, indicating that the program exitsabnormally.

Step 2 Generate the go.mod file. The subsequent dlv debugging depends on the go.modfile in the project folder.go mod init main

NO TE

In the preceding command, main indicates the project folder.

Step 3 Enter the debugging mode. Set breakpoints and perform step-by-step debugging.dlv debug test.go

NO TE

In the preceding command, test.go is the source code file.

Step 4 Confirm that a segment of machine code exists in the assembleJump function, asshown in the following figure.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 56

Page 61: Kunpeng Technical Troubleshooting Guide

Step 5 Analyze the code. It can be seen that the code is incorporated into the monkeypatch from the open source community. It is mainly used to replace functions.

For example, the monkey.patch(A, B) processes functions A and B. When functionA is executed, function B is actually executed. That is, A () à B ().

Step 6 Trace the machine code. It can be seen that the code comes from the followingtwo assembly statements:mov rdx, main.b.f;jmp [rdx];

It is used to assign an RDX register to the address of the main.b.f function. Thejump instruction jmp is used to jump to the RDX register for execution.

Step 7 Use the Kunpeng assembly instructions to reconstruct related assemblystatements and translate them into machine code.mov x10, main.b.f;ldr x11, [x10]br x11;

Step 8 Perform the operation in four times. The reason is that the mov instruction canprocess only 16 bits at a time on the Kunpeng. The x86 platform can process 64bits at a time, and main.b.f is a 64-bit immediate.

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 57

Page 62: Kunpeng Technical Troubleshooting Guide

Step 9 Recompile and verify the code.

Step 10 Incorporate the code. The problem is solved.

----End

Kunpeng Technical Troubleshooting Guide 3 Application Problems

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 58

Page 63: Kunpeng Technical Troubleshooting Guide

4 Auxiliary Analysis of the KunpengDevKit

4.1 Overview

4.2 Cases

4.1 OverviewDiagnosis and Debugging is a sub-tool of the Kunpeng Hyper Tuner.

The Diagnosis and Debugging is a performance analysis tool for Kunpeng-powered servers. It provides memory leak diagnosis (including memory releasefailure and abnormal memory release), memory usage analysis and display, aswell as out of memory (OOM) diagnosis capabilities. Use this tool to identifymemory usage issues in source code and improve application reliability.

The Diagnosis and Debugging provides the following features:

● Memory leak diagnosis: Analyzes the memory leaks (including memoryrelease failure and abnormal memory release) of applications to obtain thespecific leak information and associate the call stacks and source code.

● Memory usage analysis: Traces the memory consumption of the system layer,application layer (invoking memory application functions), allocator layer, andmemory map layer in real time when applications are running.

● OOM diagnosis: Analyzes the process memory status, system memory status,and call stack information when OOM occurs.

Tool official website: https://www.hikunpeng.com/en/developer/devkit/hyper-tuner

4.2 Cases

4.2.1 Creating a TaskCreate a task by configuring the test parameters such as the application path,diagnosis content, and diagnosis duration, as shown in the following figure. Click

Kunpeng Technical Troubleshooting Guide 4 Auxiliary Analysis of the Kunpeng DevKit

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 59

Page 64: Kunpeng Technical Troubleshooting Guide

OK to start the analysis. For details about the tool, visit https://support.huaweicloud.com/intl/en-us/hypertuner-feature/kunpengfeat_06_0083.html.

4.2.2 Analyzing the ResultAfter the task is complete, the analysis result page is displayed, as shown in thefollowing figure.

The Call Tree tab page lists the functions where memory leakage occurs. Take themain function as an example. You can view the memory leak information aboutthe main function and the called functions, including the size and number ofmemory leaks of the main function.

Kunpeng Technical Troubleshooting Guide 4 Auxiliary Analysis of the Kunpeng DevKit

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 60

Page 65: Kunpeng Technical Troubleshooting Guide

On the Source Code tab page, you can accurately locate the code where memoryleak occurs to facilitate code modification. In this example code, a memory leakoccurs in the main function. The malloc function allocates a piece of memory butdoes not release it, as shown in the following figure.

To solve the leak problem of the main function, perform the following steps:

Step 1 Add the following code to the appropriate position to release the memory:free(p);p = NULL;

Step 2 Recompile and run the code, and use the tool to scan the code again.

Step 3 Confirm that the memory leak problem does not exist in the main function.Repeat the preceding steps to modify the functions that may have memory leaks.

----End

Kunpeng Technical Troubleshooting Guide 4 Auxiliary Analysis of the Kunpeng DevKit

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 61

Page 66: Kunpeng Technical Troubleshooting Guide

A ARM64 General-Purpose Register

The ARM64 provides 31 general-purpose registers. Table A-1 lists the names andfunctions of the registers.

Table A-1 Introduction to ARM64 general-purpose registers

Register Description

x0 – x7 The registers are used to transfer the subprogramparameters. They do not need to be saved when being used.The redundant parameters are saved in the caller stack andare transferred to the called function through a stack. Thex0 register can also be used as a return value register.

x8 Indirect result register. It does not need to be saved whenbeing used. It is used to transfer the addresses of indirectresults. For example, a function returns a large structure,and x8 saves the structure address.

x9 – x15 Temporary registers, which do not need to be saved whenthey are used by subprograms.

x16 – x17 Internal process call registers. They are usually used forinstructions such as the plt addressing instruction dynamiclinks. They are also called IP0 and IP1.

x18 Reserved register on the platform. The register variesaccording to the operating system.

x19 – x28 Temporary registers, which must be saved when they areused by subprograms.

x29 Frame pointer (FP) register. It is used to connect stackframes and must be saved when being used.

x30 Link register (LR), which is used to save the return addressof subprograms.

Kunpeng Technical Troubleshooting Guide A ARM64 General-Purpose Register

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 62

Page 67: Kunpeng Technical Troubleshooting Guide

B Change History

Date Description

2021-09-01 This issue is the second official release.● Integrated the contents related to application problems and

placed them in "3 Application Problems".● Added sections 3.3 Fortran Application Problems, 4 Auxiliary

Analysis of the Kunpeng DevKit, and A ARM64 General-Purpose Register.

2021-06-18 This issue is the first official release.

Kunpeng Technical Troubleshooting Guide B Change History

Issue 02 (2021-09-01) Copyright © Huawei Technologies Co., Ltd. 63