What is jenkinshangWithJstack.sh and how does it help?

Symptoms

I’ve run the script jenkinshangWithJstack.sh and shared the data with CloudBees Support.
What is this and how does it help us understand what’s causing my master’s CPU to spike?

Diagnosis/Treatment

The script jenkinshangWithJstack.sh uses seven common tools top, vmstat, netstat, nfsstat, nfsiostat, iostat, and jstack to capture diagnostic information.
Run with the default values of a 60 second duration and an interval of 5 seconds, each tool will be invoked 12 times evenly spread over one minute. Simply taking one data point would not provide a complete picture of what’s causing the CPU to spike. Increasing the duration of the test to 5 minutes or more is often desirable.

If you were to extract the bundle which you share with Cloudbees Support, there is a directory
for the output of each tool.

But how exactly can you use this?

One example of how this data can be correlated to gain a better understanding of what is causing high CPU values is with top and jstack.

First, let investigate the topOutput folder. Each file in this folder captures the output of running the top command.
The %CPU value is the ninth column in each file.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  4323 jenkins+  20  0   17.9g    15g    17m S  335.3 64.0  36:35.87 /etc/alternatives/java -D...

Let’s find when the master’s CPU peaked during our test. To do this, we must filter out all other processes
so we can compare the Jenkins java process over the course of our test. We know the process id as we used
this when we invoked the script. Combining the process id and the process owner (typically jenkins), we
can use grep to see only the processes we are interested in:

  grep -H "4323 jenkins" *.txt

Or to make comparison easier to do, we can remove the arguments to the java command.

  grep "4323 jenkins" *.txt | sed -e 's/java .*//'
  topOutput.20180925112257.txt: 4323 jenkins+  20   0 17.9g 15g  17m S  98.4 64.0  36:33.79 /etc/alternatives/
  topOutput.20180925112318.txt: 4323 jenkins+  20   0 17.9g 15g  17m S 240.5 64.0  36:34.31 /etc/alternatives/
  topOutput.20180925112339.txt: 4323 jenkins+  20   0 17.9g 15g  17m S  96.5 64.0  36:34.84 /etc/alternatives/
  topOutput.20180925112359.txt: 4323 jenkins+  20   0 17.9g 15g  17m S 335.3 64.0  36:35.35 /etc/alternatives/
  topOutput.20180925112420.txt: 4323 jenkins+  20   0 17.9g 15g  17m S 110.4 64.0  36:35.87 /etc/alternatives/

Taking the above output as an example, we see the %CPU spiked to it’s highest point with the collection
timestamped at 20180925112359. So what was happening at that time?

The next step is to look at the sub processes running at that time. Inside the topdashHOutput folder, open the corresponding
topdashHOutput file which has the same timestamp in it’s name. The naming convention is “topdashHOutput.PID.timestamp.txt”.
In our example, this is topdashHOutput.4323.20180925112359.txt. The topdashHOutput files lists the sub processes ordered by the highest CPU.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  20379 jenkins+    20   0 17.9g  15g  17m R 89.2 64.0 248:38.86 java

At this point, we can correlate the sub process with PID 20379 to the data collected from jstack. Open the file jstack.4323.20180925112359.txt located in the jstack folder.

To make this correlation, we only need to convert the PID value 20379 to hex, which is 0x4f98. Now we find the
process who’s nid value is 0x4f98 in the jstack file.

  "Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f2558caa800 nid=0x4f98 runnable

So, in this example, we see that the CPU was spiking due to garbage collection! And that’s how easy it is to get a
quick insight into what was contributing to high CPU levels.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.