Troubleshooting slow builds

Issue

I have a build that is running slow in Jenkins, and I’d like to know some ways to troubleshoot the root cause.

Environment

Troubleshooting Process

There isn’t one simple way to troubleshoot a slow build, and the methods and tools you use will depend on what type of build you are running, and the execution environment, but here are some things to start with:

1 . First try to isolate if the build is slow on some build agents, and fast on other build agents by reviewing the build history of previous builds. Also consider if the build started to become slow recently, or if has been slow for a long time, and check for source code changes around the time it started to become slow. You likely already did these steps before finding this article, so let’s move onto more troubleshooting methods.

2 . While the build is running, collect data on the machine/container that is running that build by running:

a) If on a traditional virtual machine or bare metal machine, run:

mkdir slowBuildData
cd slowBuildData
while true; do
  top -w 300 -bc -n 1 > topOutput.$(date +%Y%m%d%H%M%S).txt
  vmstat -t > vmstat.$(date +%Y%m%d%H%M%S).txt
  sleep 5
done

(depending on the version of top installed on your machine, you may need to modify options that are not supported in your version of top)

b) If the build is running inside of Kubernetes:

export POD_ID=podThatIsRunningTheSlowBuild
mkdir slowBuildData
cd slowBuildData
while true; do
  (kubectl exec $POD_ID -- top -w 300 -bc -n 1 ) > topOutput.$(date +%Y%m%d%H%M%S).txt
  (kubectl exec $POD_ID -- vmstat -t ) > vmstat.$(date +%Y%m%d%H%M%S).txt
  sleep 5
done

Once the build is complete, press ctrl-c to stop collecting data.

3 . Now we can make use of that data to try to find possible reasons for a slow build, we will first check what the top 10 CPU utilizations were in the user process space by running:

grep '%Cpu' logs/topOutput.* | awk '{print $2 " " $3 " " $4 " " $5}' | sort -n | tail -10
5.0 us, 1.7 sy,
5.1 us, 0.0 sy,
5.1 us, 0.0 sy,
5.8 us, 2.5 sy,
6.1 us, 1.7 sy,
6.5 us, 1.6 sy,
6.7 us, 0.8 sy,
6.8 us, 1.7 sy,
7.6 us, 0.0 sy,
10.0 us, 0.0 sy,

In this case, we can see that we only got a peak of 10 % user CPU at the most, so this does not appear to be a root cause of the slow build.

4 . Next we are going to check for Sys CPU (CPU time spent inside of the kernel) and it also barely peaked over 11 % sys CPU:

grep '%Cpu' logs/topOutput.* | awk '{print $4 " " $5 " " $2 " " $3}' | sort -n | tail -10
4.2 sy, 1.7 us,
4.2 sy, 2.5 us,
4.9 sy, 0.0 us,
5.0 sy, 3.3 us,
7.6 sy, 0.8 us,
8.3 sy, 1.7 us,
9.0 sy, 2.5 us,
9.8 sy, 0.8 us,
10.3 sy, 3.4 us,
11.6 sy, 0.8 us,

In my example, we took over 1326 samples, and the above shows the top 10 user CPU and sys CPU, and both of them peaked around 10-11%, so it seems this job is not CPU bound at all.

5 . Next looking for IO wait, and we see that the top 10 ‘cpu wa’ (cpu waiting for IO) entries are all zero:

head -2 logs/vmstat.20190201154336.txt | tail -1 | awk '{print $16}';  (for f in logs/vmstat*; do tail -1 $f; done) | awk '{print $16}' | sort | tail -10
wa
0
0
0
0
0
0
0
0
0
0

This is surprisingly low IO wait, so I’m checking for IO wait in the top output, and I also see there is almost no IO wait:

grep '%Cpu' logs/topOutput.* | awk -F',' '{ print $5}' | sort -n | tail -10
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.8 wa
  0.8 wa

Hence I see no evidence of IO wait.

6 . Checking if we were low on RAM during the build, I see no evidence of this at all:

grep 'MiB Mem' logs/topOutput.* | awk -F',' '{print $2}' | sort -n | head -10
   3795.5 free
   3796.1 free
   3796.6 free
   3796.7 free
   3796.9 free
   3797.2 free
   3797.4 free
   3797.8 free
   3798.0 free
   3798.0 free

For more analysis of Linux RAM usage, we need to look at the buff/cache ram as well, which I did not go into since there was enough free at this point, for more details please see https://www.linuxatemyram.com/ .

7 . Hence, at this point I do not see any CPU, I/O, or Memory bottlenecks for this build, so the next thing to look at is in the logs of the build, review the Elapsed time data for that build, and look for large gaps.

To do this, take a backup of your JENKINS_HOME, and then install the Timestamper plugin, and restart your Jenkins instance. Then go to the build in question, and review the log with the timestamper setting set to elapsed time:

timestamper elapsed time

Then look through the build log to look for when there are large gaps in the build log timings, and focus on the parts that took a long time:

timestamper build log review

Summary

In the above steps, we checked for user and sys CPU, I/O, and Memory for this build, then we did build log analysis with the Timestamper plugin. We see that there was one command of the build that was running for two hours, so we worked with the team that maintains that code, and found the root cause of the slow build.

This is not an exhaustive guide, but it has hopefully given you a start on possible ways to troubleshoot slow builds in Jenkins.

Have more questions?

0 Comments

Please sign in to leave a comment.