Troubleshoot Slow Builds

Symptoms

  • Building times have increased since a particular build number
  • The builds are running slower than they used to

Diagnostic/Treatment

The following recommendations and techniques help to troubleshoot this issue and identify the root cause. They are based on experience with CloudBees customers.

Preconditions

  • The Jenkinsfile has not suffered any modification
  • The source code has not changed around the time the build started to become slow.
  • The build process is slow across all the agents. Review the build history to find out if the build process is becoming slow on some agents and fast on others.

Environment

Troubleshooting Process

There isn’t one simple way to troubleshoot a slow build, and the methods and tools you use will depend on what type of build you are running and the execution environment.

Pipeline Durability Settings

Refer to Suggested Best Practices And Tips for Durability Settings.

Stages

As starting point the Pipeline: Stage View helps you to identify the most time-consuming stage and how they have chnaged across the build numbers.

Agents

Executor starvation

If you see a little black clock icon in the build queue as shown below, it is an indication that your job is sitting in the queue unnecessarily (refer to Executor Starvation)

Performance

  • Firstly, validate that the computer capabilities of the agent (especially the CPU) have not suffered from any modification. Note that in CloudBees CI on Modern the Pod spec of the agent can be constrained by its limits impacting negatively in the building performance times if it is not correctly configured.

  • While the build is running, collect data on the machine/container that is running that build by running:

a) If on a traditional virtual machine or bare metal machine, run:

mkdir slowBuildData
cd slowBuildData
while true; do
  top -w 300 -bc -n 1 > topOutput.$(date +%Y%m%d%H%M%S).txt
  vmstat -t > vmstat.$(date +%Y%m%d%H%M%S).txt
  sleep 5
done

(depending on the version of top installed on your machine, you may need to modify options that are not supported in your version of top)

b) If the build is running inside of Kubernetes:

export POD_ID=podThatIsRunningTheSlowBuild
mkdir slowBuildData
cd slowBuildData
while true; do
  (kubectl exec $POD_ID -- top -w 300 -bc -n 1 ) > topOutput.$(date +%Y%m%d%H%M%S).txt
  (kubectl exec $POD_ID -- vmstat -t ) > vmstat.$(date +%Y%m%d%H%M%S).txt
  sleep 5
done

Once the build is complete, press ctrl-c to stop collecting data.

  • Now we can make use of that data to try to find possible reasons for a slow build. We will first check what the top 10 CPU utilizations were in the user process space by running:
grep '%Cpu' logs/topOutput.* | awk '{print $2 " " $3 " " $4 " " $5}' | sort -n | tail -10
5.0 us, 1.7 sy,
5.1 us, 0.0 sy,
5.1 us, 0.0 sy,
5.8 us, 2.5 sy,
6.1 us, 1.7 sy,
6.5 us, 1.6 sy,
6.7 us, 0.8 sy,
6.8 us, 1.7 sy,
7.6 us, 0.0 sy,
10.0 us, 0.0 sy,

In this case, we can see that we only got a peak of 10 % user CPU at the most, so this does not appear to be a root cause of the slow build.

  • Next we are going to check for Sys CPU (CPU time spent inside of the kernel), and it also barely peaked over 11 % sys CPU:
grep '%Cpu' logs/topOutput.* | awk '{print $4 " " $5 " " $2 " " $3}' | sort -n | tail -10
4.2 sy, 1.7 us,
4.2 sy, 2.5 us,
4.9 sy, 0.0 us,
5.0 sy, 3.3 us,
7.6 sy, 0.8 us,
8.3 sy, 1.7 us,
9.0 sy, 2.5 us,
9.8 sy, 0.8 us,
10.3 sy, 3.4 us,
11.6 sy, 0.8 us,

In my example, we took over 1326 samples, and the above shows the top 10 user CPU and sys CPU, and both of them peaked around 10-11%, so it seems this job is not CPU bound at all.

  • Next looking for IO wait, and we see that the top 10 ‘cpu wa’ (cpu waiting for IO) entries are all zero:
head -2 logs/vmstat.20190201154336.txt | tail -1 | awk '{print $16}';  (for f in logs/vmstat*; do tail -1 $f; done) | awk '{print $16}' | sort | tail -10
wa
0
0
0
0
0
0
0
0
0
0

This is surprisingly low IO wait, so I’m checking for IO wait in the top output, and I also see there is almost no IO wait:

grep '%Cpu' logs/topOutput.* | awk -F',' '{ print $5}' | sort -n | tail -10
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.0 wa
  0.8 wa
  0.8 wa

Hence I see no evidence of IO wait.

  • Checking if we were low on RAM during the build, I see no evidence of this at all:
grep 'MiB Mem' logs/topOutput.* | awk -F',' '{print $2}' | sort -n | head -10
   3795.5 free
   3796.1 free
   3796.6 free
   3796.7 free
   3796.9 free
   3797.2 free
   3797.4 free
   3797.8 free
   3798.0 free
   3798.0 free

For more analysis of Linux RAM usage, we need to look at the buff/cache ram as well, which I did not go into since there was enough free at this point. For more details please see https://www.linuxatemyram.com/ .

Steps

A couple of options here:

Pipeline Steps

The Pipeline Steps option in the left hand side of a build menu brings a list with times per step.

Timestamper

The Timestamper plugin elapsed time mode helps you to identify the most time-consuming steps.

timestamper elapsed time

Then look through the build log to look for when there are large gaps in the build log timings, and focus on the parts that took a long time:

timestamper build log review

Addtionally, you can sort the build steps by their times by importing the console output as *.CSV into a spreadsheet

  • Use delimiter to split the Elapse Time from the EVENT (e.g.])
Elapse TimeEVENT
00:00:16.158mha:////4KP/4Pf+epvfkRScTh/Od/EhPv9bp+DJvtREVK8n7tRzAAAAm +LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+
00:00:00.15> git rev-parse –is-inside-work-tree # timeout=10
00:00:00Setting origin to ssh://git@example.com:7999/my-repos/my-projects.git
00:01:52.76> Task :compileKotlin
  • After that, sort the table by the Elapse Time:
Elapse TimeEVENT
00:01:52.76> Task :compileKotlin
00:01:22.82[8mha:////4K5xCkOsmh9+gnIwabjnTGF4g/TnT1Q8ptTnVrhxTZdEAAAAqB+LCAAAAAAAAP9tjTEOwjAUQ3+LOrAycohUYoABMXWNsnCC0ISQNvq/
00:01:08.57+ npm i

Alternatively, you could calculate the elapsed time from timestamp as follows

  • Clean up the timestamp to get only the time for the event (T1)
  • Duplicate the time column and copy its value into a new column T2 but shifting up the first time record, in other words adding the starting time for the next event.
T1 (this)T2 (next)T2-T1TIMESTAMPEVENT
18:56:19.35818:56:35.50800:00:16.152021-02-15T18:56:19.358Z8mha:////4KP/4Pf+epvfkRScTh/Od/EhPv9bp+DJvtREVK8n7tRzAAAAm +LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+
18:56:35.50818:56:35.66200:00:00.152021-02-15T18:56:35.508Z> git rev-parse –is-inside-work-tree # timeout=10
18:56:35.5118:56:35.66200:00:002021-02-15T18:56:35.662ZSetting origin to ssh://git@example.com:7999/my-repos/my-projects.git
19:03:43.09419:05:35.85300:01:52.762021-02-15T19:03:43.094Z> Task :compileKotlin

Have more questions?

0 Comments

Please sign in to leave a comment.