The following information should help guide you towards setting up NFS for usage with CloudBees Jenkins Enterprise and CloudBees Jenkins Operations Center when enabling High Availability.
This guide assumes that you are using a RHEL based system or variant. If you are not, then please use the content here as a framework for how your OS should be configured using their processes and tooling.
1) Use NFS v4.1
CloudBees engineering has validated NFS v4.1 and it is the recommended NFS version for all Jenkins environments. File storage vendors have reported to CloudBees customers that there are known performance issues with v4. NFS v3 is known to be performant, but is considered insecure in most environments.
Minimal installations of RHEL do not provide NFS out of the box, so make sure to install the following package as root:
yum -y install nfs-utils
This package will pull in all the needed dependencies.
2) Disable Security Access Controls
On initial setup, it’s best to remove anything that will cause complications. You can always add these restrictions back as you’re prototyping out your infrastructure.
Disable the following components:
- SELinux
- IPTables/FirewallD
SELinux can be switched over to “permissive” mode by editing /etc/sysconfig/selinux
and setting SELINUX
to:
SELINUX=permissive
You will need to restart your operating system in order for the changes to take effect but only if you wish to disable SELinux by setting it to:
SELINUX=disabled
You can also run setenforce permissive
if you can’t afford to restart the OS. It’s generally encouraged by Red Hat to leave SELinux in a permissive state in case you are interested in enabling it in the future. The logging data it collects in permissive mode greatly helps with creating security policies.
To disable the firewall, run the following commands on RHEL6.
chkconfig iptables off
service iptables stop
To disable the firewall on a new distro such as RHEL7, run the following commands:
systemctl disable firewalld
systemctl stop firewalld
3) Configure NFS to use static ports
By default, NFS generates “dynamic” ports with the rpcbind daemon. This can result in sporadic behavior when using a corporate network that filters ports. It’s best to lock these ports down so that you can better predict behavior and provide your networking team with a definite list of exclusions to prevent downtime due to network security policy.
Edit the /etc/sysconfig/nfs
file and set the following variables to ports of your choosing:
- MOUNTD_PORT=port
- STATD_PORT=port
- LOCKD_TCPPORT=port
- LOCKD_UDPPORT=port
The following list should be documented internally and raised to your network/firewall team to ensure that these ports are not being blocked on your networks.
Port List:
- 2049 - TCP/UDP
- 111 - TCP/UDP
- MOUNTD_PORT - TCP/UDP
- STATD_PORT - TCP/UDP
- LOCKD_TCPPORT - TCP
- LOCKD_UDPPORT - UDP
4) Configure NFS to use more concurrent processes
By default, the NFS server only allows up to 8 concurrent connections. Larger systems may require more concurrent connections, so raising this value to 16 is preferable. You can make this change in the /etc/sysconfig/nfs
file by setting the RPCNFSDCOUNT variable. By default it is commented out.
RPCNFSDCOUNT=16
5) Configure NFS to use more resources.
Guides on tuning system memory should never be treated like a magic bullet solution, however we recommend that you at least start with this template and make adjustments as needed moving forward.
Add the following lines to your /etc/sysctl.conf
file for anything running on RHEL6 or lower. If you are using a newer distro, then add this information to a new file called /etc/sysctl.d/30-nfs.conf
sunrpc.tcp_slot_table_entries = 128
sunrpc.tcp_max_slot_table_entries = 128
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.ip_local_port_range = 1024 65000
fs.inode-max = 128000
fs.file-max = 64000
fs.nfs.nlm_tcpport = LOCKD_TCPPORT #from step 3
fs.nfs.nlm_udpport = LOCKD_UDPPORT #from step 3
The critical parameters you should tune are the sunrpc values. By default, distributions such as Red Hat intentionally lower the tcp_slot_table_entries param to 2 which is not ideal for NFS servers and thus should be raised.
On RHEL6 based systems or lower, you can automatically apply these changes by running sysctl -p
after adding the information to /etc/sysctl.conf
. Newer distros such as RHEL7 will invoke the command a little differently with sysctl --system
.
Note that sunrpc values will not work if the kernel has not loaded sunrpc before the sysctl values are applied at boot time. You can check to see if the values are applying correctly by rebooting the OS and running the following command:
sysctl sunrpc.tcp_max_slot_table_entries
You can force the OS to load sunrpc by adding a modprobe command. This should only be necessary if nfs-tools is not automatically loading the module in and adequate amount of time during boot.
/etc/modprobe.d/sunrpc.conf:
options sunrpc tcp_max_slot_table_entries=128 tcp_max_slot_table_entries=128
6) Configure NFS server exports
Below is an example export, the 10.0.0.0/24 network is just an example so you could restrict the IP to just the inbound clients for security purposes.
/mnt/jenkins_home 10.0.0.0/24(rw,async,no_root_squash)
7) Start NFS
Ensure that the rpcbind and nfs daemons are enabled at boot time and are switched on:
RHEL6:
chkconfig rpcbind on
chkconfig nfs on
service rpcbind start
service nfs start
RHEL7:
systemctl enable rpcbind
systemctl enable nfs
systemctl start rpcbind
systemctl start nfs
8) Configure NFS client mount point
Mount options to start with, note that most guides would tell you to place this in the /etc/fstab
but we’d prefer that you use autofs instead.
AutoFS Method:
AutoFS is recommend because the AutoFS daemon will attempt to recover a NFS mountpoint that would otherwise go down for good if fstab was used.
Minimal installations of RHEL will not include AutoFS, so install it with the following command on your client server:
yum -y install autofs
Once AutoFS is installed, you’ll need to modify a couple of files in order to start using it.
Define a mount point in /etc/auto.master.d/nfs.autofs
:
/nfs /etc/auto.nfs --timeout 60 --ghost
Define a mapping in /etc/auto.nfs
:
jenkinsHome -fstype=nfs,rw,bg,hard,intr,rsize=32768,wsize=32768,vers=4.1,proto=tcp,timeo=600,retrans=2,noatime,nodiratime,async 10.0.0.200:/mnt/jenkins_home
Create the new mountable folder:
mkdir -pv /nfs
chmod 775 /nfs
Ensure that the new files are chmod as 664. Executable permissions will cause the automount to not work correctly.
chmod 664 /etc/auto.nfs
chmod 664 /etc/auto.master.d/nfs.autofs
Finally, configure AutoFS to start at boot time and then start the daemon:
RHEL6:
chkconfig autofs on
service autofs start
RHEL7:
systemctl enable autofs
systemctl start autofs
When AutoFS kicks in, it will automatically mount your NFS share to /nfs/jenkinsHome
unless you prefer this to go somewhere else, then adjust the path accordingly in the config files.
Older (risky) FSTab Method:
The FSTab method is not advised due to it’s nature. For instance, if a NFS mount were to suddenly go down, the fstab has no way of recovering and would require manual intervention.
Edit the /etc/fstab
file and add the following line to the bottom of the file:
10.0.0.200:/mnt/jenkins_home /mnt/nfs_jenkins_home nfs _netdev,rw,bg,hard,intr,rsize=32768,wsize=32768,vers=4.1,proto=tcp,timeo=600,retrans=2,noatime,nodiratime,async 0 0
Note that the _netdev param is essential. It prevents the OS from trying to mount the volume before the network interfaces have a chance to negotiate a connection.
After editing the fstab, always double check your entry with the mount
command before rebooting the OS. Failure to do so will force RHEL to go into recovery mode.
mount -a -v
9) Configure Jenkins to use the NFS mount point
You can configure Jenkins to use a different home directory. To do this, edit the /etc/sysconfig/jenkins
file and change the JENKINS_HOME
variable:
JENKINS_HOME="/nfs/jenkinsHome"
Save the file and restart Jenkins.
service jenkins stop
service jenkins start
10) General Troubleshooting:
Check if the I/O operations are causing the Jenkins master slowness
The top
command on Unix provides the time the CPU is waiting for I/O completion
wa, IO-wait : time waiting for I/O completion
On the example below we can see that only a 0.3% of the time the CPU was waiting for I/O completion.
top - 11:12:06 up 1:11, 1 user, load average: 0.03, 0.02, 0.05
Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 501732 total, 491288 used, 10444 free, 4364 buffers
KiB Swap: 0 total, 0 used, 0 free. 42332 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1054 jenkins 20 0 4189920 332444 6960 S 0.7 66.3 1:01.26 java
1712 vagrant 20 0 107700 1680 684 S 0.3 0.3 0:00.20 sshd
1 root 20 0 33640 2280 792 S 0.0 0.5 0:00.74 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/0
Determine if the disk Jenkins is using might be causing the performance issues
At this point, we can use iostat
to understand if the mount we are using is or not causing the slowness.
$ iostat -x
Linux 3.13.0-77-generic (vagrant-ubuntu-trusty-64) 02/14/2017 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.24 0.00 0.20 0.03 0.00 98.54
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.05 1.80 9.00 1.24 292.35 23.47 61.72 0.00 0.34 0.35 0.26 0.23 0.23
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
hda 0.05 1.50 9.00 1.24 192.35 22.45 51.72 0.00 0.56 0.38 0.56 0.53 0.27
We need to look at:
* r_await
: The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
* w_await
: The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
* %util
: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance lim‐its
Debugging information for RPC and NFS
Lots of things can go wrong especially behind a complex network with rules that may be outside of your control. You can begin to make sense of what might be going wrong by enabling verbose debugging and checking the kernel system logs between both the client and the master.
Set the following 2 parameters to help generate more verbose debugging information for RPC and NFS.
sysctl -w sunrpc.rpc_debug = 2048
sysctl -w sunrpc.nfs_debug = 1
Network Monitor
Utilizing a network monitor can also help explain odd transport issues such as packet loss or ports being blocked. RHEL based systems have a simple tool called tcpdump
that you can use to watch traffic.
This tool is very verbose so it’s a good idea to try and restrict the amount of data it reports back by filtering out what you don’t need. On the client and server, run the command like this if you wanted to monitor all the traffic on port 2049.
tcpdump -w /tmp/dump port 2049
Once you dump the data, you can use your favorite text editor to read it, or use the tcpdump tool to further filter it. The following example will read the dump file, filter the data based of an example source ip address and then write the results to a new file.
tcpdump -r /tmp/dump -w /tmp/smaller tcp src 10.0.0.200
Other tools exist such as nfstrace but you’ll need to compile it.
Services are running
Verify that both nfs
and rpcbind
services are running.
If only the nfs
service is running you may face some errors like java.nio.file.FileSystemException: /some/file: Device or resource busy
like explained in this article.
0 Comments