Using a Git reference repository

Issue

  • How do I create a Git reference repository ?
  • How do I configure Git/GitHub SCM to use a reference repository ?

Environment

  • CloudBees Jenkins Enterprise - Managed Master (CJEMM)
  • CloudBees Jenkins Platform - Client Master (CJPCM)
  • CloudBees Jenkins Team (CJT)
  • Jenkins LTS
  • Git plugin

Resolution

The Jenkins Git Plugin can use a reference repository as a cache to reduce remote data transfer and to reduce local disc use.
Reference repositories are defined per project via the Additional Behavior named Advanced clone options.
Reference repositories are created and maintained manually. The Git plugin does not maintain the reference repositories.

From a user perspective, it takes two steps to configure a reference repository:

  • Create the reference repository
  • Configure your job so that it points to the reference repository

And we can add to this another step:

  • Periodically update the reference repository with the latest content from the original repository

What is a reference repository ?

A reference repository is a local bare repository whose content is used instead of copying from the remote repository. Cloning with a reference repository is much faster because clone creates pointers to the reference repository, instead of copying from the remote repository. This has multiple advantages:

  • it reduces the network I/O and reduces load on the remote Git server by not transferring content which is already in the reference repository
  • it saves local disk space by creating pointers from the project repository to the reference repository instead of creating a new copy

Where to create a reference repository ?

The reference repository needs to be available on the agent cloning the repository. It is most often required on build agents. Although it may be useful on the master in some cases. Pipeline users will benefit from reference repositories on the master when performing Branch Indexing (GitHub Branch Source plugin, BitBucket Branch Source plugin and Gitea plugin) or for Pipeline Shared Libraries (though pipeline shared libraries should really be kept small enough that a reference repository does not matter).

If the reference repository is not available when cloning, the process will fall back to clone from the remote repository.

How to create a reference repository ?

Create a bare git clone of your remote repository using the --mirror option:

git clone --mirror git@github.com:my-user/my-repository.git

This creates a bare repository that contains all refs (branches and tags) of the remote repository. Have a look at the documentation of git-clone for more details.

Advanced configuration for Git submodules

Some users have found that they can further improve performance by caching the history from multiple reference repositories into a single reference repository. For example, if a team uses a git repository that contains multiple submodules, they can create a reference repository which contains the combination of all those remote repositories in a single reference repository.

When such a multiple repository reference is created, the submodules can be references with:

git clone --bare git@github.com:my-user/my-repository.git
cd my-repository.git
git remote add submodule1 git@github.com:my-user/submodule1-repository.git
git remote add submodule2 git@github.com:my-user/submodule2-repository.git

Note about –bare and –mirror

  • --mirror is a better argument choice when there is a single remote repository because it configures the refspec to copy as much from the single remote repository as it can.
  • --bare is the better argument choice where there are multiple remote repositories in the single reference repository because the repository is no longer a “mirror” of the remote repository.

Using --mirror in a multi-repository reference repo could causes toggling of some of the reference repository content within the reference repository. When new content is fetched for the original mirrored repository, it updates references in all remotes. When new content is then fetched from subsequent remote repositories, those updates alter some of the references which had just been updated by the first repository. That toggles some of the content inside the reference repository between the original mirror repository and the subsequent remote repositories.

How to configure a Job to use the Reference repository in Jenkins ?

Configure the Git SCM and add the Additional Behavior of type Advanced clone behaviours. Then specify the location of the reference repository in the “Path of the reference repo to use during clone”:

In Pipeline, the “Pipeline syntax” link on each project (job) page includes the checkout command and will generate the correct syntax for the checkout options you select. A sample subset of a checkout command might be:

checkout([$class: 'GitSCM',
    extensions: [[$class: 'CloneOption', reference: '/var/lib/gitcache/my-repository.git']],
    [...]
])

(*Note: if you are configuring a job for which the repository has already been cloned, you will need to remove the workspace to force a new clone on the next build. The new clone will point to the reference repository. To check that the clone performed by the build points to the reference repository, check that the .git directory in the workspace has the file .git/objects/info/alternates which contains the location of the reference repository.*)

Maintain your Git repository

You can update the mirror repository from time to time with the following command:

git fetch --all --prune

Resources

Have more questions? Submit a request

4 Comments

  • 0
    Avatar
    Rafael Rezende

    This is a great feature!

    One thing which is not clear to me though: is the update happening only when the fetch command is explicitly called? Or the does plugin always force the local clone to get the latest updates from remote on every build?

  • 0
    Avatar
    Allan Burdajewicz

    Rafael,

    The current versions of the Git and Git client plugin do not update the reference repository automatically. The reference repository must be updated manually. This is what the additional step "Periodically update the reference repository with the latest content from the original repository" is about.

    Regards,

  • 0
    Avatar
    Nathan Neulinger

    I had used this functionality a few years ago - and it indeed made a huge improvement in space and performance - however, I ran into one issue that I'd be interested in knowing if has been resolved.

     

    When I would update the reference repository - it seems like it would periodically result in broken/dangling references/links/etc. within the clones referencing it. It may have been in response to a gc or repack that was triggered in the reference copy. 

     

    It's been a while though since I used this. I'm looking to use it again though. Do you know if this is something that I was likely doing wrong with my update process previously, or should I be taking any special action in how the reference repository is used in the jobs?

  • 0
    Avatar
    Gautier Seidel

    Hi,

    Cloning using a reference works fine. However, the cloned repository is not standalone, the reference is still needed.
    As a result, when the cloned repository is used within a docker container where the reference is not available, git complains like:

    error: Could not read 26fa13ae7fbe6eb887dfbdf182a6218d790203cd
    fatal: Failed to traverse parents of commit 82d549106c0b85d3e1e4444df018ba3f7cbb3d0a

    From git manual, the 'dissociate' option should do it: https://git-scm.com/docs/git-clone#git-clone---dissociate

    How to pass the 'dissociate' option to GitSCM?

    thanks,

    gautier

Please sign in to leave a comment.