GitSwarm-EE 2017.1-1 Documentation


Elasticsearch integration

Note: This feature was introduced in GitSwarm 2016.1.


Elasticsearch is a flexible, scalable and powerful search service.

If you want to keep GitSwarm's search fast when dealing with huge amount of data, you should consider enabling Elasticsearch.

GitSwarm leverages the search capabilities of Elasticsearch and enables it when searching in:

Once the data is added to the database or repository, search indexes will be updated automatically. Elasticsearch can be installed on the same machine that GitSwarm is installed or on a separate server.

Requirements

These are the minimum requirements needed for Elasticsearch to work:

Install Elasticsearch

Elasticsearch is not included in the package installations. You will have to install it yourself whether you are using the package installation or installed GitSwarm from source. Providing detailed information on installing Elasticsearch is out of the scope of this document.

You can follow the steps as described in the official web site or use the packages that are available for your OS.

Enable Elasticsearch

In order to enable Elasticsearch you need to have access to the server that GitSwarm is hosted on, and an administrator account on your GitSwarm instance. Go to Admin > Settings and find the "Elasticsearch" section.

The following Elasticsearch settings are available:

Parameter Description
Elasticsearch indexing Enables/disables Elasticsearch indexing. You may want to enable indexing but disable search in order to give the index time to be fully completed, for example.
Search with Elasticsearch enabled Enables/disables using Elasticsearch in search.
Host The TCP/IP host to use for connecting to Elasticsearch. Use a comma-separated list to support clustering (e.g., "host1, host2").
Port The TCP port that Elasticsearch listens to. The default value is 9200

Add GitSwarm's data to the Elasticsearch index

Configure Elasticsearch's host and port in Admin > Settings. Then create empty indexes using one of the following commands:

```
# package installations
sudo gitswarm-rake gitswarm:elastic:create_empty_index

# Source installations
bundle exec rake gitswarm:elastic:create_empty_index RAILS_ENV=production
```

Then enable Elasticsearch indexing and run indexing tasks. It might take a while depending on how big your Git repositories are (see Indexing large repositories).


To index all your repositories:

# package installations
sudo gitswarm-rake gitswarm:elastic:index_repositories

# Source installations
bundle exec rake gitswarm:elastic:index_repositories RAILS_ENV=production

If you want to run several tasks in parallel (probably in separate terminal windows) you can provide the ID_FROM and ID_TO parameters:

ID_FROM=1001 ID_TO=2000 sudo gitswarm-rake gitswarm:elastic:index_repositories

Both parameters are optional. Keep in mind that this task will skip repositories (and certain commits) that have already been indexed. It stores the last commit SHA of every indexed repository in the database. As an example, if you have 3,000 repositories and you want to run three separate indexing tasks, you might run:

ID_TO=1000 sudo gitswarm-rake gitswarm:elastic:index_repositories
ID_FROM=1001 ID_TO=2000 sudo gitswarm-rake gitswarm:elastic:index_repositories
ID_FROM=2001 sudo gitswarm-rake gitswarm:elastic:index_repositories

If you need to update any outdated indexes, you can use the UPDATE_INDEX parameter:

UPDATE_INDEX=true ID_TO=1000 sudo gitswarm-rake gitswarm:elastic:index_repositories

Keep in mind that it will scan all repositories to make sure that last commit is already indexed.

To index all wikis:

# package installations
sudo gitswarm-rake gitswarm:elastic:index_wikis

# Source installations
bundle exec rake gitswarm:elastic:index_wikis RAILS_ENV=production

The wiki indexer also supports the ID_FROM and ID_TO parameters if you want to limit a project set.

To index all database entities:

# package installations
sudo gitswarm-rake gitswarm:elastic:index_database

# Source installations
bundle exec rake gitswarm:elastic:index_database RAILS_ENV=production

Or everything at once (database records, repositories, wikis):

# package installations
sudo gitswarm-rake gitswarm:elastic:index

# Source installations
bundle exec rake gitswarm:elastic:index RAILS_ENV=production

Disable Elasticsearch

Disabling the Elasticsearch integration is as easy as unchecking Search with Elasticsearch enabled and Elasticsearch indexing in Admin > Settings.

Special recommendations

Here are some tips to use Elasticsearch with GitSwarm more efficiently.

Indexing large repositories

Indexing large Git repositories can take a while. To speed up the process, you can temporarily disable auto-refreshing. In our experience you can expect a 20% time drop.

  1. Disable refreshing:

    curl --request PUT localhost:9200/_settings --data '{
        "index" : {
            "refresh_interval" : "-1"
        } }'
  2. (optional) You may want to disable replication and enable it after indexing:

    curl --request PUT localhost:9200/_settings --data '{
        "index" : {
            "number_of_replicas" : 0
        } }'
  3. Create the indexes

  4. (optional) If you disabled replication in step 2, enable it after the indexing is done and set it to its default value, which is 1:

    curl --request PUT localhost:9200/_settings --data '{
        "index" : {
            "number_of_replicas" : 1
        } }'
  5. Enable refreshing again (after indexing):

    curl --request PUT localhost:9200/_settings --data '{
        "index" : {
            "refresh_interval" : "1s"
        } }'
  6. A force merge should be called after enabling the refreshing above:

    curl --request POST 'http://localhost:9200/_forcemerge?max_num_segments=5'

To minimize downtime of the search feature we recommend the following:

  1. Configure Elasticsearch in Admin > Settings, but do not enable it, just set a host and port.

  2. Create empty indexes:

    # package installations
    sudo gitswarm-rake gitswarm:elastic:create_empty_index
    
    # Source installations
    bundle exec rake gitswarm:elastic:create_empty_index RAILS_ENV=production
  3. Index all repositories using the gitlab:elastic:index_repositories Rake task (see above). You'll probably want to do this in parallel.

  4. Enable Elasticsearch indexing.

  5. Run indexers for database, wikis, and repositories (with the UPDATE_INDEX=1 parameter). By running the repository indexer twice you will be sure that everything is indexed because some commits could be pushed while you performed the initial indexing. The repository indexer will skip repositories and commits that are already indexed, so it will be much shorter than the first run.