ElasticSearch troubleshooting guide

Symptoms

  • Analytics show an error
  • I do not see data on Analytics

Diagnosis/Treatment

Compatibility

Check the logs of ES and see which kind of errors you could see on they, if you see parse errors the ES cluster could be broken, or you could be using an incorrect version of ES, have a look at the documentation of Analytics to check the versions supported, the best way to verify it is to ask for health information about ES.

This exception is show on ES when you try to use a ES upper than 1.7.X

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [metrics-20170419] has been exceeded
	at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:593) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:418) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:334) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:266) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:311) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

On kibana into CJOC you could see this exception if you try to connect to a ES upper than 1.7.X

Error: Unknown error while connecting to Elasticsearch
Error: Authorization Exception
    at respond (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:85289:15)
    at checkRespForFailure (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:85257:7)
    at http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:83895:7
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:21035:76
    at Scope.$eval (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:22022:28)
    at Scope.$digest (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:21834:31)
    at Scope.$apply (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:22126:24)

## Recommended Cluster size

We recommended to use a ES cluster with at least three nodes it ensure you fault tolerance to 2 nodes crash, these nodes should have about 16-32GB of RAM and 50-200GB of disk it depends of your environment size, if you have more than 10 Masters or more than 10000 jobs you will need a large ES environment to support your load.

ES Scale Horizontally

Cluster Health

After first configuration of Analytic, Did you restart your CJOC? it is need to create index and dashboards.
If you did that to obtain base information about health of ES cluster you could execute these commands they will give you general information about the cluster status

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_HOST:ES_PORT"
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cluster/health?pretty" > health.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/nodes?v&h=h,i,n,l,u,m,hc,hp,hm,rc,rp,rm,d,fm,qcm,rcm" > nodes.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/indices?v" > indices.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/shards?v" > shards.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/os?pretty" > stats_os.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/os,process?pretty" > stats_os_process.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/process?pretty" > stats_process.json

health.json gives you the status of the cluster, shards status, indices status, pendings tasks, … for more info take a look at Check Cluster Health

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2551,
  "active_shards" : 7053,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 6,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 11736,
  "number_of_in_flight_fetch" : 0
}

nodes.txt gives you the Id of the nodes, IP, name, free memory, free disk space, … for more info see ES cat API

h            i          n              l    u m     hc hp     hm      rc rp     rm       d      fm qcm 
6701a5b2dca8 172.17.0.3 Nico Minoru 2.93 4.4d m 10.3gb 74 13.9gb  29.4gb 57 29.9gb  52.6gb 444.5mb  0b 
559932af40d5 172.17.0.3 Fearmaster  1.93   5d * 10.5gb 75 13.9gb  29.6gb 67 29.9gb  52.6gb 304.6mb  0b 
4054511a6f8f 172.17.0.3 Ricadonna   0.18   5d m  5.8gb 41 13.8gb 113.6gb 23  120gb 262.8gb 334.2mb  0b 

shards.txt gives you the shards that are unassigned

index                   shard prirep state          docs    store ip         node        
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Ricadonna   
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Fearmaster  
builds-20170311         8     r      UNASSIGNED        0     144b 172.17.0.3 Fearmaster  

stats_os.json, stats_os_process.json, stats_process.json give you general stats of the cluster and nodes, for more info see Nodes Stats

ES Basic Concepts
ES Node
Bootstrap Checks
ES Cluster APIs

Unassigned Shards

If in your Cluster Health information you see unassigned shards and you do not have a node that is restarting, you have to assign all shards in order to have your cluster on status “green”. If you have a node that is restarting you should wait until the node is up and running, and the pending tasks returned by the health check stabilizes

This script is designed to assign shards on a ES cluster with 3 nodes, you have to set the environment variables ES_USR (user to access to ES), ES_PASSWD (password) and, DOMAIN (url to access to ES)

#fix ES shards
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export NODE_NAMES=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/nodes?h=h" |awk '{printf $1" "}')
export NODES=(${NODE_NAMES//:/ })
export NUM_UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" | grep -c UNASSIGNED)
export NUM_PER_NODE=$(( $NUM_UNASSIGNED_SHARDS / 3 )) 
export UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" |grep UNASSIGNED | awk '{print $1"#"$2 }')
export N=0

for i in $UNASSIGNED_SHARDS
do
    INDICE=$(echo $i| cut -d "#" -f 1)
    SHARD=$(echo $i| cut -d "#" -f 2)
    if [ $N -le $NUM_PER_NODE ]; then
        NODE="${NODES[0]}"
    fi
    if [ $N -gt $NUM_PER_NODE ] && [ $N -le $(( 2 * $NUM_PER_NODE )) ] ; then
        NODE="${NODES[1]}"
    fi
    if [ $N -gt $(( 2 * $NUM_PER_NODE )) ]; then
        NODE="${NODES[2]}"
    fi
    echo "fixing $INDICE $SHARD"    
    curl -XPOST -u $ES_USR:$ES_PASSWD "$DOMAIN/_cluster/reroute" -d "{\"commands\" : [ {\"allocate\" : {\"index\" : \"$INDICE\",\"shard\" : $SHARD, \"node\" : \"$NODE\", \"allow_primary\" : true }}]}" > fix_shard_out_${N}.log 
    sleep 2s
    N=$(( $N + 1 ))
done

Make a Snapshot of indices

Sometimes before doing an operation over the cluster we need to make a snapshot of the data on it. To do that we can create a new snapshot repository and create a new snapshot.
This script list the snapshot repositories available

#get all backup repositories
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
curl -u $ES_USR:$ES_PASSWD -XGET '$DOMAIN/_snapshot/_all?pretty&pretty'

This script create a new snapshot repository named backup, this repository will store their data into /usr/share/elasticsearch/snapshot/elasticsearch/backup folder.

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"

curl -XPUT -u $ES_USR:$ES_PASSWD  "$DOMAIN/_snapshot/$REPO?pretty" -d '{ "type": "fs", "settings": { "compress": "true", "location": "/usr/share/elasticsearch/snapshot/elasticsearch/backup"}}'

This script create a new snapshot named snapshot_1 into the repository backup

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"
export SNAPSHOT="snapshot_1"

curl -u $ES_USR:$ES_PASSWD -XPUT "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"

in some cases, this process could take more than one hour so you have to check the status of your snapshot with these commands

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"
export SNAPSHOT="snapshot_1"

#status of the current snapshot
#when the snapshot it is finished it will return this
#{
#"snapshots": []
#}
#
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"
#status of snapshot_1, you have to check the `status field` until it will be SUCCESS/FAILED
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"

Errors trying to create a snapshot snapshot is already running

If you execute your snapshot command and see this error, it is because another snapshot is running yet, so you could wait until this snapshot finished, or you could cancel it.

{
    "error":"RemoteTransportException[[Smuggler][inet[/172.17.0.2:9300]][cluster/snapshot/create]]; nested: ConcurrentSnapshotExecutionException[[backup:snapshot_1] a snapshot is already running]; ",
    "status":503
}

You have to check which repository is making a snapshot with these commands

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="backup"

curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"

once you have the name of the snapshot you can cancel it with this commands

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="cloudbees-analytics"
export SNAPSHOT_NAME="SET_THE_NAME_OF_SNAPSHOT_HERE"

#cancel the snapshot
curl -PDELETE -u $ES_USR:$ES_PASSWD -m 30 "$DOMAIN/_snapshot/$REPO/$SNAPSHOT_NAME?pretty"
#check the status again
curl -u $ES_USR:$ES_PASSWD -XGET "$DOMAIN/_snapshot/$REPO/_status?pretty"

Restore a Snapshot of indices

When a disaster happens we have to restore data from the snapshot that we have. In order to do that, we have to get the list of snapshots available to restore with these commands

#obtain ES list of snapshots on cloudbees-analytics repository
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export REPO="cloudbees-analytics"

curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_snapshot/$REPO/_all?pretty" > $REPO-snapshots.json

Now, we have to take a look to the cloudbees-analytics-snapshots.json file and check which snapshots and indices we want to restore, once we have done it, we could edit the following script by adding a new line restore "SNAPSHOT_NAME" "INDEX_NAME" for each index that we want to restore. This script will create a file for each snapshot with the results of the restore operation.

#set the username 
export ES_USR="YOUR_USERNAME"
#set the password
export ES_PASSWD="YOUR_PASSWORD"
export ES_URL="URL_OF_ES"

export ES_CREDS="$ES_USR:$ES_PASSWD"
#name of the snapshot repository
export ES_REPO="cloudbees-analytics"

restore() {
    local ES_SNAPSHOT=$1
    local index=$2
    local FILE=restore-${ES_REPO}-${ES_SNAPSHOT}.json
    echo "Restoring $ES_REPO - $ES_SNAPSHOT - $index"
    echo "Close $index" >> $FILE
    curl -XPOST -u $ES_CREDS "$ES_URL/$index/_close?pretty" >> $FILE
    echo "Restore $ES_REPO - $ES_SNAPSHOT - $index" >> $FILE
    curl -XPOST -u $ES_CREDS "$ES_URL/_snapshot/$ES_REPO/$ES_SNAPSHOT/_restore?wait_for_completion=true&pretty" -d"{ \"indices\" : \"$index\", \"ignore_unavailable\": \"true\", \"include_global_state\": \"true\" }" >> $FILE
    EXIT=1
    while [ $EXIT -eq 1 ]; do
        echo " wait_for_completion $ES_REPO/$ES_SNAPSHOT"
        sleep 5s
        EXIT=$(curl -XGET -u $ES_CREDS "$ES_URL/_snapshot/$ES_REPO/$ES_SNAPSHOT/_status" | grep -c "IN_PROGRESS")
    done
}

#set the SNAPSHOT_NAME and INDEX_NAME you want to restore
restore "SNAPSHOT_NAME" "INDEX_NAME"

Delete snapshots

If we want to keep only an exact number of snapshots on a repository we could use this script to do that. It requires having installed jq a JSON parser. It will list all snapshots and keep only the last 30, and delete all the rest

#set the username 
export ES_USR="YOUR_USERNAME"
#set the password
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"
#number of snapshot to keep
export LIMIT=30
export REPO="cloudbees-analytics"

export ES_SNAPSHOTS=$(curl -u admin:$ES_PASSWD -s -XGET "$ES_URL/_snapshot/$REPO/_all" | jq -r ".snapshots[:-${LIMIT}][].snapshot")

# Loop over the results and delete each snapshot
for SNAPSHOT in $ES_SNAPSHOTS
do
 echo "Deleting snapshot: $SNAPSHOT"
 curl -u $ES_USR:$ES_PASSWD -s -XDELETE "$DOMAIN/_snapshot/$REPO/$SNAPSHOT?pretty"
done

Get pending tasks on the ES cluster

Sometimes if you execute the health commands and check the pending tasks, you could see that there are too many, or you see some index on initializing status, to get the details about that tasks you could get the pending tasks of the cluster with these commands, then you can see what happens

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"

curl -u $ES_USR:$ES_PASSWD -XGET '$DOMAIN/_cluster/pending_tasks?pretty" > pending_tasks.json

Delete index

In you detect problem on an index and it is not possible to fix it, you probably need to delete the index and try to restore it from a snapshot to delete the index you can use these commands

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"
export ES_INDEX="INDEX_NAME"

curl -XDELETE  -u $ES_USR:$ES_PASSWD '$DOMAIN/$ES_INDEX?pretty'

Get Usename, password, and ES URL on CJE

If you are on CJE you could get the username, password, and ES URL with these commands

export ES_PASSWD=$(awk '/elasticsearch_password/ {print $3}' .dna/secrets)
export DOMAIN=$(awk '/domain_name/ {print $3}' .dna/project.config)
export ES_USR="admin"
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.