ElasticSearch troubleshooting guide

Symptoms

  • Any error message related to Analytics
  • Analytics does not show any data

Diagnosis/Treatment

Test the connection from Operations Center

Go to Manage Jenkins -> Configure Analytics and hit the button Test Connection.

rd-cjp-analytics-test-connection.png

Notice that in case you added Elasticsearch hostname into No Porxy Host in OC, then a restart is needed to take the effects.

Compatibility

Check the logs of ES and see which kind of errors you could see on they, if you see parse errors the ES cluster could be broken, or you could be using an incorrect version of ES, have a look at the documentation of Analytics to check the versions supported, the best way to verify it is to ask for health information about ES.

This exception is show on ES when you try to use a ES upper than 1.7.X

java.lang.IllegalArgumentException: Limit of total fields [1000] in index [metrics-20170419] has been exceeded
	at org.elasticsearch.index.mapper.MapperService.checkTotalFieldsLimit(MapperService.java:593) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:418) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:334) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:266) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:311) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

On kibana into CJOC you could see this exception if you try to connect to a ES upper than 1.7.X

Error: Unknown error while connecting to Elasticsearch
Error: Authorization Exception
    at respond (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:85289:15)
    at checkRespForFailure (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:85257:7)
    at http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:83895:7
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at wrappedErrback (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:20902:78)
    at http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:21035:76
    at Scope.$eval (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:22022:28)
    at Scope.$digest (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:21834:31)
    at Scope.$apply (http://0.0.0.0:9292/plugin/operations-center-analytics-viewer/index.js?_b=7562:22126:24)

Operations Center accessing the Internet through a Proxy

This is by far the most common issue which happens when in Operations Center under Manage Jenkins -> Manage Plugins [Advanced tab] you have a Proxy set-up. In this case, you must add the Elasticsearch hostname to the No Proxy Host section. i.e elasticsearch.jenkins.example.com. Notice that a restart is needed each time you modify the No Proxy Host section for Analytics to take the changes.

Instead of using the No Proxy Host you can also use the Java argument -Dhttp.nonProxyHosts i.e -Dhttp.nonProxyHosts=elasticsearch.jenkins.example.com. In the same case than with No Proxy Host a restart is needed for Analytics to take the effect after the Java argument was added to Operations Center.

To test the connectivity between Elasticsearch and Operations Center you can use:

  • The Test Connection button under Manage Jenkins -> Configure Analytics
  • Execute the script below under Manage Jenkins -> Script Consonle
import jenkins.plugins.asynchttpclient.AHC
import com.ning.http.client.AsyncHttpClient
import com.ning.http.client.ListenableFuture
import com.ning.http.client.Response


AsyncHttpClient ahc = AHC.instance()
ListenableFuture<Response> response = ahc.prepareGet("http://<ELASTICSEARCH_HOSTNAME>:9200/").execute()
println(response.get().status.statusCode + " " + response.get().status.statusText)
println("---")
println(response.get().getResponseBody())

In case everything is fine you should get a HTTP 200 answer like the example below:

200 OK
---
{
  "status" : 200,
  "name" : "Eros",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.6",
    "build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
    "build_timestamp" : "2016-11-18T15:21:16Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

Restart Operations Center after the initial configuration

After first configuration of Analytic a restart is needed to create the index and dashboards.

## Recommended Cluster size

We recommended to use a ES cluster with at least three nodes it ensure you fault tolerance to 2 nodes crash, these nodes should have about 16-32GB of RAM and 50-200GB of disk it depends of your environment size, if you have more than 10 Masters or more than 10000 jobs you will need a large ES environment to support your load.

ES Scale Horizontally

Elasticsearch cluster Health

After first configuration of Analytic, Did you restart your CJOC? it is need to create index and dashboards.
If you did that to obtain base information about health of ES cluster you could execute these commands they will give you general information about the cluster status

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_HOST:ES_PORT"
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cluster/health?pretty" > health.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/nodes?v&h=h,i,n,l,u,m,hc,hp,hm,rc,rp,rm,d,fm,qcm,rcm" > nodes.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/indices?v" > indices.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_cat/shards?v" > shards.txt
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/os?pretty" > stats_os.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/os,process?pretty" > stats_os_process.json
curl -u $ES_USR:$ES_PASSWD "http://$DOMAIN/_nodes/stats/process?pretty" > stats_process.json

health.json gives you the status of the cluster, shards status, indices status, pendings tasks, … for more info take a look at Check Cluster Health

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2551,
  "active_shards" : 7053,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 6,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 11736,
  "number_of_in_flight_fetch" : 0
}

nodes.txt gives you the Id of the nodes, IP, name, free memory, free disk space, … for more info see ES cat API

h            i          n              l    u m     hc hp     hm      rc rp     rm       d      fm qcm 
6701a5b2dca8 172.17.0.3 Nico Minoru 2.93 4.4d m 10.3gb 74 13.9gb  29.4gb 57 29.9gb  52.6gb 444.5mb  0b 
559932af40d5 172.17.0.3 Fearmaster  1.93   5d * 10.5gb 75 13.9gb  29.6gb 67 29.9gb  52.6gb 304.6mb  0b 
4054511a6f8f 172.17.0.3 Ricadonna   0.18   5d m  5.8gb 41 13.8gb 113.6gb 23  120gb 262.8gb 334.2mb  0b 

shards.txt gives you the shards that are unassigned

index                   shard prirep state          docs    store ip         node        
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Ricadonna   
builds-20170311         8     r      STARTED           0     144b 172.17.0.3 Fearmaster  
builds-20170311         8     r      UNASSIGNED        0     144b 172.17.0.3 Fearmaster  

stats_os.json, stats_os_process.json, stats_process.json give you general stats of the cluster and nodes, for more info see Nodes Stats

ES Basic Concepts
ES Node
Bootstrap Checks
ES Cluster APIs

Unassigned Shards

If in your Cluster Health information you see unassigned shards and you do not have a node that is restarting, you have to assign all shards in order to have your cluster on status “green”. If you have a node that is restarting you should wait until the node is up and running, and the pending tasks returned by the health check stabilizes

This script is designed to assign shards on a ES cluster with 3 nodes, you have to set the environment variables ES_USR (user to access to ES), ES_PASSWD (password) and, DOMAIN (url to access to ES)

#fix ES shards
export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="ES_URL"
export NODE_NAMES=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/nodes?h=h" |awk '{printf $1" "}')
export NODES=(${NODE_NAMES//:/ })
export NUM_UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" | grep -c UNASSIGNED)
export NUM_PER_NODE=$(( $NUM_UNASSIGNED_SHARDS / 3 )) 
export UNASSIGNED_SHARDS=$(curl -u $ES_USR:$ES_PASSWD "$DOMAIN/_cat/shards?v" |grep UNASSIGNED | awk '{print $1"#"$2 }')
export N=0

for i in $UNASSIGNED_SHARDS
do
    INDICE=$(echo $i| cut -d "#" -f 1)
    SHARD=$(echo $i| cut -d "#" -f 2)
    if [ $N -le $NUM_PER_NODE ]; then
        NODE="${NODES[0]}"
    fi
    if [ $N -gt $NUM_PER_NODE ] && [ $N -le $(( 2 * $NUM_PER_NODE )) ] ; then
        NODE="${NODES[1]}"
    fi
    if [ $N -gt $(( 2 * $NUM_PER_NODE )) ]; then
        NODE="${NODES[2]}"
    fi
    echo "fixing $INDICE $SHARD"    
    curl -XPOST -u $ES_USR:$ES_PASSWD "$DOMAIN/_cluster/reroute" -d "{\"commands\" : [ {\"allocate\" : {\"index\" : \"$INDICE\",\"shard\" : $SHARD, \"node\" : \"$NODE\", \"allow_primary\" : true }}]}" > fix_shard_out_${N}.log 
    sleep 2s
    N=$(( $N + 1 ))
done

Get pending tasks on the ES cluster

Sometimes if you execute the health commands and check the pending tasks, you could see that there are too many, or you see some index on initializing status, to get the details about that tasks you could get the pending tasks of the cluster with these commands, then you can see what happens.

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"

curl -u $ES_USR:$ES_PASSWD -XGET '$DOMAIN/_cluster/pending_tasks?pretty" > pending_tasks.json

Delete index

In you detect problem on an index and it is not possible to fix it, you probably need to delete the index and try to restore it from a snapshot to delete the index you can use these commands

export ES_USR="YOUR_USERNAME"
export ES_PASSWD="YOUR_PASSWORD"
export DOMAIN="URL_OF_ES"
export ES_INDEX="INDEX_NAME"

curl -XDELETE  -u $ES_USR:$ES_PASSWD '$DOMAIN/$ES_INDEX?pretty'

NOTE: Get Usename, password, and ES URL on CJE

If you are on CJE you could get the username, password, and ES URL with these commands

export ES_PASSWD=$(awk '/elasticsearch_password/ {print $3}' .dna/secrets)
export DOMAIN=$(awk '/domain_name/ {print $3}' .dna/project.config)
export ES_USR="admin"
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.