Cassandra - Cannot proceed on repair because a neighbor is dead: session failed/Endpoint Died

Overview

Many of you would have faced these error during nodetool repair -

Repair session 2a703be0-7722-11e5-ac24-c5a3c7c6e3cf for range (5995916511884239066,6033116056473427945] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/<node IP>) is dead: session failed.

Repair session 28c63560-7722-11e5-ac24-c5a3c7c6e3cf for range (-135025182178679597,-28481663575762586] failed with error java.io.IOException: Endpoint /<IP of node> died

and By looking at this error we get scared and immediately execute "nodetool status" command to verify the health of cluster. And everything seems good there still we goto that node to verify node is actually not dead :)

Well i did faced this issue and did ran here and there to verify my nodes are healthy and they were.. then why did i receive this error ONLY during repair.

After digging deeper i identified few facts:

1. We had very large partitions (30 - 50GB) (nodetool cfstats <keyspace>.<CF>)

2.In compare to that we use default Heap configuration i.e. 8GB HEAP which doesn't seems to be enough for this big partitions.

3. Noticed whenever CPU goes high it was the JVM which was eating up the CPU (used JCONSOLE,jmc.exe to verify the same)

4. Noticed we had very high read latency - (nodetool cfstats <keyspace>.<CF>)

5. I have noticed whenever repair executes (mostly when it executes at same time on multiple nodes) It initiate COMPACTION on such a large partitions which actually needed more HEAP than what we have allocated resulting FailureDetector.instance.isAlive(endpoint) return a failure.

Reference:

https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/repair/RepairSession.java

// Checking all nodes are live

for (InetAddress endpoint : endpoints)

{

if (!FailureDetector.instance.isAlive(endpoint))

{

String message = String.format("Cannot proceed on repair because a neighbor (%s) is dead: session failed", endpoint);

differencingDone.signalAll();

logger.error("[repair #{}] {}", getId(), message);

throw new IOException(message);

}

6. We noticed for read/write heavy loads cluster chances are the default tenuring threshold flag in GC setting is very low (in our case it was set to default i.e. 1 [JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"]) causing premature promotions. This could be very likely a symptom of our problem. We decided to try increasing the tenuringthreshold to a large value to prevent premature promotions so that ParNew collects most garbage in young generation (jstat).

The tenuring threshold can also be noted in gc.log file.

~ $ cat /var/log/cassandra/gc.log.9 | grep age | tail -10

- age 1: 129160 bytes, 129160 total

- age 1: 127672 bytes, 127672 total

- age 1: 128600 bytes, 128600 total

- age 1: 128784 bytes, 128784 total

- age 1: 1984 bytes, 1984 total

- age 1: 137176 bytes, 137176 total

- age 1: 128592 bytes, 128592 total

- age 1: 128504 bytes, 128504 total

- age 1: 127744 bytes, 127744 total

- age 1: 128256 bytes, 128256 total

7. We also noticed the GC was being initiated very frequently causing very frequent application pause.

~ $ cat /var/log/cassandra/gc.log.9 | grep application

Conclusion:

We understood that its is being caused by overloaded heap and made some change as mentioned below in order to resolve this issue:

Increased MAX_HEAP_SIZE from 8GB to 16GB and accordingly increased HEAP_NEWSIZE to 4GB (MAX_HEAP_SIZE/4), we also did increased the MaxTenuringThreshold from 1 to 5 (though to increase it upto 20 on performance observation)

MAX_HEAP_SIZE="16G"

HEAP_NEWSIZE="4G”

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=5"

The above changes will have following benefits:

This will increase the amount of work ParNew has to do, but it prevents objects from getting promoted quickly.
Increased size of ParEDEN will also reduce Minor GC frequency.
ParNew will collects most garbage in young generation resulting less Major GC.
This will also increase the Survivor size (HEAP_NEWSIZE/8) resulting more reusable objects.

Result:

We did noticed tremendous improvements:

Application pause has been decreased from 80times in a minute to 6times in a minute.
Noticed the objects are living longer will result in less major GC:

- age 1: 356112 bytes, 356112 total- age 2: 225128 bytes, 581240 total- age 3: 59760 bytes, 641000 total- age 4: 61544 bytes, 702544 total- age 5: 71984 bytes, 774528 total

CMS activity after changes dropped on floor.
Non Heap memory usage dropped drastically.
Heap utilization stabilized and started begin under 50% consistently.

Above all our repair started completing successfully without any heap/cpu spike... Huff what a relief... :)

Hope this help!!

Cheers!!!