Apr 29, 20201 min read

vRLI Cluster unresponsive as / partition full on 1 node due to multiple .hints file

Recently we've seen a situation where the root partition was full on vRLI appliance.

This was part of a vRLI 3 node cluster.

When this issue occurs, the cassandra service gets into a hung state and then this issue starts impacting other nodes in the cluster as well.

cassandra.log shows service unresponsive due to space issue on the root partition


INFO  [HANDSHAKE-XXXXXXX] 2020-03-04 10:47:57,384 OutboundTcpConnection.java:560 - Handshaking version with XXXXXXX
INFO  [RequestResponseStage-3] 2020-03-04 10:47:57,400 Gossiper.java:1019 - InetAddress /ZZZZZZZ is now UP
INFO  [GossipStage:1] 2020-03-04 10:47:58,379 StorageService.java:2292 - Node /ZZZZZZZ state jump to NORMAL
ERROR [HintsWriteExecutor:1] 2020-03-04 10:48:24,194 CassandraDaemon.java:228 - Exception in thread Thread[HintsWriteExecutor:1,5,main]
org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
        at org.apache.cassandra.hints.HintsWriteExecutor.flushInternal(HintsWriteExecutor.java:232) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.hints.HintsWriteExecutor.flush(HintsWriteExecutor.java:203) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.hints.HintsWriteExecutor.lambda$flush$1(HintsWriteExecutor.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]

The root partition was occupied by a .hprof file along with multiple .hints file and crc32 file getting created in /usr/lib/loginsight/application/lib/apache-cassandra-*/data/hints directory

Background on hints

Hints are one of three ways to support consistency in the system. When replica node is not available coordinator stores mutating data in temporary hint files to proceed as replica is available.

For details look here - https://cassandra.apache.org/doc/latest/operating/hints.html

Ideally, in all vRLI deployments, it's configured that they are deleted after the default 3 hours. But somehow it's not working and hint files stay there seems forever in some environments.

Repairing runs automatically that is an addition way to support consistency in the system.

Manual deletion is solution in this situation.

This is a bug and will be addressed in upcoming releases of vRLI

arunnukula

arunnukula

vRLI Cluster unresponsive as / partition full on 1 node due to multiple .hints file

Recent Posts

Comments