Mesg read: Connection reset by peer

From wiki.zmanda.com
Jump to navigation Jump to search

This article is a part of the Troubleshooting collection.

Problem

The backup starts normally, but after some time, the following message shows up in the amdump log on the server (long lines folded):

driver: result time 553.824 from dumper0: FAILED 01-00002 \
        [mesg read: Connection reset by peer]
dumper: kill index command

On the client side, sendbackup log contains:

sendbackup: time 301.700: index tee cannot write [Broken pipe]
sendbackup: time 301.700: pid 15145 finish time Tue Mar 21 15:39:18 2006
sendbackup: time 301.712: 124: strange(?): \
        sendbackup: index tee cannot write [Broken pipe]

The error occurs usually at the same time, in the example above after 300 seconds.

Explanation

The cause of this is usually a firewall between the server and client (or on one of them) that times out idle TCP connections.

The "mesg" channel is used to transfer the error output of the backup program. When there are no errors, the only thing that is transferred is the summary at the end (for gnutar: "Total size: 123456789 bytes").

The Amanda server notices that something breaks the TCP connection for the mesg channel. Then Amanda begins to clean up the other associated streams: it kills the server part of the index channel and closes the index channel and the data channel.

The client does not need to send anything on the mesg channel, and is unaware that that connection is closed. But as soon as it wants to write to the index channel, or data channel, it will get an error about the broken pipe.

This problem also manifests in idle SSH sessions between the same machines spontaneously ending at the next activity.

Solution

TCP connections can periodically send dummy packets (TCP keepalives) over an idle connection. You need to increase the frequency of these packets avoiding the firewall to time out the connection. The default interval is usually 7200 seconds.

For Linux do:

echo 180 > /proc/sys/net/ipv4/tcp_keepalive_time

(The unit is expressed in seconds.)

For FreeBSD do:

 sysctl net.inet.tcp.keepidle=180000

(The unit is expressed in milliseconds.)

This will send a keepalive dummy packet every 180 seconds.

For Solaris the equivalent setting is:

ndd -set /dev/tcp tcp_keepalive_interval 180000

Another possibility is to increase the timeout in the firewall.