Mesg read: Connection reset by peer

From wiki.zmanda.com
Jump to navigation Jump to search

Issue

The backup starts normally, but after some time, the following message shows up in the amdump log on the server (long lines folded):

driver: result time 553.824 from dumper0: FAILED 01-00002 \
        [mesg read: Connection reset by peer]
dumper: kill index command

On the client side, sendbackup log contains:

sendbackup: time 301.700: index tee cannot write [Broken pipe]
sendbackup: time 301.700: pid 15145 finish time Tue Mar 21 15:39:18 2006
sendbackup: time 301.712: 124: strange(?): \
        sendbackup: index tee cannot write [Broken pipe]

The error occurs usually at the same time, in the example above after 300 seconds.

Diagnose

The cause of this is usually a firewall between the server and client (or on one of them) that times out idle TCP connections.

The "mesg" channel is used to transfer the error output of the backup program. When there are no errors, the only thing that is transferred is the summary at the end (for gnutar: "Total size: 123456789 bytes").

The Amanda server notices that something breaks the TCP connection for the mesg channel. Then Amanda begins to clean up the other associated streams: it kills the server part of the index channel and closes the index channel and the data channel.

The client does not need to send anything on the mesg channel, and is unaware that that connection is closed. But as soon as it wants to write to the index channel, or data channel, it will get an error about the broken pipe.

Solution

TCP connections send periodically dummy packets over an idle connection. We have to increase the frequency of these packets avoiding the firewall to time out the connection.

For Linux/FreeBSD do:

echo 90 > /proc/sys/net/ipv4/tcp_keepalive_time

This will send a keepalive dummy packet every 90 seconds. (The default is usually 7200 seconds.)

Another possibility is to increase the timeout in the firewall.


See more amdump issues.