A few months ago, I ran into a problem when copying large files between two tiers of a network in one of our datacenters. I was doing a hardware upgrade on some Hyper-V hosts in a DMZ and was copying the images to a backup server while I swapped out hardware components in the Hyper-V hosts.
I was using robocopy to move the files between the servers, but ran into problems. The error manifested itself as "The specified network name is no longer available." Robocopy reports it as "ERROR 64 (0x00000040)."
So, basically, the copy gets aborted, and since I had the /R switch, it would attempt to retry. Eventually, I was able to make it through the copies, but not without many hours of trials and tribulations.
It's worth noting that none of the hosts on peer networks would experience this error--only hosts going between tiers, which meant they had to traverse a firewall. While the source of the problem (the firewall) may seem obvious, finding the exact problem proved to be more troublesome.
Windows 2008 has a number of new features, including the Next-generation IP stack with the much lauded (and probably equally cursed) RSS and window scaling, as well as SMBv2.
In a very controlled series of tests, we were able to narrow down the source of the problem. The steps we took were:
1. Disable RSS.
2. Enable RFC 1323 timestamps.
3. Set windowautotuninglevel=restricted.
4. Set windowautotuninglevel=highlyrestricted.
5. Set windowautotuninglevel=disabled.
6. Set congestionprovider=disabled.
During this whole time, I was capturing data with Wireshark on both the client and server machines. Something that I thought was interesting was that Wireshark was reporting the SMBv2 packets as "malformed." While Wireshark is definitely a great program for diagnosing network problems, it's not without fault. I thought that the malformed packets might be an indication that Wireshark didn't know enough about the SMBv2 protocol to interpret it correctly.
In the end, we determined that traffic was passing fine when we set the windowautotuninglevel to highly restricted. This setting sets the maximum TCP window size to 64k.
SYN showing window size of 8192 from source
Successful transfer, notice window sized only scaled to 64k
However, with window autotuning set to anything except highlyrestricted or disabled, problems occur randomly during large file transfers.
SYN showing window size of 8192 and scaling factor of 8
Scaled Window size up to 1889024.
Result--failed transfer
I decided to try protocols besides SMBv2. I disabled SMBv2 on both systems via the following commands (rebooting afterwards):
Client
sc config lanmanworkstation depend= bowser/mrxsmb10/nsi
sc config mrxsmb20 start= disabled
Server:
reg add HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters /v Smb2 /t REG_DWORD /d 1 /f
Unfortunately, with the window autotuning level setting set to default, I still had problems transferring data between network tiers.
Armed with this information, the network guy and I started investigating all of the settings on the firewall. We already know that the port for SMB was open, since I could transfer data via 445. Upon investigating the "SmartDefense" console and drilling down into the CIFS configuration, I noticed that there was a checkbox indicating that "CIFS Strict Compliance" was checked. The network guy disabled it and re-pushed the policies. Our transfers went more smoothly, but still not 100% successful.
At that point, we started trying transfers via other protocols, such as FTP. I installed the Windows 2008 FTP server and, using the new Windows 2008 FTP client, was able to successfully send data with TCP windows scaled over 3MB. So, we determined that the firewall was capable of passing some scaled packets.
Like so many things, what makes this issue frustrating and difficult to troubleshoot is the inconsistency and intermittent behaviors. Some times, I could successfully transfer files via SMB that were over 100MB. Other times, I would experience failed transfers after 20 or 30MB. This may be related to the buffers on the either the firewall or server; we weren't able to reliably determine a set of conditions that would predictably produce failures with files under 50mb. But it would happen 100% of the time with our larger test files. Infuriating to troubleshoot.
We disabled SmartDefense altogether, reset all network settings on the servers back to default, and all of the problems vanished.
Check Point is at R62 in this particular datacenter. I don't know if it's a problem with any other versions, but disabling SmartDefense was the only way to get reliable transfers via SMB to happen between hosts separated by the firewall.
No comments:
Post a Comment