In those rare cases where do you don’t have the perfect environment with a proper test installation and all the necessary resources behind it, you might have to do troubleshooting in a live(or dead) production environment.
I have a few tricks that i use, which i would like to share, none of them are hidden or highly advanced, but very effective.
Remember, Packets don’t lie. They are a source of truth on the network, logs, counters, utilization bars could be bugged. It is always a good idea to do a trace.
There are 2 ways of doing traces on NetScaler, CLI or GUI.
Nstcpdump.sh is useful to get a live trace from NetScaler, i usually use nstcpdump.sh to figure out the srcIP of the client that is connecting. When troubleshooting on production i often see a lot of NAT going on, so being able to pinpoint the ip that your interested in is crucial.
If the data is unencrypted, nstcpdump.sh could be enough, but if it is encrypted you can only see the very initial communication.
See https://support.citrix.com/article/CTX118185 for more information on how to use it.
The GUI has more options, some more useful than others. If you just select start without changing any, you’ll end up with very limited capture that’s useful if your only debugging on TCP level.
Packet size: Change the packet size from 164 to 0, then you will get the entire packet
Filter:I usually put in the source IP “CONNECTION.IP.EQ(SRCIP)”
Options: I enable “Trace filtered connection peer traffic” which then catch the source ip’s backend connection, so i don’t need to know which backend server the source is connecting too.
“Capturing SSL master keys” enabled, which generates a secondary file that’s needed in wireshark to decrypt the packets. I love this option, and i would like to thank the SE that requested this into the product(he is now writing this article instead of being an SE) Work is just much more simple now.
Initially there was the “SSL PLAIN” option, but that did not capture the handshake of the SSL session, so it’s only possible to see that it fails, but not why (very sad face). When using Capture SSL Master keys, it might fail, you need to catch the initial handshake with the client, i believe this is due to SSL session reuse. Although if you don’t want to disable that make sure that you disable server, service, LB/VS – start the trace – enable LB/VS, service, server. So you catch that initial handshake.
See https://support.citrix.com/article/CTX128655 for more information
Wireshark is your friend, and someone has made the portable edition which is faster to install, it will serve the purpose of analyzing your trace.
Load in the .cap file, and load in the .sslkeys file under Edit > Preferences -> Protocols -> SSL -> In “(Pre)-Master-Secret log filename, select the .sslkeys file
Don’t know them? Don’t like them? – don’t worry, there is a very nice search option.
Press the search icon around number 1 on my screenshot.
In the selection box(number 2), choose “packet details” and select “string” in the last selection box (number 3).
Now your searching inside the packets with a string, instead of knowing how to interpret wireshark filters, which can be intense.
When troubleshooting Authentication like kerberos, NTLM, oAuth, SAML. reproduce the error in a lab, drop the ipfilter, there are “other” connections going on, that you want to look at as well, so creating a proper filter in production with lots of data is hard. And you’ll end up with a huge file, which sucks.
Remember to enable “Debug” option in the local syslog; https://support.citrix.com/article/CTX222945
This will generate a lot of good output in the ns.log for you to analyze.
Trying out configuration
Based on new knowledge from the tracing, it’s time to alter the configuration – in production, without tickets, without approval 😉 To avoid changing it for everyone, but just for the client that you are testing with, you can create a new LB/CS with the same IP and port(yes!!! It rocks) with a listen policy (https://support.citrix.com/article/CTX211473) , then you avoid getting a new ip, doing firewall changes, routing changes, etc. With this new entry point in NetScaler that only affects the clients specified in the listen policy, you can play around with the configuration, creating a new service / server, applying other policies and so on.
These are the primary tools that i use to figure out what’s going on, and oh my lord have i heard the sentence “NetScaler is the problem” many times, when in reality the application / server owners have no clue on what their application / server is doing. But please, hold no regret against them, help them to see what’s causing the issue, and remember – packets don’t lie 🙂
Please share your favorite tool for debugging when faced with a so called “NetScaler problem”