HTCondor pilot/job timeouts

At present, we are using the default settings of HTCondor when it comes to max limits on job/pilot timeouts. This value could be adjusted if we see jobs timing out frequently or pilots not matching jobs for certain types of workloads. Leaving this thread open for site admins and other community members for their views.

@esilvaju @litmaath do you think one of your CEs was affected by this? (the case where pilots were not matched with jobs). For now the re-install seems to have done the trick, but maybe @esilvaju you could investigate this further to avoid running into the issue again in the future?

@mayank I’ve increased the timeout parameters on /etc/condor/config.d/99-condor-ce.conf

SEC_TCP_SESSION_TIMEOUT=60 #The length of time in seconds until the timeout on individual network operations when establishing a UDP security session via TCP. The default value is 20 seconds. Scalability issues with a large pool would be the only basis for a change from the default value.

SEC_TCP_SESSION_DEADLINE=360 #An integer representing the total length of time in seconds until giving up when establishing a security session. Whereas SEC_TCP_SESSION_TIMEOUT specifies the timeout for individual blocking operations (connect, read, write), this setting specifies the total time across all operations, including non-blocking operations that have little cost other than holding open the socket. The default value is 120 seconds. The intention of this setting is to avoid waiting for hours for a response in the rare event that the other side freezes up and the socket remains in a connected state. This problem has been observed in some types of operating system crashes.

SEC_DEFAULT_AUTHENTICATION_TIMEOUT=60 #The length of time in seconds that HTCondor should attempt authenticating network connections before giving up. The default is 20 seconds. Like other security settings, the portion of the configuration variable name, DEFAULT, may be replaced by a different access level to specify the timeout to use for different types of commands, for example SEC_CLIENT_AUTHENTICATION_TIMEOUT.

Default values was 3 times lower.

Hi guys, mind that HTCondor has no idea that the jobs are pilots: if a pilot cannot match a user payload, it could e.g. be due to a network issue between the WN and the central task queue, but from the HTCondor perspective the job just exited after having run for some time.

It may certainly help job submissions that some timeouts are increased on the HTCondor side, though I did not find a clear indication in the logs that submissions failed because HTCondor concluded some timeout had been reached. Let’s see…