Customers have used a shared drive (such as an NFS mount) to store all logs generated by Flow. At some point, Flow may stop working (hangs, timeout, etc.) whenever accessing the logs on the shared drive (e.g. running a step).
Flow creates a new directory in the workspace for each job that is run – the directory name in the workspace matches the job name. Inside that directory are multiple logs files, each one created for steps within the job. At some point, as jobs are run, the number of directories on the shared drive exceeds the maximum that can be indexed by the operating system. Because the system can no longer index the files in the directory, it can cause Flow to be unresponsive or to wait indefinitely to access the log directory.
An easy way to determine if this is the issue is to try to look at the drive through your operating system, and do a dir or an ls on the mounted drive. If the operating system is slow to respond or hangs, then the trouble accessing the workspace files is most likely the cause of the hang in Flow, which is a networking issue, not a Flow issue.
Your best practice would be to do periodic cleanups of the shared drive where you store your Flow log files. You can do this through Flow itself by creating a cleanup script and running it in a schedule. This way it can be set to run daily, weekly, or monthly to clean out old log directories in your workspace. Cleaning up (removing) job directories in the workspace that are older than two months is a good starting point. Adjust the time frame for saving jobs based on your engineering process and/or the number of jobs created each month. (For example, if you create 100K jobs per month, you will want to be sure to delete 100K jobs per month to keep your workspace from growing unmanageable).