Random Unavailability of Dedicated FusionAuth Instance from GKE (Impacts Site Availability)

jacob 0

Hello FusionAuth Support and Community,

I'm facing a critical issue with a dedicated FusionAuth instance and would greatly appreciate your expertise. Here's the situation:

Problem Description

Sporadic Unavailability & Downtime: Our FusionAuth dedicated instance becomes randomly unreachable specifically from within our Google Kubernetes Engine (GKE) cluster. This causes our authenticated portion of the site to be unavailable. This happens every once in a while. It happened twice this week two days in a row, and happened once before about a month ago.

Accessibility Contrast: Intriguingly, the instance remains accessible from our personal computers during these unavailability periods.

Timeout from Pods: When attempting a curl request from a pod within the GKE cluster, we consistently get a "connect ETIMEDOUT" error for the FusionAuth instance's API endpoint.

Resolves Itself: The issue mysteriously resolves itself within approximately 30 minutes.

Server Logs

The following server logs accompany the timeout:

preplan-api-7465b86756-dwgnw ClientResponse { 
preplan-api-7465b86756-dwgnw exception: FetchError: request to https://[obfuscated-instance-url]/api/identity-provider/login failed, reason: connect ETIMEDOUT [obfuscated-ip]:443 
... [Stack Trace]
}

Troubleshooting Steps (So Far)

Verified Instance Status: The dedicated instance shows no signs of being down when accessed outside the GKE cluster.

General Connectivity: Our pods have regular internet connectivity otherwise (able to curl google.com).

Whitelisting: We have whitelisted our NGINX Load Balancer IP address from our fusionauth instance settings.

Environment Details

FusionAuth Version: 1.47.1

GKE Setup: gke running a network pool of 4 nodes with our API replicated 10 times. No other issues with our cluster and site is otherwise available.

Request for Guidance

I would sincerely appreciate the community's help in figuring out:

Potential Root Causes: What could explain this temporary, selective unavailability of FusionAuth only from within our GKE cluster?

Network Configuration Issues: Are there specific firewall rules, routing, or DNS settings within GKE to examine?

Troubleshooting Techniques: Any recommended strategies to further diagnose this connectivity problem?

Thank you in advance for your insights and assistance!

mark.robustelli

@jacob-0 Sorry to hear you are having issues. Thank you for the detailed post explaining it. Unfortunately, random unavailability can be very difficult to troubleshoot.

Based on your explanation, it seems as though the instance is available from outside the GKE cluster. Could this be an issue with one of the pods going down and being restarted and the internal networking not recognizing the change? I don't quite see how it would still work from the outside, but is there any evidence of pods restarting around the down time?

jacob 0

@mark-robustelli Thank you for the quick response. I appreciate you taking the time to consider the issue.

You raise a valid point about the possibility of pod issues. To clarify, here's additional context and observations:

Relevant Pods: The specific pods attempting to communicate with FusionAuth remain consistently up and healthy throughout the periods where FusionAuth becomes unreachable.

External Connectivity: Successful communication with external services like Google and Gravatar demonstrates that broader network connectivity from our pods is unaffected.

Dedicated Service: FusionAuth is a separate, dedicated service. The issue lies in our GKE cluster's ability to reach it sporadically.
Given this additional information, I'm now leaning towards these potential areas for investigation:

GKE Egress Rules: I have meticulously examined firewall rules and configurations within GKE that might selectively block traffic to FusionAuth. It's possible a misconfiguration could be causing this intermittent issue. However it seems unlikely as it is intermittent. It also doesn't happen after a deploy or anythign related to GKE availability.

FusionAuth Ingress Rules: I have double-checked the FusionAuth server's settings to ensure there aren't any firewall or IP-based restrictions accidentally preventing connections originating from our GKE cluster. There was one that i saw it went down and I added an allowed IP at the time. It didn't immediately solve it but it did resolve within the next few minutes.

Next Steps:

Would you have any additional guidance or specific areas I should focus on? Any insights on potential pitfalls in GKE's network setup or FusionAuth's configuration that might cause this behavior would be greatly appreciated.

The fact that it is intermittent is a problem that makes this difficult to solve.

Thanks again!