Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Phase - Vacating. #6951

Closed
timothysc opened this issue Apr 17, 2015 · 23 comments
Closed

Pod Phase - Vacating. #6951

timothysc opened this issue Apr 17, 2015 · 23 comments
Labels
area/api Indicates an issue on api area. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@timothysc
Copy link
Member

Currently if you have a large replication controller running and stop said controller, kubernetes will report that no pods are running, when indeed there are currently pods on the cluster.

This occurs because the entries are deleted from etcd and eventually the kubelets will catch up. Perhaps it makes more sense to update the phase to "vacating" and only delete once the final event has been sent.

@dchen1107 , @wojtek-t , @fgrzadkowski

@derekwaynecarr
Copy link
Member

+1 this affects namespace deletion. I noticed that the Kubelet delay caused events to sometimes be created in the namespace after it should have been removed.

Sent from my iPhone

On Apr 16, 2015, at 9:45 PM, Timothy St. Clair notifications@github.com wrote:

Currently if you have a large replication controller running and stop said controller, kubernetes will report that no pods are running, when indeed there are currently pods on the cluster.

This occurs because the entries are deleted from etcd and eventually the kubelets will catch up. Perhaps it makes more sense to update the phase to "vacating" and only delete once the final event has been sent.

@dchen1107 , @wojtek-t , @fgrzadkowski


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

Won't graceful termination resolve this? Pods will stay present until grace period expires or node terminates process (or observes process exit)?

@timothysc
Copy link
Member Author

@smarterclayton - eventually the system cleans up, it's just a matter of accuracy in reporting. Unless your referring to another issue.?.?.?

@smarterclayton
Copy link
Contributor

When pods will be deleted, they will not disappear immediately:

#5085
#6804

So delete will set deletiontimestamp, which is the time after which any processes may be hard killed (SIGKILL), and the time at which the pod will be automatically deleted (releasing the name for reuse). The kubelet can then SIGTERM the pod's containers. Once the processes are observed dead, the kubelet can either mark the pod status or simply delete the pod outright. So the observed delay should be grace - earlycompletion, and the pod will disappear. If the processes have not finalized, the kubelet may ballistically decommission the processes.

On Apr 16, 2015, at 9:51 PM, Timothy St. Clair notifications@github.com wrote:

@smarterclayton - eventually the system cleans up, it's just a matter of accuracy in accounting. Unless your referring to another issue.?.?.?


Reply to this email directly or view it on GitHub.

@bgrant0607 bgrant0607 added area/api Indicates an issue on api area. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Apr 17, 2015
@derekwaynecarr
Copy link
Member

I think it is covered. The issue that I am encountering is because we run with NamespaceAutoProvision in KUBERNETES by default, you get in a state where finalization thinks all resources are deleted (because they were) and then moments later the Kubelet creates a bunch of events noting the true death of the pod. This in turn causes the namespace to be provisioned again (with just the death rattle of events from a past namespace's life). This is not a problem in OpenShift where we run with the NamespaceExists plugin. If we can get to a state where with graceful termination, the death events are dispatched before the pod is purged completely, it would be good.

Sent from my iPhone

On Apr 16, 2015, at 10:17 PM, Clayton Coleman notifications@github.com wrote:

When pods will be deleted, they will not disappear immediately:

#5085
#6804

So delete will set deletiontimestamp, which is the time after which any processes may be hard killed (SIGKILL), and the time at which the pod will be automatically deleted (releasing the name for reuse). The kubelet can then SIGTERM the pod's containers. Once the processes are observed dead, the kubelet can either mark the pod status or simply delete the pod outright. So the observed delay should be grace - earlycompletion, and the pod will disappear. If the processes have not finalized, the kubelet may ballistically decommission the processes.

On Apr 16, 2015, at 9:51 PM, Timothy St. Clair notifications@github.com wrote:

@smarterclayton - eventually the system cleans up, it's just a matter of accuracy in accounting. Unless your referring to another issue.?.?.?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

Agree w/ @smarterclayton.

See also #1535 and #1899.

I am hesitant to add a new phase. What would it be used for?

Some applications receive the termination notification (SIGTERM/prestop -- #6804), start to fail readiness checks, but continue responding to requests until they receive SIGKILL.

Others immediately stop responding to new requests and try to drain any in-flight storage/db writes.

The pod may or may not be replaced by a controller. Some applications may want to be replaced as soon as it's decided that they should be killed, while others require the existing instance to be fully terminated prior to replacement.

I'm not sure much could be inferred from a Vacating phase in general. Looking at whether deletionTimestamp is set would be generically applicable across all object kinds.

One thing it maybe could be used for is so Kubelet could determine it had already notified the containers in the pod. There would still be instances of duplicate notifications, though.

@timothysc
Copy link
Member Author

@bgrant0607 In the case of graceful shutdown of applications, a large grace period may be applied in a worst case scenario to allow cleanup of those applications that have chosen to use signal escalation. A vacating|STOPPING phase #6804 provides the operator the ability to determine where the application is in it's lifecycle, as well as enabling developers an insight into system behavior.

Right now it's all about accuracy in reporting to the user the true state of the cluster, because as of today the kubectl reports an empty cluster when in fact there are delays in the system that go outside of the scope of just signal escalation.

@smarterclayton
Copy link
Contributor

The presence of "deletionTimestamp" is an observable metric that implies a phase. Are you saying we should impose both?

On Apr 17, 2015, at 9:43 AM, Timothy St. Clair notifications@github.com wrote:

@bgrant0607 In the case of graceful shutdown of applications, a large grace period may be applied in a worst case scenario to allow cleanup of those applications that have chosen to use signal escalation. A vacating|STOPPING phase #6804 provides the operator the ability to determine where the application is in it's lifecycle, as well as enabling developers an insight into system behavior.

Right now it's all about accuracy in reporting to the user the true state of the cluster, because as of today the kubectl reports an empty cluster when in fact there are delays in the system that go outside of the scope of just signal escalation.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

Yes, deletionTimestamp is intended to provide the indication, in addition to the container's response reflected in the Ready condition.

Adding a Vacating phase would be a breaking API change. It would mean that every client/component that checked Running would need to also then consider Vacating to be Running:

switch pod.Status.Phase {
case api.PodPending:
    // not active yet
    ...
case api.PodRunning:
// need new case api.PodVacating:
    // pod active
    ...
case api.PodSucceeded:
    fallthrough
case api.PodFailed:
    // terminated
    ...
case api.PodUnknown:
    // WTF!
    ...
}

This is also discussed #1899 (comment) and elsewhere. Enums aren't extensible. Every addition is a breaking API change. People will try to infer properties from the phases, using switch statements like the above. I relented on Succeeded/Failed, but probably shouldn't have. At some point, it will just become too painful to add new phases, and then we'll be left with an unprincipled distinction between phases and non-phases.

I'm trying to keep the set of phases to the bare minimum and then explicitly specify other common properties for clients as needed. I can elaborate on this more in the api conventions doc:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/api-conventions.md#spec-and-status

I'm happy to add new additional properties if that helps. The conditions array is designed to be extensible in that way. PodStatus also contains a Message field, which could be used:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go#L802

We should also really add a Reason field.

ContainerStateTerminated already contains both Reason and Message:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go#L674

We could add an additional field to ContainerStateRunning:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go#L668

Comment on #6979 if you have feedback about the structure of ContainerStatuses.

cc @vishh @dchen1107 @thockin

@davidopp
Copy link
Member

I don't have a philosophical objection to adding more phases but I do think we should try to avoid redundant representations of the same information within the server. IIUC, deletionTimestamp != 0 would be redundant with podPhase==vacating, so I don't think we need the latter. I do think the client will want to synthesize some nice human-readable set of states larger than the set we are providing on the server, and perhaps we want to put that logic in a client library, but I don't think we need to make these sever-side states.

[For reference, in Borg tasks have a state that indicates they are "vacating" and also a timestamp for when the task should be kill -9'd (this timestamp is only nonzero when it has been requested to die and is in the grace period phase). It's confusing. Machines have a similar situation (a state that indicates they're shutting down, and a timestamp for when they will be removed, that is only nonzero when they are shutting down). It's also confusing.]

@bgrant0607
Copy link
Member

Honestly, since Phase is such a honeypot, I'm tempted to delete it from the API and just create more conditions.

@bgrant0607
Copy link
Member

I agree w/ @davidopp re. redundancy.

We definitely should also enhance the presentation in kubectl and any UIs.

@bgrant0607
Copy link
Member

Proposal:

  1. Add LastTransitionTime, Reason, and Message to PodCondition (similar to NodeCondition)
  2. Add new PodCondition types: Initialized, Instantiated (implies fully physically instantiated, not necessarily functional), Terminated, more as necessary
  3. Add Conditions to whatever other objects currently have Phase but not Conditions
  4. Add deletion reason to metadata
  5. Eliminate *.Phase and PodStatus.Message from the v1 API

Success could be recorded in the deletion reason. Any other reason is failure.

We could add a Vacating (aka Terminating, ShuttingDown, Lame, Evicted) PodConditionType, with the specific meaning that notice has been given. NotifiedOfTermination seems too long. I considered updating the Reason and Message of the Instantiated condition, but that doesn't seem right if there's no change in the condition status.

@tmrts
Copy link
Contributor

tmrts commented Apr 18, 2015

+1 for Vacating PodConditionType. Indicators like this would be helpful when Kubernetes has a Job Controller to manage workloads like batch processing and cron jobs.

@davidopp
Copy link
Member

Add new PodCondition types: Initialized, Instantiated (implies fully physically instantiated, not
necessarily functional), Terminated, more as necessary

To avoid confusion, Conditions should be orthogonal. The ones you suggested -- Instantiated, Initialized, and Terminated -- don't seem to be orthogonal (in fact I would expect a pod to transition from FFF to TFF to TTF to TTT). What you've described sounds more like a state machine than a set of orthogonal conditions, and I think Conditions should only be used for the latter.

Just adding a Vacating PodConditionType might be OK but still has the issue I mentioned earlier, where IIUC it is redundant with having nonzero DeletionTimestamp.

@bgrant0607
Copy link
Member

@davidopp I intended the conditions to be independent properties. For instance, termination wouldn't cause a pod to become uninitialized. The intent is to eliminate the need for switch statements to infer these properties.

@timothysc
Copy link
Member Author

@bgrant0607 would it be possible to create a state diagram for pod-states w/edges, for the proposed changes. I'm certain this would rectify the confusion.

@timothysc
Copy link
Member Author

I'm waiting patiently for the Timestamp stuff to land, and check.

@bgrant0607
Copy link
Member

Created #7856 re. replacement of phase with condition.

@lavalamp
Copy link
Member

lavalamp commented May 6, 2015

I'm late to the party, but let me +1 the idea that we should have a general graceful deletion semantic and NOT use pod phase to represent this information.

@smarterclayton
Copy link
Contributor

As an implementation note we may want to update the containers grace period during the first delete (when we set TTL) to the effective grace period. The timestamp is a valid marker but does not make it easy to calculate the desired grace period for termination (which should be calculated on the kubelet from the time it first observes the deletion, not as an absolute clock value).

On May 6, 2015, at 6:14 PM, Daniel Smith notifications@github.com wrote:

I'm late to the party, but let me +1 the idea that we should have a general graceful deletion semantic and NOT use pod phase to represent this information.


Reply to this email directly or view it on GitHub.

@davidopp
Copy link
Member

davidopp commented May 7, 2015

When Kubelet sees the pod deleted from etcd it sets a local timer equal to the grace period, so it should always give the right grace period.

@timothysc
Copy link
Member Author

I'm going to close *this one and join the herd on #7856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

7 participants