Closed
Description
We have a PodStatus type, which is a string, with a few values defined. People will start programming to those strings. We should define a state machine that governs what causes state changes. We did this internally and it was a great help in reasoning about what happens.
If nobody balks, I can port over internal ideas into a design doc for dicsussion.
Activity
dchen1107 commentedon Sep 3, 2014
Yes, I totally support this. When I implemented Restart Policy support #1147, and think about other features, such as garbage collection, events, etc. I couldn't help to wish we have a state machine defined, and feel that several states are missed from our PodStatus, for example, Error / Failure state. I even started to code with a state machine based on our internal design for Restart. :-)
smarterclayton commentedon Sep 3, 2014
+1
lavalamp commentedon Sep 3, 2014
@thockin There was some discussion about this when I added the binding object.
The summary is, we will have only the three states we have now. Additional states will come in the form of clarification of those main three states, in another field (PodStateReason or some such).
We're trying to avoid a monolithic state machine like the one we have internally. This approach will make it easier to have states that not all components need to understand.
@bgrant0607
bgrant0607 commentedon Sep 3, 2014
The discussion @lavalamp mentioned is here: #557 (comment)
The reasons why a pod might not be running need to be extensible, as must the reasons why a pod was terminated.
thockin commentedon Sep 4, 2014
Hmm, I'm not very satisfied with just those 3 states.
Internally (on the node side) we defined a fairly simple state machine in
which each state very clearly defined what assumptions can and can not be
made about an entity. We find it very useful when writing and reading code
and docs, once you understood the states.
ACCEPTED: The request to create the entity has been accepted, and a
placeholder for it was created. The entity does not yet have presence in
actual machine state.
PREPARING: Machine state is actively being changed to instantiate the
entity.
ACTIVE: The entity is ready to be used.
SUCCEEDED: The entity has completed. No extra effort to remove machine
state is being made.
FAILED: The entity has failed. No extra effort to remove machine state is
being made.
CLEANING: Machine state is actively being changed to remove the entity.
DEFUNCT: An error was encountered while CLEANING. No extra effort to
remove machine state is being made.
CLEANED: The entity no longer has any presence in actual machine state.
We could probably flatten ACCEPTED and PREPARING. These states should be
enough to make decisions about workflows, to tell users unambiguously what
is happening, and to allow code to manage sub-objects. We use this state
machine for the equivalent of pods and for teh pieces that make them up
(e.g. volumes, containers, etc).
On Wed, Sep 3, 2014 at 1:29 PM, bgrant0607 notifications@github.com wrote:
lavalamp commentedon Sep 4, 2014
So in the k8s model, we'd have
state/reason:
waiting/ACCEPTED
waiting/PREPARING
running/ACTIVE.
terminated/SUCCEEDED
terminated/FAILED
terminated/CLEANING
terminated/DEFUNCT
terminated/CLEANED
@bgrant0607 how are we supposed to model containers that are being restarted? Substate under running?
thockin commentedon Sep 4, 2014
Well, my point was that the distinct states provide useful information that
is obscured by lumping them all together. I'd call restarting (e.g. after
a crash) "ACTIVE".
On Thu, Sep 4, 2014 at 12:16 AM, Daniel Smith notifications@github.com
wrote:
bgrant0607 commentedon Sep 4, 2014
Is this just internal Kubelet state, or would it be exposed in Kubelet and/or apiserver APIs?
I'm skeptical of the utility of such fine-grain states in a flat state space, and I think it's a big mistake to expose such a thing in the API. The probability that we'll need to add new states in the future as we add functionality and components is 100% and the likelihood that it would be feasible to extend the set of states is 0%. Every piece of code that looked at the state field would almost certainly need to be changed. The problem is that states within a flat state space have no inherent, inferable semantics, and that resulting code that looks at such states is brittle.
Is a new state a new reason for not-yet-running, a new sort-of-running, or a new termination reason? What about restarting? Different forms of suspended? Waiting for activation by an external workflow, by an auto-sizer filling in resource values, or by a gang scheduler scheduling another pod? How about LOST? Does deliberately stopped count as FAILED? What is the state of a durable pod while the machine is down or unreachable? Is ACCEPTED admitted by the apiserver, or by the Kubelet? What state is the pod in during execution of PreStart, PreStop, or PostStop lifecycle hooks? While failing liveness and/or readiness probes?
Simple states have value to humans who want a high-level understanding of what's happening. The states are of questionable value to our components, however.
I advocate that additional useful information should be conveyed by other fields, such as some kind of one-word
reason
field and some kind of more verbose message (e.g., "image foo/bar download from registry xyz.io failed"), other fields specific to the activity (e.g., assigned host in the case of scheduling), time the state was last reported/collected, etc.I'm fine with renaming RUNNING to ACTIVE.
thockin commentedon Sep 4, 2014
On Thu, Sep 4, 2014 at 8:56 AM, bgrant0607 notifications@github.com wrote:
I completely agree that a fine-grained state machine is a terrible API
what you can assume about the object. it does not reasons or details
for state changes, just very blanket statements about what is safe to
assume and what is not.
Good questions, though many of them apply equally to what we call
PodStatus today. I'll asnwer these with a caveat that the states
described here were node-centric, so maybe need to be adjusted for
cluster-centric use.
To the first part, none of the above are represented (they are all
"reasons" not states).
I'd say "lost" (we expected a pod to be present but it was not) is FAILED.
Stopped is interesting - at the node level we call it FAILED, I think,
but that could be distinct. We only recently split ENDED into
SUCCEEDED and FAILED, and maybe we shouldn't.
If the machine disappears for longer that $timeout, pods become FAILED.
ACCEPTED means the apiserver has ACKed the transaction.
PreStart is new (we did not have events yet) and I would say PREPARING
(because the main control has not started).
PreStop is ACTIVE (the main application is in control)
PostStop is "ENDED" (either SUCCEEDED or FAILED) because the main
application can not be assumed to be alive.
Failing probes is still ACTIVE (the main application is in control)
With simple semantics, you can know whether, for example, it is
reasonable to wait for cleanup to complete or whether it is in trouble
(historically there are a lot of things that can go wrong during
cleanup).
image down load failed - will it be retried or not? Relying on
strings to carry information like that in a consistent way is horrible
and fragile.
I'm willing to invest more time in this, thinking and discussing how
cluster semantics change this model (which I literally cut and pasted
from internal docs on the 'let), but only if we really think there's a
modicum of utility here. I find the current 3 states to be overly
broad, vaguely defined, and almost useless.
This is not about names :) [If anything, RUNNING is better, since it
follows the ING and ED distinction, but some of the objects we
describe with this state machine are not actually "running" so much as
"present". Hence ACTIVE.]
bgrant0607 commentedon Sep 4, 2014
@thockin I'm supportive of adding more fine-grained reasons, but they shouldn't be considered an exhaustive, finite enumerated list. We WILL need to add more fine distinctions in the future. I'm also fine with keeping the state internally and not exposing it in the APIs.
I could easily imagine inserting a stage before ACCEPTED, for instance, for instances pending an external admission control check.
Treating deliberate stops as failures would be misinterpreted by many tools/systems. And, as described in #137, there isn't a bounded number of reasons why a pod might be deliberately stopped.
My point about the lifecycle hooks was that there will be components that want to know about them, and that the Kubelet itself will need to keep track of whether they have been executed or not. Ditto with failing probes, unresponsive hosts, and several other examples. So, how does one decide what to elevate to states and what not to elevate as states? I suggest we don't elevate fine distinctions to top-level status, but remain open-ended and flexible about reporting the finer distinctions.
image load failed: Not intended to be an API. I intended it to be the cause of a failure (not retried at that level). Maybe it was a bad example since it could relate to retry logic, as discussed in #941 and #1088.
thockin commentedon Sep 6, 2014
On Thu, Sep 4, 2014 at 1:28 PM, bgrant0607 notifications@github.com wrote:
IMO, reasons should be orthogonal to state, if we expose state this way.
I'd argue against it - the new state does not offer any semantic value
to a user, does it? If it has not been ACCEPTED into the system,
should it even be queryable for state?
Yeah, I could see STOPPED as a peer of SUCCEEDED and FAILED.
I think I agree, if I understand. These states are not fine
distinctions, they simply define coarse semantic differences which can
be used to understand what can and can not be assumed. They are
intentionally vague about details, so as to leave any implementation
lots of room to implement. If, internally, kubelet or apiserver
wanted to track finer states, that is perfectly valid, as long as
those internal states each map to exactly one external state.
For example, during task setup internally, we have a number of states
as we install packages and start the process. But those all report
externally as PREPARING. Ditto teardown and SUCCEEDED/FAILED
bgrant0607 commentedon Sep 8, 2014
@thockin "reasons should be orthogonal to state": You mean that the clients shouldn't also need to look at state in order to interpret reason? I agree with that.
State before ACCEPTED: Accepted by apiserver or by kubelet? An example of where it would be useful to expose a pre-accepted state would be in order to await action by a budget-checking service. Anyway, I don't want to rathole on this. It was just one example.
STOPPED: STOPPED, SUCCEEDED, and FAILED are all flavors of TERMINATED -- no longer active. We should capture and report more detailed termination reasons, exit codes, etc.
While I can see why Kubelet might want to keep track of them internally, I can't imagine of cases where CLEANING, DEFUNCT, and CLEAN are going to be significant to API clients or even to cluster-level management components. This is why I consider them "fine distinctions". I didn't say that reasons/subStates/stateDetails could have no semantics, only that not everything needed to understand them, since they'd have the option of looking at the coarse states instead.
ACCEPTED and PREPARING are similar -- there's also another way a client could distinguish these states, which was by whether the host info (host, hostIP, etc.) were initialized or not.
With respect to what a client can assume: The apiserver will necessarily have stale information, and the client will have even-more-stale information. If the client thought, for example, that the state were PREPARING, does that mean that the pod is not yet running? Of course not.
I propose we keep high-level states, waiting (or pending), active (was running), and terminated, and push anything else into Reason.
What I currently see in type Status seems pretty reasonable:
9 remaining items