New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: We should run the master in an autoscaling group of size 1 #11934
Comments
How does AWS handle mounting persistent disks to instances in an autoscaling group? Also, what about health checks (you also want to re-launch the VM if the VM is running but the apiserver is down)? |
Are the files/configuration that need to survive termination in a known location? We could create an EBS volume and mount it in the master instance. Alternately, I think the same idea would work, but it would need to be the boot volume. |
@jboelter We put all of the config that needs to survive on an EBS that is mounted to the master when it is initially created (not the boot volume, but a second disk that has the essential info placed on it). @roberthbailey We can mount a blank disk or a snapshot of a disk. But, I don't think there is any way to have the ASG know to remount the disk that was used previously. For this to come back up with the correct data we could run a script when the instance starts. That script would make some AWS API calls to try to find an existing EBS volume for the master and remount it. @justinsb might have some better solution in mind though :) |
@iterion perfect -- The ASG has an associated LaunchConfiguration that specifies the details. We should be able to reference a known volume id created prior. This assumes there are no race conditions w/ the volume in use after termination while a new instance is created. Edit: It appears that the AutoScaling EBS type doesn't allow for a volume id (which would only make sense for a ASG size of 1) -- mounting w/ an init script may be the way to go. Should still be able to use a well-known volume id though. |
@jboelter Interesting, I can't find where to specify the volume id when creating a launch configuration, perhaps I'm looking in the wrong place. It looks as if you can specify a FYI - I'm looking here: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-launchconfig-blockdev-mapping.html#cfn-as-launchconfig-blockdev-mapping-ebs |
@iterion yeah, just noticed the same and edited my note above as you posted |
Bummer, perhaps we could tag the ASG or launch configuration with the volume id that was used? Alternatively, we could tag the EBS with something that identifies it as the master disk for that cluster. We run the risk of having multiple disks with the same tags though. |
I'm going to make an attempt at this. I am planning on using the approach of tagging the volume and then trying to mount it as part of instance boot. |
Rather than have a separate process or script that: discovers the volume, tries to mount it and then starts our processes, I am experimenting with using the kubelet for this: Advantages:
Shortcomings:
|
+1, AWS should come up with an ASG in front of the master for self healing (in conjunction with master using an EIP, can't seem to find the issue # at the moment), or be configured with multiple masters (preferably still behind an ASG!). |
Good news is I have this working on a branch. Bad news is that the diff is pretty substantial. I am cherry picking smaller PRs across so that the remaining changes become palatable! |
Do you have a list of things which needs to be restored except etcd? |
Is it plausible to split out etcd into its own autoscaling group? |
This is implemented in kops. As kube-up is in maintenance mode, it won't be implemented there. @Zilman it's plausible, but then the etcd ASG becomes the challenging one! |
@justinsb: well, if you have an etcd ASG of size 3, seems to me like you don't really need to persist anything as at least one etcd instance is guaranteed to stay up. |
A 3 node etcd cluster can't operate with less than 2 nodes running. If you lose more than 1 nodes data you're restoring from backups. Also, without running a runtime reconfig, your etcd instances are fixed. |
This will provide automatic relaunch in case of failure.
The text was updated successfully, but these errors were encountered: