AWS: We should run the master in an autoscaling group of size 1 #11934

justinsb · 2015-07-28T17:44:55Z

This will provide automatic relaunch in case of failure.

roberthbailey · 2015-07-28T18:23:33Z

How does AWS handle mounting persistent disks to instances in an autoscaling group? Also, what about health checks (you also want to re-launch the VM if the VM is running but the apiserver is down)?

jboelter · 2015-07-29T14:54:10Z

Are the files/configuration that need to survive termination in a known location?

We could create an EBS volume and mount it in the master instance. Alternately, I think the same idea would work, but it would need to be the boot volume.

iterion · 2015-07-29T15:13:30Z

@jboelter We put all of the config that needs to survive on an EBS that is mounted to the master when it is initially created (not the boot volume, but a second disk that has the essential info placed on it).

@roberthbailey We can mount a blank disk or a snapshot of a disk. But, I don't think there is any way to have the ASG know to remount the disk that was used previously.

For this to come back up with the correct data we could run a script when the instance starts. That script would make some AWS API calls to try to find an existing EBS volume for the master and remount it. @justinsb might have some better solution in mind though :)

jboelter · 2015-07-29T15:39:22Z

@iterion perfect -- The ASG has an associated LaunchConfiguration that specifies the details. We should be able to reference a known volume id created prior. This assumes there are no race conditions w/ the volume in use after termination while a new instance is created.

Edit: It appears that the AutoScaling EBS type doesn't allow for a volume id (which would only make sense for a ASG size of 1) -- mounting w/ an init script may be the way to go. Should still be able to use a well-known volume id though.

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-launchconfig-blockdev-template.html

iterion · 2015-07-29T15:52:41Z

@jboelter Interesting, I can't find where to specify the volume id when creating a launch configuration, perhaps I'm looking in the wrong place. It looks as if you can specify a BlockDeviceMapping. On that mapping I see that there is a way to configure an EBS instance but it only lets you specify a snapshot id.

FYI - I'm looking here: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-launchconfig-blockdev-mapping.html#cfn-as-launchconfig-blockdev-mapping-ebs

jboelter · 2015-07-29T15:54:03Z

@iterion yeah, just noticed the same and edited my note above as you posted

iterion · 2015-07-29T16:00:56Z

Bummer, perhaps we could tag the ASG or launch configuration with the volume id that was used? Alternatively, we could tag the EBS with something that identifies it as the master disk for that cluster. We run the risk of having multiple disks with the same tags though.

justinsb · 2015-08-15T12:12:01Z

I'm going to make an attempt at this.

I am planning on using the approach of tagging the volume and then trying to mount it as part of instance boot.

justinsb · 2015-08-17T12:43:49Z

Rather than have a separate process or script that: discovers the volume, tries to mount it and then starts our processes, I am experimenting with using the kubelet for this:
justinsb@334ad49

Advantages:

we could easily have hot-failover machines (i.e. run an auto-scaling group with multiple machines). Mounting a volume is a simple way to do leader election on many clouds/environments.

Shortcomings:

this requires passing an explicit volume ID in, but I hope that in future we will be able to specify volumes using something like k8s selectors & labels (Specify PersistentVolumeClaimSource by Selector, not Name #9712).
this requires a volume per process. This may not be a bad thing: better isolation, and volumes are pretty cheap (on AWS & GCE at least). We could implement volumes on volumes (a subdirectory on a volume, which k8s could copy/move around).
because of the above, there is no guarantee that we will launch everything on the same machine in a multi-machine environment. This may require some tweaks particularly during bootstrapping, and we would prefer minimal latency to etcd.

pikeas · 2015-11-12T07:50:21Z

+1, AWS should come up with an ASG in front of the master for self healing (in conjunction with master using an EIP, can't seem to find the issue # at the moment), or be configured with multiple masters (preferably still behind an ASG!).

justinsb · 2015-11-12T13:30:16Z

Good news is I have this working on a branch. Bad news is that the diff is pretty substantial. I am cherry picking smaller PRs across so that the remaining changes become palatable!

jwerak · 2016-05-12T13:51:10Z

Do you have a list of things which needs to be restored except etcd?

namliz · 2016-08-18T03:09:32Z

Is it plausible to split out etcd into its own autoscaling group?
If so, you could just scale masters and the etcd cluster independently and there's no need to persist anything.

justinsb · 2016-08-18T04:59:14Z

This is implemented in kops. As kube-up is in maintenance mode, it won't be implemented there.

@Zilman it's plausible, but then the etcd ASG becomes the challenging one!

namliz · 2016-08-18T06:11:45Z

@justinsb: well, if you have an etcd ASG of size 3, seems to me like you don't really need to persist anything as at least one etcd instance is guaranteed to stay up.

erutherford · 2016-08-18T06:33:37Z

A 3 node etcd cluster can't operate with less than 2 nodes running. If you lose more than 1 nodes data you're restoring from backups.

Also, without running a runtime reconfig, your etcd instances are fixed.

roberthbailey added area/platform/aws sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jul 28, 2015

mbforbes added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Aug 16, 2015

roberthbailey added the team/community label Aug 27, 2015

bgrant0607-nocc added team/none and removed team/none labels Sep 27, 2015

justinsb self-assigned this Nov 12, 2015

justinsb closed this as completed Aug 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: We should run the master in an autoscaling group of size 1 #11934

AWS: We should run the master in an autoscaling group of size 1 #11934

justinsb commented Jul 28, 2015

roberthbailey commented Jul 28, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

justinsb commented Aug 15, 2015

justinsb commented Aug 17, 2015

pikeas commented Nov 12, 2015

justinsb commented Nov 12, 2015

jwerak commented May 12, 2016

namliz commented Aug 18, 2016

justinsb commented Aug 18, 2016

namliz commented Aug 18, 2016

erutherford commented Aug 18, 2016

Navigation Menu

AWS: We should run the master in an autoscaling group of size 1 #11934

AWS: We should run the master in an autoscaling group of size 1 #11934

Comments

justinsb commented Jul 28, 2015

roberthbailey commented Jul 28, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

jboelter commented Jul 29, 2015

iterion commented Jul 29, 2015

justinsb commented Aug 15, 2015

justinsb commented Aug 17, 2015

pikeas commented Nov 12, 2015

justinsb commented Nov 12, 2015

jwerak commented May 12, 2016

namliz commented Aug 18, 2016

justinsb commented Aug 18, 2016

namliz commented Aug 18, 2016

erutherford commented Aug 18, 2016