Updates to Performance and Scalability in Kubernetes 1.3 -- 2,000 node 60,000 pod clusters

July 7, 2016, 12:00 pm

≪ Previous: Kubernetes 1.3: Bridging Cloud Native and Enterprise Workloads

We are proud to announce that with the release of version 1.3, Kubernetes now supports 2000-node clusters with even better end-to-end pod startup time. The latency of our API calls are within our one-second Service Level Objective (SLO) and most of them are even an order of magnitude better than that. It is possible to run larger deployments than a 2,000 node cluster, but performance may be degraded and it may not meet our strict SLO.

In this blog post we discuss the detailed performance results from Kubernetes 1.3 and what changes we made from version 1.2 to achieve these results. We also describe Kubemark, a performance testing tool that we’ve integrated into our continuous testing framework to detect performance and scalability regressions.

Evaluation Methodology

We have described our test scenarios in a previous blog post. The biggest change since the 1.2 release is that in our API responsiveness tests we now create and use multiple namespaces. In particular for the 2000-node/60000 pod cluster tests we create 8 namespaces. The change was done because we believe that users of such very large clusters are likely to use many namespaces, certainly at least 8 in the cluster in total.

Metrics from Kubernetes 1.3

So, what is the performance of Kubernetes version 1.3? The following graph shows the end-to-end pod startup latency with a 2000 and 1000 node cluster. For comparison we show the same metric from Kubernetes 1.2 with a 1000-node cluster.

The next graphs show API response latency for a v1.3 2000-node cluster.

How did we achieve these improvements?

The biggest change that we made for scalability in Kubernetes 1.3 was adding an efficient Protocol Buffer-based serialization format to the API as an alternative to JSON. It is primarily intended for communication between Kubernetes control plane components, but all API server clients can use this format. All Kubernetes control plane components now use it for their communication, but the system continues to support JSON for backward compatibility.

We didn’t change the format in which we store cluster state in etcd to Protocol Buffers yet, as we’re still working on the upgrade mechanism. But we’re very close to having this ready, and we expect to switch the storage format to Protocol Buffers in Kubernetes 1.4. Our experiments show that this should reduce pod startup end-to-end latency by another 30%.

How do we test Kubernetes at scale?

Spawning clusters with 2000 nodes is expensive and time-consuming. While we need to do this at least once for each release to collect real-world performance and scalability data, we also need a lighter-weight mechanism that can allow us to quickly evaluate our ideas for different performance improvements, and that we can run continuously to detect performance regressions. To address this need we created a tool call “Kubemark.”

What is “Kubemark”?

Kubemark is a performance testing tool which allows users to run experiments on emulated clusters. We use it for measuring performance in large clusters.

A Kubemark cluster consists of two parts: a real master node running the normal master components, and a set of “hollow” nodes. The prefix “hollow” means an implementation/instantiation of a component with some “moving parts” mocked out. The best example is hollow-kubelet, which pretends to be an ordinary Kubelet, but doesn’t start any containers or mount any volumes. It just claims it does, so from master components’ perspective it behaves like a real Kubelet.

Since we want a Kubemark cluster to be as similar to a real cluster as possible, we use the real Kubelet code with an injected fake Docker client. Similarly hollow-proxy (KubeProxy equivalent) reuses the real KubeProxy code with injected no-op Proxier interface (to avoid mutating iptables).

Thanks to those changes

many hollow-nodes can run on a single machine, because they are not modifying the environment in which they are running
without real containers running and the need for a container runtime (e.g. Docker), we can run up to 14 hollow-nodes on a 1-core machine.
yet hollow-nodes generate roughly the same load on the API server as their “whole” counterparts, so they provide a realistic load for performance testing [the only fundamental difference is that we are not simulating any errors that can happens in reality (e.g. failing containers) - adding support for this is a potential extension to the framework in the future]

How do we set up Kubemark clusters?

To create a Kubemark cluster we use the power the Kubernetes itself gives us - we run Kubemark clusters on Kubernetes. Let’s describe this in detail.

In order to create a N-node Kubemark cluster, we:

create a regular Kubernetes cluster where we can run N hollow-nodes [e.g. to create 2000-node Kubemark cluster, we create a regular Kubernetes cluster with 22 8-core nodes]
create a dedicated VM, where we start all master components for our Kubemark cluster (etcd, apiserver, controllers, scheduler, …).
schedule N “hollow-node” pods on the base Kubernetes cluster. Those hollow-nodes are configured to talk to the Kubemark API server running on the dedicated VM
finally, we create addon pods (currently just Heapster) by scheduling them on the base cluster and configuring them to talk to the Kubemark API server

Once this done, you have a usable Kubemark cluster that you can run your (performance) tests on. We have scripts for doing all of this on Google Compute Engine (GCE). For more details, take a look at our guide.

One thing worth mentioning here is that while running Kubemark, underneath we’re also testing Kubernetes correctness. Obviously your Kubemark cluster will not work correctly if the base Kubernetes cluster under it doesn’t work.

Performance measured in real clusters vs Kubemark

Crucially, the performance of Kubemark clusters is mostly similar to the performance of real clusters. For the pod startup end-to-end latency, as shown in the graph below, the difference is negligible:

For the API-responsiveness, the differences are higher, though generally less than 2x. However, trends are exactly the same: an improvement/regression in a real cluster is visible as a similar percentage drop/increase in metrics in Kubemark.

Conclusion

We continue to improve the performance and scalability of Kubernetes. In this blog post we
showed that the 1.3 release scales to 2000 nodes while meeting our responsiveness SLOs
explained the major change we made to improve scalability from the 1.2 release, and
described Kubemark, our emulation framework that allows us to quickly evaluate the performance impact of code changes, both when experimenting with performance improvement ideas and to detect regressions as part of our continuous testing infrastructure.

Please join our community and help us build the future of Kubernetes! If you’re particularly interested in scalability, participate by:

chatting with us on our Slack channel
joining the scalability Special Interest Group, which meets every Thursday at 9 AM Pacific Time on this SIG-Scale Hangout

For more information about the Kubernetes project, visit kubernetes.io and follow us on Twitter @Kubernetesio.

-- Wojciech Tyczynski, Software Engineer, Google

↧

Five Days of Kubernetes 1.3

July 11, 2016, 10:52 am

≫ Next: Minikube: easily run Kubernetes locally

≪ Previous: Updates to Performance and Scalability in Kubernetes 1.3 -- 2,000 node 60,000 pod clusters

Last week we released Kubernetes 1.3, two years from the day when the first Kubernetes commit was pushed to GitHub. Now 30,000+ commits later from over 800 contributors, this 1.3 releases is jam packed with updates driven by feedback from users.

While many new improvements and features have been added in the latest release, we’ll be highlighting several that stand-out. Follow along and read these in-depth posts on what’s new and how we continue to make Kubernetes the best way to manage containers at scale.

Day 1	* Minikube: easily run Kubernetes locally * rktnetes: brings rkt container engine to Kubernetes
Day 2	* Autoscaling in Kubernetes * Partner post: Kubernetes in Rancher, the further evolution
Day 3	* Deploying thousand instances of Cassandra using Pet Set * Partner post: Stateful Applications in Containers, by Diamanti
Day 4	* Cross Cluster Services * Partner post: Citrix and NetScaler CPX
Day 5	* Dashboard - Full Featured Web Interface for Kubernetes * Partner post: Steering an Automation Platform at Wercker with Kubernetes
Bonus	* Updates to Performance and Scalability

Connect

We’d love to hear from you and see you participate in this growing community:

Get involved with the Kubernetes project on GitHub
Post questions (or answer questions) on Stackoverflow
Connect with the community on Slack
Follow us on Twitter @Kubernetesio for latest updates

↧

Minikube: easily run Kubernetes locally

July 11, 2016, 10:53 am

≫ Next: rktnetes brings rkt container engine to Kubernetes

≪ Previous: Five Days of Kubernetes 1.3

Editor's note: This is the first post in a series of in-depth articles on what's new in Kubernetes 1.3

While Kubernetes is one of the best tools for managing containerized applications available today, and has been production-ready for over a year, Kubernetes has been missing a great local development platform.

For the past several months, several of us from the Kubernetes community have been working to fix this in the Minikube repository on GitHub. Our goal is to build an easy-to-use, high-fidelity Kubernetes distribution that can be run locally on Mac, Linux and Windows workstations and laptops with a single command.

Thanks to lots of help from members of the community, we're proud to announce the official release of Minikube. This release comes with support for Kubernetes 1.3, new commands to make interacting with your local cluster easier and experimental drivers for xhyve (on Mac OSX) and KVM (on Linux).

Using Minikube

Minikube ships as a standalone Go binary, so installing it is as simple as downloading Minikube and putting it on your path:

Minikube currently requires that you have VirtualBox installed, which you can download here.

(This is for Mac, for Linux substitute “minikube-darwin-amd64” with “minikube-linux-amd64”)curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-darwin-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

To start a Kubernetes cluster in Minikube, use the `minikube start` command:

$ minikube start

Starting local Kubernetes cluster...

Kubernetes is available at https://192.168.99.100:443

Kubectl is now configured to use the cluster

At this point, you have a running single-node Kubernetes cluster on your laptop! Minikube also configures `kubectl` for you, so you're also ready to run containers with no changes.

Minikube creates a Host-Only network interface that routes to your node. To interact with running pods or services, you should send traffic over this address. To find out this address, you can use the `minikube ip` command:

Minikube also comes with the Kubernetes Dashboard. To open this up in your browser, you can use the built-in `minikube dashboard` command:

In general, Minikube supports everything you would expect from a Kubernetes cluster. You can use `kubectl exec` to get a bash shell inside a pod in your cluster. You can use the `kubectl port-forward` and `kubectl proxy` commands to forward traffic from localhost to a pod or the API server.

Since Minikube is running locally instead of on a cloud provider, certain provider-specific features like LoadBalancers and PersistentVolumes will not work out-of-the-box. However, you can use NodePort LoadBalancers and HostPath PersistentVolumes.

Architecture

Minikube is built on top of Docker's libmachine, and leverages the driver model to create, manage and interact with locally-run virtual machines.

RedSpread was kind enough to donate their localkube codebase to the Minikube repo, which we use to spin up a single-process Kubernetes cluster inside a VM. Localkube bundles etcd, DNS, the Kubelet and all the Kubernetes master components into a single Go binary, and runs them all via separate goroutines.

Upcoming Features

Minikube has been a lot of fun to work on so far, and we're always looking to improve Minikube to make the Kubernetes development experience better. If you have any ideas for features, don't hesitate to let us know in the issue tracker.

Here's a list of some of the things we're hoping to add to Minikube soon:

Native hypervisor support for OSX and Windows

We're planning to remove the dependency on Virtualbox, and integrate with the native hypervisors included in OSX and Windows (Hypervisor.framework and Hyper-v, respectively).

Improved support for Kubernetes features

We're planning to increase the range of supported Kubernetes features, to include things like Ingress.

Configurable versions of Kubernetes

Today Minikube only supports Kubernetes 1.3. We're planning to add support for user-configurable versions of Kubernetes, to make it easier to match what you have running in production on your laptop.

Community

We'd love to hear feedback on Minikube. To join the community:

Post issues or feature requests on GitHub
Join us in the #minikube channel on Slack

Please give Minikube a try, and let us know how it goes!

--Dan Lorenc, Software Engineer, Google

↧

rktnetes brings rkt container engine to Kubernetes

July 11, 2016, 12:34 pm

≫ Next: Autoscaling in Kubernetes

≪ Previous: Minikube: easily run Kubernetes locally

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.3

As part of Kubernetes 1.3, we’re happy to report that our work to bring interchangeable container engines to Kubernetes is bearing early fruit. What we affectionately call “rktnetes” is included in the version 1.3 Kubernetes release, and is ready for development use. rktnetes integrates support for CoreOS rkt into Kubernetes as the container runtime on cluster nodes, and is now part of the mainline Kubernetes source code. Today it’s easier than ever for developers and ops pros with container portability in mind to try out running Kubernetes with a different container engine.

"We find CoreOS’s rkt a compelling container engine in Kubernetes because of how rkt is composed with the underlying systemd,” said Mark Petrovic, senior MTS and architect at Xoom, a PayPal service. “The rkt runtime assumes only the responsibility it needs to, then delegates to other system services where appropriate. This separation of concerns is important to us.”

What’s rktnetes?

rktnetes is the nickname given to the code that enables Kubernetes nodes to execute application containers with the rkt container engine, rather than with Docker. This change adds new abilities to Kubernetes, for instance running containers under flexible levels of isolation. rkt explores an alternative approach to container runtime architecture, aimed to reflect the Unix philosophy of cleanly separated, modular tools. Work done to support rktnetes also opens up future possibilities for Kubernetes, such as multiple container image format support, and the integration of other container runtimes tailored for specific use cases or platforms.

Why does Kubernetes need rktnetes?

rktnetes is about more than just rkt. It’s also about refining and exercising Kubernetes interfaces, and paving the way for other modular runtimes in the future. While the Docker container engine is well known, and is currently the default Kubernetes container runtime, a number of benefits derive from pluggable container environments. Some clusters may call for very specific container engine implementations, for example, and ensuring the Kubernetes design is flexible enough to support alternate runtimes, starting with rkt, helps keep the interfaces between components clean and simple.

Separation of concerns: Decomposing the monolithic container daemon

The current container runtime used by Kubernetes imposes a number of design decisions. Experimenting with other container execution architectures is worthwhile in such a rapidly evolving space. Today, when Kubernetes sends a request to a node to start running a pod, it communicates through the kubelet on each node with the default container runtime’s central daemon, responsible for managing all of the node’s containers.

rkt does not implement a monolithic container management daemon. (It is worth noting that the default container runtime is in the midst of refactoring its original monolithic architecture.) The rkt design has from day one tried to apply the principle of modularity to the fullest, including reusing well-tested system components, rather than reimplementing them.

The task of building container images is abstracted away from the container runtime core in rkt, and implemented by an independent utility. The same approach is taken to ongoing container lifecycle management. A single binary, rkt, configures the environment and prepares container images for execution, then sets the container application and its isolation environment running. At this point, the rkt program has done its “one job”, and the container isolator takes over.

The API for querying container engine and pod state, used by Kubernetes to track cluster work on each node, is implemented in a separate service, isolating coordination and orchestration features from the core container runtime. While the API service does not fully implement all the API features of the current default container engine, it already helps isolate containers from failures and upgrades in the core runtime, and provides the read-only parts of the expected API for querying container metadata.

Modular container isolation levels

With rkt managing container execution, Kubernetes can take advantage of the CoreOS container engine’s modular stage1 isolation mechanism. The typical container runs under rkt in a software-isolated environment constructed from Linux kernel namespaces, cgroups, and other facilities. Containers isolated in this common way nevertheless share a single kernel with all the other containers on a system, making for lightweight isolation of running apps.

However, rkt features pluggable isolation environments, referred to as stage1s, to change how containers are executed and isolated. For example, the rkt fly stage1 runs containers in the host namespaces (PID, mount, network, etc), granting containers greater power on the host system. Fly is used for containerizing lower-level system and network software, like the kubelet itself. At the other end of the isolation spectrum, the KVM stage1 runs standard app containers as individual virtual machines, each above its own Linux kernel, managed by the KVM hypervisor. This isolation level can be useful for high security and multi-tenant cluster workloads.

Currently, rktnetes can use the KVM stage1 to execute all containers on a node with VM isolation by setting the kubelet’s --rkt-stage1-image option. Experimental work exists to choose the stage1 isolation regime on a per-pod basis with a Kubernetes annotation declaring the pod’s appropriate stage1. KVM containers and standard Linux containers can be mixed together in the same cluster.

How rkt works with Kubernetes

Kubernetes today talks to the default container engine over an API provided by the Docker daemon. rktnetes communicates with rkt a little bit differently. First, there is a distinction between how Kubernetes changes the state of a node’s containers – how it starts and stops pods, or reschedules them for failover or scaling – and how the orchestrator queries pod metadata for regular, read-only bookkeeping. Two different facilities implement these two different cases.

Managing microservice lifecycles

The kubelet on each cluster node communicates with rkt to prepare containers and their environments into pods, and with systemd, the linux service management framework, to invoke and manage the pod processes. Pods are then managed as systemd services, and the kubelet sends systemd commands over dbus to manipulate them. Lifecycle management, such as restarting failed pods and killing completed processes, is handled by systemd, at the kubelet’s behest.

The API service for reading pod data

A discrete rkt api-service implements the pod introspection mechanisms expected by Kubernetes. While each node’s kubelet uses systemd to start, stop, and restart pods as services, it contacts the API service to read container runtime metadata. This includes basic orchestration information such as the number of pods running on the node, the names and networks of those pods, and the details of pod configuration, resource limits and storage volumes (think of the information shown by the kubectl describe subcommand).

Pod logs, having been written to journal files, are made available for kubectl logs and other forensic subcommands by the API service as well, which reads from log files to provide pod log data to the kubelet for answering control plane requests.

This dual interface to the container environment is an area of very active development, and plans are for the API service to expand to provide methods for the pod manipulation commands. The underlying mechanism will continue to keep separation of concerns in mind, but will hide more of this from the kubelet. The methods the kubelet uses to control the rktnetes container engine will grow less different from the default container runtime interface over time.

Try rktnetes

So what can you do with rktnetes today? Currently, rktnetes passes all of the applicable Kubernetes “end-to-end” (aka “e2e”) tests, provides standard metrics to cAdvisor, manages networks using CNI, handles per-container/pod logs, and automatically garbage collects old containers and images. Kubernetes running on rkt already provides more than the basics of a modular, flexible container runtime for Kubernetes clusters, and it is already a functional part of our development environment at CoreOS.

Developers and early adopters can follow the known issues in the rktnetes notes to get an idea of the wrinkles and bumps test-drivers can expect to encounter. This list groups the high-level pieces required to bring rktnetes to feature parity with the existing container runtime and API. We hope you’ll try out rktnetes in your Kubernetes clusters, too.

Use rkt with Kubernetes Today

The introductory guide Running Kubernetes on rkt walks through the steps to spin up a rktnetes cluster, from kubelet --container-runtime=rkt to networking and starting pods. This intro also sketches the configuration you’ll need to start a cluster on GCE with the Kubernetes kube-up.sh script.

Recent work aims to make rktnetes cluster creation much easier, too. While not yet merged, an in-progress pull request creates a single rktnetes configuration toggle to select rkt as the container engine when deploying a Kubernetes cluster with the coreos-kubernetes configuration tools. You can also check out the rktnetes workshop project, which launches a single-node rktnetes cluster on just about any developer workstation with one vagrant up command.

We’re excited to see the experiments the wider Kubernetes and CoreOS communities devise to put rktnetes to the test, and we welcome your input – and pull requests!

--Yifan Gu and Josh Wood, rktnetes Team, CoreOS. Twitter @CoreOSLinux.

↧

Autoscaling in Kubernetes

July 12, 2016, 9:59 am

≫ Next: Kubernetes in Rancher: the further evolution

≪ Previous: rktnetes brings rkt container engine to Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.3

Customers using Kubernetes respond to end user requests quickly and ship software faster than ever before. But what happens when you build a service that is even more popular than you planned for, and run out of compute? In Kubernetes 1.3, we are proud to announce that we have a solution: autoscaling. On Google Compute Engine (GCE) and Google Container Engine (GKE) (and coming soon on AWS), Kubernetes will automatically scale up your cluster as soon as you need it, and scale it back down to save you money when you don’t.

Benefits of Autoscaling

To understand better where autoscaling would provide the most value, let’s start with an example. Imagine you have a 24/7 production service with a load that is variable in time, where it is very busy during the day in the US, and relatively low at night. Ideally, we would want the number of nodes in the cluster and the number of pods in deployment to dynamically adjust to the load to meet end user demand. The new Cluster Autoscaling feature together with Horizontal Pod Autoscaler can handle this for you automatically.

Setting Up Autoscaling on GCE

Before we begin, we need to have an active GCE project with Google Cloud Monitoring, Google Cloud Logging and Stackdriver enabled. For more information on project creation, please read our Getting Started Guide. We also need to download a recent version of Kubernetes project (version v1.3.0 or later).

First, we set up a cluster with Cluster Autoscaler turned on. The number of nodes in the cluster will start at 2, and autoscale up to a maximum of 5. To implement this, we’ll export the following environment variables:

export NUM_NODES=2

export KUBE_AUTOSCALER_MIN_NODES=2

export KUBE_AUTOSCALER_MAX_NODES=5

export KUBE_ENABLE_CLUSTER_AUTOSCALER=true

and start the cluster by running:

./cluster/kube-up.sh

The kube-up.sh script creates a cluster together with Cluster Autoscaler add-on. The autoscaler will try to add new nodes to the cluster if there are pending pods which could schedule on a new node.

Let’s see our cluster, it should have two nodes:

$ kubectl get nodes

NAME STATUS AGE

kubernetes-master Ready,SchedulingDisabled 2m

kubernetes-minion-group-de5q Ready 2m

kubernetes-minion-group-yhdx Ready 1m

Run & expose php-apache server

To demonstrate autoscaling we will use a custom docker image based on php-apache server. The image can be found here. It defines index.php page which performs some CPU intensive computations.

First, we’ll start a deployment running the image and expose it as a service:

$ kubectl run php-apache \

--image=gcr.io/google_containers/hpa-example \

--requests=cpu=500m,memory=500M --expose --port=80

service "php-apache" createddeployment "php-apache" created

Now, we will wait some time and verify that both the deployment and the service were correctly created and are running:

$ kubectl get deployment

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

php-apache 1 1 1 1 49s

$ kubectl get pods
NAME READY STATUS RESTARTS AGE

php-apache-2046965998-z65jn 1/1 Running 0 30s

We may now check that php-apache server works correctly by calling wget with the service's address:

$ kubectl run -i --tty service-test --image=busybox /bin/sh
Hit enter for command prompt
$ wget -q -O- http://php-apache.default.svc.cluster.local

OK!

Starting Horizontal Pod Autoscaler

Now that the deployment is running, we will create a Horizontal Pod Autoscaler for it. To create it, we will use kubectl autoscale command, which looks like this:

$ kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

This defines a Horizontal Ppod Autoscaler that maintains between 1 and 10 replicas of the Pods controlled by the php-apache deployment we created in the first step of these instructions. Roughly speaking, the horizontal autoscaler will increase and decrease the number of replicas (via the deployment) so as to maintain an average CPU utilization across all Pods of 50% (since each pod requests 500 milli-cores by kubectl run, this means average CPU usage of 250 milli-cores). See here for more details on the algorithm.

We may check the current status of autoscaler by running:

$ kubectl get hpa

NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE

php-apache Deployment/php-apache/scale 50% 0% 1 20 14s

Please note that the current CPU consumption is 0% as we are not sending any requests to the server (the CURRENT column shows the average across all the pods controlled by the corresponding replication controller).

Raising the Load

Now, we will see how our autoscalers (Cluster Autoscaler and Horizontal Pod Autoscaler) react on the increased load of the server. We will start two infinite loops of queries to our server (please run them in different terminals):

$ kubectl run -i --tty load-generator --image=busybox /bin/sh
Hit enter for command prompt
$ while true; do wget -q -O- http://php-apache.default.svc.cluster.local; done

We need to wait a moment (about one minute) for stats to propagate. Afterwards, we will examine status of Horizontal Pod Autoscaler:

$ kubectl get hpa

NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE

php-apache Deployment/php-apache/scale 50% 310% 1 20 2m

$ kubectl get deployment php-apache

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

php-apache 7 7 7 3 4m

Horizontal Pod Autoscaler has increased the number of pods in our deployment to 7. Let’s now check, if all the pods are running:

jsz@jsz-desk2:~/k8s-src$ kubectl get pods

php-apache-2046965998-3ewo6 0/1 Pending 0 1m

php-apache-2046965998-8m03k 1/1 Running 0 1m

php-apache-2046965998-ddpgp 1/1 Running 0 5m

php-apache-2046965998-lrik6 1/1 Running 0 1m

php-apache-2046965998-nj465 0/1 Pending 0 1m

php-apache-2046965998-tmwg1 1/1 Running 0 1m

php-apache-2046965998-xkbw1 0/1 Pending 0 1m

As we can see, some pods are pending. Let’s describe one of pending pods to get the reason of the pending state:

$ kubectl describe pod php-apache-2046965998-3ewo6

Name:php-apache-2046965998-3ewo6

Namespace:default

...

Events:

FirstSeenFromSubobjectPathTypeReasonMessage

1m{default-scheduler }WarningFailedSchedulingpod (php-apache-2046965998-3ewo6) failed to fit in any node

fit failure on node (kubernetes-minion-group-yhdx): Insufficient CPU

fit failure on node (kubernetes-minion-group-de5q): Insufficient CPU

1m{cluster-autoscaler }NormalTriggeredScaleUppod triggered scale-up, mig: kubernetes-minion-group, sizes (current/new): 2/3

The pod is pending as there was no CPU in the system for it. We see there’s a TriggeredScaleUp event connected with the pod. It means that the pod triggered reaction of Cluster Autoscaler and a new node will be added to the cluster. Now we’ll wait for the reaction (about 3 minutes) and list all nodes:

$ kubectl get nodes

NAME STATUS AGE

kubernetes-master Ready,SchedulingDisabled 9m

kubernetes-minion-group-6z5i Ready 43s

kubernetes-minion-group-de5q Ready 9m

kubernetes-minion-group-yhdx Ready 9m

As we see a new node kubernetes-minion-group-6z5i was added by Cluster Autoscaler. Let’s verify that all pods are now running:

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

php-apache-2046965998-3ewo6 1/1 Running 0 3m

php-apache-2046965998-8m03k 1/1 Running 0 3m

php-apache-2046965998-ddpgp 1/1 Running 0 7m

php-apache-2046965998-lrik6 1/1 Running 0 3m

php-apache-2046965998-nj465 1/1 Running 0 3m

php-apache-2046965998-tmwg1 1/1 Running 0 3m

php-apache-2046965998-xkbw1 1/1 Running 0 3m

After the node addition all php-apache pods are running!

Stop load

We will finish our example by stopping the user load. We’ll terminate both infinite while loops sending requests to the server and verify the result state:

$ kubectl get hpa

NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE

php-apache Deployment/php-apache/scale 50% 0% 1 10 16m

$ kubectl get deployment php-apache

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE

php-apache 1 1 1 1 14m

As we see, in the presented case CPU utilization dropped to 0, and the number of replicas dropped to 1.

After deleting pods most of the cluster resources are unused. Scaling the cluster down may take more time than scaling up because Cluster Autoscaler makes sure that the node is really not needed so that short periods of inactivity (due to pod upgrade etc) won’t trigger node deletion (see cluster autoscaler doc). After approximately 10-12 minutes you can verify that the number of nodes in the cluster dropped:

$ kubectl get nodes

NAME STATUS AGE

kubernetes-master Ready,SchedulingDisabled 37m

kubernetes-minion-group-de5q Ready 36m

kubernetes-minion-group-yhdx Ready 36m

The number of nodes in our cluster is now two again as node kubernetes-minion-group-6z5i was removed by Cluster Autoscaler.

Other use cases

As we have shown, it is very easy to dynamically adjust the number of pods to the load using a combination of Horizontal Pod Autoscaler and Cluster Autoscaler.

However Cluster Autoscaler alone can also be quite helpful whenever there are irregularities in the cluster load. For example, clusters related to development or continuous integration tests can be less needed on weekends or at night. Batch processing clusters may have periods when all jobs are over and the new will only start in couple hours. Having machines that do nothing is a waste of money.

In all of these cases Cluster Autoscaler can reduce the number of unused nodes and give quite significant savings because you will only pay for these nodes that you actually need to run your pods. It also makes sure that you always have enough compute power to run your tasks.

-- Jerzy Szczepkowski and Marcin Wielgus, Software Engineers, Google

↧

Kubernetes in Rancher: the further evolution

July 12, 2016, 12:12 pm

≫ Next: Thousand Instances of Cassandra using Kubernetes Pet Set

≪ Previous: Autoscaling in Kubernetes

Editor’s note: today's guest post is from Alena Prokharchyk, Principal Software Engineer at Rancher Labs, who’ll share how they are incorporating new Kubernetes features into their platform.

Kubernetes was the first external orchestration platform supported by Rancher, and since its release, it has become one of the most widely used among our users, and continues to grow rapidly in adoption. As Kubernetes has evolved, so has Rancher in terms of adapting new Kubernetes features. We’ve started with supporting Kubernetes version 1.1, then switched to 1.2 as soon as it was released, and now we’re working on supporting the exciting new features in 1.3. I’d like to walk you through the features that we’ve been adding support for during each of these stages.

Rancher and Kubernetes 1.2

Kubernetes 1.2 introduced enhanced Ingress object to simplify allowing inbound connections to reach the cluster services: here’s an excellent blog post about ingress policies. Ingress resource allows users to define host name routing rules and TLS config for the Load Balancer in a user friendly way. Then it should be backed up by an Ingress controller that would configure a corresponding cloud provider’s Load Balancer with the Ingress rules. Since Rancher already included a software defined Load Balancer based on HAproxy, we already supported all of the configuration requirements of the Ingress resource, and didn’t have to do any changes on the Rancher side to adopt Ingress. What we had to do was write an Ingress controller that would listen to Kubernetes ingress specific events, configure the Rancher Load Balancer accordingly, and propagate the Load Balancer public entry point back to Kubernetes:

Screen-Shot-2016-05-13-at-11.15.56-AM.png

Now, the ingress controller gets deployed as a part of our Rancher Kubernetes system stack, and is managed by Rancher. Rancher monitors Ingress controller health, and recreates it in case of any failures. In addition to standard ingress features, Rancher also lets you to horizontally scale the Load Balancer supporting the ingress service by specifying scale via Ingress annotations. For example:

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

annotations:

scale: "2"

spec:

rules:

- host: foo.bar.com

http:

paths:

- path: /foo

backend:

serviceName: nginx-service

servicePort: 80

As a result of the above, 2 instances of Rancher Load Balancer will get started on separate hosts, and Ingress will get updated with 2 public ip addresses:

kubectl get ingress

NAME RULE BACKEND ADDRESS

scalelb - 104.154.107.202, 104.154.107.203 // hosts ip addresses where Rancher LB instances are deployed

foo.bar.com

/foo nginx-service:80

More details on Rancher Ingress Controller implementation for Kubernetes can be found here:

Rancher and Kubernetes 1.3

We’ve very excited about Kubernetes 1.3 release, and all the new features that are included with it. There are two that we are especially interested in: Stateful Apps and Cluster Federation.

Kubernetes Stateful Apps

Stateful Apps is a new resource to Kubernetes to represent a set of pods in stateful application. This is an alternative to the using Replication Controllers, which are best leveraged for running stateless apps. This feature is specifically useful for apps that rely on quorum with leader election (such as MongoDB, Zookeeper, etcd) and decentralized quorum (Cassandra). Stateful Apps create and maintains a set of pods, each of which have a stable network identity. In order to provide the network identity, it must be possible to have a resolvable DNS name for the pod that is tied to the pod identity as per Kubernetes design doc:

# service mongo pointing to pods created by PetSet mdb, with identities mdb-1, mdb-2, mdb-3

dig mongodb.namespace.svc.cluster.local +short A

172.130.16.50

dig mdb-1.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-1

dig mdb-2.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-2

dig mdb-3.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-3

The above is implemented via an annotation on pods, which is surfaced to endpoints, and finally surfaced as DNS on the service that exposes those pods. Currently Rancher simplifies DNS configuration by leveraging Rancher DNS as a drop-in replacement for SkyDNS. Rancher DNS is fast, stable, and scalable - every host in cluster gets DNS server running. Kubernetes services get programmed to Rancher DNS, and being resolved to either service’s cluster IP from 10,43.x.x address space, or to set of Pod ip addresses for headless service. To make PetSet work with Kubernetes via Rancher, we’ll have to add support for Pod Identities to Rancher DNS configuration. We’re working on this now and should have it supported in one of the upcoming Rancher releases.

Cluster Federation

Cluster Federation is a control plane of cluster federation in Kubernetes. It offers improved application availability by spreading applications across multiple clusters (the image below is a courtesy of Kubernetes):

Screen Shot 2016-07-07 at 1.46.55 PM.png

Each Kubernetes cluster exposes an API endpoint and gets registered to Cluster Federation as a part of Federation object. Then using Cluster Federation API, you can create federated services. Those objects are comprised of multiple equivalent underlying Kubernetes resources. Assuming that the 3 clusters on the picture above belong to the same Federation object, each Service created via Cluster Federation, will get equivalent service created in each of the clusters. Besides that, a Cluster Federation service will get publicly resolvable DNS name resolvable to Kuberentes service’s public ip addresses (DNS record gets programmed to a one of the public DNS providers below):

Screen Shot 2016-07-07 at 1.24.18 PM.png

To support Cluster Federation via Kubernetes in Rancher, certain changes need to be done. Today each Kubernetes cluster is represented as a Rancher environment. In each Kubernetes environment, we create a full Kubernetes system stack comprised of several services: Kubernetes API server, Scheduler, Ingress controller, persistent etcd, Controller manager, Kubelet and Proxy (2 last ones run on every host). To setup Cluster Federation, we will create one extra environment where Cluster Federation stack is going to run:

Screen Shot 2016-07-07 at 1.23.14 PM.png

Then every underlying Kubernetes cluster represented by Rancher environment, should be registered to a specific Cluster Federation. Potentially each cluster can be auto-discovered by Rancher Cluster Federation environment via label representing federation name on Kubernetes cluster. We’re still working through finalizing our design, but we’re very excited by this feature, and see a lot of use cases it can solve. Cluster Federation doc references:

Kubernetes cluster federation design doc
Kubernetes blog post on multi zone clusters
Kubernetes federated services design doc

Plans for Kubernetes 1.4

When we launched Kubernetes support in Rancher we decided to maintain our own distribution of Kubernetes in order to support Rancher’s native networking. We were aware that by having our own distribution, we’d need to update it every time there were changes made to Kubernetes, but we felt it was necessary to support the use cases we were working on for users. As part of our work for 1.4 we looked at our networking approach again, and re-analyzed the initial need for our own fork of Kubernetes. Other than the networking integration, all of the work we’ve done with Kubernetes has been developed as a Kubernetes plugin:

Rancher as a CloudProvider (to support Load Balancers).
Rancher as a CredentialProvider (to support Rancher private registries).
Rancher Ingress controller to back up Kubernetes ingress resource.

So we’ve decided to eliminate the need of Rancher Kubernetes distribution, and try to upstream all our changes to the Kubernetes repo. To do that, we will be reworking our networking integration, and support Rancher networking as a CNI plugin for Kubernetes. More details on that will be shared as soon as the feature design is finalized, but expect it to come in the next 2-3 months. We will also continue investing in Rancher’s core capabilities integrated with Kubernetes, including, but not limited to:

Access rights management via Rancher environment that represents Kubernetes cluster
Credential management and easy web-based access to standard kubectl cli
Load Balancing support
Rancher internal DNS support
Catalog support for Kubernetes templates
Enhanced UI to represent even more Kubernetes objects like: Deployment, Ingress, Daemonset.

All of that is to make Kubernetes experience even more powerful and user intuitive. We’re so excited by all of the progress in the Kubernetes community, and thrilled to be participating. Kubernetes 1.3 is an incredibly significant release, and you’ll be able to upgrade to it very soon within Rancher.

-- Alena Prokharchyk, Principal Software Engineer, Rancher Labs. Twitter @lemonjet& GitHub alena1108

↧

Thousand Instances of Cassandra using Kubernetes Pet Set

July 13, 2016, 9:41 am

≫ Next: Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!”

≪ Previous: Kubernetes in Rancher: the further evolution

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.3

Running The Greek Pet Monster Races

For the Kubernetes 1.3 launch, we wanted to put the new Pet Set through its paces. By testing a thousand instances of Cassandra, we could make sure that Kubernetes 1.3 was production ready. Read on for how we adapted Cassandra to Kubernetes, and had our largest deployment ever.

It’s fairly straightforward to use containers with basic stateful applications today. Using a persistent volume, you can mount a disk in a pod, and ensure that your data lasts beyond the life of your pod. However, with deployments of distributed stateful applications, things can become more tricky. With Kubernetes 1.3, the new Pet Set component makes everything much easier. To test this new feature out at scale, we decided to host the Greek Pet Monster Races! We raced Centaurs and other Ancient Greek Monsters over hundreds of thousands of races across multiple availability zones.

As many of you know Kubernetes is from the Ancient Greek: κυβερνήτης. This means helmsman, pilot, steersman, or ship master. So in order to keep track of race results, we needed a data store, and we choose Cassandra. Κασσάνδρα, Cassandra who was the daughter of King of Priam and Queen Hecuba of Troy. With multiple references to the ancient Greek language, we thought it would be appropriate to race ancient Greek monsters.

From there the story kinda goes sideways because Cassandra was actually the Pets as well. Read on and we will explain.

One of the new exciting features in Kubernetes 1.3 is Pet Set. In order to organize the deployment of containers inside of Kubernetes, different deployment mechanisms are available. Examples of these components include Resource Controllers and Daemon Set. Pet Sets is a new feature that delivers the capability to deploy containers, as Pets, inside of Kubernetes. Pet Sets provide a guarantee of identity for various aspects of the pet / pod deployment: DNS name, consistent storage, and ordered pod indexing. Previously, using components like Deployments and Replication Controllers, would only deploy an application with a weak uncoupled identity. A weak identity is great for managing applications such as microservices, where service discovery is important, the application is stateless, and the naming of individual pods does not matter. Many software applications do require strong identity, including many different types of distributed stateful systems. Cassandra is a great example of a distributed application that requires consistent network identity, and stable storage.

Pet Sets provides the following capabilities:

A stable hostname, available to others in DNS. Number is based off of the Pet Set name and starts at zero. For example cassandra-0.

An ordinal index of Pets. 0, 1, 2, 3, etc.

Stable storage linked to the ordinal and hostname of the Pet.

Peer discovery is available via DNS. With Cassandra the names of the peers are known before the Pets are created.

Startup and Teardown ordering. Which numbered Pet is going to be created next is known, and which Pet will be destroyed upon reducing the Pet Set size. This feature is useful for such admin tasks as draining data from a Pet, when reducing the size of a cluster.

If your application has one or more of these requirements, then it may be a candidate for Pet Set.
A relevant analogy is that a Pet Set is composed of Pet dogs. If you have a white, brown or black dog and the brown dog runs away, you can replace it with another brown dog no one would notice. If over time you can keep replacing your dogs with only white dogs then someone would notice. Pet Set allows your application to maintain the unique identity or hair color of your Pets.

Example workloads for Pet Set:

Clustered software like Cassandra, Zookeeper, etcd, or Elastic require stable membership.
Databases like MySQL or PostgreSQL that require a single instance attached to a persistent volume at any time.

Only use Pet Set if your application requires some or all of these properties. Managing pods as stateless replicas is vastly easier.

So back to our races!

As we have mentioned, Cassandra was a perfect candidate to deploy via a Pet Set. A Pet Set is much like a Replica Controller with a few new bells and whistles. Here's an example YAML manifest:

# Headless service to provide DNS lookup

apiVersion: v1

kind:Service

metadata:

labels:

app: cassandra

spec:

clusterIP:None

ports:

- port:9042

selector:

app: cassandra-data

----

# new API name

apiVersion:"apps/v1alpha1"

kind:PetSet

metadata:

spec:

serviceName: cassandra

# replicas are the same as used by Replication Controllers

# except pets are deployed in order 0, 1, 2, 3, etc

replicas: 5

template:

metadata:

annotations:

pod.alpha.kubernetes.io/initialized:"true"

labels:

app: cassandra-data

spec:

# just as other component in Kubernetes one

# or more containers are deployed

containers:

- name: cassandra

image:"cassandra-debian:v1.1"

imagePullPolicy:Always

ports:

- containerPort:7000

- containerPort:7199

- containerPort:9042

resources:

limits:

cpu:"4"

memory:11Gi

requests:

cpu:"4"

memory:11Gi

securityContext:

privileged:true

env:

- name: MAX_HEAP_SIZE

value:8192M

- name: HEAP_NEWSIZE

value:2048M

# this is relying on guaranteed network identity of Pet Sets, we

# will know the name of the Pets / Pod before they are created

- name: CASSANDRA_SEEDS

value:"cassandra-0.cassandra.default.svc.cluster.local,cassandra-1.cassandra.default.svc.cluster.local"

- name: CASSANDRA_CLUSTER_NAME

value:"OneKDemo"

- name: CASSANDRA_DC

value:"DC1-Data"

- name: CASSANDRA_RACK

value:"OneKDemo-Rack1-Data"

- name: CASSANDRA_AUTO_BOOTSTRAP

value:"false"

# this variable is used by the read-probe looking

# for the IP Address in a `nodetool status` command

- name: POD_IP

valueFrom:

fieldRef:

fieldPath: status.podIP

readinessProbe:

exec:

command:

-/bin/bash

--c

-/ready-probe.sh

initialDelaySeconds:15

timeoutSeconds: 5

# These volume mounts are persistent. They are like inline claims,

# but not exactly because the names need to match exactly one of

# the pet volumes.

volumeMounts:

- name: cassandra-data

mountPath:/cassandra_data

# These are converted to volume claims by the controller

# and mounted at the paths mentioned above. Storage can be automatically

# created for the Pets depending on the cloud environment.

volumeClaimTemplates:

- metadata:

annotations:

volume.alpha.kubernetes.io/storage-class: anything

spec:

accessModes:["ReadWriteOnce" ]

resources:

requests:

storage:380Gi

You may notice that these containers are on the rather large size, and it is not unusual to run Cassandra in production with 8 CPU and 16GB of ram. There are two key new features that you will notice above; dynamic volume provisioning, and of course Pet Set. The above manifest will create 5 Cassandra Pets / Pods starting with the number 0: cassandra-data-0, cassandra-data-1, etc.

In order to generate data for the races, we used another Kubernetes feature called Jobs. Simple python code was written to generate the random speed of the monster for every second of the race. Then that data, position information, winners, other data points, and metrics were stored in Cassandra. To visualize the data, we used JHipster to generate a AngularJS UI with Java services, and then used D3 for graphing.

An example of one of the Jobs:

apiVersion: batch/v1

kind: Job

metadata:

labels:

spec:

parallelism: 2

completions: 4

template:

metadata:

labels:

spec:

containers:

- name: pet-race-giants

image: py3numpy-job:v1.0

command: ["pet-race-job", --length=100", "--pet=Giants", "--scale=3"]

resources:

limits:

cpu: "2"

requests:

cpu: "2"

restartPolicy: Never

Since we are talking about Monsters, we had to go big. We deployed 1,009 minion nodes to Google Compute Engine (GCE), spread across 4 zones, running a custom version of the Kubernetes 1.3 beta. We ran this demo on beta code since the demo was being set up before the 1.3 release date. For the minion nodes, GCE virtual machine n1-standard-8 machine size was chosen, which is vm with 8 virtual CPUs and 30GB of memory. It would allow for a single instance of Cassandra to run on one node, which is recommended for disk I/O.

Then the pets were deployed! One thousand of them, in two different Cassandra Data Centers. Cassandra distributed architecture is specifically tailored for multiple-data center deployment. Often multiple Cassandra data centers are deployed inside the same physical or virtual data center, in order to separate workloads. Data is replicated across all data centers, but workloads can be different between data centers and thus application tuning can be different. Data centers named 'DC1-Analytics' and ‘DC1-Data’ where deployed with 500 pets each. The race data was created by the python Batch Jobs connected to DC1-Data, and the JHipster UI was connected DC1-Analytics.

Here are the final numbers:

8,072 Cores. The master used 24, minion nodes used the rest
1,009 IP addresses
1,009 routes setup by Kubernetes on Google Cloud Platform
100,510 GB persistent disk used by the Minions and the Master
380,020 GB SSD disk persistent disk. 20 GB for the master and 340 GB per Cassandra Pet.
1,000 deployed instances of Cassandra

Yes we deployed 1,000 pets, but one really did not want to join the party! Technically with the Cassandra setup, we could have lost 333 nodes without service or data loss.

Limitations with Pet Sets in 1.3 Release

Pet Set is an alpha resource not available in any Kubernetes release prior to 1.3.
The storage for a given pet must either be provisioned by a dynamic storage provisioner based on the requested storage class, or pre-provisioned by an admin.
Deleting the Pet Set will not delete any pets or Pet storage. You will need to delete your Pets and possibly its storage by hand.
All Pet Sets currently require a "governing service", or a Service responsible for the network identity of the pets. The user is responsible for this Service.
Updating an existing Pet Set is currently a manual process. You either need to deploy a new Pet Set with the new image version or orphan Pets one by one and update their image, which will join them back to the cluster.

Resources and References

The source code for the demo is available on GitHub: (Pet Set examples will be merged into the Kubernetes Cassandra Examples).
More information about Jobs
Documentation for Pet Set
Image credits: Cassandra image and Cyclops image

-- Chris Love, Senior DevOps Open Source Consultant for Datapipe. Twitter @chrislovecnm

↧

Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!”

July 13, 2016, 12:59 pm

≫ Next: Cross Cluster Services - Achieving Higher Availability for your Kubernetes Applications

≪ Previous: Thousand Instances of Cassandra using Kubernetes Pet Set

Editor's note: today’s guest post is from Mark Balch, VP of Products at Diamanti, who’ll share more about the contributions they’ve made to Kubernetes.

Congratulations to the Kubernetes community on another value-packed release. A focus on stateful applications and federated clusters are two reasons why I’m so excited about 1.3. Kubernetes support for stateful apps such as Cassandra, Kafka, and MongoDB is critical. Important services rely on databases, key value stores, message queues, and more. Additionally, relying on one data center or container cluster simply won’t work as apps grow to serve millions of users around the world. Cluster federation allows users to deploy apps across multiple clusters and data centers for scale and resiliency.

You may have heard me say before that containers are the next great application platform. Diamanti is accelerating container adoption for stateful apps in production - where performance and ease of deployment really matter.

Apps Need More Than Cattle

Beyond stateless containers like web servers (so-called “cattle” because they are interchangeable), users are increasingly deploying stateful workloads with containers to benefit from “build once, run anywhere” and to improve bare metal efficiency/utilization. These “pets” (so-called because each requires special handling) bring new requirements including longer life cycle, configuration dependencies, stateful failover, and performance sensitivity. Container orchestration must address these needs to successfully deploy and scale apps.

Enter Pet Set, a new object in Kubernetes 1.3 for improved stateful application support. Pet Set sequences through the startup phase of each database replica (for example), ensuring orderly master/slave configuration. Pet Set also simplifies service discovery by leveraging ubiquitous DNS SRV records, a well-recognized and long-understood mechanism.

Diamanti’s FlexVolume contribution to Kubernetes enables stateful workloads by providing persistent volumes with low-latency storage and guaranteed performance, including enforced quality-of-service from container to media.

A Federalist

Users who are planning for application availability must contend with issues of failover and scale across geography. Cross-cluster federated services allows containerized apps to easily deploy across multiple clusters. Federated services tackles challenges such as managing multiple container clusters and coordinating service deployment and discovery across federated clusters.

Like a strictly centralized model, federation provides a common app deployment interface. With each cluster retaining autonomy, however, federation adds flexibility to manage clusters locally during network outages and other events. Cross-cluster federated services also applies consistent service naming and adoption across container clusters, simplifying DNS resolution.

It’s easy to imagine powerful multi-cluster use cases with cross-cluster federated services in future releases. An example is scheduling containers based on governance, security, and performance requirements. Diamanti’s scheduler extension was developed with this concept in mind. Our first implementation makes the Kubernetes scheduler aware of network and storage resources local to each cluster node. Similar concepts can be applied in the future to broader placement controls with cross-cluster federated services.

Get Involved

With interest growing in stateful apps, work has already started to further enhance Kubernetes storage. The Storage Special Interest Group is discussing proposals to support local storage resources. Diamanti is looking forward to extend FlexVolume to include richer APIs that enable local storage and storage services including data protection, replication, and reduction. We’re also working on proposals for improved app placement, migration, and failover across container clusters through Kubernetes cross-cluster federated services.

Join the conversation and contribute! Here are some places to get started:

Product Management group
Kubernetes Storage SIG
Kubernetes Cluster Federation SIG

-- Mark Balch, VP Products, Diamanti. Twitter @markbalch

↧

Cross Cluster Services - Achieving Higher Availability for your Kubernetes Applications

July 14, 2016, 10:13 am

≫ Next: Citrix + Kubernetes = A Home Run

≪ Previous: Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!”

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.3

As Kubernetes users scale their production deployments we’ve heard a clear desire to deploy services across zone, region, cluster and cloud boundaries. Services that span clusters provide geographic distribution, enable hybrid and multi-cloud scenarios and improve the level of high availability beyond single cluster multi-zone deployments. Customers who want their services to span one or more (possibly remote) clusters, need them to be reachable in a consistent manner from both within and outside their clusters.

In Kubernetes 1.3, our goal was to minimize the friction points and reduce the management/operational overhead associated with deploying a service with geographic distribution to multiple clusters. This post explains how to do this.

Note: Though the examples used here leverage Google Container Engine (GKE) to provision Kubernetes clusters, they work anywhere you want to deploy Kubernetes.

Let’s get started. The first step is to create is to create Kubernetes clusters into 4 Google Cloud Platform (GCP) regions using GKE.

asia-east1-b
europe-west1-b
us-east1-b
us-central1-b

Let’s run the following commands to build the clusters:

gcloud container clusters create gce-asia-east1 \

--scopes cloud-platform \

--zone asia-east1-b

gcloud container clusters create gce-europe-west1 \

--scopes cloud-platform \

--zone=europe-west1-b

gcloud container clusters create gce-us-east1 \

--scopes cloud-platform \

--zone=us-east1-b

gcloud container clusters create gce-us-central1 \

--scopes cloud-platform \

--zone=us-central1-b

Let’s verify the clusters are created:

gcloud container clusters list

NAME              ZONE            MASTER_VERSION MASTER_IP       NUM_NODES STATUS
gce-asia-east1    asia-east1-b    1.2.4           104.XXX.XXX.XXX 3          RUNNING
gce-europe-west1 europe-west1-b 1.2.4           130.XXX.XX.XX   3          RUNNING
gce-us-central1   us-central1-b   1.2.4           104.XXX.XXX.XX 3          RUNNING
gce-us-east1      us-east1-b      1.2.4           104.XXX.XX.XXX 3          RUNNING

The next step is to bootstrap the clusters and deploy the federation control plane on one of the clusters that has been provisioned. If you’d like to follow along, refer to Kelsey Hightower’s tutorial which walks through the steps involved.

Federated Services

Federated Services are directed to the Federation API endpoint and specify the desired properties of your service.

Once created, the Federated Service automatically:

creates matching Kubernetes Services in every cluster underlying your cluster federation,

monitors the health of those service "shards" (and the clusters in which they reside), and

manages a set of DNS records in a public DNS provider (like Google Cloud DNS, or AWS Route 53), thus ensuring that clients of your federated service can seamlessly locate an appropriate healthy service endpoint at all times, even in the event of cluster, availability zone or regional outages.

Clients inside your federated Kubernetes clusters (i.e. Pods) will automatically find the local shard of the federated service in their cluster if it exists and is healthy, or the closest healthy shard in a different cluster if it does not.

Federations of Kubernetes Clusters can include clusters running in different cloud providers (e.g. GCP, AWS), and on-premise (e.g. on OpenStack). All you need to do is create your clusters in the appropriate cloud providers and/or locations, and register each cluster's API endpoint and credentials with your Federation API Server.

In our example, we have clusters created in 4 regions along with a federated control plane API deployed in one of our clusters, that we’ll be using to provision our service. See diagram below for visual representation.

Creating a Federated Service

Let’s list out all the clusters in our federation:

kubectl --context=federation-cluster get clusters

NAME               STATUS    VERSION   AGE
gce-asia-east1     Ready               1m
gce-europe-west1   Ready               57s
gce-us-central1    Ready               47s
gce-us-east1       Ready               34s

Let’s create a federated service object:

kubectl --context=federation-cluster create -f services/nginx.yaml

The '--context=federation-cluster' flag tells kubectl to submit the request to the Federation API endpoint, with the appropriate credentials. The federated service will automatically create and maintain matching Kubernetes services in all of the clusters underlying your federation.

You can verify this by checking in each of the underlying clusters, for example:

kubectl --context=gce-asia-east1a get svc nginx
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx 10.63.250.98 104.199.136.89 80/TCP 9m

The above assumes that you have a context named 'gce-asia-east1a' configured in your client for your cluster in that zone. The name and namespace of the underlying services will automatically match those of the federated service that you created above.

The status of your Federated Service will automatically reflect the real-time status of the underlying Kubernetes services, for example:

kubectl --context=federation-cluster describe services nginx

Name:                   nginx
Namespace:              default
Labels:                 run=nginx
Selector:               run=nginx
Type:                   LoadBalancer
IP:
LoadBalancer Ingress:   104.XXX.XX.XXX, 104.XXX.XX.XXX, 104.XXX.XX.XXX, 104.XXX.XXX.XX
Port:                   http    80/TCP
Endpoints:              <none>
Session Affinity:       None
No events.

The 'LoadBalancer Ingress' addresses of your federated service corresponds with the 'LoadBalancer Ingress' addresses of all of the underlying Kubernetes services. For inter-cluster and inter-cloud-provider networking between service shards to work correctly, your services need to have an externally visible IP address. Service Type: Loadbalancer is typically used here.

Note also what we have not yet provisioned any backend Pods to receive the network traffic directed to these addresses (i.e. 'Service Endpoints'), so the federated service does not yet consider these to be healthy service shards, and has accordingly not yet added their addresses to the DNS records for this federated service.

Adding Backend Pods

To render the underlying service shards healthy, we need to add backend Pods behind them. This is currently done directly against the API endpoints of the underlying clusters (although in future the Federation server will be able to do all this for you with a single command, to save you the trouble). For example, to create backend Pods in our underlying clusters:

for CLUSTER in asia-east1-a europe-west1-a us-east1-a us-central1-a
do
kubectl --context=$CLUSTER run nginx --image=nginx:1.11.1-alpine --port=80
done

Verifying Public DNS Records

Once the Pods have successfully started and begun listening for connections, Kubernetes in each cluster (via automatic health checks) will report them as healthy endpoints of the service in that cluster. The cluster federation will in turn consider each of these service 'shards' to be healthy, and place them in serving by automatically configuring corresponding public DNS records. You can use your preferred interface to your configured DNS provider to verify this. For example, if your Federation is configured to use Google Cloud DNS, and a managed DNS domain 'example.com':

$ gcloud dns managed-zones describe example-dot-com

creationTime: '2016-06-26T18:18:39.229Z'
description: Example domain for Kubernetes Cluster Federation
dnsName: example.com.
id: '3229332181334243121'
kind: dns#managedZone
name: example-dot-com
nameServers:
- ns-cloud-a1.googledomains.com.
- ns-cloud-a2.googledomains.com.
- ns-cloud-a3.googledomains.com.
- ns-cloud-a4.googledomains.com.

$ gcloud dns record-sets list --zone example-dot-com

NAME                                                                                                 TYPE      TTL     DATA
example.com.                                                                                       NS     21600 ns-cloud-e1.googledomains.com., ns-cloud-e2.googledomains.com.
example.com.                                                                                      SOA     21600 ns-cloud-e1.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 1209600 300
nginx.mynamespace.myfederation.svc.example.com.                            A     180     104.XXX.XXX.XXX, 130.XXX.XX.XXX, 104.XXX.XX.XXX, 104.XXX.XXX.XX
nginx.mynamespace.myfederation.svc.us-central1-a.example.com.     A     180     104.XXX.XXX.XXX
nginx.mynamespace.myfederation.svc.us-central1.example.com.
nginx.mynamespace.myfederation.svc.us-central1.example.com.         A    180     104.XXX.XXX.XXX, 104.XXX.XXX.XXX, 104.XXX.XXX.XXX
nginx.mynamespace.myfederation.svc.asia-east1-a.example.com.       A    180     130.XXX.XX.XXX
nginx.mynamespace.myfederation.svc.asia-east1.example.com.
nginx.mynamespace.myfederation.svc.asia-east1.example.com.           A    180     130.XXX.XX.XXX, 130.XXX.XX.XXX
nginx.mynamespace.myfederation.svc.europe-west1.example.com. CNAME    180   nginx.mynamespace.myfederation.svc.example.com.
... etc.

Note: If your Federation is configured to use AWS Route53, you can use one of the equivalent AWS tools, for example:

$aws route53 list-hosted-zones

and

$aws route53 list-resource-record-sets --hosted-zone-id Z3ECL0L9QLOVBX

Whatever DNS provider you use, any DNS query tool (for example 'dig' or 'nslookup') will of course also allow you to see the records created by the Federation for you.

Discovering a Federated Service from pods Inside your Federated Clusters

By default, Kubernetes clusters come preconfigured with a cluster-local DNS server ('KubeDNS'), as well as an intelligently constructed DNS search path which together ensure that DNS queries like "myservice", "myservice.mynamespace", "bobsservice.othernamespace" etc issued by your software running inside Pods are automatically expanded and resolved correctly to the appropriate service IP of services running in the local cluster.

With the introduction of Federated Services and Cross-Cluster Service Discovery, this concept is extended to cover Kubernetes services running in any other cluster across your Cluster Federation, globally. To take advantage of this extended range, you use a slightly different DNS name (e.g. myservice.mynamespace.myfederation) to resolve federated services. Using a different DNS name also avoids having your existing applications accidentally traversing cross-zone or cross-region networks and you incurring perhaps unwanted network charges or latency, without you explicitly opting in to this behavior.

So, using our NGINX example service above, and the federated service DNS name form just described, let's consider an example: A Pod in a cluster in the us-central1-a availability zone needs to contact our NGINX service. Rather than use the service's traditional cluster-local DNS name ("nginx.mynamespace", which is automatically expanded to"nginx.mynamespace.svc.cluster.local") it can now use the service's Federated DNS name, which is"nginx.mynamespace.myfederation". This will be automatically expanded and resolved to the closest healthy shard of my NGINX service, wherever in the world that may be. If a healthy shard exists in the local cluster, that service's cluster-local (typically 10.x.y.z) IP address will be returned (by the cluster-local KubeDNS). This is exactly equivalent to non-federated service resolution.

If the service does not exist in the local cluster (or it exists but has no healthy backend pods), the DNS query is automatically expanded to `"nginx.mynamespace.myfederation.svc.us-central1-a.example.com". Behind the scenes, this is finding the external IP of one of the shards closest to my availability zone. This expansion is performed automatically by KubeDNS, which returns the associated CNAME record. This results in a traversal of the hierarchy of DNS records in the above example, and ends up at one of the external IP's of the Federated Service in the local us-central1 region.

It is also possible to target service shards in availability zones and regions other than the ones local to a Pod by specifying the appropriate DNS names explicitly, and not relying on automatic DNS expansion. For example, "nginx.mynamespace.myfederation.svc.europe-west1.example.com" will resolve to all of the currently healthy service shards in Europe, even if the Pod issuing the lookup is located in the U.S., and irrespective of whether or not there are healthy shards of the service in the U.S. This is useful for remote monitoring and other similar applications.

Discovering a Federated Service from Other Clients Outside your Federated Clusters

For external clients, automatic DNS expansion described is no longer possible. External clients need to specify one of the fully qualified DNS names of the federated service, be that a zonal, regional or global name. For convenience reasons, it is often a good idea to manually configure additional static CNAME records in your service, for example:

eu.nginx.acme.com        CNAME nginx.mynamespace.myfederation.svc.europe-west1.example.com.
us.nginx.acme.com        CNAME nginx.mynamespace.myfederation.svc.us-central1.example.com.
nginx.acme.com             CNAME nginx.mynamespace.myfederation.svc.example.com.

That way your clients can always use the short form on the left, and always be automatically routed to the closest healthy shard on their home continent. All of the required failover is handled for you automatically by Kubernetes Cluster Federation.

Handling Failures of Backend Pods and Whole Clusters

Standard Kubernetes service cluster-IP's already ensure that non-responsive individual Pod endpoints are automatically taken out of service with low latency. The Kubernetes cluster federation system automatically monitors the health of clusters and the endpoints behind all of the shards of your federated service, taking shards in and out of service as required. Due to the latency inherent in DNS caching (the cache timeout, or TTL for federated service DNS records is configured to 3 minutes, by default, but can be adjusted), it may take up to that long for all clients to completely fail over to an alternative cluster in in the case of catastrophic failure. However, given the number of discrete IP addresses which can be returned for each regional service endpoint (see e.g. us-central1 above, which has three alternatives) many clients will fail over automatically to one of the alternative IP's in less time than that given appropriate configuration.

Community

We'd love to hear feedback on Kubernetes Cross Cluster Services. To join the community:

Post issues or feature requests on GitHub

Join us in the #federation channel on Slack

Participate in the Cluster Federation SIG

Please give Cross Cluster Services a try, and let us know how it goes!

-- Quinton Hoole, Engineering Lead, Google and Allan Naim, Product Manager, Google

↧

Citrix + Kubernetes = A Home Run

July 14, 2016, 12:32 pm

≫ Next: Dashboard - Full Featured Web Interface for Kubernetes

≪ Previous: Cross Cluster Services - Achieving Higher Availability for your Kubernetes Applications

Editor’s note: today’s guest post is by Mikko Disini, a Director of Product Management at Citrix Systems, sharing their collaboration experience on a Kubernetes integration.

Technical collaboration is like sports. If you work together as a team, you can go down the homestretch and pull through for a win. That’s our experience with the Google Cloud Platform team.

Recently, we approached Google Cloud Platform (GCP) to collaborate on behalf of Citrix customers and the broader enterprise market looking to migrate workloads. This migration required including the NetScaler Docker load balancer, CPX, into Kubernetes nodes and resolving any issues with getting traffic into the CPX proxies.

Why NetScaler and Kubernetes?

Citrix customers want the same Layer 4 to Layer 7 capabilities from NetScaler that they have on-prem as they move to the cloud as they begin deploying their container and microservices architecture with Kubernetes
Kubernetes provides a proven infrastructure for running containers and VMs with automated workload delivery
NetScaler CPX provides Layer 4 to Layer 7 services and highly efficient telemetry data to a logging and analytics platform, NetScaler Management and Analytics System

I wish all our experiences working together with a technical partner were as good as working with GCP. We had a list of issues to enable our use cases and were able to collaborate swiftly on a solution. To resolve these, GCP team offered in depth technical assistance, working with Citrix such that NetScaler CPX can spin up and take over as a client-side proxy running on each host.

Next, NetScaler CPX needed to be inserted in the data path of GCP ingress load balancer so that NetScaler CPX can spread traffic to front end web servers. The NetScaler team made modifications so that NetScaler CPX listens to API server events and configures itself to create a VIP, IP table rules and server rules to take ingress traffic and load balance across front end applications. Google Cloud Platform team provided feedback and assistance to verify modifications made to overcome the technical hurdles. Done!

NetScaler CPX use case is supported in Kubernetes 1.3. Citrix customers and the broader enterprise market will have the opportunity to leverage NetScaler with Kubernetes, thereby lowering the friction to move workloads to the cloud.

You can learn more about NetScaler CPX here.

-- Mikko Disini, Director of Product Management - NetScaler, Citrix Systems

↧

Dashboard - Full Featured Web Interface for Kubernetes

July 15, 2016, 10:39 am

≫ Next: Steering an Automation Platform at Wercker with Kubernetes

≪ Previous: Citrix + Kubernetes = A Home Run

Editor’s note: this post is part of a series of in-depth articles on what's new in Kubernetes 1.3

Kubernetes Dashboard is a project that aims to bring a general purpose monitoring and operational web interface to the Kubernetes world. Three months ago we released the first production ready version, and since then the dashboard has made massive improvements. In a single UI, you’re able to perform majority of possible interactions with your Kubernetes clusters without ever leaving your browser. This blog post breaks down new features introduced in the latest release and outlines the roadmap for the future.

Full-Featured Dashboard

Thanks to a large number of contributions from the community and project members, we were able to deliver many new features for Kubernetes 1.3 release. We have been carefully listening to all the great feedback we have received from our users (see the summary infographics) and addressed the highest priority requests and pain points.

The Dashboard UI now handles all workload resources. This means that no matter what workload type you run, it is visible in the web interface and you can do operational changes on it. For example, you can modify your stateful MySQL installation with Pet Sets, do a rolling update of your web server with Deployments or install cluster monitoring with DaemonSets.

Home screen that shows all workloads running in a cluster.

In addition to viewing resources, you can create, edit, update, and delete them. This feature enables many use cases. For example, you can kill a failed Pod, do a rolling update on a Deployment, or just organize your resources. You can also export and import YAML configuration files of your cloud apps and store them in a version control system.

YAML resource editor and exporter.

The release includes a beta view of cluster nodes for administration and operational use cases. The UI lists all nodes in the cluster to allow for overview analysis and quick screening for problematic nodes. The details view shows all information about the node and links to pods running on it.

Node view that lists its details and Pods running on it.

There are also many smaller scope new features that the we shipped with the release, namely: support for namespaced resources, internationalization, performance improvements, and many bug fixes (find out more in the release notes). All these improvements result in a better and simpler user experience of the product.

Future Work

The team has ambitious plans for the future spanning across multiple use cases. We are also open to all feature requests, which you can post on our issue tracker.

Here is a list of our focus areas for the following months:

Handle more Kubernetes resources - To show all resources that a cluster user may potentially interact with. Once done, Dashboard can act as a complete replacement for CLI.
Monitoring and troubleshooting - To add resource usage statistics/graphs to the objects shown in Dashboard. This focus area will allow for actionable debugging and troubleshooting of cloud applications.
Security, auth and logging in - Make Dashboard accessible from networks external to a Cluster and work with custom authentication systems.

Connect With Us

We would love to talk with you and hear your feedback!

Email us at the SIG-UI mailing list
Chat with us on the Kubernetes Slack #SIG-UI channel
Join our meetings: 4PM CEST. See the SIG-UI calendar for details.

-- Piotr Bryk, Software Engineer, Google

↧

Steering an Automation Platform at Wercker with Kubernetes

July 15, 2016, 12:37 pm

≫ Next: Update on Kubernetes for Windows Server Containers

≪ Previous: Dashboard - Full Featured Web Interface for Kubernetes

Editor’s note: today’s guest post is by Andy Smith, the CTO of Wercker, sharing how Kubernetes helps them save time and speed up development.

At Wercker we run millions of containers that execute our users’ CI/CD jobs. The vast majority of them are ephemeral and only last as long as builds, tests and deploys take to run, the rest are ephemeral, too -- aren't we all --, but tend to last a bit longer and run our infrastructure. As we are running many containers across many nodes, we were in need of a highly scalable scheduler that would make our lives easier, and as such, decided to implement Kubernetes.

Wercker is a container-centric automation platform that helps developers build, test and deploy their applications. We support any number of pipelines, ranging from building code, testing API-contracts between microservices, to pushing containers to registries, and deploying to schedulers. All of these pipeline jobs run inside Docker containers and each artifact can be a Docker container.

And of course we use Wercker to build Wercker, and deploy itself onto Kubernetes!

Overview

Because we are a platform for running multi-service cloud-native code we've made many design decisions around isolation. On the base level we use CoreOS and cloud-init to bootstrap a cluster of heterogeneous nodes which I have named Patricians, Peasants, as well as controller nodes that don't have a cool name and are just called Controllers. Maybe we should switch to Constables.

Patrician nodes are where the bulk of our infrastructure runs. These nodes have appropriate network interfaces to communicate with our backend services as well as be routable by various load balancers. This is where our logging is aggregated and sent off to logging services, our many microservices for reporting and processing the results of job runs, and our many microservices for handling API calls.

On the other end of the spectrum are the Peasant nodes where the public jobs are run. Public jobs consist of worker pods reading from a job queue and dynamically generating new runner pods to handle execution of the job. The job itself is an incarnation of our open source CLI tool, the same one you can run on your laptop with Docker installed. These nodes have very limited access to the rest of the infrastructure and the containers the jobs themselves run in are even further isolated.

Controllers are controllers, I bet ours look exactly the same as yours.

Dynamic Pods

Our heaviest use of the Kubernetes API is definitely our system of creating dynamic pods to serve as the runtime environment for our actual job execution. After pulling job descriptions from the queue we define a new pod containing all the relevant environment for checking out code, managing a cache, executing a job and uploading artifacts. We launch the pod, monitor its progress, and destroy it when the job is done.

Ingresses

In order to provide a backend for HTTP API calls and allow self-registration of handlers we make use of the Ingress system in Kubernetes. It wasn't the clearest thing to set up, but reading through enough of the nginx example eventually got us to a good spot where it is easy to connect services to the frontend.

Upcoming Features in 1.3

While we generally treat all of our pods and containers as ephemeral and expect rapid restarts on failures, we are looking forward to Pet Sets and Init Containers as ways to optimize some of our processes. We are also pleased with official support for Minikube coming along as it improves our local testing and development.

Conclusion

Kubernetes saves us the non-trivial task of managing many, many containers across many nodes. It provides a robust API and tooling for introspecting these containers, and it includes much built in support for logging, metrics, monitoring and debugging. Service discovery and networking alone saves us so much time and speeds development immensely.

Cheers to you Kubernetes, keep up the good work :)

-- Andy Smith, CTO, Wercker

↧

Update on Kubernetes for Windows Server Containers

July 17, 2016, 11:11 pm

≫ Next: Bringing End-to-End Kubernetes Testing to Azure (Part 2)

≪ Previous: Steering an Automation Platform at Wercker with Kubernetes

Today's post is written by Jitendra Bhurat, Product Manager at Apprenda, and Cesar Wong, Principal Software Engineer at Red Hat; describing the progress made to bring Kubernetes on Windows Server.

Large organizations have significant investments and long-term roadmaps for both Linux and Windows Server. With Microsoft adopting Docker in Windows Server 2016, organizations are looking for a simplified means to orchestrating containers in both their Linux and Windows environments.

As the adoption and contribution curve of Kubernetes is unmatched, there has been a lot of interest in making Kubernetes compatible with the Microsoft ecosystem. At Apprenda, we have been building .NET distributed systems for the better part of a decade and kicked off the porting of Kubernetes to Windows in June.

Our current goal is to develop a minimally viable proof of concept (POC) of Kubernetes on Windows Server 2016 so that the community can learn about the pitfalls that will need to be overcome for future production environments. As the community drives this project to its first milestone, a number of lessons have been learned, work is ongoing, and we are currently investigating a number of networking options on Windows Server.

GETTING STARTED

In June, Apprenda and Kismatic (acquired by Apprenda), formed the Kubernetes Windows SIG team, and along with Red Hat, formed a partnership to create a Windows version of the Kubernetes Kubelet and Kube-Proxy, the two key components for operating Kubernetes on Windows Server. For a minimum viable POC, the initial work was divided into the following logical areas to help facilitate project management:

Focus Area	Architectural Notes	Status	Part of POC
Container Runtime	Expanding Kubernetes container runtimes to support Windows Server 2016 Docker containers	Work for POC complete	Yes
cAdvisor	Resource usage and performance characteristics in Kubernetes for Docker containers. Other architectural features in Kubernetes can run without this component.	Research will start after POC milestone	No
Pod Architecture	Fundamental unit of container bundling in Kubernetes. Current area of research ongoing on ultimate architecture in Windows Server.	Work for POC close to complete and will be finished when networking work is completed.	Yes
Networking and Kube-Proxy	Networking and communications among components (e.g. services to pod, pod to pod, etc.).	Still active area of research for POC. For some parts of Kubernetes there are no direct parallels in Windows Server - e.g. IPTables. Currently investigating Open vSwitch for container networking.	Yes
OOM Score	Frees up memory by sacrificing processes when all else fails. There is no direct comparison for OOM in Windows Server.	Research will start after POC milestone	No

Cesar Wong from Red Hat offered to contribute to Container Runtime and Pod Architecture sections and Apprenda has been working on Kubelet Integration and Kube-Proxy.

It is important to mention that for Windows Server environments, the Kubernetes control plane (API server, schedulers, etcd, etc) would continue to run on Linux and there would not be a Windows-only version of Kubernetes. Given the vast majority of organizations are running both Linux and Windows Server instances, this requirement is not a technical roadblock to adoption. For example, vSphere has a similar requirement.

Container Runtime

The current Kubelet implementation for Linux relies on an infrastructure container per pod to hold on to an IP address while other containers may be killed/restarted. This requires that containers be able to share their network stack (--net=container:id). In Docker for Windows, it is not possible to share the network stack across containers. It is also not possible to set a container’s DNS servers. In the container runtime POC for Windows, we created a new container runtime that removed the requirement of an infrastructure container. It also uses the IP of the first container in the pod as the pod’s IP address. With some limitations, however, it was possible to stand up pods with Windows-specific containers.

It is worth noting that a refactor of the container runtime in the Kubelet is under way and the POC code will need to be updated to reflect this new runtime architecture.

Pod Architecture

Pods in Kubernetes can include multiple containers, all sharing the same network, PID and volume namespace. Since Windows containers cannot share namespaces, multiple containers inside a pod are more isolated from each other. Furthermore, each container gets its own IP address, while the pod will only expose a single IP address for all containers.

At least initially, this means that Windows pods in Kubernetes should be limited to a single container. This also reflects the fact that containers in Windows are more monolithic than their Linux counterparts, including more of the underlying operating system pieces, including a service manager and other processes.

Kubelet Integration

The goal here was to have Kubelet running on Windows Server Container and be capable of accepting requests/commands from the Kubernetes API Server running on Linux. The Kubelet code base is, unfortunately, closely coupled with the Linux OS. For example, even for trivial things like finding the hostname, the Kubelet code assumes Linux as the operating system. Thanks to the versatility of Kubernetes, many of the OS dependencies (like hostname) were fixable using command line flags to provide a default value.

Thus far, we have been able to disjoin parts of the Kubelet from its dependencies which allow it to run on Windows with the proper abstraction layers. While the current status of the Kubelet is good enough for a POC, there is more work that needs to be done to get it into a state of general availability. For one, instead of using flags and environmental variables, it would be best to have code changes upstream. There are also a number of bugs that need fixing. For example, we encountered a Golang bug where moving a directory fails in Windows and had to provide an alternate implementation.

Networking / Kube-Proxy

Organizations using containers need the ability to deploy multiple containers and pods on a single host, have them share an IP address and be able to easily talk to other containers and pods on that same host. To accomplish this design goal, we conducted a lot of research on Windows Container Networking, as this held the keys to both the Kubernetes pod and service level networking.

We looked in-depth at the different networking modes supported by Windows Containers and found L2 Bridge Networking mode to be the most appropriate in our case, as it allowed inter-container communication across different hosts. Working closely with the Microsoft networking team, we were able to identify and resolve issues we were having setting up the L2 Bridge networking mode in our environment using the TP5 release of Windows Server 2016.

As we dug deep into networking, two options were clear for the Kube-Proxy implementation:

Implement Kube-Proxy natively on Windows
Run a Linux version of the Kube-Proxy on a Hyper-V VM using the same bridge as the other containers running on Windows and have the Kube-Proxy forward requests to the other containers and have the Windows host forward requests to the proxy

For the POC, we decided to run Kube-Proxy on a Hyper-V Linux virtual machine and configure L2 Bridge mode networking with a private subnet. This implementation would enable us to forward traffic from the Kube-Proxy to containers running on Windows. Unfortunately, it did not work as it did in theory.

On further investigation, with the help of Microsoft, we determined that for this to work, we would have to configure the L2 Bridge networking mode with an externally accessible gateway and on the same subnet as the container host. Such a requirement goes against the networking isolation boundary that the pod currently enjoys because each container on that host and other hosts can communicate with each other and all the other container hosts. This means any container can talk to any other container regardless of pod membership.

We are currently looking at using Open vSwitch (OVS) to configure overlay networking to overcome the issue described above. Cloudbase, which is also a member of the Kubernetes Windows SIG, is actively involved in this effort. Their team has successfully implemented OVS in Windows Server and their work is promising for this effort. The community is also currently engaged with the Microsoft lead on Windows Server networking to find an alternative in case this route does not pan out.

As we continue to make progress on the POC, we welcome ideas from the community to help us advance this vision. You can connect with us in the following ways:

Chat with us on the Kubernetes Slack: #sig-windows
Contribute on the Kubernetes Windows SIG Google Group
Join our meetings: biweekly on Tuesdays at 10AM PT

--Jitendra Bhurat, Product Manager at Apprenda. Container Runtime and Pod Architecture sections contributed by Cesar Wong, Principal Software Engineer at Red Hat

↧

Bringing End-to-End Kubernetes Testing to Azure (Part 2)

July 18, 2016, 10:34 am

≫ Next: A Very Happy Birthday Kubernetes

≪ Previous: Update on Kubernetes for Windows Server Containers

Editor’s Note: Today’s guest post is Part II from a series by Travis Newhouse, Chief Architect at AppFormix, writing about their contributions to Kubernetes.

Historically, Kubernetes testing has been hosted by Google, running e2e tests on Google Compute Engine (GCE) and Google Container Engine (GKE). In fact, the gating checks for the submit-queue are a subset of tests executed on these test platforms. Federated testing aims to expand test coverage by enabling organizations to host test jobs for a variety of platforms and contribute test results to benefit the Kubernetes project. Members of the Kubernetes test team at Google and SIG-Testing have created a Kubernetes test history dashboard that publishes the results from all federated test jobs (including those hosted by Google).

In this blog post, we describe extending the e2e test jobs for Azure, and show how to contribute a federated test to the Kubernetes project.

END-TO-END INTEGRATION TESTS FOR AZURE

After successfully implementing “development distro” scripts to automate deployment of Kubernetes on Azure, our next goal was to run e2e integration tests and share the results with the Kubernetes community.

We automated our workflow for executing e2e tests of Kubernetes on Azure by defining a nightly job in our private Jenkins server. Figure 2 shows the workflow that uses kube-up.sh to deploy Kubernetes on Ubuntu virtual machines running in Azure, then executes the e2e tests. On completion of the tests, the job uploads the test results and logs to a Google Cloud Storage directory, in a format that can be processed by the scripts that produce the test history dashboard. Our Jenkins job uses the hack/jenkins/e2e-runner.sh and hack/jenkins/upload-to-gcs.sh scripts to produce the results in the correct format.

Kubernetes on Azure - Flow Chart - New Page.png

Figure 2 - Nightly test job workflow

HOW TO CONTRIBUTE AN E2E TEST

Throughout our work to create the Azure e2e test job, we have collaborated with members of SIG-Testing to find a way to publish the results to the Kubernetes community. The results of this collaboration are documentation and a streamlined process to contribute results from a federated test job. The steps to contribute e2e test results can be summarized in 4 steps.

Create a Google Cloud Storage bucket in which to publish the results.
Define an automated job to run the e2e tests. By setting a few environment variables, hack/jenkins/e2e-runner.sh deploys Kubernetes binaries and executes the tests.
Upload the results using hack/jenkins/upload-to-gcs.sh.
Incorporate the results into the test history dashboard by submitting a pull-request with modifications to a few files in kubernetes/test-infra.

The federated tests documentation describes these steps in more detail. The scripts to run e2e tests and upload results simplifies the work to contribute a new federated test job. The specific steps to set up an automated test job and an appropriate environment in which to deploy Kubernetes are left to the reader’s preferences. For organizations using Jenkins, the jenkins-job-builder configurations for GCE and GKE tests may provide helpful examples.

RETROSPECTIVE

The e2e tests on Azure have been running for several weeks now. During this period, we have found two issues in Kubernetes. Weixu Zhuang immediately published fixes that have been merged into the Kubernetes master branch.

The first issue happened when we wanted to bring up the Kubernetes cluster using SaltStack on Azure using Ubuntu VMs. A commit (07d7cfd3) modified the OpenVPN certificate generation script to use a variable that was only initialized by scripts in the cluster/ubuntu. Strict checking on existence of parameters by the certificate generation script caused other platforms that use the script to fail (e.g. our changes to support Azure). We submitted a pull-request that fixed the issue by initializing the variable with a default value to make the certificate generation scripts more robust across all platform types.

The second pull-request cleaned up an unused import in the Daemonset unit test file. The import statement broke the unit tests with golang 1.4. Our nightly Jenkins job helped us find this error and we promptly pushed a fix for it.

CONCLUSION AND FUTURE WORK

The addition of a nightly e2e test job for Kubernetes on Azure has helped to define the process to contribute a federated test to the Kubernetes project. During the course of the work, we also saw the immediate benefit of expanding test coverage to more platforms when our Azure test job identified compatibility issues.

We want to thank Aaron Crickenberger, Erick Fejta, Joe Finney, and Ryan Hutchinson for their help to incorporate the results of our Azure e2e tests into the Kubernetes test history. If you’d like to get involved with testing to create a stable, high quality releases of Kubernetes, join us in the Kubernetes Testing SIG (sig-testing).

--Travis Newhouse, Chief Architect at AppFormix

↧

A Very Happy Birthday Kubernetes

July 21, 2016, 10:45 am

≫ Next: Happy Birthday Kubernetes. Oh, the places you’ll go!

≪ Previous: Bringing End-to-End Kubernetes Testing to Azure (Part 2)

Last year at OSCON, I got to reconnect with a bunch of friends and see what they have been working on. That turned out to be the Kubernetes 1.0 launch event. Even that day, it was clear the project was supported by a broad community -- a group that showed an ambitious vision for distributed computing.

Today, on the first anniversary of the Kubernetes 1.0 launch, it’s amazing to see what a community of dedicated individuals can do. Kubernauts have collectively put in 237 person years of coding effort since launch to bring forward our most recent release 1.3. However the community is much more than simply coding effort. It is made up of people -- individuals that have given their expertise and energy to make this project flourish. With more than 830 diverse contributors, from independents to the largest companies in the world, it’s their work that makes Kubernetes stand out. Here are stories from a couple early contributors reflecting back on the project:

Justin Santa Barbara, independent Kubernetes contributor
Clayton Coleman, contributor and architect on Kubernetes on OpenShift at Red Hat

The community is also more than online GitHub and Slack conversation; year one saw the launch of KubeCon, the Kubernetes user conference, which started as a grassroot effort that brought together 1,000 individuals between two events in San Francisco and London. The advocacy continues with users globally. There are more than 130 Meetup groups that mention Kubernetes, many of which are helping celebrate Kubernetes’ birthday. To join the celebration, participate at one of the 20 #k8sbday parties worldwide: Austin, Bangalore, Beijing, Boston, Cape Town, Charlotte, Cologne, Geneva, Karlsruhe, Kisumu, Montreal, Portland, Raleigh, Research Triangle, San Francisco, Seattle, Singapore, SF Bay Area, or Washington DC.

The Kubernetes community continues to work to make our project more welcoming and open to our kollaborators. This spring, Kubernetes and KubeCon moved to the Cloud Native Compute Foundation (CNCF), a Linux Foundation Project, to accelerate the collaborative vision outlined only a year ago at OSCON …. lifting a glass to another great year.

-- Sarah Novotny, Kubernetes Community Wonk

↧

Happy Birthday Kubernetes. Oh, the places you’ll go!

July 21, 2016, 10:46 am

≫ Next: The Bet on Kubernetes, a Red Hat Perspective

≪ Previous: A Very Happy Birthday Kubernetes

Editor’s note, Today’s guest post is from an independent Kubernetes contributor, Justin Santa Barbara, sharing his reflection on growth of the project from inception to its future.

Dear K8s,

It’s hard to believe you’re only one - you’ve grown up so fast. On the occasion of your first birthday, I thought I would write a little note about why I was so excited when you were born, why I feel fortunate to be part of the group that is raising you, and why I’m eager to watch you continue to grow up!

--Justin

You started with an excellent foundation - good declarative functionality, built around a solid API with a well defined schema and the machinery so that we could evolve going forwards. And sure enough, over your first year you grew so fast: autoscaling, HTTP load-balancing support (Ingress), support for persistent workloads including clustered databases (PetSets). You’ve made friends with more clouds (welcome Azure & OpenStack to the family), and even started to span zones and clusters (Federation). And these are just some of the most visible changes - there’s so much happening inside that brain of yours!

I think it’s wonderful you’ve remained so open in all that you do - you seem to write down everything on Github - for better or worse. I think we’ve all learned a lot about that on the way, like the perils of having engineers make scaling statements that are then weighed against claims made without quite the same framework of precision and rigor. But I’m proud that you chose not to lower your standards, but rose to the challenge and just ran faster instead - it might not be the most realistic approach, but it is the only way to move mountains!

And yet, somehow, you’ve managed to avoid a lot of the common dead-ends that other open source software has fallen into, particularly as those projects got bigger and the developers end up working on it more than they use it directly. How did you do that? There’s a probably-apocryphal story of an employee at IBM that makes a huge mistake, and is summoned to meet with the big boss, expecting to be fired, only to be told “We just spent several million dollars training you. Why would we want to fire you?”. Despite all the investment google is pouring into you (along with Redhat and others), I sometimes wonder if the mistakes we are avoiding could be worth even more. There is a very open development process, yet there’s also an “oracle” that will sometimes course-correct by telling us what happens two years down the road if we make a particular design decision. This is a parent you should probably listen to!

And so although you’re only a year old, you really have an old soul. I’m just one of the many people raising you, but it’s a wonderful learning experience for me to be able to work with the people that have built these incredible systems and have all this domain knowledge. Yet because we started from scratch (rather than taking the existing Borg code) we’re at the same level and can still have genuine discussions about how to raise you. Well, at least as close to the same level as we could ever be, but it’s to their credit that they are all far too nice ever to mention it!

If I would pick just two of the wise decisions those brilliant people made:

Labels & selectors give us declarative “pointers”, so we can say “why” we want things, rather than listing the things directly. It’s the secret to how you can scale to great heights; not by naming each step, but saying “a thousand more steps just like that first one”.
Controllers are state-synchronizers: we specify the goals, and your controllers will indefatigably work to bring the system to that state. They work through that strongly-typed API foundation, and are used throughout the code, so Kubernetes is more of a set of a hundred small programs than one big one. It’s not enough to scale to thousands of nodes technically; the project also has to scale to thousands of developers and features; and controllers help us get there.

And so on we will go! We’ll be replacing those controllers and building on more, and the API-foundation lets us build anything we can express in that way - with most things just a label or annotation away! But your thoughts will not be defined by language: with third party resources you can express anything you choose. Now we can build Kubernetes without building in Kubernetes, creating things that feel as much a part of Kubernetes as anything else. Many of the recent additions, like ingress, DNS integration, autoscaling and network policies were done or could be done in this way. Eventually it will be hard to imagine you before these things, but tomorrow’s standard functionality can start today, with no obstacles or gatekeeper, maybe even for an audience of one.

So I’m looking forward to seeing more and more growth happen further and further from the core of Kubernetes. We had to work our way through those phases; starting with things that needed to happen in the kernel of Kubernetes - like replacing replication controllers with deployments. Now we’re starting to build things that don’t require core changes. But we’re still still talking about infrastructure separately from applications. It’s what comes next that gets really interesting: when we start building applications that rely on the Kubernetes APIs. We’ve always had the Cassandra example that uses the Kubernetes API to self-assemble, but we haven’t really even started to explore this more widely yet. In the same way that the S3 APIs changed how we build things that remember, I think the k8s APIs are going to change how we build things that think.

So I’m looking forward to your second birthday: I can try to predict what you’ll look like then, but I know you’ll surpass even the most audacious things I can imagine. Oh, the places you’ll go!

-- Justin Santa Barbara, Independent Kubernetes Contributor

↧

The Bet on Kubernetes, a Red Hat Perspective

July 21, 2016, 10:46 am

≫ Next: Why OpenStack's embrace of Kubernetes is great for both communities

≪ Previous: Happy Birthday Kubernetes. Oh, the places you’ll go!

Editor’s note: Today’s guest post is from a Kubernetes contributor Clayton Coleman, Architect on OpenShift at Red Hat, sharing their adoption of the project from its beginnings.

Two years ago, Red Hat made a big bet on Kubernetes. We bet on a simple idea: that an open source community is the best place to build the future of application orchestration, and that only an open source community could successfully integrate the diverse range of capabilities necessary to succeed. As a Red Hatter, that idea is not far-fetched - we’ve seen it successfully applied in many communities, but we’ve also seen it fail, especially when a broad reach is not supported by solid foundations. On the one year anniversary of Kubernetes 1.0, two years after the first open-source commit to the Kubernetes project, it’s worth asking the question:

Was Kubernetes the right bet?

The success of software is measured by the successes of its users - whether that software enables for them new opportunities or efficiencies. In that regard, Kubernetes has succeeded beyond our wildest dreams. We know of hundreds of real production deployments of Kubernetes, in the enterprise through Red Hat’s multi-tenant enabled OpenShift distribution, on Google Container Engine (GKE), in heavily customized versions run by some of the world's largest software companies, and through the education, entertainment, startup, and do-it-yourself communities. Those deployers report improved time to delivery, standardized application lifecycles, improved resource utilization, and more resilient and robust applications. And that’s just from customers or contributors to the community - I would not be surprised if there were now thousands of installations of Kubernetes managing tens of thousands of real applications out in the wild.

I believe that reach to be a validation of the vision underlying Kubernetes: to build a platform for all applications by providing tools for each of the core patterns in distributed computing. Those patterns:

simple replicated web software
distributed load balancing and service discovery
immutable images run in containers
co-location of related software into pods
simplified consumption of network attached storage
flexible and powerful resource scheduling
running batch and scheduled jobs alongside service workloads
managing and maintaining clustered software like databases and message queues

Allow developers and operators to move to the next scale of abstraction, just like they have enabled Google and others in the tech ecosystem to scale to datacenter computers and beyond. From Kubernetes 1.0 to 1.3 we have continually improved the power and flexibility of the platform while ALSO improving performance, scalability, reliability, and usability. The explosion of integrations and tools that run on top of Kubernetes further validates core architectural decisions to be composable, to expose open and flexible APIs, and to deliberately limit the core platform and encourage extension.

Today Kubernetes has one of the largest and most vibrant communities in the open source ecosystem, with almost a thousand contributors, one of the highest human-generated commit rates of any single-repository project on GitHub, over a thousand projects based around Kubernetes, and correspondingly active Stack Overflow and Slack channels. Red Hat is proud to be part of this ecosystem as the largest contributor to Kubernetes after Google, and every day more companies and individuals join us. The idea of Kubernetes found fertile ground, and you, the community, provided the excitement and commitment that made it grow.

So, did we bet correctly? For all the reasons above, and hundreds more: Yes.

What’s next?

Happy as we are with the success of Kubernetes, this is no time to rest! While there are many more features and improvements we want to build into Kubernetes, I think there is a general consensus that we want to focus on the only long term goal that matters - a healthy, successful, and thriving technical community around Kubernetes. As John F. Kennedy probably said:

> Ask not what your community can do for you, but what you can do for your community

In a recent post to the kubernetes-dev list, Brian Grant laid out a great set of near term goals - goals that help grow the community, refine how we execute, and enable future expansion. In each of the Kubernetes Special Interest Groups we are trying to build sustainable teams that can execute across companies and communities, and we are actively working to ensure each of these SIGs is able to contribute, coordinate, and deliver across a diverse range of interests under one vision for the project.

Of special interest to us is the story of extension - how the core of Kubernetes can become the beating heart of the datacenter operating system, and enable even more patterns for application management to build on top of Kubernetes, not just into it. Work done in the 1.2 and 1.3 releases around third party APIs, API discovery, flexible scheduler policy, external authorization and authentication (beyond those built into Kubernetes) is just the start. When someone has a need, we want them to easily find a solution, and we also want it to be easy for others to consume and contribute to that solution. Likewise, the best way to prove ideas is to prototype them against real needs and to iterate against real problems, which should be easy and natural.

By Kubernetes’ second birthday, I hope to reflect back on a long year of refinement, user success, and community participation. It has been a privilege and an honor to contribute to Kubernetes, and it still feels like we are just getting started. Thank you, and I hope you come along for the ride!

-- Clayton Coleman, Contributor and Architect on Kubernetes and OpenShift at Red Hat. Follow him on Twitter and GitHub: @smarterclayton

↧

Why OpenStack's embrace of Kubernetes is great for both communities

July 25, 2016, 9:14 pm

≫ Next: Challenges of a Remotely Managed, On-Premises, Bare-Metal Kubernetes Cluster

≪ Previous: The Bet on Kubernetes, a Red Hat Perspective

Today, Mirantis, the leading contributor to OpenStack, announced that it will re-write its private cloud platform to use Kubernetes as its underlying orchestration engine. We think this is a great step forward for both the OpenStack and Kubernetes communities. With Kubernetes under the hood, OpenStack users will benefit from the tremendous efficiency, manageability and resiliency that Kubernetes brings to the table, while positioning their applications to use more cloud-native patterns. The Kubernetes community, meanwhile, can feel confident in their choice of orchestration framework, while gaining the ability to manage both container- and VM-based applications from a single platform.

The Path to Cloud Native

Google spent over ten years developing, applying and refining the principles of cloud native computing. A cloud-native application is:

Container-packaged. Applications are composed of hermetically sealed, reusable units across diverse environments;
Dynamically scheduled, for increased infrastructure efficiency and decreased operational overhead; and
Microservices-based. Loosely coupled components significantly increase the overall agility, resilience and maintainability of applications.

These principles have enabled us to build the largest, most efficient, most powerful cloud infrastructure in the world, which anyone can access via Google Cloud Platform. They are the same principles responsible for the recent surge in popularity of Linux containers. Two years ago, we open-sourced Kubernetes to spur adoption of containers and scalable, microservices-based applications, and the recently released Kubernetes version 1.3 introduces a number of features to bridge enterprise and cloud native workloads. We expect that adoption of cloud-native principles will drive the same benefits within the OpenStack community, as well as smoothing the path between OpenStack and the public cloud providers that embrace them.

Making OpenStack better

We hear from enterprise customers that they want to move towards cloud-native infrastructure and application patterns. Thus, it is hardly surprising that OpenStack would also move in this direction [1], with large OpenStack users such as eBay and GoDaddy adopting Kubernetes as key components of their stack. Kubernetes and cloud-native patterns will improve OpenStack lifecycle management by enabling rolling updates, versioning, and canary deployments of new components and features. In addition, OpenStack users will benefit from self-healing infrastructure, making OpenStack easier to manage and more resilient to the failure of core services and individual compute nodes. Finally, OpenStack users will realize the developer and resource efficiencies that come with a container-based infrastructure.

OpenStack is a great tool for Kubernetes users

Conversely, incorporating Kubernetes into OpenStack will give Kubernetes users access to a robust framework for deploying and managing applications built on virtual machines. As users move to the cloud-native model, they will be faced with the challenge of managing hybrid application architectures that contain some mix of virtual machines and Linux containers. The combination of Kubernetes and OpenStack means that they can do so on the same platform using a common set of tools.

We are excited by the ever increasing momentum of the cloud-native movement as embodied by Kubernetes and related projects, and look forward to working with Mirantis, its partner Intel, and others within the OpenStack community to brings the benefits of cloud-native to their applications and infrastructure.

--Martin Buhr, Product Manager, Strategic Initiatives, Google

[1] Check out the announcement of Kubernetes-OpenStack Special Interest Group here, and a great talk about OpenStack on Kubernetes by CoreOS CEO Alex Polvi at the most recent OpenStack summit here.

↧

Challenges of a Remotely Managed, On-Premises, Bare-Metal Kubernetes Cluster

August 2, 2016, 12:28 pm

≫ Next: Create a Couchbase cluster using Kubernetes

≪ Previous: Why OpenStack's embrace of Kubernetes is great for both communities

Today's post is written by Bich Le, chief architect at Platform9, describing how their engineering team overcame challenges in remotely managing bare-metal Kubernetes clusters.

Introduction

The recently announced Platform9 Managed Kubernetes (PMK) is an on-premises enterprise Kubernetes solution with an unusual twist: while clusters run on a user’s internal hardware, their provisioning, monitoring, troubleshooting and overall life cycle is managed remotely from the Platform9 SaaS application. While users love the intuitive experience and ease of use of this deployment model, this approach poses interesting technical challenges. In this article, we will first describe the motivation and deployment architecture of PMK, and then present an overview of the technical challenges we faced and how our engineering team addressed them.

Multi-OS bootstrap model

Like its predecessor, Managed OpenStack, PMK aims to make it as easy as possible for an enterprise customer to deploy and operate a “private cloud”, which, in the current context, means one or more Kubernetes clusters. To accommodate customers who standardize on a specific Linux distro, our installation process uses a “bare OS” or “bring your own OS” model, which means that an administrator deploys PMK to existing Linux nodes by installing a simple RPM or Deb package on their favorite OS (Ubuntu-14, CentOS-7, or RHEL-7). The package, which the administrator downloads from their Platform9 SaaS portal, starts an agent which is preconfigured with all the information and credentials needed to securely connect to and register itself with the customer’s Platform9 SaaS controller running on the WAN.

Node management

The first challenge was configuring Kubernetes nodes in the absence of a bare-metal cloud API and SSH access into nodes. We solved it using the node pool concept and configuration management techniques. Every node running the agent automatically shows up in the SaaS portal, which allows the user to authorize the node for use with Kubernetes. A newly authorized node automatically enters a node pool, indicating that it is available but not used in any clusters. Independently, the administrator can create one or more Kubernetes clusters, which start out empty. At any later time, he or she can request one or more nodes to be attached to any cluster. PMK fulfills the request by transferring the specified number of nodes from the pool to the cluster. When a node is authorized, its agent becomes a configuration management agent, polling for instructions from a CM server running in the SaaS application and capable of downloading and configuring software.

Cluster creation and node attach/detach operations are exposed to administrators via a REST API, a CLI utility named qb, and the SaaS-based Web UI. The following screenshot shows the Web UI displaying one 3-node cluster named clus100, one empty cluster clus101, and the three nodes.

Cluster initialization

The first time one or more nodes are attached to a cluster, PMK configures the nodes to form a complete Kubernetes cluster. Currently, PMK automatically decides the number and placement of Master and Worker nodes. In the future, PMK will give administrators an “advanced mode” option allowing them to override and customize those decisions. Through the CM server, PMK then sends to each node a configuration and a set of scripts to initialize each node according to the configuration. This includes installing or upgrading Docker to the required version; starting 2 docker daemons (bootstrap and main), creating the etcd K/V store, establishing the flannel network layer, starting kubelet, and running the Kubernetes appropriate for the node’s role (master vs. worker). The following diagram shows the component layout of a fully formed cluster.

Containerized kubelet?

Another hurdle we encountered resulted from our original decision to run kubelet as recommended by the Multi-node Docker Deployment Guide. We discovered that this approach introduces complexities that led to many difficult-to-troubleshoot bugs that were sensitive to the combined versions of Kubernetes, Docker, and the node OS. Example: kubelet’s need to mount directories containing secrets into containers to support the Service Accounts mechanism. It turns out that doing this from inside of a container is tricky, and requires a complex sequence of steps that turned out to be fragile. After fixing a continuing stream of issues, we finally decided to run kubelet as a native program on the host OS, resulting in significantly better stability.

Overcoming networking hurdles

The Beta release of PMK currently uses flannel with UDP back-end for the network layer. In a Kubernetes cluster, many infrastructure services need to communicate across nodes using a variety of ports (443, 4001, etc..) and protocols (TCP and UDP). Often, customer nodes intentionally or unintentionally block some or all of the traffic, or run existing services that conflict with the required ports, resulting in non-obvious failures. To address this, we try to detect configuration problems early and inform the administrator immediately. PMK runs a “preflight” check on all nodes participating in a cluster before installing the Kubernetes software. This means running small test programs on each node to verify that (1) the required ports are available for binding and listening; and (2) nodes can connect to each other using all required ports and protocols. These checks run in parallel and take less than a couple of seconds before cluster initialization.

Monitoring

One of the values of a SaaS-managed private cloud is constant monitoring and early detection of problems by the SaaS team. Issues that can be addressed without intervention by the customer are handled automatically, while others trigger proactive communication with the customer via UI alerts, email, or real-time channels. Kubernetes monitoring is a huge topic worthy of its own blog post, so we’ll just briefly touch upon it. We broadly classify the problem into layers: (1) hardware & OS, (2) Kubernetes core (e.g. API server, controllers and kubelets), (3) add-ons (e.g. SkyDNS& ServiceLoadbalancer) and (4) applications. We are currently focused on layers 1-3. A major source of issues we’ve seen is add-on failures. If either DNS or the ServiceLoadbalancer reverse http proxy (soon to be upgraded to an Ingress Controller) fails, application services will start failing. One way we detect such failures is by monitoring the components using the Kubernetes API itself, which is proxied into the SaaS controller, allowing the PMK support team to monitor any cluster resource. To detect service failure, one metric we pay attention to is pod restarts. A high restart count indicates that a service is continually failing.

Future topics

We faced complex challenges in other areas that deserve their own posts: (1) Authentication and authorization with Keystone, the identity manager used by Platform9 products; (2) Software upgrades, i.e. how to make them brief and non-disruptive to applications; and (3) Integration with customer’s external load-balancers (in the absence of good automation APIs).

Conclusion

Platform9 Managed Kubernetes uses a SaaS-managed model to try to hide the complexity of deploying, operating and maintaining bare-metal Kubernetes clusters in customers’ data centers. These requirements led to the development of a unique cluster deployment and management architecture, which in turn led to unique technical challenges.This article described an overview of some of those challenges and how we solved them. For more information on the motivation behind PMK, feel free to view Madhura Maskasky's blog post.

--Bich Le, Chief Architect, Platform9

↧

Create a Couchbase cluster using Kubernetes

August 15, 2016, 3:19 pm

≫ Next: SIG Apps: build apps for and operate them in Kubernetes

≪ Previous: Challenges of a Remotely Managed, On-Premises, Bare-Metal Kubernetes Cluster

Editor’s note: today’s guest post is by Arun Gupta, Vice President Developer Relations at Couchbase, showing how to setup a Couchbase cluster with Kubernetes.

Couchbase Server is an open source, distributed NoSQL document-oriented database. It exposes a fast key-value store with managed cache for submillisecond data operations, purpose-built indexers for fast queries and a query engine for executing SQL queries. For mobile and Internet of Things (IoT) environments, Couchbase Lite runs native on-device and manages sync to Couchbase Server.

Couchbase Server 4.5 was recently announced, bringing many new features, including production certified support for Docker. Couchbase is supported on a wide variety of orchestration frameworks for Docker containers, such as Kubernetes, Docker Swarm and Mesos, for full details visit this page.

This blog post will explain how to create a Couchbase cluster using Kubernetes. This setup is tested using Kubernetes 1.3.3, Amazon Web Services, and Couchbase 4.5 Enterprise Edition.

Like all good things, this post is standing on the shoulder of giants. The design pattern used in this blog was defined in a Friday afternoon hack with @saturnism. A working version of the configuration files was contributed by @r_schmiddy.

Couchbase Cluster

A cluster of Couchbase Servers is typically deployed on commodity servers. Couchbase Server has a peer-to-peer topology where all the nodes are equal and communicate to each other on demand. There is no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical. It allows the nodes to be added or removed without considering their “type”. This model works particularly well with cloud infrastructure in general. For Kubernetes, this means that we can use the exact same container image for all Couchbase nodes.

A typical Couchbase cluster creation process looks like:

Start Couchbase: Start n Couchbase servers
Create cluster: Pick any server, and add all other servers to it to create the cluster
Rebalance cluster: Rebalance the cluster so that data is distributed across the cluster

In order to automate using Kubernetes, the cluster creation is split into a “master” and “worker” Replication Controller (RC).

The master RC has only one replica and is also published as a Service. This provides a single reference point to start the cluster creation. By default services are visible only from inside the cluster. This service is also exposed as a load balancer. This allows the Couchbase Web Console to be accessible from outside the cluster.

The worker RC use the exact same image as master RC. This keeps the cluster homogenous which allows to scale the cluster easily.

Configuration files used in this blog are available here. Let’s create the Kubernetes resources to create the Couchbase cluster.

Create Couchbase “master” Replication Controller

Couchbase master RC can be created using the following configuration file:

apiVersion: v1
kind: ReplicationController
metadata:
name: couchbase-master-rc
spec:
replicas: 1
selector:
   app: couchbase-master-pod
template:
   metadata:
     labels:
       app: couchbase-master-pod
   spec:
     containers:
     - name: couchbase-master
       image: arungupta/couchbase:k8s
       env:
         - name: TYPE
           value: MASTER
       ports:
       - containerPort: 8091
----
apiVersion: v1
kind: Service
metadata:
name: couchbase-master-service
labels:
   app: couchbase-master-service
spec:
ports:
   - port: 8091
selector:
   app: couchbase-master-pod
type: LoadBalancer

This configuration file creates a couchbase-master-rc Replication Controller. This RC has one replica of the pod created using the arungupta/couchbase:k8s image. This image is created using the Dockerfile here. This Dockerfile uses a configuration script to configure the base Couchbase Docker image. First, it uses Couchbase REST API to setup memory quota, setup index, data and query services, security credentials, and loads a sample data bucket. Then, it invokes the appropriate Couchbase CLI commands to add the Couchbase node to the cluster or add the node and rebalance the cluster. This is based upon three environment variables:

TYPE: Defines whether the joining pod is worker or master
AUTO_REBALANCE: Defines whether the cluster needs to be rebalanced
COUCHBASE_MASTER: Name of the master service

For this first configuration file, the TYPE environment variable is set to MASTER and so no additional configuration is done on the Couchbase image.

Let’s create and verify the artifacts.

Create Couchbase master RC:

kubectl create -f cluster-master.yml
replicationcontroller "couchbase-master-rc" created
service "couchbase-master-service" created

List all the services:

kubectl get svc
NAME                       CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
couchbase-master-service   10.0.57.201                 8091/TCP   30s
kubernetes                 10.0.0.1      <none>        443/TCP    5h

Output shows that couchbase-master-service is created.

Get all the pods:

kubectl get po
NAME READY STATUS RESTARTS AGE
couchbase-master-rc-97mu5 1/1 Running 0 1m

A pod is created using the Docker image specified in the configuration file.

Check the RC:

kubectl get rc
NAME DESIRED CURRENT AGE
couchbase-master-rc 1 1 1m

It shows that the desired and current number of pods in the RC are matching.

Describe the service:

kubectl describe svc couchbase-master-service
Name:couchbase-master-service
Namespace:default
Labels:app=couchbase-master-service
Selector:app=couchbase-master-pod
Type:LoadBalancer
IP:10.0.57.201
LoadBalancer Ingress:a94f1f286590c11e68e100283628cd6c-1110696566.us-west-2.elb.amazonaws.com
Port:<unset>8091/TCP
NodePort:<unset>30019/TCP
Endpoints:10.244.2.3:8091
Session Affinity:None
Events:

FirstSeenLastSeenCountFromSubobjectPathTypeReasonMessage

------------------------------------------------------------

2m2m1{service-controller }NormalCreatingLoadBalancerCreating load balancer

2m2m1{service-controller }NormalCreatedLoadBalancerCreated load balancer

Among other details, the address shown next to LoadBalancer Ingress is relevant for us. This address is used to access the Couchbase Web Console.

Wait for ~3 mins for the load balancer to be ready to receive requests. Couchbase Web Console is accessible at <ip>:8091 and looks like:

The image used in the configuration file is configured with the Administrator username and password password. Enter the credentials to see the console:

Click on Server Nodes to see how many Couchbase nodes are part of the cluster. As expected, it shows only one node:

Click on Data Buckets to see a sample bucket that was created as part of the image:

This shows the travel-sample bucket is created and has 31,591 JSON documents.

Create Couchbase “worker” Replication Controller
Now, let’s create a worker replication controller. It can be created using the configuration file:

apiVersion: v1
kind: ReplicationController
metadata:
 name: couchbase-worker-rc
spec:
 replicas: 1
 selector:
   app: couchbase-worker-pod
 template:
   metadata:
     labels:
       app: couchbase-worker-pod
   spec:
     containers:
     - name: couchbase-worker
       image: arungupta/couchbase:k8s
       env:
         - name: TYPE
           value: "WORKER"
         - name: COUCHBASE_MASTER
           value: "couchbase-master-service"
         - name: AUTO_REBALANCE
           value: "false"
       ports:
       - containerPort: 8091

This RC also creates a single replica of Couchbase using the same arungupta/couchbase:k8s image. The key differences here are:

TYPE environment variable is set to WORKER. This adds a worker Couchbase node to be added to the cluster.

COUCHBASE_MASTER environment variable is passed the value of couchbase-master-service. This uses the service discovery mechanism built into Kubernetes for pods in the worker and the master to communicate.

AUTO_REBALANCE environment variable is set to false. This ensures that the node is only added to the cluster but the cluster itself is not rebalanced. Rebalancing is required to to re-distribute data across multiple nodes of the cluster. This is the recommended way as multiple nodes can be added first, and then cluster can be manually rebalanced using the Web Console.

Let’s create a worker:

kubectl create -f cluster-worker.yml
replicationcontroller "couchbase-worker-rc" created

Check the RC:

kubectl get rc
NAME                  DESIRED   CURRENT   AGE
couchbase-master-rc   1         1         6m
couchbase-worker-rc   1         1         22s

A new couchbase-worker-rc is created where the desired and the current number of instances are matching.

Get all pods:

kubectl get po
NAME                        READY     STATUS    RESTARTS   AGE
couchbase-master-rc-97mu5   1/1       Running   0          6m
couchbase-worker-rc-4ik02   1/1       Running   0          46s

An additional pod is now created. Each pod’s name is prefixed with the corresponding RC’s name. For example, a worker pod is prefixed with couchbase-worker-rc.

Couchbase Web Console gets updated to show that a new Couchbase node is added. This is evident by red circle with the number 1 on the Pending Rebalance tab.

Clicking on the tab shows the IP address of the node that needs to be rebalanced:

Scale Couchbase cluster

Now, let’s scale the Couchbase cluster by scaling the replicas for worker RC:

kubectl scale rc couchbase-worker-rc --replicas=3
replicationcontroller "couchbase-worker-rc" scaled

Updated state of RC shows that 3 worker pods have been created:

kubectl get rc
NAME                  DESIRED   CURRENT   AGE
couchbase-master-rc   1         1         8m
couchbase-worker-rc   3         3         2m

This can be verified again by getting the list of pods:

kubectl get po
NAME                        READY     STATUS    RESTARTS   AGE
couchbase-master-rc-97mu5   1/1       Running   0          8m
couchbase-worker-rc-4ik02   1/1       Running   0          2m
couchbase-worker-rc-jfykx   1/1       Running   0          53s
couchbase-worker-rc-v8vdw   1/1       Running   0          53s

Pending Rebalancetab of Couchbase Web Console shows that 3 servers have now been added to the cluster and needs to be rebalanced.

Rebalance Couchbase Cluster

Finally, click on Rebalance button to rebalance the cluster. A message window showing the current state of rebalance is displayed:

Once all the nodes are rebalanced, Couchbase cluster is ready to serve your requests:

In addition to creating a cluster, Couchbase Server supports a range of high availability and disaster recovery (HA/DR) strategies. Most HA/DR strategies rely on a multi-pronged approach of maximizing availability, increasing redundancy within and across data centers, and performing regular backups.

Now that your Couchbase cluster is ready, you can run your first sample application.

For further information check out the Couchbase Developer Portal and Forums, or see questions on Stack Overflow.

--Arun Gupta, Vice President Developer Relations at Couchbase

Download Kubernetes

Get involved with the Kubernetes project on GitHub

Post questions (or answer questions) on Stack Overflow

Connect with the community on Slack

Follow @Kubernetesio on Twitter for latest updates

↧