From Mosuma
Jump to: navigation, search

Fixing network on a very old instance of coreos on digitalocean

Symptoms: You create a new droplet from an old backup/snapshot (old coreos). Upon reboot, network fails, you can only access it via the console

Solution: Add the following line into your coreos:/usr/share/oem/grub.cfg

set oem_id="digitalocean"
set linux_append="coreos.autologin=tty1"


Details here:

Setting up a cluster on Digital Ocean

The following are my trial-and-error logs. They need a major revamp/re-org/edit to be useful.

Errors and Solutions

1. Request Time out on the newly added extra node (after reboot, or during first boot)

2015/06/01 05:58:55 etcdserver: publish error: etcdserver: request timed out


  1. Restart etcd2 on each of the existing nodes
  2. cleanup new node
rm -rf /var/lib/etcd2/member
systemctl daemon-reload
systemctl restart etcd2
journalctl -r -u etcd2 | tail -100



To be able to access a node's cluster status from the node itself, you MUST specify ONE of the following 2 choices in your /run/systemd/system/etcd2.service.d/20-cloudinit.conf


The second choice/line is dangerous as it listens on ALL interfaces, including the public IP. Therefore, please use the first line after you have tested that it is working. Then, you will be able do get info from the node itself:

core@mynode:~$etcdctl cluster-health
cluster is healthy
member a538c1d07c91af51 is healthy
member be7b405835a4c8b1 is healthy
member f596b0c1e1bd345c is healthy
member f923c95736ec9f15 is healthy
member ff6b99200d183463 is healthy

If you don't, and you only specified this:


Then your node is visible only to OTHER members in the cluster, i.e., in the node if you type

etcdctl cluster-health

You will get the error message:

core@mynode:~$etcdctl cluster-health
Error:  cannot sync with the cluster using endpoints,
Dealing with an unhealthy node

Hypothesis: If a node becomes unhealthy due to any of the following reasons, it is PERMANENTLY DEAD, you CANNOT revive it, you MUST 1) remove the unhealthy node from the cluster, and 2)add that node back into the cluster under a new memberID (by running etcdctl member remove/add on another HEALTHY cluster node)

  • deleted /var/lib/etcd2/*

Exceptions: under the following situations, an unhealthy node can still be revived:

  • made changes to /run/systemd/system/etcd2.service.d/20-cloudinit.conf
  • systemctl restart etcd2
  • successful restart and running of etcd2
Re-initializing a DEAD node

When you are reincarnating a node (which was part of the cluster under another cluster node memberID), apart from removing its old node memberID from the cluster, you must also remove the old member data from the node's etcd2 data directory:/var/lib/etcd2/ via

rm -rf /var/lib/etcd2/*

Otherwise you will not be able to start etcd2 via systemctl after you have done a member add, and get errors like

Error:  cannot sync with the cluster using endpoints,
dial tcp connection refused

Let's run through the typical flow:

Suppose the reincarnated node has the following info (with old etcd name)

On ANY healthy node

On any node that is active/healthy in your etcd cluster.

1. Remove node's old etcd memberID from cluster, e.g.,

etcdctl member remove $DEADNODE_ETCD_NAME

2. Add the same node back into the cluster, which will assign it a new cluster node memberID, e.g.,

etcdctl member add ${DEADNODE} http://${DEADNODE_IP}:2380

Copy the INFO output by the above etcdctl member add command, e.g.,

Added member named baka with ID b28aac5fdc579bb to cluster

On the reincarnated node baka

1. copy the relevant INFO (from the above healthy node member add output in step 2 ) into /run/systemd/system/etcd2.service.d/20-cloudinit.conf


NOTE: There should only be one pair (outermost) of quotes in this line:

Environment="ETCD_INITIAL_CLUSTER=baka ...

For example, the following is WRONG:

Environment="ETCD_INITIAL_CLUSTER="baka ...

2. remove the old etcd2 data directory

sudo rm -rf /var/lib/etcd2/member

3. restart and reload etcd2

systemctl daemon-reload
systemctl restart etcd2

4. Wait for a few minutes for the new node to settle down. Check with

journalctl -r -u etcd2 |more
On ANY node in the cluster

Go to any node and check that the reincarnated node is up and running

etcdctl cluster-health
fleetctl list-machines
  • A node's fleetd machine name is invariant (does not change), while etcd member node name changes after you delete/re-add a node back into a cluster
  • You MUST have N=3 nodes up, in order for etcd2 to work
    • N=3 is the MINIMUM no. for it to work when you generate the discovery token (assumes 3 by default)
    • However, if you have N=3, but activated 4 nodes, the 4th node will auto downgrade into a proxy, which forwards traffic between itself and the real 3-node cluster
  • Always generate a new token if ANY ONE of the following conditions apply (TO BE VALIDATED):
    1. ALL your machines in your current cluster (using your current token) are destroyed
    2. you manually stopped your ONE and LAST etcd2 process via systemctl stop etcd2 in the cluster
    3. you destroyed one droplet and total nodes in cluster is less than 3, then you need a new key
  • etcd 2.0 is now in the stable CoreOS channel
  • fleet configuration is not a problem, it is etcd2 configuration that is causing the problem
    • Restart fleet with
      sudo systemctl restart fleet
      if fleetctl is giving your problems like
core@g00 ~ $ fleetctl list-machines
Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd
  • If you encounter errors "2015/06/01 03:18:16 tocommit(3165157) is out of range [lastIndex(0)]" on a newly indicted member node. One solution is as follows
    1. to restart etcd2 on each of the existing cluster members
    2. remove NODENAME.default/member from new node
    3. /usr/bin/etcd2
  • Removing the member directory data locally will re-init the cluster to a stable state:
    • For /usr/bin/etcd2 manual tests, remove NODENAME.etcd/member
    • For systemctl processes, remove /var/lib/etcd2/member
  • When manual tested etcd2 is working, copy over the NODENAME.etcd/member directory to /var/lib/etcd2/member, change permissions
    chown -R etcd.etcd /var/lib/etcd2
    and restart etcd2 could help
  • during manual etcd2 testing, if you encounter errors "etcdserver: send message to unknown receiver f7781b4bca87cf0e", then delete the NODENAME.etcd/member directory, and restart /usr/bin/etcd2


Problems with cloud-config

journalctl _EXE=/usr/bin/coreos-cloudinit

examining etcd2 logs will help you debug problems with starting etcd2 via systemctl:

journalctl -r -u etcd2
  • -r: list in reverse chronological order
  • -u: restrict to etcd2

Adding a new NODE to a Full Cluster

  • After the initial size of 3 nodes are up and running in the cluster, any additional new nodes (4th, 5th, etc.) become proxies by default
  • To make the new/extra node a full member of the cluster (instead of becoming a proxy), you need to manually add the new node to the cluster by running the following command on any *existing/running* cluster/proxy node that is already aware of the cluster:
etcdctl member add initial_random_name_will_eventually_be_replaced_with_hash http://IPADDRESS_OF_NEW_NODE:2380

Then on the NEW node you should restart etcd2 by modifying /run/systemd/system/etcd2.service.d/20-cloudinit.conf

as follows:





then restart etcd2 system service on the NEW node:

systemctl daemon-reload
rm -rf /var/lib/etcd2/proxy
systemctl restart etcd2

To test the above manually, shut down the existing etcd2 service, then run the following:

sudo bash
systemctl stop etcd2
export ETCD_NAME="g00"
export ETCD_INITIAL_CLUSTER="xiaoqiao=http://10.130.127.xx:2380,daqiao=http://10.130.204.xx:2380,diaochan=http://10.130.126.xx:2380,g00=http://10.130.126.xx:2380"
export ETCD_ADVERTISE_CLIENT_URLS=http://10.130.126.xx:2379
export ETCD_INITIAL_ADVERTISE_PEER_URLS=http://10.130.126.xx:2380
export ETCD_LISTEN_PEER_URLS=http://10.130.126.xx:2380


    reboot-strategy: off
    #reboot-strategy: best-effort
    # generate new token from
    # generate new token from
    # MUST regenerate token if all machines in a cluster has died

    # ADVERTISE REQUIRES HTTP://localhost:2379 to be PRESENT!!!!
    advertise-client-urls: http://$private_ipv4:2379,http://localhost:2379
    # multi-region and multi-cloud deployments need to use $public_ipv4
    initial-advertise-peer-urls: http://$private_ipv4:2380
    listen-peer-urls: http://$private_ipv4:2380
      public-ip: $private_ipv4
      metadata: region=singapore
    - name: etcd2.service
      command: start
    - name: fleet.service
      command: start


You should be able to start everything manually on your cluster node, via the following steps:

1. Make sure etcd2 is running, and get/set a key/value is working

sudo bash
systemctl stop etcd2 # stop etcd2 daemon
/usr/bin/etcd2 # start etcd2 manually
etcdctl set stupidkey shit # setting a key to a value
etcdctl get stupidkey # retrieving the key, you should get 'shit'

2. list machines with fleet

  • MUST restart fleet
  • You should be able to see your machines in the cluster, e.g.:
# systemctl restart fleet
# fleetctl list-machines
34838551...	region=singapore

3. If the above 2 steps are working, it means that the problems you encounter are due to misconfiguration AND/OR expired token, and you should be able to modify or tinker with the etcd2 config files in /run/systemd/system/etcd2.service.d

My experiments give me the following list of working and not-working configurations. This configuration is tested to work on 1 node cluster only. TODO: test on multi-node cluster.


# WORKING etcd 2.0.1 WITH NO CONFIG ------------------
# etcdctl works WITH BLANK etcd2 configuration on local cluster node

# WORKING ---------------------------------





# TEST ------------------------------------

# NOT WORKING -----------------------------



after edit, you must run

systemctl daemon-reload

VERY IMPORTANT: every time you modify your 20-loudinit.conf (same applies to creating a new cluster via cloud-config) you MUST EITHER:

otherwise you will encounter problems when restarting etcd2, like this:
xiaoqiao core # systemctl restart etcd2
xiaoqiao core # systemctl status -l etcd2
● etcd2.service - etcd2
   Loaded: loaded (/usr/lib64/systemd/system/etcd2.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/etcd2.service.d
   Active: activating (auto-restart) (Result: exit-code) since Mon 2015-05-18 02:38:59 UTC; 5s ago
  Process: 1637 ExecStart=/usr/bin/etcd2 (code=exited, status=1/FAILURE)
 Main PID: 1637 (code=exited, status=1/FAILURE)
   CGroup: /system.slice/etcd2.service

May 18 02:38:59 xiaoqiao systemd[1]: etcd2.service: main process exited, code=exited, status=1/FAILURE
May 18 02:38:59 xiaoqiao systemd[1]: Unit etcd2.service entered failed state.
May 18 02:38:59 xiaoqiao systemd[1]: etcd2.service failed.

Check that etcd2 is running

1. log into one of the cluster node then

sudo bash

2. check etcd2 status

systemctl status -l etcd2

3. Test your etcd2 by writing something to it (on one of the cluster node)

etcdctl set shit bar

If you cannot write then something is wrong

4. If something is wrong, modify the config settings in /run/systemd/system/etcd2.service.d/20-cloudinit.conf then

systemctl restart etcd2


If etcd2 is running properly, fleet will list your machines, e.g.,

fleetctl list-machines
localhost core # fleetctl list-machines
34838551...	region=singapore

otherwise it will complain

fleetctl list-machines
Error retrieving list of active machines: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}

If etcd2 is running on a port that fleet does NOT know, you will get errors like so:

localhost core # systemctl status -l fleet
● fleet.service - fleet daemon
   Loaded: loaded (/usr/lib64/systemd/system/fleet.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/fleet.service.d
   Active: active (running) since Mon 2015-05-18 02:45:55 UTC; 27s ago
 Main PID: 1800 (fleetd)
   CGroup: /system.slice/fleet.service
           └─1800 /usr/bin/fleetd

May 18 02:46:03 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:4001/: dial tcp connection refused
May 18 02:46:04 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
May 18 02:46:08 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:4001/: dial tcp connection refused
May 18 02:46:09 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
May 18 02:46:09 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:4001/: dial tcp connection refused
May 18 02:46:10 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
May 18 02:46:18 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:4001/: dial tcp connection refused
May 18 02:46:19 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
May 18 02:46:19 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:4001/: dial tcp connection refused
May 18 02:46:20 xiaoqiao fleetd[1800]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled

Fixing mix of public / private IP addresses

If you had initialized some of your nodes with metadata and cloud-config, while using DO's defaults for others, you may get a mix of public and ip addresses in your fleet list-machines, e.g.,

2fc73348...	region=singapore
7f389c85...	-
892c144a...	region=singapore
b84743c0...	-
fcd1b454...	region=singapore


On those machines with proper initialization, do the following:

sudo systemctl cat fleet.service

to examine the init file.

On those machines without proper initialization,

  1. Create a file /run/systemd/system/fleet.service.d/20-cloudinit.conf as follows:


ppp.ppp.ppp.ppp is your desired private IP PPP.PPP.PPP.PPP is an optional public IP you can put into the metadata then restart fleet on that machine

sudo systemctl daemon-reload
sudo systemctl restart fleet

Disabling auto reboots on Digital Ocean

Several of my instances just rebooted by itself at weird times of the day, interrupting my services. I have no choice but to disable that:

Two methods

1. when creating a new instance, input cloud-config by specifying user-data in your VPS setup

    reboot-strategy: off

2. On an already running instance, modify /etc/coreos/update.conf by adding in the following line


Note: Default REBOOT_STRATEGY=best-effort

Your GROUP=XXXX might be different from mine, keep that GROUP=XXX line as it is

Thereafter you may need to restart these 2 services (UNVERIFIED):

sudo systemctl restart update-engine
sudo systemctl restart locksmithd

Automatic reboots should be disabled, but automatic updates are still ongoing. You can check the update status via this command:

journalctl -f -u update-engine