Thursday, September 15, 2016

DC/OS install prereq run and ssh timeouts

When running the DC/OS installer  as per this doc https://dcos.io/docs/1.7/administration/installing/custom/cli/
On the following step,

4. Run a preflight script to validate that your cluster is installable.
$ sudo bash dcos_generate_config.sh --preflight

We were repeatedly hitting this issue on the bootstrap node

run_command_chain_async executed

====> OUTPUT FOR run_preflight
====> XX.XX.XX.XX:22 FAILED
     CODE:
255
     TASK:
/usr/bin/ssh -oConnectTimeout=10 -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oBatchMode=yes -oPasswordAuthentication=no -p22 -i /genconf/ssh_key -tt mesos@XX.XX.XX.XX echo INSTALL PREREQUISITES
 /opt/dcos_install_tmp
     STDERR:
          ssh: connect to host XX.XX.XX.XX port 22: Operation timed out
     STDOUT:

But the same command would work fine when run directly on a shell session on the same node without any issues. After a while, we finally realized that the script was actually running this command from within a docker container it launches during the install. That would require the bootstrap node to work as a router between the docker container subnet and the actual DC/OS master nodes. Our bootstrap node had this disabled


net.ipv4.ip_forward = 0

Setting the value above to 1 fixed the issue and the install proceeded. This is a known issue in docker - https://github.com/docker/docker/issues/490#issuecomment-19487335

No comments: