Troubleshooting (AEN 4.1.2)#

Overview

This is a troubleshooting guide for a Anaconda Enterprise Notebooks deployment.

Normal Operation

Server

Anaconda Enterprise Notebooks Server is installed in /opt/wakari/wakari-server.

You can get the status of the server processes with:

# service wakari-server status
wk-server                        RUNNING    pid 20758, uptime 5 days, 0:30:23
worker                           RUNNING    pid 20757, uptime 5 days, 0:30:23

or:

root@server # ps -Hu wakari
  PID TTY          TIME CMD
20756 ?        00:02:26 .supervisord
20757 ?        00:05:58   mtq-worker
20758 ?        00:00:08   wk-server
20765 ?        00:02:00     wk-server
20766 ?        00:01:55     wk-server
20767 ?        00:02:20     wk-server
20770 ?        00:02:02     wk-server
supervisord details
description Manages wakari-worker and multiple processes of wk-server
user wakari
configuration /opt/wakari/wakari-server/etc/supervisord.conf
log /opt/wakari/wakari-server/var/log/supervisord.log
control service wakari-server
ports none
wk-server details
description Handles user interaction and passing jobs on to the wakari gateway. Access to it is managed by nginx.
user wakari
command /opt/wakari/wakari-server/bin/wk-server
configuration /opt/wakari/wakari-server/etc/wakari/
control service wakari-server
logs /opt/wakari/wakari-server/var/log/wakari/server.log
ports 5000 (only on localhost)
wakari-worker details
description Asynchronously executes tasks from wk-server
user wakari
logs /opt/wakari/wakari-server/var/log/wakari/worker.log
control service wakari-server
nginx details
description Serves static files and acts as proxy for all other requests which are passed to wk-server process running on port 5000.
user nginx
configuration /etc/nginx/nginx.conf /opt/wakari/wakari-server/etc/conf.d/www.enterprise.conf
logs /var/log/nginx/woc.log /var/log/nginx/woc-error.log
control service nginx status
port 80

Nginx runs at least two processes: - master process running as root user - worker processes running as nginx user

Gateway

Anaconda Enterprise Notebooks Gateway is installed in /opt/wakari/wakari-gateway.

You can get the status of the gateway processes with:

# service wakari-gateway status
wk-gateway                       RUNNING    pid 1137, uptime 5 days, 1:59:28

or:

root@gateway # ps -Hu wakari
  PID TTY          TIME CMD
 1136 ?        00:01:59 .supervisord
 1137 ?        00:00:02   wk-gateway
supervisord details
description Manages the wk-gateway process.
user wakari
configuration /opt/wakari/wakari-gateway/etc/supervisord.conf
log /opt/wakari/wakari-gateway/var/log/supervisord.log
control service wakari-gateway
ports none
wakari-gateway details
description Passes requests from Anaconda Enterprise Notebooks Server to the Compute Nodes.
user wakari
configuration /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
logs
/opt/wakari/wakari-gateway/var/log/wakari/gateway.application.log
/opt/wakari/wakari-gateway/var/log/wakari/gateway.log
working dir / (root)
port 8089 (webcache)

Compute Node

Anaconda Enterprise Notebooks Compute is installed in /opt/wakari/wakari-compute.

You can get the status of the compute node processes with:

# service wakari-compute status
wk-compute                       RUNNING    pid 22050, uptime 3 days, 1:03:19

or:

root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:02:01 .supervisord
 1152 ?        00:00:01   wk-compute

wk-compute will load each of these configuration files, in order:

  • /etc/wakari/config.json
  • /etc/wakari/compute-launcher-config.json
  • ./compute-launcher-config.json
  • Config file specified by -c option

If an option is specified in multiple files, the last one encountered takes precedence.

supervisord details
description Manages the wk-compute process.
user wakari
configuration /opt/wakari/wakari-compute/etc/supervisord.conf
log /opt/wakari/wakari-compute/var/log/supervisord.log
control service wakari-compute
working dir /opt/wakari/wakari-compute/etc
ports none
wk-compute details
description Launches compute processes
user wakari
configuration /opt/wakari/wakari-compute/etc/wakari/wk-compute-launcher-config.json /opt/wakari/wakari-compute/etc/wakari/scripts/config.json
logs /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.application.log /opt/wakari/wakari-compute/var/log/wakari/compute-launcher.log
working dir / (root)
control service wakari-compute
port 5002 (rfe)

Projects and Permissions

Projects live in the projectRoot folder on the compute node (by default, /projects). The project directory is created the first time the project is started; the start-project script clones it from /opt/wakari/wakari-compute/lib/node_modules/wakari-compute-launcher/skeleton.

Project directory permissions are as follows:

owner: rwx, user who created the project
group: rwx, owner's group
other: --x, to allow access to the Public folder
ACL:   rwx for any other team members

Files and subdirectories within the project directory have the same permissions as the project directory, except:

  1. The public folder and everything in it are world readable.
  2. Any files hardlinked into the root anaconda environment (/opt/wakari/anaconda) remain owned by the root or wakari users.

Project file and directory permissions are maintained by the start-project script. All files and directories in the project will have their permissions set when the project is started, except for files owned by root or the AEN_SRVC_ACCT user (usually wakari or aen_admin). Files owned by root or the AEN_SRVC_ACCT user do not have their permissions changed, in order to avoid changing the permissions of the linked files in /opt/wakari/anaconda.

CAUTION: DO NOT start a project as the AEN_SRVC_ACCT user (usually wakari or aen_admin). The permissions system will not correctly manage project files owned by this user.

General Troubleshooting Steps

Ensure that the Anaconda Enterprise Notebooks services are set to start at boot

(on all 3 components: Server, Gateway, and Compute nodes)

chkconfig --list | grep wakari

If they are missing, you can try adding them with:

chkconfig --add [wakari-server|wakari-gateway|wakari-compute]

Then services can be started safely with the restart command as follows:

service wakari-server restart
service wakari-gateway restart
service wakari-compute restart

These commands need to be executed on the appropriate nodes.

Ensure that all services are running

(see Normal Operation, above).

# service wakari-server status
wk-server                        RUNNING    pid 20758, uptime 5 days, 0:30:23
worker                           RUNNING    pid 20757, uptime 5 days, 0:30:23

root@server # service nginx status
nginx (pid  26303) is running...

# service wakari-gateway status
wk-gateway                       RUNNING    pid 1137, uptime 5 days, 1:59:28

# service wakari-compute status
wk-compute                       RUNNING    pid 22050, uptime 3 days, 1:03:19

If any of the processes are missing, restart them using the commands above.

Check for Extraneous Processes

Use ps -Hu wakari to get a complete list of the processes running under the wakari user account.

root@server # ps -Hu wakari
  PID TTY          TIME CMD
20756 ?        00:02:26 .supervisord
20757 ?        00:05:58   mtq-worker
20758 ?        00:00:08   wk-server
20765 ?        00:02:00     wk-server
20766 ?        00:01:55     wk-server
20767 ?        00:02:20     wk-server
20770 ?        00:02:02     wk-server

root@server # ps -f -C nginx
UID        PID  PPID  C STIME TTY          TIME CMD
root     26303     1  0 12:18 ?        00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx    26305 26303  0 12:18 ?        00:00:00 nginx: worker process

root@gateway # ps -Hu wakari
  PID TTY          TIME CMD
 1136 ?        00:01:59 .supervisord
 1137 ?        00:00:02   wk-gateway

root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:02:01 .supervisord
 1152 ?        00:00:01   wk-compute

What’s normal:

  • The wk-server, wk-gateway, and wk-compute processes should have the PIDs reported by supervisorctl.
  • The nginx master process should have the PID reported by service nginx status.
  • If you have installed more than one Anaconda Enterprise Notebooks component on a single machine, the processes from all of the installed components will show up on that machine.
  • On the Compute node, any Anaconda Enterprise Notebooks applications currently being run by users will be present. For example:
root@compute # ps -Hu wakari
  PID TTY          TIME CMD
 1150 ?        00:00:00 .supervisord
 1152 ?        00:00:00   wk-compute
 1340 ?        00:00:00 bash
 1341 ?        00:00:00   notebookwrapper

If extra wk-server, wk-gateway, wk-compute, or supervisord processes are present, use the kill command to remove them. Then restart the services using service SERVICE_NAME restart as described above.

Check connectivity between the servers

Server to Gateways

On the Server, navigate to Admin/Data Centers. For each data center in the list, check connectivity from the server to that gateway (in this example, the gateway is http://gateway.example.com:8089):

root@server # curl --connect-timeout 5 http://gateway.example.com:8089 > /dev/null

Gateways to Compute Nodes

On the Server, navigate to Admin/Enterprise Resources. For each compute resource in the list, open it and check the contents of the URL field to ensure that it begins with either “http” or “https”. Check connectivity to that URL from the corresponding Gateway. For example, if the URL is http://compute.example.com:5002:

root@gateway # curl --connect-timeout 5 http://compute.example.com:5002 > /dev/null

Gateways to server

This path is used by the gateway configuration command wk-gateway-configure. First, ensure that the gateway is linked to the correct server in the configuration file and that the full server URL is specified. Then check connectivity to the server.

root@gateway # grep WAKARI_SERVER /opt/wakari/wakari-gateway/etc/wakari/wk-gateway-config.json
  "WAKARI_SERVER": "http://wakari.example.com",

root@gateway # curl --connect-timeout 5 http://wakari.example.com > /dev/null
root@gateway # curl --connect-timeout 5 http://error.example.com > /dev/null
curl: (7) Failed to connect to error.example.com port 80: Connection refused

If a connection fails, check the following items:

  • Ensure that Gateways (Data Centers) and Compute nodes (Enterprise Resources) are correctly configured on the server.
  • Verify that processes are listening on the configured ports:
root@server # netstat -plt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
tcp        0      0 *:http                      *:*                         LISTEN      26409/nginx
tcp        0      0 *:ssh                       *:*                         LISTEN      986/sshd
tcp        0      0 localhost:smtp              *:*                         LISTEN      1063/master
tcp        0      0 *:complex-main             *:*                         LISTEN      26192/python
tcp        0      0 localhost:27017             *:*                         LISTEN      29261/mongod
tcp        0      0 *:ssh                       *:*                         LISTEN      986/sshd
tcp        0      0 localhost:smtp              *:*                         LISTEN      1063/master
  • Check firewall settings/logs on both hosts to ensure that packets are not being blocked or discarded.

Check Configuration File Syntax

Use this command to verify that the configuration file contains valid JSON:

root@server  # python -m json.tool /opt/wakari/wakari-server/etc/wakari/*.json
root@gateway # python -m json.tool /opt/wakari/wakari-gateway/etc/wakari/*.json
root@compute # python -m json.tool /opt/wakari/wakari-compute/etc/wakari/*.json

If the file is correct, the contents will be displayed. If there is a syntax error in the file, the message No JSON object could be decoded will be displayed instead. Edit the configuration file, ensuring correct JSON syntax.

Check file ownership

Verify that all files in /opt/wakari/anaconda belong to user/group wakari:

root@server # find /opt/wakari/anaconda \! -user wakari -print
root@server # find /opt/wakari/anaconda \! -group wakari -print

If any files are listed in the output, fix their ownership:

chown -R wakari:wakari /opt/wakari/anaconda

Verify that POSIX ACLs are enabled

The acl option must be enabled on the filesystem containing the project root directory.

First, determine the project root directory. If a custom projectRoot is configured, you can determine it with:

root@compute # grep projectRoot /opt/wakari/wakari-compute/etc/wakari/config.json

If not, the project root is /projects.

Either the mount options or default options listed by tune2fs should indicate the acl option is enabled.

root@compute # fs=`df /projects | tail -1 | cut -d " " -f 1`
root@compute # mount | grep $fs
/dev/vda on / type ext4 (rw)
root@compute # tune2fs -l $fs | grep options
Default mount options:    user_xattr acl

Clear Browser Cookies

When the Anaconda Enterprise Notebooks configuration changes, or the software is upgraded, cookies remaining in the browser can cause issues. Clearing cookies and logging in again can help to resolve problems.

Specific Problems

Problem Cause Solution
Browser indicates “too many redirects” Cookies are out of date Clear your browser’s cookies and cache, then try again.
supervisorctl error: “unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file” “supervisord” is not running on the Server Ensure that supervisord is included in the crontab, as described above. Then start supervisord manually.
Data Center Not Found message when deleting a project Datacenter has already been removed As root, run /opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project>
Forgotten administrator password   Use ssh to log in to the server as root, and run the command /opt/wakari/wakari-server/bin/wk-server-admin add-user wakari --admin -p <new password> -e <your email>. You can then log in to Anaconda Enterprise Notebooks as the wakari user with the new password you chose.

Logs

The locations of the Anaconda Enterprise Notebooks log files for each process and application are shown in the tables above.

The Anaconda Enterprise Notebooks installers log in to /tmp/wakari_{server,gateway,compute}.log.

If log files grow too large they can be deleted. To set the logs to be more or less verbose, the Jupyter Notebook system has a setting ‘Application.log_level’. Setting ‘Application.log_level’ to ‘ERROR’ will make the logs less verbose than the default but still fairly informative.

Killed supervisord and “Error: This socket is closed.”

When the supervisor daemon “supervisord” is killed, information sent to standard output “stdout” and standard error “stderr” is held in a pipe which eventually fills up. Then attempting to start any app fails with an error message saying “This socket is closed.”

To prevent this problem, always shut down and restart the processes cleanly and do not shut down or kill supervisord without first shutting down wk-compute and other processes that use it.

To recover from this problem, shut down the process “wk-compute” with sudo kill -9. Then restart the supervisord and wk-compute processes:

sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start

Service Error 502: Can not connect to the application manager

When a gateway node shows this error it means that a compute resource is not responding.

This error is caused when the process “wk-compute” has been shut down. To recover from this problem, restart the supervisord and wk-compute processes:

sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start

“502 Communication Error” on Amazon Web Services

If you see a page showing “502 Communication Error: This gateway could not communicate with the Wakari server” and the IP address of the Wakari server, configure the AEN gateway to use the DNS hostname of the server. On Amazon Web Services (AWS) this will be the DNS hostname of the Amazon Elastic Compute Cloud (EC2) instance.

Invalid usernames

The first character of a username must be a letter [a-z] or a digit [0-9].

Each other character in a username may be a letter [a-z], a digit [0-9], a period [.], an underscore [_], or a hyphen [-].

The POSIX standard specifies that these characters are the portable filename character set, and that portable usernames have the same character set.

An Anaconda Enterprise Notebooks username should be at least 3 characters and no more than 25 characters.