Troubleshooting

This troubleshooting guide provides you with ways to deal with issues that may occur with your AEN installation.

General troubleshooting steps

  1. Clear browser cookies. When you change the AEN configuration or upgrade AEN, cookies remaining in the browser can cause issues. Clearing cookies and logging in again can help to resolve problems.
  2. Make sure NGINX and MongoDB are running.
  3. Make sure that AEN services are set to start at boot, on all nodes.
  4. Make sure that services are running as expected. If any services are not running or are missing, restart them.
  5. Check for and remove extraneous processes.
  6. Check the connectivity between nodes.
  7. Check the configuration file syntax.
  8. Check file ownership.
  9. Verify that POSIX ACLs are enabled.

Browser error: too many redirects

Cause

Browser cookies are out of date.

Solution

  1. Log out.
  2. Clear the browser’s cookies.
  3. Clear the browser cache.
  4. Log in.

Browser error: too many redirects when starting project apps

Browser shows “Too many redirects” when the user tries to start an application.

Cause

The project’s Compute Resource is invalid or was deleted.

Exception: exceptions.TypeError: ‘NoneType’ object has no attribute ‘__getitem__’

This exception appears on the Admin > Exceptions page when a project does not have a Compute Resource assigned.

Cause

The project’s Compute Resource is invalid or was deleted.

Error: unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file

This is a supervisorctl error.

Cause

supervisord is not running on the Server.

Solution

Ensure that supervisord is included in the crontab. Then restart supervisord manually.

Error: “Data Center Not Found” when deleting a project

Cause

The data center has been removed.

Solution

As root, run:

/opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project>

Forgotten administrator password

  1. Use ssh to log into the server as root.

  2. Run:

    /opt/wakari/wakari-server/bin/wk-server-admin reset-password -u SOME_USER -p SOME_PASSWORD
    

    NOTE: Replace SOME_USER with the administrator username and SOME_PASSWORD with the password.

  3. Log into AEN as the administrator user with the new password.

Alternatively you may add an administrator user:

  1. Use ssh to log into the server as root.

  2. Run:

    /opt/wakari/wakari-server/bin/wk-server-admin add-user SOME_USER --admin -p SOME_PASSWORD -e YOUR_EMAIL
    

    NOTE: Replace SOME_USER with the username, replace SOME_PASSWORD with the password, and replace YOUR_EMAIL with your email address.

  3. Log into AEN as the administrator user with the new password.

Log files being deleted

Log files are being deleted.

NOTE: Locations of AEN log files for each process and application are shown in the node sections in Concepts.

Cause

AEN installers log into /tmp/wakari\_{server,gateway,compute}.log. If the log files grow too large, they might be deleted.

Solution

To set the logs to be more or less verbose, Jupyter Notebooks uses Application.log_level.

To make the logs less verbose than the default, but still informative, set Application.log_level to ERROR.

Error: This socket is closed

You receive the “This socket is closed” error message when you try to start an application.

Cause

When the supervisord process is killed, information sent to the standard output stdout and the standard error stderr is held in a pipe that will eventually fill up.

Once full, attempting to start any application will cause the “This socket is closed” error.

Solution

To prevent this issue:

  • Follow the instructions in Managing services to stop and restart processes.
  • Do not stop or kill supervisord without first stopping wk-compute and any other processes that use it.

To resolve the “This socket is closed” error:

  1. Stop wk-compute by running sudo kill -9.

  2. Restart the supervisord and wk-compute processes:

    sudo /etc/init.d/wakari-compute stop
    sudo /etc/init.d/wakari-compute start
    

Service error 502: Cannot connect to the application manager

Gateway node displays “Service Error 502: Can not connect to the application manager.”

Cause

A compute node is not responding because the wk-compute process has stopped.

Solution

Stop and then restart the supervisord and wk-compute processes:

sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start

502 communication error on Amazon web services (AWS)

You receive the “502 Communication Error: This gateway could not communicate with the Wakari server” error message.

Cause

An AEN gateway cannot communicate with the Wakari server on AWS. There may be an issue with the IP address of the Wakari server.

Solution

Configure your AEN gateway to use the DNS hostname of the server. On AWS this is the DNS hostname of the Amazon Elastic Compute Cloud (EC2) instance.

Invalid username

Cause

The username does not follow 1 or more of these rules:

  • Must be at least 3 characters and no more than 25 characters.
  • The first character must be a letter (A-Z) or a digit (0-9).
  • Other characters can be a letter, digit, period (.), underscore (_) or hyphen (-).
  • The POSIX standard specifies that these characters are the portable filename character set, and that portable usernames have the same character set.

Solution

Follow the above rules for usernames.

Notebook Error: Cannot download notebook as PDF via LaTeX

Cause

LaTeX is not properly installed.

CentOS/6 Solution

  1. Install TeXLive from the TUG site. Follow the described steps. The installation may take some time.

  2. Add the installation to the PATH in the file /etc/profile.d/latex.sh. Add the following, replacing the year and architecture as needed:

    PATH=/usr/local/texlive/2017/bin/x86_64-linux:$PATH
    
  3. Restart the compute node.

CentOS/7 Solution

  1. Install the missing packages running the command:

    yum install texlive texlive-xetex texlive-xetexconfig texlive-xetex-def texlive-adjustbox texlive-upquote texlive-ulem
    

Unresponsive wk-server thread without error messages

Cause

Two things can cause the wk-server thread to freeze without error messages:

  • LDAP freezing
  • MongoDB freezing

If LDAP or MongoDB are configured with a long timeout, Gunicorn can time out first and kill the LDAP or MongoDB process. Then the LDAP or MongoDB process dies without logging a timeout error.

Solution

  1. Check for frozen LDAP or MongoDB server processes.
  2. You may also wish to configure the Gunicorn timeout to more than 30 seconds.

Unresponsive wk-gateway thread without error messages

Cause

If TLS is configured with a passphrase protected private key, wk-gateway will freeze without any error messages.

Solution

Update the TLS configuration so that it does not use a passphrase protected private key.

Error starting projects

Project’s status page shows “There was an error starting this project”.

Cause

Lack of disk space in compute nodes prevents projects from starting.

Solution

  1. Verify that the project node meets the system requirements.

  2. Check if there is enough free space on the compute node’s partition where /projects lives:

    df -h /projects
    
  3. Free up some disk space to meet the system requirements.

  4. Restart the project.

Changes in .condarc file are ignored

Changes applied to .condarc are ignored by conda.

Cause

Conda loads its configuration by merging multiple files together.

Solution

Check if you are applying the changes to the correct file.

To show the merged state that conda is currently using:

conda config --show

To show all config files that conda is currently reading:

conda config --show-sources