===============
Troubleshooting
===============

This troubleshooting guide provides you with ways to deal with
issues that may occur with your AEN installation.

.. contents::
   :local:
   :depth: 1


General troubleshooting steps
=============================

#. Clear browser cookies. When you change the AEN configuration
   or upgrade AEN, cookies remaining in the browser can cause
   issues. Clearing cookies and logging in again can help to
   resolve problems.

#. :doc:`Make sure NGINX and MongoDB are running <sys-mgmt/verify-nginx-mongodb>`.

#. Make sure that AEN services are :ref:`set to start at boot
   <verify-services-start-at-boot>`, on all nodes.

#. :doc:`Make sure that services are running
   <sys-mgmt/manage-services>` as expected. If any services are
   not running or are missing, :ref:`restart them
   <restart-services>`.

#. :ref:`Check for and remove extraneous processes
   <identify-extra-services>`.

#. :doc:`Check the connectivity between nodes
   <sys-mgmt/check-node-connections>`.

#. :ref:`Check the configuration file syntax
   <check-config-syntax>`.

#. :doc:`Check file ownership <user-mgmt/manage-permissions>`.

#. :doc:`Verify that POSIX ACLs are enabled
   <user-mgmt/manage-permissions>`.


Browser error: too many redirects
=================================

Cause
-----

Browser cookies are out of date.

Solution
---------

#. Log out.
#. Clear the browser's cookies.
#. Clear the browser cache.
#. Log in.


Error: unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file
=========================================================================

This is a supervisorctl error.

Cause
-----

supervisord is not running on the Server.

Solution
--------

Ensure that supervisord is included in the crontab. Then restart
supervisord manually.


Error: "Data Center Not Found" when deleting a project
======================================================

Cause
-----

The data center has been removed.

Solution
--------

As root, run::

  /opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project>



Forgotten administrator password
================================

#. Use ssh to log into the server as root.

#. Run::

     /opt/wakari/wakari-server/bin/wk-server-admin reset-password -u SOME_USER -p SOME_PASSWORD

   NOTE: Replace SOME_USER with the administrator username and SOME_PASSWORD with the password.

#. Log into AEN as the administrator user with the new password.


Alternatively you may add an administrator user:

#. Use ssh to log into the server as root.

#. Run::

     /opt/wakari/wakari-server/bin/wk-server-admin add-user SOME_USER --admin -p SOME_PASSWORD -e YOUR_EMAIL

   NOTE: Replace SOME_USER with the username, replace SOME_PASSWORD with the password, and replace YOUR_EMAIL with your email address.

#. Log into AEN as the administrator user with the new password.


Log files being deleted
=======================

Log files are being deleted.

NOTE: Locations of AEN log files for each process and application
are shown in the node sections in :doc:`concepts`.


Cause
-----

AEN installers log into
``/tmp/wakari\_{server,gateway,compute}.log``. If the log files
grow too large, they might be deleted.

Solution
--------

To set the logs to be more or less verbose, Jupyter Notebooks
uses `Application.log_level
<http://jupyter-notebook.readthedocs.io/en/latest/config.html>`_.

To make the logs less verbose than the default, but still
informative, set Application.log_level to ERROR.


Error: This socket is closed
============================

You receive the "This socket is closed" error message when you
try to start an application.

Cause
-----

When the supervisord process is killed, information sent to the
standard output ``stdout`` and the standard error ``stderr`` is
held in a pipe that will eventually fill up.

Once full, attempting to start any application will cause the
"This socket is closed" error.


Solution
--------

To prevent this issue:

* Follow the instructions in :doc:`sys-mgmt/manage-services` to
  stop and restart processes.

* Do not stop or kill supervisord without first stopping
  wk-compute and any other processes that use it.

To resolve the "This socket is closed" error:

#. Stop wk-compute by running ``sudo kill -9``.

#. Restart the supervisord and wk-compute processes:

   .. code-block:: bash

      sudo /etc/init.d/wakari-compute stop
      sudo /etc/init.d/wakari-compute start


Service error 502: Cannot connect to the application manager
============================================================

Gateway node displays "Service Error 502: Can not connect
to the application manager."

Cause
-----

A compute node is not responding because the wk-compute process
has stopped.


Solution
--------

Stop and then restart the supervisord and wk-compute processes:

.. code-block:: bash

   sudo /etc/init.d/wakari-compute stop
   sudo /etc/init.d/wakari-compute start


502 communication error on Amazon web services (AWS)
====================================================

You receive the "502 Communication Error: This gateway could not
communicate with the Wakari server" error message.

Cause
-----

An AEN gateway cannot communicate with the Wakari server on
AWS. There may be an issue with the IP address of
the Wakari server.

Solution
--------

Configure your AEN gateway to use the DNS hostname of the server.
On AWS this is the DNS hostname of the Amazon Elastic Compute
Cloud (EC2) instance.


Invalid username
================

Cause
-----

The username does not follow 1 or more of these rules:

* Must be at least 3 characters and no more than 25 characters.

* The first character must be a letter (A-Z) or a digit (0-9).

* Other characters can be a letter, digit, period (.),
  underscore (_) or hyphen (-).

* The `POSIX standard
  <http://serverfault.com/a/578264/117528>`_ specifies that these
  characters are the portable filename character set, and that
  portable usernames have the same character set.

Solution
--------

Follow the above rules for usernames.


Notebook Error: Cannot download notebook as PDF via LaTeX
=========================================================

Cause
-----

LaTeX is not properly installed.

CentOS/6 Solution
-----------------

#. Install TeXLive from the `TUG site <https://www.tug.org/texlive/quickinstall.html>`_.
   Follow the described steps. The installation may take some time.

#. Add the installation to the ``PATH`` in the file
   ``/etc/profile.d/latex.sh``. Add the following, replacing the year and architecture as needed:

   .. code-block:: bash

      PATH=/usr/local/texlive/2017/bin/x86_64-linux:$PATH

#. Restart the compute node.

CentOS/7 Solution
-----------------

#. Install the missing packages running the command:

   .. code-block:: bash

      yum install texlive texlive-xetex texlive-xetexconfig texlive-xetex-def texlive-adjustbox texlive-upquote texlive-ulem


Unresponsive ``wk-server`` thread without error messages
========================================================

Cause
-----

Two things can cause the ``wk-server`` thread to freeze without error messages:

* LDAP freezing
* MongoDB freezing

If LDAP or MongoDB are configured with a long timeout, Gunicorn can time out first and kill the
LDAP or MongoDB process. Then the LDAP or MongoDB process dies without logging a timeout error.

Solution
--------

#. Check for frozen LDAP or MongoDB server processes.

#. You may also wish to configure the Gunicorn timeout to more than 30 seconds.

Unresponsive ``wk-gateway`` thread without error messages
=========================================================

Cause
-----

If TLS is configured with a passphrase protected private key,
``wk-gateway`` will freeze without any error messages.

Solution
--------

Update the TLS configuration so that it does not use a
passphrase protected private key.
