Skip to content

Operations & deployment info

Amy Glen edited this page Oct 2, 2024 · 18 revisions

Scope

This is a draft operations page for the ARAX system. It is not complete. We are working on filling it with instructions and procedures.

SOPs

How to check what version of RTX-KG2 is deployed via the ARAX web browser UI

  1. Navigate your web browser to the ARAX web browser UI but for the RTX-KG2 service endpoint, for example kg2.test.transltr.io.
  2. In the Queries page (selected using the "Queries" navigation link in the section "Input" of the navigation bar in the left-hand side of the window), click on the {JSON} tab.
  3. In the text-box below "JSON input", enter the following TRAPI query graph verbatim:
{"nodes": {"n00": {"ids": ["RTX:KG2c"]}}, "edges": {}}
  1. Click the blue "Post to ARAX" button
  2. In the navigation bar in the left-hand side of the window, under the "Output" section, click on the "Results" navigation link.
  3. In the "Expansion Results" section, you should see a single "result", whose title should indicate the RTX-KG2 release version, like "Result 1 :: RTX-KG2.10.0c".

How to check if any services in the docker container have crashed

  1. ssh into the arax server: ssh <user>@<arax server name>
  2. get into the docker container: sudo docker exec -ti rtx1 bash
  3. look at which services are running: service --status-all this should return a list that looks similar to the following:
 [ + ]  RTX_Complete
 [ + ]  RTX_OpenAPI_beta
 [ + ]  RTX_OpenAPI_devED
 [ + ]  RTX_OpenAPI_devLM
 [ - ]  RTX_OpenAPI_dili
 [ + ]  RTX_OpenAPI_kg2
 [ - ]  RTX_OpenAPI_legacy
 [ - ]  RTX_OpenAPI_mvp
 [ - ]  RTX_OpenAPI_production
 [ + ]  RTX_OpenAPI_test
 [ + ]  apache-htcacheclean
 [ + ]  apache2
 [ - ]  apparmor
 [ - ]  bootmisc.sh
 [ - ]  checkfs.sh
 [ - ]  checkroot-bootclean.sh
 [ - ]  checkroot.sh
 [ - ]  cron
 [ - ]  dbus
 [ - ]  hostname.sh
 [ ? ]  hwclock.sh
 [ - ]  killprocs
 [ - ]  mountall-bootclean.sh
 [ - ]  mountall.sh
 [ - ]  mountdevsubfs.sh
 [ - ]  mountkernfs.sh
 [ - ]  mountnfs-bootclean.sh
 [ - ]  mountnfs.sh
 [ + ]  mysql
 [ + ]  neo4j
 [ ? ]  networking
 [ - ]  nginx
 [ ? ]  ondemand
 [ - ]  procps
 [ - ]  rc.local
 [ - ]  rsync
 [ - ]  sendsigs
 [ - ]  umountfs
 [ - ]  umountnfs.sh
 [ - ]  umountroot
 [ - ]  unattended-upgrades
 [ - ]  urandom
 [ - ]  x11-common
  1. the services that need to be running for production are apache2, mysql, apache-htcacheclean, RTX_Complete, and RTX_OpenAPI_production.
  2. In this case RTX_OpenAPI_production is not running to start again run service RTX_OpenAPI_production start to start it again. This should print the following if all goes well:
 * Starting system RTX_OpenAPI_production daemon                         [ OK ]

But what if the whole container has gone down?

  1. Check the list of containers: sudo docker ps -a

  2. (a) If the container rtx1 is running but is not responding restart it with sudo docker restart rtx1

    (b) Otherwise, if it is stopped start it with sudo docker start rtx1

  3. get into the docker container: sudo docker exec -ti rtx1 bash

  4. Start all of the commonly used services:

service apache2 start
service apache-htcacheclean start
service mysql start
service RTX_Complete start
service RTX_OpenAPI_production start
service RTX_OpenAPI_kg2 start
service RTX_OpenAPI_beta start
service RTX_OpenAPI_kg2beta start
service RTX_OpenAPI_test start
service RTX_OpenAPI_devED start
service RTX_OpenAPI_devLM start
  1. Wait a few seconds and double check that it is running at arax.ncats.io

Important: Please do not start ARAX or RTX-KG2 by running the init script /etc/init.d/RTX_OpenAPI_<DEVAREA> directly. Instead always use the service command to start ARAX or RTX-KG2. Otherwise it will cause issues like RTX issue 2350.

What to do when NCATS restarts the arax.ncats.io instance

  1. establish a remote terminal session in the instance: ssh [email protected]; you have to know what your Linux username on arax.ncats.io is, and it may not be the one you use on your home institution systems or dev system. The rest of the steps below assume you are running commands in the bash shell in the host OS on arax.ncats.io.
  2. start the rtx1 Docker container: sudo docker start rtx1
  3. start mysql inside the container: sudo docker exec rtx1 service mysql start
  4. start the "autocomplete" service inside the container: sudo docker exec -it rtx1 service RTX_Complete start
  5. start the RTX-KG2 API service inside the container: sudo docker exec rtx1 service RTX_OpenAPI_kg2 start
  6. (for any other KG2 API endpoints like kg2NewFmt, do the same as above but substituting the other endpoint name, i.e., kg2NewFmt instead of kg2)
  7. start the production ARAX API inside the container: sudo docker exec rtx1 service RTX_OpenAPI_production start
  8. (for any other ARAX API endpoints like "beta" or "devED", do the same as above but substituting the other endpoint name instead of "production")
  • devED
  • test
  • beta
  • devLM
  • NewFmt
  1. start apache2 inside the container: sudo docker exec rtx1 service apache2 start
  2. point your browser at https://arax.ncats.io and run a test query. Also test out the autocompleter.

How to fix arax.ncats.io when it's hanging

Log into the arax.ncats.io instance:

Enter the rtx1 Docker container:

sudo docker exec -ti rtx1 bash

Kill all python processes (this causes all RTX services to stop working correctly since they run python):

killall python3

Then to restart, run:

service RTX_OpenAPI_production start
service RTX_OpenAPI_devED start
service RTX_OpenAPI_kg2 start
service RTX_Complete start
service RTX_OpenAPI_test start
service RTX_OpenAPI_beta start
service RTX_OpenAPI_devLM start
service RTX_OpenAPI_kg2NewFmt start
service RTX_OpenAPI_NewFmt start

Note that the last two services are only relevant during the interim period where we are transitioning between TRAPI versions, and thus have separate ARAX and RTX-KG2 endpoints for the previous TRAPI version (1.1) and the new TRAPI version (1.2).

How to deploy code changes on arax.ncats.io

Deploying changes to the endpoint /foo on arax.ncats.io (e.g., /kg2beta) which is running branch currentbranch (e.g., master) involves (approximately) the following steps:

ssh arax.ncats.io
sudo docker exec -it rtx1 bash
su - rt
cd /mnt/data/orangeboard/foo/RTX
git status

check that the only modifications to tracked files are in openapi.yaml and then do:

git pull origin currentbranch
exit
service RTX_OpenAPI_foo restart
tail -f /tmp/RTX_OpenAPI_foo.elog

If you need to switch the branch that the endpoint /foo is on, say from currentbranch to otherbranch, the above steps would instead look something like this:

git pull origin currentbranch
git checkout otherbranch
git pull origin otherbranch
exit
service RTX_OpenAPI_foo restart
tail -f /tmp/RTX_OpenAPI_foo.elog

Rolling out a new KG2 version

This process essentially consists of building a new KG2c and other downstream databases off of this new KG2 version, organizing the necessary build artifacts on arax.ncats.io, uploading them to ITRB's SFTP server, and making any necessary code changes to ensure ARAX is compatible with the new KG2 version.

See this Github issue template for steps to roll-out a new KG2 version. You can create a new issue from this template at: https://github.com/RTXteam/RTX/issues/new?template=kg2rollout.md.

Nginx

On arax.ncats.io, we use Nginx as a TLS endpoint which proxies unencrypted HTTP requests to port 8080 on the host OS. We currently set the number of worker_connections to 10000.

ARAX response database

For all ARAX services (both ITRB deployed services and those that are running on our team's development instance and that are not ITRB deployed), we use a central database server for storing records of ARAX queries and pointers to their result JSON in an S3 bucket. The database server is running on on-demand EC2 instance arax-responses.rtx.ai in the us-east-1 AWS region. The server is running inside a Docker container 4394c6724a54 on that instance. Within the container, as root, you would run service mysqld status to check the status:

service mysqld status
mysqld: unrecognized service
root@4394c6724a54:/# service mysql status
 * /usr/bin/mysqladmin  Ver 8.0.34-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Server version		8.0.34-0ubuntu0.20.04.1
Protocol version	10
Connection		Localhost via UNIX socket
UNIX socket		/var/run/mysqld/mysqld.sock
Uptime:			380 days 21 hours 41 min 7 sec

Threads: 27  Questions: 424280834  Slow queries: 820  Opens: 695  Flush tables: 3  Open tables: 465  Queries per second avg: 12.892

If you see these errors in the ARAX log:

2024-09-09T19:15:05.622014 ERROR: Unable to store response record in MySQL

and if they are recurring and reported by multiple users, MySQL may be down. You can check:

ssh [email protected]
sudo docker exec 4394c6724a54 service mysql status

If MySQL is indeed down or not accepting connections, the recommended fix would be to restart mysqld on arax-responses.rtx.ai,

ssh [email protected]
sudo docker exec 4394c6724a54 service mysql restart

Jaeger/Opentelemetry

Translator services are required to gather web request telemetry data (on both the client side if the client is a Translator service and on the server side if the server is another Translator service) via OpenTelemetry and to deposit those telemetry data into a Jaeger data collector. ITRB-deployed ARAX, RTX-KG2, and Plover services transmit their OpenTelemetry data to an ITRB-provided Jaeger service. But ARAX and RTX-KG2 services on arax.ncats.io (our development instance) and our development Plover instances (when running) send their OpenTelemetry data to a Jaeger service on the EC2 instance jaeger.rtx.ai. Therefore, the jaeger.rtx.ai instance should be kept running at all times. All ITRB instances of ARAX, RTX-KG2, and Plover send their telemetry data to jaeger-otel-agent.sri.