Basic diagnostic procedures and maintenance guidelines

Guidelines

Basic system resource checks

  1. After logging in, you can perform a simple check, using the w command in a terminal. The resulting output shows a few basic pieces of information such as:

  • uptime,

  • the number of logged in users,

  • (most importantly) the average system load over a particular period of time (1, 5 or 15 minutes). If the load is very high, e.g. over 100 or more, it shows that one or more processes demand much more CPU than is available in the system, which indicates that these processes are not working normally.

Sample output of the w command:

  1. We can check various system resources by using those commands:

  • free -h → On Unix-like operating systems, use the free command to display the total amount of free and used physical and swap memory, and the buffers used by the kernel. When the system runs out of memory, it starts to use OOM Killer, which stands for "Out of memory Killer". This is a kernel feature that kills processes according to their oom_score. In other words, OOM Killer will destroy any process that is allocating too much memory and is the least important to the system. Consequently, the first ones that are killed are usually user applications with the most memory allocated.

Sample output of the free -h command:

A more complex view and more precise data presentation can be created by using different commands:

  • dstat -vn → It shows much more data and not only memory-related processes, but also CPU, disk and network bandwidth usage. It also provides a way to track what is happening with resources in close to real time.

Sample output of the dstat -vn command:

  • df -i → Before checking the disk space used, it is good practice to check the number of free inodes in the system, as all of them might be in use even all the free disk space is used up.

Sample output of the df -i command:

  • df -h → It shows the amount of total, free and used disk space on all mounted partitions/drives, which can be helpful when determining what is causing the slowdown at any time. Keeping enough free space on a disk is crucial to keep the system running smoothly, e.g. having a root partition "/" full can destabilize the whole operating system.

Sample output of the df -h command:

  • iotop -aoP → A tool which shows current disk read/write/swap/%IO parameters, an associated command, and user with which the command is running. It shows which processes have the most disk operations, and when one is in progress; for instance, too many writes to a disk might slow down the whole system, as other disk operations are put on hold/in a queue because of this demanding process.

Sample output for the iotop -aoP command sorted by most IO% (used Input/Output operations in percentage – the least, the better – 100% is the max for the system):

  • top/htop → Real-time monitoring tools which focus on various aspects of the system. Those commands have many useful parameters that relate to processes: PID, suitable level, exact command with child processes created, owner of the command/user with which it was started, the time that the command was run, % CPU and memory used, load on system and a few more. It is a convenient way to find out which process demands most CPU/memory resources, when was it started and by whom. By using the -u USER switch, we can list the processes owned by a specific user.

Sample output of the top command:

  • cat /proc/mdstat → is a file maintained by the kernel which contains real-time information about the RAID arrays and devices. For a detailed view of individual devices, use: mdadm --detail /dev/md0 (instead of md0, use your device name, listed in cat/proc/mdstat).

Sample output of the cat /proc/mdstat command:

Sample output of a detailed view using mdadm --detail /dev/md0:

  • smartctl -a /dev/sda → It outputs quite a lot of detailed information, but one of the most important of them is the table with SMART attributes which displays exact records for individual SMART attributes.

A sample data from the smartctl -a /dev/sda command:

  • drbdadm status → This command shows the status drbd (if drbd is configured and in use).

Sample output of the drbdadm status command showing it functioning correctly:

  • dmesg -xe → It is a command that prints the kernel message buffer. It helps to target (for example) malfunctioning drivers for devices, or devices themselves, to name just a few things. The typical output contains a lot of information so it can be used with fewer or more options (which enable scrolling text in a terminal window).

Sample output of dmesg -xe | more command:

Further investigation

  1. Redis → In the case of problems with an application such as XTM Workbench running slowly, you can make sure that redis is working properly, and no performance errors are listed in the logs. Usually the target path on which we can check the redis logs is: /var/log/redis/redis.log.

Sample output of the cat /var/log/redis/redis.log command showing it functioning correctly:

  1. Garbage Collector (GC) → There is also a need to check the garbage collector logs because, when we, for example, discover that the garbage collector running time is 50s a minute or that many logs are coming from the garbage collector every second, this might suggest that too little memory has been allocated for an application. Note that if there is a major problem with Garbage Collector, the logs will show FullGC. It can be checked using cat catalina.out | grep ‘GC’.

Sample output of the cat catalina.out | grep 'GC' command:

  1. Postgresql → Log into your postgresql instance using psql and then use \c postgres to switch to the postgres database. Note that you might have different settings so, if necessary, use your credentials and the appropriate switches: (psql -U username - W -d postgres).

Finally, run this query:

SELECT pid, usename, pg_blocking_pids(pid) AS "blocked_by", query AS "blocked_query" FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0;

If there are no errors (or stuck queries), you should see output similar to this:

pid | usename | blocked_by | blocked_query -----+---------+------------+--------------- (0 rows)

Â