jammy (8) watchdog.8.gz

Provided by: watchdog_5.16-1_amd64 bug

NAME

     watchdog - a software watchdog daemon

SYNOPSIS

     watchdog    [-F|--foreground]    [-f|--force]    [-c    filename|--config-file   filename]
     [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]

DESCRIPTION

     The Linux kernel can reset the system if serious  problems  are  detected.   This  can  be
     implemented  via  special watchdog hardware, or via a slightly less reliable software-only
     watchdog inside the kernel. Either way, there needs to be a daemon that tells  the  kernel
     the system is working fine. If the daemon stops doing that, the system is reset.

     watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to it often enough to
     keep the kernel from resetting, at least once per minute. Each  write  delays  the  reboot
     time  another  minute.  After  a minute of inactivity the watchdog hardware will cause the
     reset. In the case of the software watchdog the ability to reboot will depend on the state
     of the machines and interrupts.

     The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is
     closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT  option
     enabled.

TESTS

     The watchdog daemon does several tests to check the system status:

     •  Is the process table full?

     •  Is there enough free memory?

     •  Is there enough allocatable memory?

     •  Are some files accessible?

     •  Have some files changed within a given interval?

     •  Is the average work load too high?

     •  Has a file table overflow occurred?

     •  Is a process still running? The process is specified by a pid file.

     •  Do some IP addresses answer to ping?

     •  Do network interfaces receive traffic?

     •  Is the temperature too high? (Temperature data not always available.)

     •  Execute a user defined command to do arbitrary tests.

     •  Execute  one or more test/repair commands found in /etc/watchdog.d.  These commands are
        called with the argument test or repair.

     If any of these checks fail watchdog will cause a shutdown.  Should  any  of  these  tests
     except  the  user defined binary last longer than one minute the machine will be rebooted,
     too.

OPTIONS

     Available command line options are the following:

v, –verbose

Set verbose mode. Only implemented if compiled with SYSLOG feature. This mode will

            log  each  several  infos in LOG_DAEMON with priority LOG_DEBUG.  This is useful if
            you want to see exactly what happened  until  the  watchdog  rebooted  the  system.
            Currently it logs the temperature (if available), the load average, the change date
            of the files it checks and how often it went to sleep. You can use  this  twice  to
            enable some more verbose debug message for testing.

s, –sync

Try to synchronize the filesystem every time the process is awake. Note that the

            system is rebooted if for any reason the synchronizing lasts longer than a minute.

b, –softboot

Soft-boot the system if an error occurs during the main loop, e.g. if a given file

            is  not  accessible  via  the  stat(2)  call.  Note that this does not apply to the
            opening of /dev/watchdog and /proc/loadavg, which are opened before the  main  loop
            starts. Now this is implemented by disabling the error re-try timer.

F, –foreground

Run in foreground mode, useful for running under systemd (for example).

f, –force

Force the usage of the interval given or the maximal load average given in the

            config file. Without this option these values are sanity checked.

c config-file, –config-file config-file

Use config-file as the configuration file instead of the default

            /etc/watchdog.conf.

q, –no-action

Do not reboot or halt the machine. This is for testing purposes. All checks are

            executed and the results are logged as usual, but no action is  taken.   Also  your
            hardware  card  or  the  kernel software watchdog driver is not enabled. NOTE: This
            still allows 'repair' actions to run, but the daemon  itself  will  not  attempt  a
            reboot.

X num, –loop-exit num

Run for 'num' loops then exit as if SIGTERM was received. Intended for test/debug

            (e.g. using valgrind for checking memory access). If the daemon  exits  on  a  loop
            counter and you have the CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or
            device-driver then an unplanned reboot will follow - be warned!

FUNCTION

     After watchdog starts, it puts itself into  the  background  and  then  tries  all  checks
     specified  in  its configuration file in turn. Between each two tests it will write to the
     kernel device to prevent a reset. After finishing all tests watchdog  goes  to  sleep  for
     some  time.  The  kernel  drivers  expects  a  write  to the watchdog device every minute.
     Otherwise the system will be reset.  watchdog will sleep for  a  configure  interval  that
     defaults to 1 second to make sure it triggers the device early enough.

     Under  high  system  load  watchdog might be swapped out of memory and may fail to make it
     back in in time. Under these circumstances the Linux kernel will  reset  the  machine.  To
     make  sure  you won't get unnecessary reboots make sure you have the variable realtime set
     to yes in the configuration file watchdog.conf.  This adds real time support to  watchdog:
     it  will lock itself into memory and there should  be no problem even under the highest of
     loads.

     On system running out of memory the kernel will try  to  free  enough  memory  by  killing
     process. The watchdog daemon itself is exempted from this so-called out-of-memory killer.

     Also you can specify a maximal allowed load average. Once this load average is reached the
     system is rebooted. You may specify maximal load averages for 1 minute, 5  minutes  or  15
     minutes.  The default values is to disable this test. Be careful not to set this parameter
     too low. To set a value less then the predefined minimal value of 2, you have to  use  the
     -f option.

     You  can  also  specify  a  minimal amount of virtual memory you want to have available as
     free. As soon as more virtual memory is used action is taken by watchdog.  Note,  however,
     that watchdog does not distinguish between different types of memory usage. It just checks
     for free virtual memory.

     If you have a machine with temperature sensor(s)  you  can  specify  the  maximal  allowed
     temperature. Once this temperature is reached on any sensor the system is powered off. The
     default value is 90 C. Typically the temperature information is provided  by  the  sensors
     package  as  files  in  the  virtual  filesystem  /sys/device  and can be found using, for
     example, the command

         find /sys -name 'temp*input' -print

     These files hold the temperature in milli-Celsius. You can have multiple sensors  used  in
     the  config  file. For example to change to 75C maximum and to check two virtual files for
     the system temperature you might have this:

         max-temperature = 75
         temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
         temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input

     The watchdog will issue warnings once the temperature increases 90%, 95% and  98%  of  the
     configured maximum temperature.

     When using file mode watchdog will try to stat(2) the given files. Errors returned by stat
     will not cause a reboot. For a reboot the stat call has to last at least the re-try  time-
     out  value  (default  1 minute).  This may happen if the file is located on an NFS mounted
     filesystem. If your system relies on an NFS mounted filesystem you might try this  option.
     However, in such a case the sync option may not work if the NFS server is not answering.

     watchdog  can  read  the  pid from a pid file and see whether the process still exists. If
     not, action is taken by watchdog.  So you can for instance restart the  server  from  your
     repair-binary.

     watchdog  will  try  periodically to fork itself to see whether the process table is full.
     This process will leave a zombie process until watchdog wakes up  again  and  catches  it;
     this is harmless, don't worry about it.

     In  ping mode watchdog tries to ping the given IPv4 addresses. These addresses do not have
     to be a single machine. It is possible to ping to a broadcast address instead to see if at
     least one machine in a subnet is still living.

     Do  not  use this broadcast ping unless your MIS person a) knows about it and b) has given
     you explicit permission to use it!

     watchdog will send out three ping packages and wait up to <interval> seconds for the reply
     with  <interval> being the time it goes to sleep between two times triggering the watchdog
     device. Thus a unreachable network will not cause a hard reset but a soft reboot.

     You can also test passively  for  an  unreachable  network  by  just  monitoring  a  given
     interface for traffic. If no traffic arrives the network is considered unreachable causing
     a soft reboot or action from the repair binary.

     watchdog can run an external command for user-defined tests. A return  code  not  equal  0
     means an error occurred and watchdog should react. If the external command is killed by an
     uncaught signal this is considered an error by watchdog too.  The command may take  longer
     than  the  time  slice  defined  for  the  kernel device without a problem. However, error
     messages are generated into the syslog facility. If you have enabled softboot on error the
     machine  will  be  rebooted  if  the  binary doesn't exit in half the time watchdog sleeps
     between two tries triggering the kernel device.

     If you specify a repair binary it will be started instead of shutting down the system.  If
     this binary is not able to fix the problem watchdog will still cause a reboot afterwards.

     If  the  machine  is  halted  an email is sent to notify a human that the machine is going
     down. Starting with version 4.4 watchdog will also notify  the  human  in  charge  if  the
     machine is rebooted.

     The re-try timer applies to most errors, except reset/reboot calls and too hot.  It allows
     a given error source to recover, and treats most tests in this way.  Exceptions  are  file
     handle  test,  load averages, and system memory. If set to the minimum time of 1 second it
     will still allow a single re-try at any polling interval of the system.

SOFT REBOOT

     A soft reboot (i.e. controlled shutdown and reboot) is initiated for every error  that  is
     found.  Since there might be no more processes available, watchdog does it all by himself.
     That means:

     1.  Kill all processes with SIGTERM.

     2.  After a short pause kill all remaining processes with SIGKILL.

     3.  Record a shutdown entry in wtmp.

     4.  Save the random seed from /dev/urandom.  If the device is non-existant or there is  no
         filename for saving this step is skipped.

     5.  Turn off accounting.

     6.  Turn off quota and swap.

     7.  Unmount all partitions

     8.  Finally reboot.

CHECK BINARY

     If  the  return  code  of  the  check binary is not zero watchdog will assume an error and
     reboot the system. Be careful with this if you  are  using  the  real-time  properties  of
     watchdog since watchdog will wait for the return of this binary before proceeding. An exit
     code smaller than 245 is interpreted as an system error code (see  errno.h  for  details).
     Values of 245 or larger than are special to watchdog:

     255    (based  on  -1  as unsigned 8-bit number) Reboot the system. This is not exactly an
            error message but a command to watchdog.  If the return code is this  the  watchdog
            will not try to run a shutdown script instead.

     254    Reset  the  system. This is not exactly an error message but a command to watchdog.
            If the return code is this the watchdog will  attempt  to  hard-reset  the  machine
            without  attempting  any  sort  of  orderly stopping of process, unmounting of file
            systems, etc.

     253    Maximum load average exceeded.

     252    The temperature inside is too high.

     251    /proc/loadavg contains no (or not enough) data.

     250    The given file was not changed in the given interval.

     249    /proc/meminfo contains invalid data.

     248    Child process was killed by a signal.

     247    Child process did not return in time.

     246    Free for personal watchdog-specific use (was -10 as an unsigned 8-bit number).

     245    Reserved for an unknown result, for example a slow background test  that  is  still
            running so neither a success nor an error.

REPAIR BINARY

     The  repair binary is started with one parameter: the error number that caused watchdog to
     initiate the boot process. After trying to repair the system the binary should exit with 0
     if  the  system  was  successfully  repaired  and thus there is no need to boot anymore. A
     return value not equal 0 tells watchdog to reboot. The return code of  the  repair  binary
     should  be  the error number of the error causing watchdog to reboot. Be careful with this
     if you are using the real-time properties since watchdog will wait for the return of  this
     binary before proceeding.

     The  configuration  file parameter repair-maximum controls the number of successive repair
     attempts that report 0 (i.e. success) but fail to clear  the  tested  fault.  If  this  is
     exceeded  then a reboot takes place. If set to zero then a reboot can always be blocked by
     the repair program reporting success.

TEST DIRECTORY

     Executables placed in the test directory are discovered by watchdog  on  startup  and  are
     automatically  executed.   They  are  bounded  time-wise  by the test-timeout directive in
     watchdog.conf.

     These executables are called with either "test" as the first argument (if a test is  being
     performed)  or  "repair" as the first argument (if a repair for a previously-failed "test"
     operation on is being performed).

     As with test binaries and repair binaries, expected exit codes for a  successful  test  or
     repair operation is always zero.

     If  an executable's test operation fails, the same executable is automatically called with
     the "repair" argument as well as the return code of the previously-failed test operation.

     For example, if the following execution returns 42:

         /etc/watchdog.d/my-test test

     The watchdog daemon will attempt to repair the problem by calling:

         /etc/watchdog.d/my-test repair 42

     This enables administrators and application developers  to  make  intelligent  test/repair
     commands.   If the "repair" operation is not required (or is not likely to succeed), it is
     important that the author of the command return a non-zero value so the machine will still
     reboot as expected.

     Note  that the watchdog daemon may interpret and act upon any of the reserved return codes
     noted in the Check Binary section prior to calling a given command in "repair" mode.

     As for the repair binary, the configuration parameter  repair-maximum  also  controls  the
     number  of successive repair attempts that report success (return 0) but fail to clear the
     fault.

BUGS

     None known so far.

AUTHORS

     The original code is an example written by Alan Cox <[email protected]>, the author
     of  the  kernel  driver. All additions were written by Michael Meskes <[email protected]>.
     Johnie Ingram <[email protected]> had the idea of testing the load average. He  also  took
     over  the  Debian  specific  work.  Dave  Cinege  <[email protected]>  brought up some
     hardware watchdog issues and helped testing this stuff.

FILES

     /dev/watchdog
            The watchdog device.

     /var/run/watchdog.pid
            The pid file of the running watchdog.