关于linux:检查从同一个bash脚本启动的后台进程的运行状态

Check running status of a background process launched from same bash script

我必须编写一个bash脚本,它根据传递的命令行参数在后台启动一个进程,并返回它是否能够成功地启动程序。

下面是我试图实现的一个伪代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if ["$1" ="PROG_1" ] ; then
    ./launchProg1 &
    if [ isLaunchSuccess ] ; then
        echo"Success"
    else
        echo"failed"
        exit 1
    fi
elif ["$1" ="PROG_2" ] ; then
    ./launchProg2 &
    if [ isLaunchSuccess ] ; then
        echo"Success"
    else
        echo"failed"
        exit 1
    fi
fi

脚本不能EDCOX1,0或EDCOX1,1,因为它将被另一个关键任务的C++程序调用,并且需要高吞吐量(每秒启动进程的WRT NOT),而且进程的运行时间是未知的。脚本既不需要捕获任何输入/输出,也不等待启动的进程完成。

我尝试以下操作失败:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#Method 1
if ["$1" ="KP1" ] ; then
    echo"The Arguement is KP1"
    ./kp 'this is text' &
    if [ $? ="0" ] ; then
        echo"Success"
    else
        echo"failed"
        exit 1
    fi
elif ["$1" ="KP2" ] ; then
    echo"The Arguement is KP2"
    ./NoSuchCommand 'this is text' &
    if [ $? ="0" ] ; then
        echo"Success"
    else
        echo"failed"
        exit 1
    fi
#Method 2
elif ["$1" ="CD5" ] ; then
    echo"The Arguement is CD5"
    cd"doesNotExist" &
    PROC_ID=$!
    echo"PID is $PROC_ID"
    if kill -0"$PROC_ID" ; then
        echo"Success"
    else
        echo"failed"
        exit 1
    fi
#Method 3
elif ["$1" ="CD6" ] ; then
    echo"The Arguement is CD6"
    cd .. &
    PROC_ID=$!
    echo"PID is $PROC_ID"
    ps -eo pid | grep"$PROC_ID" && { echo"Success"; exit 0; }
    ps -eo pid | grep "$PROC_ID" || { echo"failed" ; exit 1; }
else
    echo"Unknown Argument"
    exit 1
fi

运行脚本会产生不可靠的输出。方法1、2总是返回Success,而方法3在检查之前完成进程执行时返回failed

这是在GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)上测试的样品。

1
2
3
4
5
6
7
8
9
10
11
[scripts]$ ./processStarted3.sh KP1
The Arguement is KP1
Success
[scripts]$ ./processStarted3.sh KP2
The Arguement is KP2
Success
./processStarted3.sh: line 13: ./NoSuchCommand: No such file or directory
[scripts]$ ./processStarted3.sh CD6
The Arguement is CD6
PID is 25050
failed

正如在类似问题中所建议的,我不能使用进程名称,因为一个进程可能会执行多次,而其他进程则无法应用。

我没有尝试过screen和tmux,因为获得在生产服务器上安装它们的权限并不容易(但如果这是唯一的选择的话,我会这么做)。

更新@霍蒂./kp是存在的程序,启动程序返回Success./NoSuchCommand不存在。正如您从(编辑的)输出中看到的,脚本错误地返回Success

当进程完成执行或程序异常终止时,这并不重要。通过脚本启动的程序不会以任何方式进行跟踪(因此我们不在任何表中存储pid,也不需要使用deamontools

@ Etan Reisner无法启动的程序示例将是不存在的./NoSuchCommand。或者可能是一个损坏的程序无法启动。

@ Vorsprung调用一个在后台启动程序的脚本不需要花费很多时间(并且可以根据我们的期望进行管理)。但随着时间的推移,sleep 1会积累起来,导致问题的发生。

上述#Method3工作良好,在执行ps -eo pid | grep"$PROC_ID" && { echo"Success"; exit 0; }检查之前终止。


下面是一个例子,它将显示一个进程是否成功启动的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
$1 & #executes a program in background which is provided as an argument
pid=$! #stores executed process id in pid
count=$(ps -A| grep $pid |wc -l) #check whether process is still running
if [[ $count -eq 0 ]] #if process is already terminated, then there can be two cases, the process executed and stop successfully or it is terminated abnormally
then
        if wait $pid; then #checks if process executed successfully or not
                echo"success"
        else                    #process terminated abnormally
                echo"failed (returned $?)"
        fi
else
        echo"success"  #process is still running
fi

#Note: The above script will only provide a result whether process started successfully or not. If porcess starts successfully and later it terminates abnormally then this sciptwill not provide a correct result


接受的答案并不像广告上说的那样有效。

此检查中的计数将始终至少为1,因为"grep$pid"将找到具有$pid的进程(如果存在)和grep。

1
2
3
4
5
6
7
count=$(ps -A| grep $pid |wc -l)
if [[ $count -eq 0 ]]
then
    ### We can never get here
else
    echo"success"  #process is still running
fi

将上述内容更改为检查计数为1或从计数中排除grep,应使原始工作正常进行。

这里是原始示例的替代(可能更简单)实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
$1 & # executes a program in background which is provided as an argument
pid=$! # stores executed process id in pid

# check whether process is still running
# The"[^[]" excludes the grep from finding itself in the ps output
if ps | grep"$pid[^[]">/dev/null
then
    echo"success (running)"  # process is still running
else
    # If the process is already terminated, then there are 2 cases:
    # 1) the process executed and stop successfully
    # 2) it is terminated abnormally

    if wait $pid # check if process executed successfully or not
    then
        echo"success (ran)"
    else
        echo"failed (returned $?)" # process terminated abnormally
    fi
fi

# Note: The above script will detect if a process started successfully or not. If process is running when we check, but later it terminates abnormally then this script will not detect this.


使用jobs

将以下内容放入bash脚本并执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/bash

{ sleep 1 ; echo sleep1 ; } &
sleep 0
jobs
wait

echo nosleep &
sleep 0
jobs
wait

echo exit1
false &
sleep 0
jobs
wait

notexisting &
sleep 0
jobs
wait

./existingbutnotexecutable &
sleep 0
jobs
wait

输出

1
2
3
4
5
6
7
8
9
10
11
$ ./testrun.sh
[1]+  Running                 { sleep 1; echo sleep1; } &
sleep1
nosleep
[1]+  Done                    echo nosleep
exit1
[1]+  Exit 1                  false
./testrun.sh: line 19: notexisting: command not found
[1]+  Exit 127                notexisting
./testrun.sh: line 24: ./existingbutnotexecutable: Permission denied
[1]+  Exit 126                ./existingbutnotexecutable

jobs的输出可以看出:

  • 仍在运行的后台作业
  • 完成运行的作业
  • 以非零exitstatus状态运行的作业
  • 由于找不到命令而无法运行的作业
  • 以及由于不可执行而无法运行的作业。

也许还有更多的案例,但我没有做更多的研究。

wait只是为了确保一次没有多个后台工作。

必须使用sleep 0,否则,即使在shell能够报告未找到的错误命令之前,jobs也会报告进程正在运行。我试了一下echo,但似乎没有足够的延迟。

卸下sleep就得到这个输出。

1
2
3
4
5
6
7
8
9
10
11
$ ./testrun.sh
[1]+  Running                 { sleep 1; echo sleep1; } &
sleep1
[1]+  Running                 echo nosleep &
nosleep
exit1
[1]+  Running                 false &
[1]+  Running                 notexisting &
./testrun.sh: line 19: notexisting: command not found
[1]+  Running                 ./existingbutnotexecutable &
./testrun.sh: line 24: ./existingbutnotexecutable: Permission denied

注意,jobs总是说"正在运行",并且总是在命令的结果之前出现。错误与否。

根据jobs的输出,有一种可能采取行动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash

isrunsuccess() {
  case $(jobs) in
    *Running*)   echo">>> running" ;;
    *Done*)      echo">>> done" ;;
    *Exit\ 127*) echo">>> not found" ;;
    *Exit\ 126*) echo">>> not executable" ;;
    *Exit*)      echo">>> done nonzero exitstatus" ;;
  esac
}

{ sleep 1 ; echo sleep1 ; } &
sleep 0
isrunsuccess
wait

echo nosleep &
sleep 0
isrunsuccess
wait

echo exit1
false &
sleep 0
isrunsuccess
wait

notexisting &
sleep 0
isrunsuccess
wait

./existingbutnotexecutable &
sleep 0
isrunsuccess
wait

输出

1
2
3
4
5
6
7
8
9
10
11
$ ./testrun.sh
>>> running
sleep1
nosleep
>>> done
exit1
>>> done nonzero exitstatus
./testrun.sh: line 29: notexisting: command not found
>>> not found
./testrun.sh: line 34: ./existingbutnotexecutable: Permission denied
>>> not executable

您可以合并"已运行"和"未运行"案例

1
2
3
4
5
6
isrunsuccess() {
  case $(jobs) in
    *Exit\ 127*|*Exit\ 126*) echo">>> did not run" ;;
    *Running*|*Done*|*Exit*) echo">>> still running or was running" ;;
  esac
}

输出

1
2
3
4
5
6
7
8
9
10
11
$ ./testrun.sh
>>> still running or was running
sleep1
nosleep
>>> still running or was running
exit1
>>> still running or was running
./testrun.sh: line 26: notexisting: command not found
>>> did not run
./testrun.sh: line 31: ./existingbutnotexecutable: Permission denied
>>> did not run

检查bash中字符串内容的其他方法:如何在UnixShell脚本中判断一个字符串是否包含另一个字符串?

bash说明exit status 127 for not found和126 for not executable的文档:https://www.gnu.org/software/bash/manual/html_node/exit-status.html


抱歉,错过了这个要求"脚本不能等待或休眠"

启动后台程序,获取PID。等一下。然后检查它是否仍在运行kill-0

杀戮-0状态从$获取?这用于决定进程是否仍在运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash

./$1 &
pid=$!

sleep 1;

kill -0 $pid
stat=$?
if [ $stat -eq 0 ] ; then
  echo"running as $!"
  exit 0
else
  echo"$! did not start"
  exit 1
fi

也许如果你的超级C++程序不能等待一秒钟,它也不能期望能够以每秒高的速率启动shell命令的加载。

也许你需要在这里实现一个队列?

对不起,我的问题比答案还多