I returned to the company this morning and found that a server game process was hung up. The first thing that came to my mind was that the cloud server host was down and was restarted (because I had encountered it twice before), so I immediately logged on to the server and checked the process log first. Determine the time when the process was killed, and then check the kernel log/var/log/message to find the following
It was determined that insufficient memory caused the system to trigger the OOM-killer mechanism and kill the process. The mechanism of OOM-killer is shared below:
Collected at the following address:
Linux allows programs to request more memory than the available memory of the system. This feature is called Overcommit. This is done for system optimization, because not all programs use memory immediately after applying for memory. When you use it, the system may have reclaimed some resources. Unfortunately, when you use the memory given to you by this Overcommit, if the system has no resources, the OOM killer will jump out. In order to protect important processes from being dropped by oom-killer, we can: echo -17>/proc//oom_adj, -17 means OOM is disabled
We can also disable the OOM of the entire system:
sysctl -w vm.panic_on_oom=1 0 sysctl -p
The parameter/proc/sys/vm/overcommit_memory can control the process' response strategy to memory overcommitment
- When overcommit_memory=0, the process is allowed to slightly overcommit memory, but not allowed for a large number of overload requests (default)
- When overcommit_memory=1, process overcommit is always allowed
- When overcommit_memory=2, overcommit is always prohibited
This selection strategy under Linux has been constantly evolving. As users, we can influence OOM killer's decision by setting some values.
Each process under Linux has an OOM weight. In/proc//oom_adj, the value is -17 to +15. The higher the value, the easier it is to be killed.
In the end, the OOM killer uses the value of/proc//oom_score to determine which process is killed. This value is calculated by the memory consumption, CPU time (utime + stime), survival time (uptime-start time) and oom_adj of the system integrated process. The more memory consumed, the higher the score, and the longer the survival time, the lower the score.
In short, the general strategy is: Lose the least amount of work, release the largest amount of memory while not hurting innocent processes that use a lot of memory, and kill as few processes as possible.