docker容器详细教程（Docker学习8容器原理PID）

逗爷 2023-08-14 20:59:32 348

docker容器详细教程（Docker学习8容器原理PID）pstree -pl3. "init"示例当一个进程的父进程被kill掉后，该进程将会被当前namespace中pid 为1的进程接管，而不是被最外层的系统级别的init 进程接管。当pid为1的进程停止运行后，内核将会给这个namespace及其子孙namespace里的所有其它进程发送SIGKILL信号，使其它进程都停止，当前namespace及后代namespace都会销毁。root@xundh-To-be-filled-by-O-E-M:~# hostname xundh-To-be-filled-by-O-E-M root@xundh-To-be-filled-by-O-E-M:~# hostname container001 root@xundh-To-be-filled-by-O-E-M:~# exec bash root@container001:~#(

docker容器详细教程（Docker学习8容器原理PID）(1)

一、PID namespace介绍

Linux通过命名空间管理进程pid，对于同一进程，在不同的命名空间中，看到的pid号不同。每个pid命名空间有一套自己的pid管理方法，所以在不同的命名空间中调用getpid()，看到的pid号是不同的。

PID namespace可以嵌套，就是有父子关系
父 namespace可以看以所有后代namespace的进程信息
子namespace看不到父namespace 或兄弟namespace的进程信息
目前PID namespace最多可以嵌套32层，在内核中使用MAX_PID_NS_LEVEL来定义。

Linux下每个进程都有一个对应的/proc/PID目录，该目录包含了当前进程的信息。对一个PID namespace而言，/proc目录只包含当前namespace和它所有后代namespace里的进程信息。

Linux中的进程 ID从1累加、不能重复、可回收再次利用。
进程ID为1的进程是内核启动的第一个应用层进程，一般是init进程。现在采用systemd的系统第一个进程是systemd。
当系统中一个进程的父进程退出时，内核会指定init进程成为这个进程的新父进程。
当init进程退出时，系统也将退出。

二、实验观察

ubuntu 18.04 环境(测试CentOS7对fork还没完全支持)

1. 简单示例

(1) 查看当前pid namespace的ID

readlink /proc/self/ns/pid

docker容器详细教程（Docker学习8容器原理PID）(2)

(2) 启动新的pid namespace

启动新的pid namespace、uts mount namespace

sudo unshare --uts --pid --mount --fork /bin/bash

显示结果：

xundh@xundh-To-be-filled-by-O-E-M:~$ sudo unshare --uts --pid --mount --fork /bin/bash [sudo] xundh 的密码： root@xundh-To-be-filled-by-O-E-M:~#

参数注释

--uts 新的uts 生成新hostname
--mount 新的mount namespace ，方便修改新namespace里的mount信息
--fork ，让unshare进程fork一个新的进程出来，然后再用bash替换掉新的进程

(3) 命名hostname 进入容器环境

hostname container001 exec bash

显示结果：

root@xundh-To-be-filled-by-O-E-M:~# hostname xundh-To-be-filled-by-O-E-M root@xundh-To-be-filled-by-O-E-M:~# hostname container001 root@xundh-To-be-filled-by-O-E-M:~# exec bash root@container001:~#

(4) 查看进程关系

pstree -pl

docker容器详细教程（Docker学习8容器原理PID）(3)

(5) 查看pid的namespace

readlink /proc/2526/ns/pid

docker容器详细教程（Docker学习8容器原理PID）(4)

2. PID namespace嵌套

调用 unshare或 setns函数后，当前进程的namespace不会发生变化，不会加入到新的namespace，而子进程会加入到新的namespace;
在一个PID namespace里的进程，它的父进程可能不在当前namespace中，而是父namespace里，这类进程的ppid=0。
可以在祖先namespace中看到子namespace的所有进程信息，且可以发信号给子namespace进程，但进程在不同namespace里的PID不一样。

(1) 记下最外层namespace ID

xundh@xundh-To-be-filled-by-O-E-M:~$ readlink /proc/$$/ns/pid pid:[4026531836]

(2) 创建新的pid namespace

xundh@xundh-To-be-filled-by-O-E-M:~$ sudo unshare --uts --pid --mount --fork --mount-proc /bin/bash root@xundh-To-be-filled-by-O-E-M:~# hostname container001 root@xundh-To-be-filled-by-O-E-M:~# exec bash root@container001:~# readlink /proc/$$/ns/pid pid:[4026532324] root@container001:~#

(3) 再创建新的pid namespace

root@container001:~# unshare --uts --pid --mount --fork --mount-proc /bin/bash root@container001:~# hostname container002 root@container001:~# exec bash root@container002:~# readlink /proc/$$/ns/pid pid:[4026532327]

(4) 再创建新的pid namespace

root@container002:~# unshare --uts --pid --mount --fork --mount-proc /bin/bash root@container002:~# hostname container003 root@container002:~# exec bash root@container003:~# readlink /proc/$$/ns/pid pid:[4026532330]

另外开一个窗口查看namespace关系：

pstree -pl

docker容器详细教程（Docker学习8容器原理PID）(5)

3. "init"示例

当一个进程的父进程被kill掉后，该进程将会被当前namespace中pid 为1的进程接管，而不是被最外层的系统级别的init 进程接管。当pid为1的进程停止运行后，内核将会给这个namespace及其子孙namespace里的所有其它进程发送SIGKILL信号，使其它进程都停止，当前namespace及后代namespace都会销毁。

(1) 接上面示例，在container003中启动3个新的bash

root@container003:~# bash root@container003:~# bash root@container003:~# bash root@container003:~# pstree bash───bash───bash───bash───pstree root@container003:~#

(2) fork一个新进程

root@container003:~# unshare --fork nohup sleep 3600& [1] 44 root@container003:~# nohup: 忽略输入并把输出追加到'nohup.out' root@container003:~#

这里子进程会后台运行，sheep1小时。

现在的进程关系：

root@container003:~# pstree -p bash(1)───bash(19)───bash(27)───bash(35)─┬─pstree(46) └─unshare(44)───sleep(45)

(3) kill掉44

root@container003:~# kill 44 root@container003:~# pstree -p bash(1)─┬─bash(19)───bash(27)───bash(35)───pstree(47) └─sleep(45) [1] 已终止 unshare --fork nohup sleep 3600 root@container003:~#

可以看到原来的sleep(45)被bash(1)接管了。

三、pid namespace的管理

1. pid_namespace(include/linux/pid_namespace.h)结构体

struct pid_namespace { struct kref kref; //引用计数 struct pidmap pidmap[PIDMAP_ENTRIES]; //pid分配的bitmap，如果位为1，表示这个pid已经分配了 int last_pid; //记录上次分配的pid，理论上，当前分配的pid=last_pid 1 struct task_struct *child_reaper; //表示进程结束后，需要这个child_reaper进程对这个进程进行托管 struct kmem_cache *pid_cachep; //高速缓存，这个不太清楚，待这块分析源代码 unsigned int level; //记录这个pid namespace的深度 struct pid_namespace *parent; //记录父pid namespace #ifdef CONFIG_PROC_FS struct vfsmount *proc_mnt; #endif #ifdef CONFIG_BSD_PROCESS_ACCT struct bsd_acct_struct *bacct; #endif };

成员变量解释：

pidmap 它表示在这个pid命名空间的pid的分配情况，pidmap是个数组，每一位代表这个这个偏移量的pid是否分配出去，初始这个数组只有一个元素。

pidmap结构体(include/linux/pid_namespace.h)：

struct pidmap { atomic_t nr_free;//表示这个bitmap还有多少位为0，就是说对应的pid没有被分配出去 void *page;//表示一段连续的内存空间，每位的0或1表示对应pid是否被分配 };

默认情况下pid最大是32768，那么默认正好是1页能保存下的pid使用情况，linux默认一页的大小是4k=410248位=32768 如果pid的最大值超过32768那么pidmap数组就用上了，多个pidmap就是为了pid限制大于32768来设计的。

child_reaper的作用就是当父进程先于子进程结束的时候，就把子进程的父进程更新为child_reaper。

四、PID Namespace 程序测试

#define _GNU_SOURCE #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <sched.h> #include <signal.h> #include <unistd.h> /* 定义一个给 clone 用的栈，栈大小1M */ #define STACK_SIZE (1024 * 1024) static char container_stack[STACK_SIZE]; char* const container_args[] = { "/bin/bash" NULL }; int container_main(void* arg) { /* 查看子进程的PID，我们可以看到其输出子进程的 pid 为 1 */ printf("Container []] - inside the container!\n" getpid()); sethostname("container" 10); /* 设置hostname */ execv(container_args[0] container_args); printf("Something's wrong!\n"); return 1; } int main() { printf("Parent - start a container!\n"); /*启用PID namespace - CLONE_NEWPID*/ int container_pid = clone(container_main container_stack STACK_SIZE CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD NULL); waitpid(container_pid NULL 0); printf("Parent - container stopped!\n"); return 0; }

运行环境：CentOS7

[root@localhost c4]# ls a.out main.c [root@localhost c4]# ./a.out Parent - start a container! Container [ 1] - inside the container! [root@container c4]# echo $$ 1 [root@container c4]#

这里子进程的pid为1。在传统的UNIX系统中，PID为1的进程是init，它是所有进程的父进程，有很多特权（比如：屏蔽信号等）；另外，其还会为检查所有进程的状态。如果某个子进程脱离了父进程（父进程没有wait它），那么init就会负责回收资源并结束这个子进程。所以，要做到进程空间的隔离，首先要创建出PID为1的进程，最好就像chroot那样，把子进程的PID在容器内变成1。

但是，我们会发现，在子进程的shell里输入ps top等命令，我们还是可以看得到所有进程。说明并没有完全隔离。这是因为，像ps top这些命令会去读/proc文件系统，所以，因为/proc文件系统在父进程和子进程都是一样的，所以这些命令显示的东西都是一样的。

所以，我们还需要对文件系统进行隔离。

网站首页

返回栏目