首页
论坛
课程
招聘
[原创]什么是runC?
2022-1-11 11:19 7070

[原创]什么是runC?

2022-1-11 11:19
7070

什么是runC

OCI 标准

​ 容器运行时,Container runtime是指管理和运行容器的工具,当前的容器工具很多,比如docker,rkt等,但是如果每个容器工具都使用自己的运行时,那么就不利于容器灵雨的发展,因此,一些容器厂商就一起制定了容器镜像格式和容器运行时的标准,即Open Container Initiative(OCI)。

OCI bundle

OCI Bundle是指满足OCI标准的一系列文件,这些文件包含了运行容器所需要的所有数据,它们存放在一个共同的目录,该目录包含以下两项:

  1. config.json:包含容器运行的配置数据
  2. container 的 root filesystem

runC框架

​ 这是runC主要的代码逻辑,其中libcontainer其实就是早期docker的一大基础,为了适应OCI格式进行了二次的封装。

 

 

​ 以runc create 为例子,其对应的主要操作如下:

  • startContainer:通过读取config.json配置将配置内容转换为OCI标准规定的内存数据结构形式,尝试创建容器,并根据参数执行不同的操作比如run,start,Restore。

contianer对应的一些数据结构如下,这里创建了一个接口,里面包括了一个容器需要的所有的操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
type BaseContainer interface {
    // Returns the ID of the container
    ID() string
 
    // Returns the current status of the container.
    Status() (Status, error)
 
    // State returns the current container's state information.
    State() (*State, error)
 
    // OCIState returns the current container's state information.
    OCIState() (*specs.State, error)
 
    // Returns the current config of the container.
    Config() configs.Config
 
    // Returns the PIDs inside this container. The PIDs are in the namespace of the calling process.
    //
    // Some of the returned PIDs may no longer refer to processes in the Container, unless
    // the Container state is PAUSED in which case every PID in the slice is valid.
    Processes() ([]int, error)
 
    // Returns statistics for the container.
    Stats() (*Stats, error)
 
    // Set resources of container as configured
    //
    // We can use this to change resources when containers are running.
    //
    Set(config configs.Config) error
 
    // Start a process inside the container. Returns error if process fails to
    // start. You can track process lifecycle with passed Process structure.
    Start(process *Process) (err error)
 
    // Run immediately starts the process inside the container.  Returns error if process
    // fails to start.  It does not block waiting for the exec fifo  after start returns but
    // opens the fifo after start returns.
    Run(process *Process) (err error)
 
    // Destroys the container, if its in a valid state, after killing any
    // remaining running processes.
    //
    // Any event registrations are removed before the container is destroyed.
    // No error is returned if the container is already destroyed.
    //
    // Running containers must first be stopped using Signal(..).
    // Paused containers must first be resumed using Resume(..).
    Destroy() error
 
    // Signal sends the provided signal code to the container's initial process.
    //
    // If all is specified the signal is sent to all processes in the container
    // including the initial process.
    Signal(s os.Signal, all bool) error
 
    // Exec signals the container to exec the users process at the end of the init.
    Exec() error
}

在linux平台上,对该接口进行了一些包裹,生成了linux 平台的一些专用接口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Container is a libcontainer container object.
//
// Each container is thread-safe within the same process. Since a container can
// be destroyed by a separate process, any function may return that the container
// was not found.
type Container interface {
    BaseContainer
 
    // Methods below here are platform specific
 
    // Checkpoint checkpoints the running container's state to disk using the criu(8) utility.
    Checkpoint(criuOpts *CriuOpts) error
 
    // Restore restores the checkpointed container to a running state using the criu(8) utility.
    Restore(process *Process, criuOpts *CriuOpts) error
 
    // If the Container state is RUNNING or CREATED, sets the Container state to PAUSING and pauses
    // the execution of any user processes. Asynchronously, when the container finished being paused the
    // state is changed to PAUSED.
    // If the Container state is PAUSED, do nothing.
    Pause() error
 
    // If the Container state is PAUSED, resumes the execution of any user processes in the
    // Container before setting the Container state to RUNNING.
    // If the Container state is RUNNING, do nothing.
    Resume() error
 
    // NotifyOOM returns a read-only channel signaling when the container receives an OOM notification.
    NotifyOOM() (<-chan struct{}, error)
 
    // NotifyMemoryPressure returns a read-only channel signaling when the container reaches a given pressure level
    NotifyMemoryPressure(level PressureLevel) (<-chan struct{}, error)
}

还有一个重要的接口Factory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type Factory interface {
    // Creates a new container with the given id and starts the initial process inside it.
    // id must be a string containing only letters, digits and underscores and must contain
    // between 1 and 1024 characters, inclusive.
    //
    // The id must not already be in use by an existing container. Containers created using
    // a factory with the same path (and filesystem) must have distinct ids.
    //
    // Returns the new container with a running process.
    //
    // On error, any partially created container parts are cleaned up (the operation is atomic).
    Create(id string, config *configs.Config) (Container, error)
 
    // Load takes an ID for an existing container and returns the container information
    // from the state.  This presents a read only view of the container.
    Load(id string) (Container, error)
 
    // StartInitialization is an internal API to libcontainer used during the reexec of the
    // container.
    StartInitialization() error
 
    // Type returns info string about factory type (e.g. lxc, libcontainer...)
    Type() string
}

其中也有对应Linux 平台的一个实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// LinuxFactory implements the default factory interface for linux based systems.
type LinuxFactory struct {
    // Root directory for the factory to store state.
    Root string
 
    // InitPath is the path for calling the init responsibilities for spawning
    // a container.
    InitPath string
 
    // InitArgs are arguments for calling the init responsibilities for spawning
    // a container.
    InitArgs []string
 
    // CriuPath is the path to the criu binary used for checkpoint and restore of
    // containers.
    CriuPath string
 
    // New{u,g}idmapPath is the path to the binaries used for mapping with
    // rootless containers.
    NewuidmapPath string
    NewgidmapPath string
 
    // Validator provides validation to container configurations.
    Validator validate.Validator
 
    // NewIntelRdtManager returns an initialized Intel RDT manager for a single container.
    NewIntelRdtManager func(config *configs.Config, id string, path string) intelrdt.Manager
}

Linux Factory中的create的具体实现其实就是创建一个LinuxContainer(这正和我们之前所说的Linux下的container接口相对应):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
type linuxContainer struct {
    id                   string
    root                 string
    config               *configs.Config
    cgroupManager        cgroups.Manager
    intelRdtManager      intelrdt.Manager
    initPath             string
    initArgs             []string
    initProcess          parentProcess
    initProcessStartTime uint64
    criuPath             string
    newuidmapPath        string
    newgidmapPath        string
    m                    sync.Mutex
    criuVersion          int
    state                containerState
    created              time.Time
    fifo                 *os.File
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
    rootlessCg, err := shouldUseRootlessCgroupManager(context)
    if err != nil {
        return nil, err
    }
    config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
        CgroupName:       id,
        UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
        NoPivotRoot:      context.Bool("no-pivot"),
        NoNewKeyring:     context.Bool("no-new-keyring"),
        Spec:             spec,
        RootlessEUID:     os.Geteuid() != 0,
        RootlessCgroups:  rootlessCg,
    })
    if err != nil {
        return nil, err
    }
 
    factory, err := loadFactory(context)
    if err != nil {
        return nil, err
    }
    return factory.Create(id, config)
}

可以看到,首先加载配置config,然后使用loadFactory创建相关的LinuxFactory,最终调用了factory.Create(id, config),然后由factory.Create(id, config)返回一个LinuxContainer。其中LoadFactory十分关键,他在最后调用了libcontainer.New()函数来返回LinuxContainer,在该New函数里面其设置了InitPath(InitPath非常重要):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// New returns a linux based container factory based in the root directory and
// configures the factory with the provided option funcs.
func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
    if root != "" {
        if err := os.MkdirAll(root, 0o700); err != nil {
            return nil, err
        }
    }
    l := &LinuxFactory{
        Root:      root,
        InitPath:  "/proc/self/exe",
        InitArgs:  []string{os.Args[0], "init"},
        Validator: validate.New(),
        CriuPath:  "criu",
    }
 
    for _, opt := range options {
        if opt == nil {
            continue
        }
        if err := opt(l); err != nil {
            return nil, err
        }
    }
    return l, nil
}

在LinuxFactory的Create过程中InitPath和InitArgs被传递给linuxContainer。在知道是如何创建出一个linuxContainer之后,我们把目光返回到startContainer,该函数最后生成了runner结构体,然后调用了其run方法,参数为spec.Process,这里的spec.Process其实就是当初config.json里面的进程信息。

 

​ 在run方法中,一方面通过newProcess以config.json为模板创建了libcontainer.Process结构体,与进程相关的limt和Capabilities等设置都在此时完成,另一方面主要根据action做了三种操作:

1
2
3
4
5
6
7
8
9
10
switch r.action {
case CT_ACT_CREATE:
    err = r.container.Start(process)
case CT_ACT_RESTORE:
    err = r.container.Restore(process, r.criuOpts)
case CT_ACT_RUN:
    err = r.container.Run(process)
default:
    panic("Unknown action")
}

Process结构体,其中大部分的内容都来自config.json文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// Process specifies the configuration and IO for a process inside
// a container.
type Process struct {
    // The command to be run followed by any arguments.
    Args []string
 
    // Env specifies the environment variables for the process.
    Env []string
 
    // User will set the uid and gid of the executing process running inside the container
    // local to the container's user and group configuration.
    User string
 
    // AdditionalGroups specifies the gids that should be added to supplementary groups
    // in addition to those that the user belongs to.
    AdditionalGroups []string
 
    // Cwd will change the processes current working directory inside the container's rootfs.
    Cwd string
 
    // Stdin is a pointer to a reader which provides the standard input stream.
    Stdin io.Reader
 
    // Stdout is a pointer to a writer which receives the standard output stream.
    Stdout io.Writer
 
    // Stderr is a pointer to a writer which receives the standard error stream.
    Stderr io.Writer
 
    // ExtraFiles specifies additional open files to be inherited by the container
    ExtraFiles []*os.File
 
    // Initial sizings for the console
    ConsoleWidth  uint16
    ConsoleHeight uint16
 
    // Capabilities specify the capabilities to keep when executing the process inside the container
    // All capabilities not specified will be dropped from the processes capability mask
    Capabilities *configs.Capabilities
 
    // AppArmorProfile specifies the profile to apply to the process and is
    // changed at the time the process is execed
    AppArmorProfile string
 
    // Label specifies the label to apply to the process.  It is commonly used by selinux
    Label string
 
    // NoNewPrivileges controls whether processes can gain additional privileges.
    NoNewPrivileges *bool
 
    // Rlimits specifies the resource limits, such as max open files, to set in the container
    // If Rlimits are not set, the container will inherit rlimits from the parent process
    Rlimits []configs.Rlimit
 
    // ConsoleSocket provides the masterfd console.
    ConsoleSocket *os.File
 
    // Init specifies whether the process is the first process in the container.
    Init bool
 
    ops processOperations
 
    LogLevel string
 
    // SubCgroupPaths specifies sub-cgroups to run the process in.
    // Map keys are controller names, map values are paths (relative to
    // container's top-level cgroup).
    //
    // If empty, the default top-level container's cgroup is used.
    //
    // For cgroup v2, the only key allowed is "".
    SubCgroupPaths map[string]string
}

start方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
func (c *linuxContainer) Start(process *Process) error {
    c.m.Lock()
    defer c.m.Unlock()
    if c.config.Cgroups.Resources.SkipDevices {
        return errors.New("can't start container with SkipDevices set")
    }
    if process.Init {
        if err := c.createExecFifo(); err != nil {
            return err
        }
    }
    if err := c.start(process); err != nil {
        if process.Init {
            c.deleteExecFifo()
        }
        return err
    }
    return nil
}

可以看到,start方法,主要是创建了一个fifo管道(这个管道主要用于阻塞,后面会用到),然后调用了start方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
func (c *linuxContainer) start(process *Process) (retErr error) {
    parent, err := c.newParentProcess(process)
    if err != nil {
        return fmt.Errorf("unable to create new parent process: %w", err)
    }
 
    logsDone := parent.forwardChildLogs()
    if logsDone != nil {
        defer func() {
            // Wait for log forwarder to finish. This depends on
            // runc init closing the _LIBCONTAINER_LOGPIPE log fd.
            err := <-logsDone
            if err != nil && retErr == nil {
                retErr = fmt.Errorf("unable to forward init logs: %w", err)
            }
        }()
    }
 
    if err := parent.start(); err != nil {
        return fmt.Errorf("unable to start container process: %w", err)
    }
 
    if process.Init {
        c.fifo.Close()
        if c.config.Hooks != nil {
            s, err := c.currentOCIState()
            if err != nil {
                return err
            }
 
            if err := c.config.Hooks[configs.Poststart].RunHooks(s); err != nil {
                if err := ignoreTerminateErrors(parent.terminate()); err != nil {
                    logrus.Warn(fmt.Errorf("error running poststart hook: %w", err))
                }
                return err
            }
        }
    }
    return nil
}

该方法第一步首先返回了一个initProcess结构体,这个结构体实现了 parentProcess接口,该结构体由linuxContainer的newInitProcess函数创建。

1
2
3
4
5
6
7
8
9
10
11
12
13
type initProcess struct {
    cmd             *exec.Cmd
    messageSockPair filePair
    logFilePair     filePair
    config          *initConfig
    manager         cgroups.Manager
    intelRdtManager intelrdt.Manager
    container       *linuxContainer
    fds             []string
    process         *Process
    bootstrapData   io.Reader
    sharePidns      bool
}

接口如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type parentProcess interface {
    // pid returns the pid for the running process.
    pid() int
 
    // start starts the process execution.
    start() error
 
    // send a SIGKILL to the process and wait for the exit.
    terminate() error
 
    // wait waits on the process returning the process state.
    wait() (*os.ProcessState, error)
 
    // startTime returns the process start time.
    startTime() (uint64, error)
 
    signal(os.Signal) error
 
    externalDescriptors() []string
 
    setExternalDescriptors(fds []string)
 
    forwardChildLogs() chan error
}

​ 在整个的newParentProcess函数过程中,首先创了一对sock和一对pipe管道,然后用这一对sock中的childsock和childpipe创建了一个cmd模板,该模板中执行的命令正好就是之前的InitPath中设置的路径("/proc/self/exe",和 "init",这其实表示会执行runC本身,参数就是init),sock和pipe其实是为了实现cmd和父进程直接的数据通信,它们被放入到cmd.ExtraFiles中,同时相关的文件描述符被放入到环境变量里面,接下来是对进程是否是初始化进程进行判断,如果不是,则调用newSetnsProcess,来返回一个setnsProcess结构体,该结构体同样实现了parentProcess接口,newSetnsProcess主要是用来在已有容器中创建一个新的进程。

 

​ 接下来执行includeExecFifo()方法,其就是打开之前创建的exec.fifo文件,并存入到cmd.ExtraFiles和环境变量中,最后调用最关键的函数newInitProcess来创建Init结构体:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*initProcess, error) {
    cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))
    nsMaps := make(map[configs.NamespaceType]string)
    for _, ns := range c.config.Namespaces {
        if ns.Path != "" {
            nsMaps[ns.Type] = ns.Path
        }
    }
    _, sharePidns := nsMaps[configs.NEWPID]
    data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps, initStandard)
    if err != nil {
        return nil, err
    }
 
    if c.shouldSendMountSources() {
        // Elements on this slice will be paired with mounts (see StartInitialization() and
        // prepareRootfs()). This slice MUST have the same size as c.config.Mounts.
        mountFds := make([]int, len(c.config.Mounts))
        for i, m := range c.config.Mounts {
            if !m.IsBind() {
                // Non bind-mounts do not use an fd.
                mountFds[i] = -1
                continue
            }
 
            // The fd passed here will not be used: nsexec.c will overwrite it with dup3(). We just need
            // to allocate a fd so that we know the number to pass in the environment variable. The fd
            // must not be closed before cmd.Start(), so we reuse messageSockPair.child because the
            // lifecycle of that fd is already taken care of.
            cmd.ExtraFiles = append(cmd.ExtraFiles, messageSockPair.child)
            mountFds[i] = stdioFdCount + len(cmd.ExtraFiles) - 1
        }
 
        mountFdsJson, err := json.Marshal(mountFds)
        if err != nil {
            return nil, fmt.Errorf("Error creating _LIBCONTAINER_MOUNT_FDS: %w", err)
        }
 
        cmd.Env = append(cmd.Env,
            "_LIBCONTAINER_MOUNT_FDS="+string(mountFdsJson),
        )
    }
 
    init := &initProcess{
        cmd:             cmd,
        messageSockPair: messageSockPair,
        logFilePair:     logFilePair,
        manager:         c.cgroupManager,
        intelRdtManager: c.intelRdtManager,
        config:          c.newInitConfig(p),
        container:       c,
        process:         p,
        bootstrapData:   data,
        sharePidns:      sharePidns,
    }
    c.initProcess = init
    return init, nil
}

在该函数中首先设置standard环境变量,然后从config.json里面读取需要新建的namespaces,并将这些数据进行存储,然后创建initProcess结构体,中间的shouldSendMountSources不用特别关心,它其实是为了挂载一些目录所设置的。到此为止,parentProcess结构体就基本设置完成了。

 

​ 在start方法中接下来调用了parentProcess的start()函数,这里其实是initProcess结构体实现的start函数。在该start函数中会启动之前设置的/proc/self/exe进程,参数为init,然后给父进程设置了cgroup,之后通过sock把信息传输给子进程,这里最关键的其实是启动了runC init这样一个子进程,因为创建的容器可能具备新的namespaces,因此,通过子进程执行runC init的时候可以很方便的通过setns()完成命名空间的切换,同时setns其实是不运行在多线程条件下使用的,但是go runtime就是多线程的,因此必须在go runtime之前设置命名空间,因此使用cgo在go runtime启动之前使用c代码设置命名空间。

 

​ 在cgo中,首先利用环境变量拿到了pipe(可以看到之前父进程在环境变量里面进程了设置),然后以netlink msg的格式读取父进程发送的config配置信息,接着同样执行了创建sock组的操作,这是为了使得它和孙进程之间可以相互通信,接着以状态机的形式用clone创建出符合config.json中设置的命名空间的进程,然后本来的子进程就exit(0)销毁。、

 

​ 接着回到create中,在执行init进程之后对其进行了cgroup的限制,这也方便在接下来的过程中防止子进程通过cgroup进行逃逸,接着父进程发送bootstrapData数据到init进程,之后create拿到init创建的子进程的pid,然后通过pipe管拿到子进程打开的fd进行保存,在进行一系列的设置之后通过sendConfig发送config.json中的要执行的进程的信息,接下来就是容器初始化和执行config.json中设置的进程了,具体的过程可以参考standard_init_linux.go中linuxStandardInit的Init函数,到此为止一个容器的大致启动过程就基本分析结束了。

参考链接:

https://segmentfault.com/a/1190000017576314#item-1

https://github.com/opencontainers/runc


【公告】欢迎大家踊跃尝试高研班11月试题,挑战自己的极限!

收藏
点赞2
打赏
分享
最新回复 (1)
雪    币: 6928
活跃值: 活跃值 (12848)
能力值: (RANK:690 )
在线值:
发帖
回帖
粉丝
ScUpax0s 活跃值 12 2022-1-12 14:54
2
0
支持
游客
登录 | 注册 方可回帖
返回