首页
论坛
课程
招聘
[原创]Linux内核5.13版本内存管理模块源码分析
2021-9-1 09:47 6400

[原创]Linux内核5.13版本内存管理模块源码分析

2021-9-1 09:47
6400

本文基于现今最新的Linux内核5.13版本。

 

内存管理模块一直是内核中最重要的模块之一,本文希望能简单的梳理内核内存管理模块的一部分核心内容,并结合我们在漏洞利用中的一些经验,以达到加深对内核理解的效果。

 

目录

开始之前

NUMA与UMA/SMP

我们从最顶层的说起。

 

假设我们有3个CPU分别是C1,C2,C3

 

UMA/SMP:Uniform-Memory-Access均匀存储器存取

 

可以简单理解为,C1,C2,C3作为一个整体,共享所有的物理内存。每台处理器可以有私有的高速cache。

 

hYeeoj.png

 

NUMA:Nonuniform-Memory-Access非均匀存储器存取

 

对于处理器C1,C2,C3,他们不是“共享”内存的

 

hYellV.png

 

具体地说,相对CPU1,连接到 CPU1 的内存控制器的内存被认为是本地内存。而连接到CPU2)的内存被视为 CPU1 的外部或远程内存。

 

由于远程内存访问比本地内存访问有额外的延迟开销,因为它必须遍历互连(点对点链接)并连接到远程内存控制器。由于内存位置不同,系统会经历“不均匀”的内存访问时间。

层次化

Linux内核中内存组织的层次化主要是经历了 node->zone->page 这样一个顺序。

 

hYe3OU.png

 

可以看到,每个CPU维护了自己对应的node,而这个node就可以理解为本地内存(NUMA)。

 

每个node又被划分为多个Zone。

 

Node 在内核源码中是一个全局数组。

1
2
//arch/x86/mm/numa.c
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
typedef struct pglist_data {
    /*
     * node_zones contains just the zones for THIS node. Not all of the
     * zones may be populated, but it is the full list. It is referenced by
     * this node's node_zonelists as well as other node's node_zonelists.
     */
    struct zone node_zones[MAX_NR_ZONES];
 
    /*
     * node_zonelists contains references to all zones in all nodes.
     * Generally the first zones will be references to this node's
     * node_zones.
     */
    struct zonelist node_zonelists[MAX_ZONELISTS];
 
    int nr_zones; /* number of populated zones in this node */
  ......

可以看到在对应的数组中,每一个 pglist_data 中都包含了对应的多个 node_zones 以及其对应的引用。

 

每个zone中维护了一些比较重要的结构。

  • watermark(水位)

    • 表征了我们当前zone的page的使用情况,当watermark过高时,会自动启动内存回收算法。
  • spanned_pages

    • 表征当前zone中含有的page frames数目。
  • long lowmem_reserve[MAX_NR_ZONES] 一个动态的数组,主要功能是保留一些低位内存空间,以防止当在高区有大量可释放的内存,但我们却在低区启动了OOM。也是一种预留内存。

  • zone_start_pfn:当前zone起始的物理页面号。而通过zone_start_pfn+spanned_pages可获得该zone的结束物理页面号。

    1
    /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
  • free_area:表征当前zone中还有多少空余可供分配的page frames。

值得一提的是,其实Zone也分为不同种类的Zone(类比slab),可以通过如下方式查看:

1
2
3
4
5
6
root@ubuntu:~# cat /proc/zoneinfo |grep Node
Node 0, zone      DMA
Node 0, zone    DMA32
Node 0, zone   Normal
Node 0, zone  Movable
Node 0, zone   Device

接下来说说page(页),page frame(页帧/框)。这两个的关系类似鸡蛋(page)与篮子(page frame)的关系。

 

一般来说,一个page的大小是4K,是管理物理内存的最小单位。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
//mm_types.h
struct page {
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    union {
        struct {    /* Page cache and anonymous pages */
            struct list_head lru;
            /* See page-flags.h for PAGE_MAPPING_FLAGS */
            struct address_space *mapping;
            pgoff_t index;        /* Our offset within mapping. */
            /**
             * @private: Mapping-private opaque data.
             * Usually used for buffer_heads if PagePrivate.
             * Used for swp_entry_t if PageSwapCache.
             * Indicates order in the buddy system if PageBuddy.
             */
            unsigned long private;
        };
        struct {    /* page_pool used by netstack */
            /**
             * @dma_addr: might require a 64-bit value on
             * 32-bit architectures.
             */
            unsigned long dma_addr[2];
        };
        struct {    /* slab, slob and slub */
            union {
                struct list_head slab_list;
                struct {    /* Partial pages */
                    struct page *next;
#ifdef CONFIG_64BIT
                    int pages;    /* Nr of pages left */
                    int pobjects;    /* Approximate count */
#else
                    short int pages;
                    short int pobjects;
#endif
                };
            };
            struct kmem_cache *slab_cache; /* not slob */
            /* Double-word boundary */
            void *freelist;        /* first free object */
            union {
                void *s_mem;    /* slab: first object */
                unsigned long counters;        /* SLUB */
                struct {            /* SLUB */
                    unsigned inuse:16;
                    unsigned objects:15;
                    unsigned frozen:1;
                };
            };
        };
        struct {    /* Tail pages of compound page */
            unsigned long compound_head;    /* Bit zero is set */
 
            /* First tail page only */
            unsigned char compound_dtor;
            unsigned char compound_order;
            atomic_t compound_mapcount;
            unsigned int compound_nr; /* 1 << compound_order */
        };
        struct {    /* Second tail page of compound page */
            unsigned long _compound_pad_1;    /* compound_head */
            atomic_t hpage_pinned_refcount;
            /* For both global and memcg */
            struct list_head deferred_list;
        };
        struct {    /* Page table pages */
            unsigned long _pt_pad_1;    /* compound_head */
            pgtable_t pmd_huge_pte; /* protected by page->ptl */
            unsigned long _pt_pad_2;    /* mapping */
            union {
                struct mm_struct *pt_mm; /* x86 pgds only */
                atomic_t pt_frag_refcount; /* powerpc */
            };
#if ALLOC_SPLIT_PTLOCKS
            spinlock_t *ptl;
#else
            spinlock_t ptl;
#endif
        };
        struct {    /* ZONE_DEVICE pages */
            /** @pgmap: Points to the hosting device page map. */
            struct dev_pagemap *pgmap;
            void *zone_device_data;
        };
 
        /** @rcu_head: You can use this to free a page by RCU. */
        struct rcu_head rcu_head;
    };
 
    union {        /* This union is 4 bytes in size. */
        atomic_t _mapcount;
 
        /*
         * If the page is neither PageSlab nor mappable to userspace,
         * the value stored here may help determine what this page
         * is used for.  See page-flags.h for a list of page types
         * which are currently stored here.
         */
        unsigned int page_type;
 
        unsigned int active;        /* SLAB */
        int units;            /* SLOB */
    };
 
    /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
    atomic_t _refcount;
 
#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;            /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
 
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
} _struct_page_alignment;

我们主要聊里面几个重要的成员:

  • flags: 标定了page frame一些相应的属性。

    flags的格式如下:

    hYeJw4.png

我们主要关注最后一位flag,用于标识page的状态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
```c
enum pageflags {
    PG_locked,        /* Page is locked. Don't touch. */
    PG_referenced,//表示page刚刚被访问过
    PG_uptodate,
    PG_dirty,            //是否页面数据已经被修改(脏页)
    PG_lru,                //是否处于lru链表中
    PG_active,
    PG_workingset,
    PG_waiters,       
    PG_error,
    PG_slab,            //是否属于slab分配器
    PG_owner_priv_1,    /* Owner use. If pagecache, fs may use*/
    PG_arch_1,
    PG_reserved,
    PG_private,        /* If pagecache, has fs-private data */
    PG_private_2,        /* If pagecache, has fs aux data */
    PG_writeback,        //page正在被写回
    PG_head,        /* A head page */
    PG_mappedtodisk,    /* Has blocks allocated on-disk */
    PG_reclaim,        /* To be reclaimed asap */
    PG_swapbacked,        /* Page is backed by RAM/swap */
    PG_unevictable,        /* Page is "unevictable"  */
#ifdef CONFIG_MMU
    PG_mlocked,        /* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
    PG_uncached,        /* Page has been mapped as uncached */
#endif
#ifdef CONFIG_MEMORY_FAILURE
    PG_hwpoison,        /* hardware poisoned page. Don't touch */
#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
    PG_young,
    PG_idle,
#endif
#ifdef CONFIG_64BIT
    PG_arch_2,
#endif
    __NR_PAGEFLAGS,
 
    /* Filesystems */
    PG_checked = PG_owner_priv_1,
 
    /* SwapBacked */
    PG_swapcache = PG_owner_priv_1,    //page处于swap cache中
  /* Swap page: swp_entry_t in private */
 
    /* Two page bits are conscripted by FS-Cache to maintain local caching
     * state.  These bits are set on pages belonging to the netfs's inodes
     * when those inodes are being locally cached.
     */
    PG_fscache = PG_private_2,    /* page backed by cache */
 
    /* XEN */
    /* Pinned in Xen as a read-only pagetable page. */
    PG_pinned = PG_owner_priv_1,
    /* Pinned as part of domain save (see xen_mm_pin_all()). */
    PG_savepinned = PG_dirty,
    /* Has a grant mapping of another (foreign) domain's page. */
    PG_foreign = PG_owner_priv_1,
    /* Remapped by swiotlb-xen. */
    PG_xen_remapped = PG_owner_priv_1,
 
    /* SLOB */
    PG_slob_free = PG_private,
 
    /* Compound pages. Stored in first tail page's flags */
    PG_double_map = PG_workingset,
 
    /* non-lru isolated movable page */
    PG_isolated = PG_reclaim,
 
    /* Only valid for buddy pages. Used to track pages that are reported */
    PG_reported = PG_uptodate,
};
  • _mapcount:表示当前page frame被map的次数(被页表引用的次数)

  • lru:根据page frame的活跃程度(使用频率),将page frame挂在不同的list上,作为页面回收的依据。

  • _refcount:引用计数。

    本变量不可直接使用,而是要通过include/linux/page_ref.h 对应的函数来原子的读写。

  • pgoff_t index:表示在文件映射时的该page在文件内的offset,单位是page的大小。

  • mapping,我们主要说最常见的两种情况:

    • 如果当前的page是一个匿名页,page->mapping指向它的anon_vma。并设置PAGE_MAPPING_ANON位来区分它。

    • 如果当前的page是一个非匿名页,也就是说与某个文件相关联。那么mapping指向文件inode的地址空间。

    • 根据是否处于VM_MERGEABLE区域,是否开启CONFIG_KSM,此指针指向的位置仍有不同。

      详见:/include/linux/page-flags.h

在了解了本部分知识后,可以阅读:

 

文件系统cache与匿名页交换

 

加深对应的理解。

1
2
3
4
root@ubuntu:~# free
              total        used        free      shared  buff/cache   available
Mem:        4012836      207344     3317312        1128      488180     3499580
Swap:        998396           0      998396

页表组织

本部分我们主要目光放在x86-64下的4级页表的组织。

 

即:PGD -> PUD -> PMD -> PTE

PGD 页全局目录(Page Global Directory)
PUD 页上级目录(Page Upper Directory)
PMD 页中间目录(Page Middle Directory)
PTE 页表(Page Table)
 

一个比较好的说明图片:

 

hYeUYR.png

一次页表查询

当我们给出一个virtual addr(aka. v_addr),我们需要通过页表机制,来获取其对应的物理地址(aka. p_addr)。接下来梳理一下从 v_addr -> p_addr的过程。

  1. 首先从CR3寄存器中获取PML4T(页目录)的物理页面的基地址。
  2. 与v_addr的对应bits组合,相加得到页目录项的物理地址。
  3. 读取对应的目录项(pgd_t),从中取出PUD的物理基地址。
  4. 与v_addr对应的bits相加,得到对应的PUD目录项的物理地址。
  5. 读取pud_t,获得PMD的基地址。
  6. 与v_addr对应bits相加,定位到对应的页中间目录项物理地址。
  7. 取出pmd_t即对应的页表物理基地址。与v_addr对应bits相加获取页表项对应的物理地址(pte)
  8. 读取pte_t,从中得到真正的物理内存页基地址。
  9. 从v_addr最后一部分取出物理页内偏移量,相加得到对应的最终的物理地址。
  10. 访问真正的物理内存地址的数据。

需要注意的是,每个进程都拥有自己的PGD。它是一个物理页,并包含一个pgd_t数组。即每个进程都有一套自身的页表。

1
task_struct -> mm_struct -> pgd_t * pgd

当发生进程切换时,切换的是进程页表:即将新进程的pgd(页目录)加载到CR3寄存器中。

KPTI与内核页表

如果有熟悉Kernel Pwn,在漏洞利用中有一种缓解技术叫做KPTI(Kernel page-table isolation)即内核页表隔离。当题目开启了内核页表隔离时,不能直接着陆到用户态。

 

KPTI的核心是,在开启了这个选项的程序中,每个进程拥有两套页表,分别是内核态页表(只能在内核态访问)与用户态页表,他们处于不同的地址空间下。

 

当发生一次syscall时,涉及到用户态与内核态页表的切换。(切换CR3)

 

hYeaf1.png

 

而如果我们着陆用户态(ireq/sysret)的时候,没有正常切换/设置CR3寄存器,就会导致页表错误,最后引发段错误。

 

在bypass的时候,我们往往通过SWITCH_USER_CR3:

1
2
3
mov     rdi, cr3
or      rdi, 1000h
mov     cr3, rdi

来重新设置cr3寄存器。

 

或者是通过 swapgs_restore_regs_and_return_to_usermode 函数返回。

 

知道了KPTI之后的页表,那么显而易见,当我们没有开启KPTI时,只有进程的页表是时刻在更新,而内核页表全局只有一份,所有进程共享内核页表。而每个进程的“进程页表”中内核态地址相关的页表项都是“内核页表”的一个拷贝。当我们想要索引内核页表时,可以通过:init_mm.pgd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct mm_struct init_mm = {
    .mm_rb        = RB_ROOT,
    .pgd        = swapper_pg_dir,
    .mm_users    = ATOMIC_INIT(2),
    .mm_count    = ATOMIC_INIT(1),
    .write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),
    MMAP_LOCK_INITIALIZER(init_mm)
    .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
    .arg_lock    =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
    .mmlist        = LIST_HEAD_INIT(init_mm.mmlist),
    .user_ns    = &init_user_ns,
    .cpu_bitmap    = CPU_BITS_NONE,
    INIT_MM_CONTEXT(init_mm)
};

而这个 swapper_pg_dir 本质上就是内核PGD的基地址。

1
2
3
4
5
/*
 * Initialized during boot, and readonly for initializing page tables
 * afterwards
 */
pgd_t swapper_pg_dir[PTRS_PER_PGD];

关于内核页表的创建过程可以看:

 

https://richardweiyang-2.gitbook.io/kernel-exploring/00-evolution_of_kernel_pagetable

TLB缓存

TLB是translation lookaside buffer的简称。其本质上就是一块高速缓存。记得之前在计算机体系结构课上学过:<u>全相联映射、组相连映射、直接映射</u>等。

 

在正常情况下,我们通过四级页表查询来做页表转换来进行虚拟地址到物理地址的转换。

 

而TLB提供了一种更高速的方式来做虚拟地址到物理地址的转换。

TLB是一个小的,虚拟寻址的缓存,其中每一行都保存着一个由单个PTE(Page Table Entry,页表项)组成的块。如果没有TLB,则每次取数据都需要两次访问内存,即查页表获得物理地址和取数据。

 

不同的映射方式的cache有不同的组织形式,但是其整体思想都是通过虚拟地址来查cache,如果TLB cache命中,则直接可以得到物理地址。

 

TLB包含最近使用过的页表条目。给定一个虚拟地址,处理器将检查TLB是否存在页表条目(TLB命中),检索帧号并形成实际地址。如果在TLB中找不到页表条目(TLB丢失),则页号用于索引过程页表。TLB首先检查页面是否已在主存储器中,如果不在主存储器中,则发出页面错误,然后更新TLB以包括新的页面条目。

 

hYej10.png

 

从这张图可以清晰的看出来,TLB提供了一种从v_addr[12:47] 到 p_addr[12:47]的映射。(低12bits均相似,所以不用管)

 

而ASID主要是为了区分不同的进程。

page cache 页缓冲

首先明确一点。page cache是Linux内核使用的主要磁盘缓存。

page cache is the main disk cache used by the Linux kernel.

 

我们一般在异步情况下读写文件时,首先写入对应的page cache,此时pages变成dirty pages ,后续会有内核线程pdflush真正写回到硬盘上。相对而言的,当我们读文件时,也是先放入page cache,然后再拷贝给用户态。当我们再次读同一个文件,如果page cache里已经有了,那么其性能就会有很大提升。

Inverted page tables(IPT)

倒排页表,顾名思义,其储存的是每个物理page frame的信息。

 

其出现是为了缓解多级页表占用的内存问题。倒排页表项与物理内存页框有一一对应关系,而不是每一个虚拟页面有一个表项。

 

它所包含的表项数量较少(物理内存大小一般远小于虚拟内存大小)。所以其使用页框号而不是虚拟页号来索引页表项。

 

虽然IPT的设计节省了大量空间,但是也导致从虚拟地址到物理地址的转换会变得很困难。当进程n访问虚拟页面p时,硬件不再能通过把p当作指向页表的一个索引来查找物理页框。取而代之的是,它必须搜索整个倒排页表来查找某一个表项。

 

所以相比来说,TLB则是更好的一种技术。

Huge page

huge page也称作大页,巨页。

 

我们一般来说一个页表项是4k,这就产生了一个问题:当物理内存很大时,页表会变得非常大,占用大量物理内存。而大页则是使页变大,由于页变大了,需要的页表项也就小了,占用物理内存也减少了。

 

x64四级页表系统支持2MB的大页,1GB的大页。

 

其优点主要是可以减少页表项,加快检索速度,提高TLB hit概率。

 

当我们打开CR4的pse位时(page size extension)就开启了对应的大页。

 

但是缺点是需要预先分配;如果分配过多,会造成内存浪费,不能被其他程序使用

THP(transparent huge page)

THP(transparent huge page)即透明大页,他是对Huge Page的一个优化,它允许大页做动态的分配。THP减小了针对huge page支持的开销。使得应用程序可以根据需要灵活地选择虚存页面大小,而不会被强制使用 2MB 大页面。

 

THP是通过将巨大的页面分解成较小的4KB页面来实现的,然后这些页面被正常地交换出去。但是为了有效地使用hugepages,内核必须找到物理上连续的内存区域,其大小足以满足请求,而且还要正确对齐。为此,我们增加了一个khugepaged内核线程。这个线程会偶尔尝试用hugepage分配来替代目前正在使用的较小的页面,从而最大限度地提高THP的使用率。在用户区,不需要对应用程序进行修改(因此是透明的)。但有一些方法可以优化其使用。对于想要使用hugepages的应用程序,使用posix_memalign()也可以帮助确保大的分配被对齐到巨大的页面(2MB)边界上。另外,THP只对匿名内存区域启用。

 

但是问题是由于其动态分配的性质,以及繁琐的内存锁操作,THP很可能会导致性能上的下降。

页表标识位

https://zhuanlan.zhihu.com/p/67053210

 

P(Present) - 为1表明该page存在于当前物理内存中,为0则PTE的其他部分都失去意义了,不用看了,直接触发page fault。P位为0的PTE也不会有对应的TLB entry,因为早在P位由1变为0的时候,对应的TLB就已经被flush掉了。

 

G (Global)- 用于context switch的时候不用flush掉kernel对应的TLB,所以这个标志位在TLB entry中也是存在的。

 

A(Access) - 当这个page被访问(读/写)过后,硬件将该位置1,TLB只会缓存access的值为1的page对应的映射关系。软件可将该位置0,然后对应的TLB将会被flush掉。这样,软件可以统计出每个page被访问的次数,作为内存不足时,判断该page是否应该被回收的参考。

 

D (Dirty)- 这个标志位只对file backed的page有意义,对anonymous的page是没有意义的。当page被写入后,硬件将该位置1,表明该page的内容比外部disk/flash对应部分要新,当系统内存不足,要将该page回收的时候,需首先将其内容flush到外部存储。之后软件将该标志位清0。

 

R/W和U/S属于权限控制类:

 

R/W(Read/Write) - 置为1表示该page是writable的,置为0则是readonly,对只读的page进行写操作会触发page fault。

 

U/S(User/Supervisor) - 置为0表示只有supervisor(比如操作系统中的kernel)才可访问该page,置为1表示user也可以访问。

 

PCD和PWT和cache属性相关:

 

PCD(Page Cache Disabled)- 置为1表示disable,即该page中的内容是不可以被cache的。如果置为0(enable),还要看CR0寄存器中的CD位这个总控开关是否也是0。

 

PWT (Page Write Through)- 置为1表示该page对应的cache部分采用write through的方式,否则采用write back。

 

在64位下:

  • CR3中加入了对PCID的支持。当CR4寄存器的PCIDE位为1,此时这低12位就表示PCID。此时PCD和PWT就被覆盖了
  • XD(EXecute Disable),主要是控制可执行权限用的。

伙伴系统

1
2
3
4
5
6
7
8
9
10
11
/*
 * Set up kernel memory allocators
 */
static void __init mm_init(void)
{
    ......
    mem_init();    //伙伴系统初始化
  ......
    kmem_cache_init(); //slab初始化
    ......
}

overview

伙伴系统主要以2的方幂来划分空闲的内存区域,直至获取我们想要的内存大小的内存块。

1
2
3
4
5
6
7
#define MAX_ORDER 11
struct zone{
  ...
  /* free areas of different sizes */
    struct free_area    free_area[MAX_ORDER];
  ...
}

可以看到,每个zone都维护了MAX_ORDER个free_area。其中MAX_ORDER表征切分的2的最大次幂。

1
2
3
4
struct free_area {
    struct list_head    free_list[MIGRATE_TYPES];
    unsigned long        nr_free;
};

而对应的MIGRATE_TYPES则为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
enum migratetype {
    MIGRATE_UNMOVABLE,//不可移动页
    MIGRATE_MOVABLE,    //    可移动页
    MIGRATE_RECLAIMABLE,//可回收页
    MIGRATE_PCPTYPES,    /* the number of types on the pcp lists */
    MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,//在罕见的情况下,内核需要分配一个高阶的页面块而不能休眠.如果向具有特定可移动性的列表请求分配内存失败,这种紧急情况下可从MIGRATE_HIGHATOMIC中分配内存
#ifdef CONFIG_CMA
    MIGRATE_CMA,        //Linux内核最新的连续内存分配器(CMA), 用于避免预留大块内存
#endif
#ifdef CONFIG_MEMORY_ISOLATION
    MIGRATE_ISOLATE,    //是一个特殊的虚拟区域, 用于跨越NUMA结点移动物理内存页. 在大型系统上, 它有益于将物理内存页移动到接近于使用该页最频繁的CPU.
#endif
    MIGRATE_TYPES            //只是表示迁移类型的数目, 也不代表具体的区域
};
  • free_area[0] 中存储了2^0大小的页面组成的list(即一个page大小)
  • free_area[1] 中存储了2^1大小的页面组成的list(即两个page大小)
  • ......
  • free_area[10] 中存储了2^10大小的页面组成的list(即十个page大小)

进一步的,对于每个free_list,都拥有不同的属性:

 

hYZIo9.png

 

伙伴系统主要涉及的函数如下:alloc_pages、alloc_page等一些列函数,我们从最顶端的接口入手

alloc_pages(gfp_t gfp_mask, unsigned int order)

1
2
3
4
static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
{
    return alloc_pages_node(numa_node_id(), gfp_mask, order);
}

本函数的参数:

1
2
3
4
5
6
7
8
9
10
- rdi:GFP bitmasks,分配的属性。见[附录](#1)
- rsi:分配内存的阶。
 
根据调用流:
 
```c
alloc_pages
  alloc_pages_node
      __alloc_pages_node(nid, gfp_mask, order) //nid是离当前CPU最近的node
          __alloc_pages(gfp_mask, order, nid, NULL) //the 'heart' of the zoned buddy allocator

__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)

本函数是伙伴系统分配的核心函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)
{
    struct page *page;
 
  //先设置WMark为low
    unsigned int alloc_flags = ALLOC_WMARK_LOW;
  //新的gfp,用于标定分配的属性
    gfp_t alloc_gfp;
 
  //用于保存参与分配的函数之间传递的大部分不可变的分配参数的结构,包括alloc_pages*系列函数。
  //代表了分配时的固定的上下文信息。
  /*
  struct alloc_context
{
    struct zonelist *zonelist;
    nodemask_t *nodemask;
    struct zoneref *preferred_zoneref;
    int migratetype;
    enum zone_type highest_zoneidx;
    bool spread_dirty_pages;
};
  */
    struct alloc_context ac = { };
 
    // 检查order
    if (unlikely(order >= MAX_ORDER)) {
        WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
        return NULL;
    }
 
  //GFP_BOOT_MASK,感觉应该是代表分配启动
    gfp &= gfp_allowed_mask;
  //根据当前进程的flags(current->flags)调整gfp
    gfp = current_gfp_context(gfp);
    alloc_gfp = gfp;
 
  //prepare_alloc_pages对于struct alloc_context进行赋值
  /*
  ac->highest_zoneidx = gfp_zone(gfp_mask);
    ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
    ac->nodemask = nodemask;
    ac->migratetype = gfp_migratetype(gfp_mask);
  */
    if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
            &alloc_gfp, &alloc_flags))
        return NULL;
 
  //避免碎片化
  //alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);
    alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
 
    //第一次内存分配尝试
    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
    if (likely(page))
        goto out;
 
    alloc_gfp = gfp;
    ac.spread_dirty_pages = false;
 
    /*
     * Restore the original nodemask if it was potentially replaced with
     * &cpuset_current_mems_allowed to optimize the fast-path attempt.
     */
    ac.nodemask = nodemask;
    //第一次分配失败,第二次尝试分配
    page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
 
out:
    if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&
        unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
        __free_pages(page, order);
        page = NULL;
    }
 
    trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 
    return page;
}

get_page_from_freelist(从zone freelist,快分配)

快分配路径。

 

get_page_from_freelist 尝试去分配页面,如果分配失败,则交给 __alloc_pages_slowpath 处理一些特殊场景。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                        const struct alloc_context *ac)
{
    struct zoneref *z;
    struct zone *zone;
    struct pglist_data *last_pgdat_dirty_limit = NULL;
    bool no_fallback;
 
retry:
  //扫描zone,尝试查找一个有足够的free pages的zone
    no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
    z = ac->preferred_zoneref;
  //z这里是优先查找的zone,从context中获得的。
    for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,ac->nodemask) {
        struct page *page;
        unsigned long mark;
 
        if (cpusets_enabled() &&
            (alloc_flags & ALLOC_CPUSET) &&
            !__cpuset_zone_allowed(zone, gfp_mask))
                continue;
 
    //主要是要保证在dirty limit之内分配,防止从LRU list中写入,kswapd即可完成平衡
        if (ac->spread_dirty_pages) {
            if (last_pgdat_dirty_limit == zone->zone_pgdat)
                continue;
 
            if (!node_dirty_ok(zone->zone_pgdat)) {
                last_pgdat_dirty_limit = zone->zone_pgdat;
                continue;
            }
        }
 
 
        if (no_fallback && nr_online_nodes > 1 &&
            zone != ac->preferred_zoneref->zone)
    {
            int local_nid;
 
            /*
             * If moving to a remote node, retry but allow
             * fragmenting fallbacks. Locality is more important
             * than fragmentation avoidance.
             */
      //如果移动到一个远的node,但是允许碎片化回退,那么局部性比碎片避免更重要
            local_nid = zone_to_nid(ac->preferred_zoneref->zone);    //获取local node id
            if (zone_to_nid(zone) != local_nid) {//如果使用的不是local node
                alloc_flags &= ~ALLOC_NOFRAGMENT;        //进行标记,retry
                goto retry;
            }
        }
 
    //检查水位是否充足,并进行回收
        mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
        if (!zone_watermark_fast(zone, order, mark,
                       ac->highest_zoneidx, alloc_flags,
                       gfp_mask))
    {
            int ret;
 
        ......
            /* Checked here to keep the fast path fast */
            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
            if (alloc_flags & ALLOC_NO_WATERMARKS)
                goto try_this_zone;
 
            if (!node_reclaim_enabled() ||
                !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
                continue;
 
            ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
            switch (ret) {
            case NODE_RECLAIM_NOSCAN:
                /* did not scan */
                continue;
            case NODE_RECLAIM_FULL:
                /* scanned but unreclaimable */
                continue;
            default:
                /* did we reclaim enough */
                if (zone_watermark_ok(zone, order, mark,
                    ac->highest_zoneidx, alloc_flags))
                    goto try_this_zone;
 
                continue;
            }
        }
 
    //调用rmqueue进行分配
try_this_zone:
        page = rmqueue(ac->preferred_zoneref->zone, zone, order,
                gfp_mask, alloc_flags, ac->migratetype);
    //如果分配成功
        if (page) {
            prep_new_page(page, order, gfp_mask, alloc_flags);
 
            /*
             * If this is a high-order atomic allocation then check
             * if the pageblock should be reserved for the future
             */
            if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                reserve_highatomic_pageblock(page, zone, order);
 
            return page;
        }
    //如果分配失败
    else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
            /* Try again if zone has deferred pages */
            if (static_branch_unlikely(&deferred_pages)) {
                if (_deferred_grow_zone(zone, order))
                    goto try_this_zone;
            }
#endif
        }
    }
 
    /*
     * It's possible on a UMA machine to get through all zones that are
     * fragmented. If avoiding fragmentation, reset and try again.
     */
    if (no_fallback) {
        alloc_flags &= ~ALLOC_NOFRAGMENT;
        goto retry;
    }
 
    return NULL;
}

rmqueue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
/*
 * Allocate a page from the given zone. Use pcplists for order-0 allocations.
 */
static inline
struct page *rmqueue(struct zone *preferred_zone,
            struct zone *zone, unsigned int order,
            gfp_t gfp_flags, unsigned int alloc_flags,
            int migratetype)
{
    unsigned long flags;
    struct page *page;
    // 如果当前的阶是0,直接使用per cpu lists分配
    if (likely(order == 0)) {
        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
                migratetype != MIGRATE_MOVABLE) {
            page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,migratetype, alloc_flags);
            goto out;
        }
    }
 
 
  //当设置了__GFP_NOFAIL,不能分配order > 1的空间。
    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
  //加锁
    spin_lock_irqsave(&zone->lock, flags);
 
    do {
        page = NULL;
        /*
         * order-0 request can reach here when the pcplist is skipped
         * due to non-CMA allocation context. HIGHATOMIC area is
         * reserved for high-order atomic allocation, so order-0
         * request should skip it.
         */
    //如果pcplist分配被跳过,那么order=0会到达这里,但是我们的HIGHATOMIC区域是保留给高阶原子分配,所以order-0请求应该跳过它。
        if (order > 0 && alloc_flags & ALLOC_HARDER)
    {
      //调用__rmqueue_smallest分配,页迁移类型为MIGRATE_HIGHATOMIC
            page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
            if (page)
                trace_mm_page_alloc_zone_locked(page, order, migratetype);
        }
    //不满足上一个if,或分配失败,调用__rmqueue分配
        if (!page)
            page = __rmqueue(zone, order, migratetype, alloc_flags);
    } while (page && check_new_pages(page, order));
 
    spin_unlock(&zone->lock);
 
    if (!page)
        goto failed;
  //更新zone的freepage状态
    __mod_zone_freepage_state(zone, -(1 << order),get_pcppage_migratetype(page));
 
    __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
  //统计NUMA架构信息(hit/miss)
    zone_statistics(preferred_zone, zone);
  //恢复中断
    local_irq_restore(flags);
 
out:
    /* Separate test+clear to avoid unnecessary atomics */
    if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
        clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
        wakeup_kswapd(zone, 0, 0, zone_idx(zone));
    }
 
    VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
    return page;
 
failed:
    local_irq_restore(flags);
    return NULL;
}

当我们要分配单一的一个页面(order=0)时,直接从per_cpu_list中分配。

 

rmqueue_pcplist 中,经过如下步骤:

  1. 关闭中断,保存中断上下文
  2. 获取当前CPU上目标zone中的per_cpu_pages指针
  3. 获取per_cpu_pages中指定的迁移类型的页面list
  4. 调用 __rmqueue_pcplist 从链表上摘取目标页面
  5. 若分配成功,更新当前zone的统计信息
  6. 恢复中断

__rmqueue_pcplist ,经过如下步骤:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/* Remove page from the per-cpu list, caller must protect the list */
static inline
struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
            unsigned int alloc_flags,
            struct per_cpu_pages *pcp,
            struct list_head *list)
{
    struct page *page;
 
    do {
    //listnext指针判断当前list是否为空
        if (list_empty(list)) {
      //如果为空,调用rmqueue_bulk将它们添加到提供的列表中。
            pcp->count += rmqueue_bulk(zone, 0,READ_ONCE(pcp->batch), list,migratetype, alloc_flags);
            if (unlikely(list_empty(list)))
                return NULL;
        }
        //取出list的第一个元素
        page = list_first_entry(list, struct page, lru);
    //从页的lrulist中删除
        list_del(&page->lru);
    //空闲计数减1
        pcp->count--;
    } while (check_new_pcp(page));
 
    return page;
}

__rmqueue_bulk ,经过如下步骤:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/*
 * Obtain a specified number of elements from the buddy allocator, all under
 * a single hold of the lock, for efficiency.  Add them to the supplied list.
 * Returns the number of new pages which were placed at *list.
 */
static int rmqueue_bulk(struct zone *zone, unsigned int order,
            unsigned long count, struct list_head *list,
            int migratetype, unsigned int alloc_flags)
{
    int i, allocated = 0;
 
    spin_lock(&zone->lock);
  //扫描当前的zone的每个order list,尝试找一个最合适的page
    for (i = 0; i < count; ++i)
  {
    //取出一个page
        struct page *page = __rmqueue(zone, order, migratetype,alloc_flags);
        if (unlikely(page == NULL))
            break;
 
        if (unlikely(check_pcp_refill(page)))
            continue;
 
    //将当前page添加到lrulist
        list_add_tail(&page->lru, list);
        allocated++;
    //如果page在cma区域中,更新zone部分成员的信息,调整NR_FREE_PAGES
    /*
    atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_zone_stat[item]);
    */
        if (is_migrate_cma(get_pcppage_migratetype(page)))
            __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order));
    }
 
  //如果check_pcp_refill检查失败,移除页面,调整NR_FREE_PAGES
  //for循环i次,扫描了i个pageblock,而每个pageblock有2^i个pages,更新NR_FREE_PAGES
    __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
    spin_unlock(&zone->lock);
    return allocated;
}

__rmqueue 经历如下步骤:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/*
 * Do the hard work of removing an element from the buddy allocator.
 * Call me with the zone->lock already held.
 */
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
                        unsigned int alloc_flags)
{
    struct page *page;
 
  // 如果打开了CMA,平衡常规区域和CMA区域之间的可移动分配,当该区一半以上的可用内存在CMA区域时,从CMA分配。
    if (IS_ENABLED(CONFIG_CMA))
  {
        if (alloc_flags & ALLOC_CMA &&
            zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES) / 2)         {
            page = __rmqueue_cma_fallback(zone, order);
            if (page)
                goto out;
        }
    }
retry:
  //否则直接调用__rmqueue_smallest分配。
    page = __rmqueue_smallest(zone, order, migratetype);
    if (unlikely(!page)) {
        if (alloc_flags & ALLOC_CMA)
            page = __rmqueue_cma_fallback(zone, order);
        if (!page && __rmqueue_fallback(zone, order, migratetype,
                                alloc_flags))
            goto retry;
    }
out:
    if (page)
        trace_mm_page_alloc_zone_locked(page, order, migratetype);
    return page;
}

__rmqueue_smallest 经历如下步骤:

 

主要是从每个order的freelist查找大小和迁移属性都合适的page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/*
 * Go through the free lists for the given migratetype and remove
 * the smallest available page from the freelists
 */
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                        int migratetype)
{
    unsigned int current_order;
    struct free_area *area;
    struct page *page;
 
    /* Find a page of the appropriate size in the preferred list */
    for (current_order = order; current_order < MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
    //从对应迁移类型的链表头获取page。
        page = get_page_from_free_area(area, migratetype);
        if (!page)
            continue;
    //删除page,更新zone
        del_page_from_free_list(page, zone, current_order);
 
        expand(zone, page, order, current_order, migratetype);
    //设置迁移类型
        set_pcppage_migratetype(page, migratetype);
        return page;
    }
 
    return NULL;
}

expand 经历如下步骤:

 

如果当前的 current_order > order 时:

 

假设此时 high=4,low=2。(current_order、order)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
while (high > low) {
        high--;
        size >>= 1;
        VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
 
        /*
         * Mark as guard pages (or page), that will allow to
         * merge back to allocator when buddy will be freed.
         * Corresponding page table entries will not be touched,
         * pages will stay not present in virtual address space
         */
        if (set_page_guard(zone, &page[size], high, migratetype))
            continue;
 
        add_to_free_list(&page[size], zone, high, migratetype);
        set_buddy_order(&page[size], high);
    }

那么会对多出来的页进行标记,标即为guard pages 。不可访问。然后将切分后的page放入相应的free链表中

__alloc_pages_slowpath(慢分配)

当快分配不成功时,走慢分配路径。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                        struct alloc_context *ac)
{
    bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
    const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
    struct page *page = NULL;
    unsigned int alloc_flags;
    unsigned long did_some_progress;
    enum compact_priority compact_priority;
    enum compact_result compact_result;
    int compaction_retries;
    int no_progress_loops;
    unsigned int cpuset_mems_cookie;
    int reserve_flags;
 
    //如果内存分配来自__GFP_ATOMIC(原子请求)、__GFP_DIRECT_RECLAIM(可直接回收),会产生冲突,取消原子标识
    if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
        gfp_mask &= ~__GFP_ATOMIC;
 
retry_cpuset:
    compaction_retries = 0;
    no_progress_loops = 0;
    compact_priority = DEF_COMPACT_PRIORITY;
    cpuset_mems_cookie = read_mems_allowed_begin();
 
    //快速分配采用保守的alloc_flags,我们这里进行重新设置,降低成本。
    alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
    //重新计算分配迭代zone的起始点。
    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                    ac->highest_zoneidx, ac->nodemask);
    if (!ac->preferred_zoneref->zone)
        goto nopage;
 
    //如果设置了ALLOC_KSWAPD,唤醒kswapds进程
    if (alloc_flags & ALLOC_KSWAPD)
        wake_all_kswapds(order, gfp_mask, ac);
 
    //使用重新调整后的信息再次重新分配
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;
 
    /*
     * For costly allocations, try direct compaction first, as it's likely
     * that we have enough base pages and don't need to reclaim. For non-
     * movable high-order allocations, do that as well, as compaction will
     * try prevent permanent fragmentation by migrating from blocks of the
     * same migratetype.
     * Don't try this for allocations that are allowed to ignore
     * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
     */
    //示情况进行内存压缩
    if (can_direct_reclaim &&
            (costly_order ||
               (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
        page = __alloc_pages_direct_compact(gfp_mask, order,
                        alloc_flags, ac,
                        INIT_COMPACT_PRIORITY,
                        &compact_result);
        if (page)
            goto got_pg;
 
    //如果设置了__GFP_NORETRY,可能包含了一些THP page fault的分配
        if (costly_order && (gfp_mask & __GFP_NORETRY)) {
            if (compact_result == COMPACT_SKIPPED ||
                compact_result == COMPACT_DEFERRED)
                goto nopage;
 
            //同步压缩开销太大,保持异步压缩
            compact_priority = INIT_COMPACT_PRIORITY;
        }
    }
 
retry:
    //保证kswapd不会睡眠,再次唤醒
    if (alloc_flags & ALLOC_KSWAPD)
        wake_all_kswapds(order, gfp_mask, ac);
 
    //区分真正需要访问全部内存储备的请求和可以承受部分内存的被oom kill掉的请求。
    reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
    if (reserve_flags)
        alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
 
    //当不允许在当前cpu-node中分配,且设置了reserve_flags,那么降低此时的分配标准,重置高优先级的迭代器再进行分配。
    if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
        ac->nodemask = NULL;
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                    ac->highest_zoneidx, ac->nodemask);
    }
 
    /* Attempt with potentially adjusted zonelist and alloc_flags */
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;
 
    /* Caller is not willing to reclaim, we can't balance anything */
    if (!can_direct_reclaim)
        goto nopage;
 
    /* Avoid recursion of direct reclaim */
    if (current->flags & PF_MEMALLOC)
        goto nopage;
 
    //尝试先回收,再分配
    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                            &did_some_progress);
    if (page)
        goto got_pg;
 
    //尝试直接压缩,再分配
    page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                    compact_priority, &compact_result);
    if (page)
        goto got_pg;
 
    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
        goto nopage;
 
    /*
     * Do not retry costly high order allocations unless they are
     * __GFP_RETRY_MAYFAIL
     */
    if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
        goto nopage;
 
    //是否应当再次进行内存回收
    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                 did_some_progress > 0, &no_progress_loops))
        goto retry;
 
    //是否应该再次压缩
    if (did_some_progress > 0 &&
            should_compact_retry(ac, order, alloc_flags,
                compact_result, &compact_priority,
                &compaction_retries))
        goto retry;
 
 
 
    //在我们启动oom之前判断可能的条件竞争问题
    if (check_retry_cpuset(cpuset_mems_cookie, ac))
        goto retry_cpuset;
 
    //回收失败,开启oomkiller,杀死一些进程以获得内存
    page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
    if (page)
        goto got_pg;
 
    //避免nowatermark的内存无限使用
    if (tsk_is_oom_victim(current) &&
        (alloc_flags & ALLOC_OOM ||
         (gfp_mask & __GFP_NOMEMALLOC)))
        goto nopage;
 
    if (did_some_progress) {
        no_progress_loops = 0;
        goto retry;
    }
 
nopage:
 
    if (check_retry_cpuset(cpuset_mems_cookie, ac))
        goto retry_cpuset;
 
    //当设置了__GFP_NOFAIL时,多次尝试
    if (gfp_mask & __GFP_NOFAIL) {
        //当所有的NOFAIL的请求都被blocked掉时,警告用户此时应该使用NOWAIT
        if (WARN_ON_ONCE(!can_direct_reclaim))
            goto fail;
 
        WARN_ON_ONCE(current->flags & PF_MEMALLOC);
        WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
        /*
         通过让他们访问内存储备来帮助不失败的分配,但不要使用ALLOC_NO_WATERMARKS,因为这可能耗尽整个内存储备,使情况变得更糟
         */
        page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
        if (page)
            goto got_pg;
 
        cond_resched();
        goto retry;
    }
fail:
    warn_alloc(gfp_mask, ac->nodemask,
            "page allocation failure: order:%u", order);
got_pg:
    return page;
}

__free_pages

1
2
3
4
5
6
7
8
9
10
void __free_pages(struct page *page, unsigned int order)
{
  //检查页框是否还有进程在使用,检查_count变量的值是否为0
    if (put_page_testzero(page))
        free_the_page(page, order);
  //这里我个人理解时,类比于之前的set_page_guard那一步,分配的order大于需要的order,相当于分配了多页,那么这里就是挨个释放多页
    else if (!PageHead(page))
        while (order-- > 0)
            free_the_page(page + (1 << order), order);
}

free_the_page 中:

1
2
3
4
5
6
7
8
9
static inline void free_the_page(struct page *page, unsigned int order)
{
  //如果是通过pcpulist分配
    if (order == 0)        /* Via pcp? */
        free_unref_page(page);
  //否则调用__free_pages_ok
    else
        __free_pages_ok(page, order, FPI_NONE);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/*
 * Free a 0-order page
 */
void free_unref_page(struct page *page)
{
    unsigned long flags;
  //获取page frame number
    unsigned long pfn = page_to_pfn(page);
    //进行free前检查
    if (!free_unref_page_prepare(page, pfn))
        return;
 
    local_irq_save(flags);
    free_unref_page_commit(page, pfn);
    local_irq_restore(flags);
}

free_unref_page_commit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static void free_unref_page_commit(struct page *page, unsigned long pfn)
{
    struct zone *zone = page_zone(page);
    struct per_cpu_pages *pcp;
    int migratetype;
    //获取当前page的迁移类型
    migratetype = get_pcppage_migratetype(page);
    __count_vm_event(PGFREE);
 
    //percpu list只放入几种制定类型的page
    if (migratetype >= MIGRATE_PCPTYPES) {
        if (unlikely(is_migrate_isolate(migratetype))) {
      //free
            free_one_page(zone, page, pfn, 0, migratetype,
                      FPI_NONE);
            return;
        }
        migratetype = MIGRATE_MOVABLE;
    }
 
    pcp = &this_cpu_ptr(zone->pageset)->pcp;
 
  //将page用头插法放入pcp->lists[migratetype]链表头
    list_add(&page->lru, &pcp->lists[migratetype]);
    pcp->count++;
 
  //如果pcp中的page数量大于最大数量,则将多余的page放入伙伴系统
    if (pcp->count >= READ_ONCE(pcp->high))
        free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
}

free_one_page -->__free_one_page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
static inline void __free_one_page(struct page *page,
        unsigned long pfn,
        struct zone *zone, unsigned int order,
        int migratetype, fpi_t fpi_flags)
{
    struct capture_control *capc = task_capc(zone);
    unsigned long buddy_pfn;
    unsigned long combined_pfn;
    unsigned int max_order;
    struct page *buddy;
    bool to_tail;
    //获取最大order-1
    max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
 
    VM_BUG_ON(!zone_is_initialized(zone));
    VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
 
    VM_BUG_ON(migratetype == -1);
    if (likely(!is_migrate_isolate(migratetype)))
        __mod_zone_freepage_state(zone, 1 << order, migratetype);
 
    VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
    VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
continue_merging:
    // 循环扫描直到order == max_order-1
    // 处理合并问题
    while (order < max_order)
    {
        if (compaction_capture(capc, page, order, migratetype)) {
            __mod_zone_freepage_state(zone, -(1 << order),
                                migratetype);
            return;
        }
        //查找buddy page frame
        //page_pfn ^ (1 << order)
        buddy_pfn = __find_buddy_pfn(pfn, order);
        //获得对应的 struct page
        buddy = page + (buddy_pfn - pfn);
        //判断是否有效
        if (!pfn_valid_within(buddy_pfn))
            goto done_merging;
        /*
        检查当前的buddy page是否是free状态可合并的。
        主要满足以下条件:
        1.处于buddy system中
        2.有相同的order
        3.处于同一个zone
        */
        if (!page_is_buddy(page, buddy, order))
            goto done_merging;
        //如果满足free条件,或者是一个gaurd page,那么进行合并,合并后向上移动一个order。
        if (page_is_guard(buddy))
            clear_page_guard(zone, buddy, order, migratetype);
        else
            del_page_from_free_list(buddy, zone, order);
 
        //合并页,设置新的pfn
        combined_pfn = buddy_pfn & pfn;
        page = page + (combined_pfn - pfn);
        pfn = combined_pfn;
        order++;
    }
    if (order < MAX_ORDER - 1) {
        //防止隔离pageblock和正常pageblock上page的合并
        if (unlikely(has_isolate_pageblock(zone))) {
            int buddy_mt;
 
            buddy_pfn = __find_buddy_pfn(pfn, order);
            buddy = page + (buddy_pfn - pfn);
            buddy_mt = get_pageblock_migratetype(buddy);
 
            if (migratetype != buddy_mt
                    && (is_migrate_isolate(migratetype) ||
                        is_migrate_isolate(buddy_mt)))
                goto done_merging;
        }
        max_order = order + 1;
        goto continue_merging;
    }
 
done_merging:
    //设置阶,标记为伙伴系统的page
    set_buddy_order(page, order);
 
    if (fpi_flags & FPI_TO_TAIL)
        to_tail = true;
    else if (is_shuffle_order(order))    //is_shuffle_order,return false
        to_tail = shuffle_pick_tail();
    else
        //如果此时的page不是最大的page,那么检查是否buddy page是否是空的。 如果是的话,说明buddy page很可能正在被释放,而很快就要被合并起来。
        //在这种情况下,我们优先将page插入zone->free_area[order]的list的尾部,延缓page的使用,从而方便buddy被free掉后,两个页进行合并。
        to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
    //插入尾部
    if (to_tail)
        add_to_free_list_tail(page, zone, order, migratetype);
    else
    //插入头部
        add_to_free_list(page, zone, order, migratetype);
 
    /* Notify page reporting subsystem of freed page */
    if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
        page_reporting_notify_free(order);
}

free_pcppages_bulk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
static void free_pcppages_bulk(struct zone *zone, int count,
                    struct per_cpu_pages *pcp)
{
    int migratetype = 0;
    int batch_free = 0;
    int prefetch_nr = READ_ONCE(pcp->batch);
    bool isolated_pageblocks;
    struct page *page, *tmp;
    LIST_HEAD(head);
 
    //获取pcpulist中pages最大数量
    count = min(pcp->count, count);
 
    while (count)
    {
        struct list_head *list;
 
        /*
         * Remove pages from lists in a round-robin fashion. A
         * batch_free count is maintained that is incremented when an
         * empty list is encountered.  This is so more pages are freed
         * off fuller lists instead of spinning excessively around empty
         * lists
         */
        //batch_free(删除页数递增),遍历迁移列表
        do {
            batch_free++;
            if (++migratetype == MIGRATE_PCPTYPES)
                migratetype = 0;
            list = &pcp->lists[migratetype];
        } while (list_empty(list));
 
        //如果只有一个非空列表
        if (batch_free == MIGRATE_PCPTYPES)
            batch_free = count;
 
        do {
            //获取列表尾部的元素
            page = list_last_entry(list, struct page, lru);
            /* must delete to avoid corrupting pcp list */
            list_del(&page->lru);
            pcp->count--;
 
            if (bulkfree_pcp_prepare(page))
                continue;
            // 放入head列表中
            list_add_tail(&page->lru, &head);
 
            //对page的buddy页进行预取
            if (prefetch_nr) {
                prefetch_buddy(page);
                prefetch_nr--;
            }
        } while (--count && --batch_free && !list_empty(list));
    }
 
    spin_lock(&zone->lock);
    isolated_pageblocks = has_isolate_pageblock(zone);
 
    /*
     * Use safe version since after __free_one_page(),
     * page->lru.next will not point to original list.
     */
    list_for_each_entry_safe(page, tmp, &head, lru) {
        int mt = get_pcppage_migratetype(page);
        //MIGRATE_ISOLATE的page不可以被放入pcplist
        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
        //迁移类型不可以是isolate?但是has_isolate_pageblock未实现。
        if (unlikely(isolated_pageblocks))
            mt = get_pageblock_migratetype(page);
        //调用__free_one_page放入伙伴系统
        __free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE);
        trace_mm_page_pcpu_drain(page, 0, mt);
    }
    spin_unlock(&zone->lock);
}

SLAB/SLUB分配器

hNJNh6.png

关键结构体

kmem_cache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/*
 * Slab cache management.
 */
struct kmem_cache {
    struct kmem_cache_cpu __percpu *cpu_slab;    //per cpu 缓存
    /* Used for retrieving partial slabs, etc. */
    slab_flags_t flags;
    unsigned long min_partial;    //partial链表中slab的最大数量
    unsigned int size;    /* The size of an object including metadata 每个块内存实际需要的大小*/
    unsigned int object_size;/* The size of an object without metadata 除去元数据的对象大小*/
    struct reciprocal_value reciprocal_size;
    unsigned int offset;    /* Free pointer offset 到空闲指针的偏移,可以索引指向下一个空闲块的指针*/
#ifdef CONFIG_SLUB_CPU_PARTIAL
    /* Number of per cpu partial objects to keep around */
    unsigned int cpu_partial;        //cpuslab partial链表中slab的最大数量,超过数量的则被放入kmem_cache_node普通的partial链表中
#endif
    struct kmem_cache_order_objects oo;//记录slab管理page的数量(高16bits)和slab obj的数量(低16bits
 
    /* Allocation and freeing of slabs */
    struct kmem_cache_order_objects max;    //最大分配数量   
    struct kmem_cache_order_objects min;    //最小分配量
    gfp_t allocflags;    /* 从伙伴系统集成的分配请求掩码 */
    int refcount;        /* Refcount for slab cache destroy */
    void (*ctor)(void *);
    unsigned int inuse;        /* Offset to metadata */
    unsigned int align;        /* Alignment */
    unsigned int red_left_pad;    /* Left redzone padding size */
    const char *name;    /* 文件系统显示使用 */
    struct list_head list;    /* 所有slab的list */
#ifdef CONFIG_SYSFS
    struct kobject kobj;    /* 文件系统使用 */
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED
    unsigned long random;
#endif
 
#ifdef CONFIG_NUMA
    /*
     * Defragmentation by allocating from a remote node.
     */
    unsigned int remote_node_defrag_ratio;
#endif
 
#ifdef CONFIG_SLAB_FREELIST_RANDOM
    unsigned int *random_seq;
#endif
 
#ifdef CONFIG_KASAN
    struct kasan_cache kasan_info;
#endif
 
    unsigned int useroffset;    /* Usercopy region offset */
    unsigned int usersize;        /* Usercopy region size */
 
    struct kmem_cache_node *node[MAX_NUMNODES];        //slab节点
};

kmem_cache_cpu

1
2
3
4
5
6
7
8
9
10
11
struct kmem_cache_cpu {
    void **freelist;    /* Pointer to next available object */
    unsigned long tid;    /* CPU的独特标识 */
    struct page *page;    /* 当前正准备分配的slab */
#ifdef CONFIG_SLUB_CPU_PARTIAL
    struct page *partial;    /* 指向当前的半满的slab(slab中有空闲的object) */
#endif
#ifdef CONFIG_SLUB_STATS
    unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};

kmem_cache_node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct kmem_cache_node {
    spinlock_t list_lock;        //自旋锁
#ifdef CONFIG_SLAB
    ......
#endif
 
#ifdef CONFIG_SLUB
    unsigned long nr_partial;        //partial list中slab的数量
    struct list_head partial;        //当前节点的partial链表
#ifdef CONFIG_SLUB_DEBUG
    atomic_long_t nr_slabs;
    atomic_long_t total_objects;
    struct list_head full;
#endif
#endif
 
};

一个更清晰的三层结构:

 

hNYRaR.png

slab_hardened缓解/加固

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static inline void *freelist_ptr(const struct kmem_cache *s, void *ptr,
                 unsigned long ptr_addr)
{
#ifdef CONFIG_SLAB_FREELIST_HARDENED
    /*
     * When CONFIG_KASAN_SW/HW_TAGS is enabled, ptr_addr might be tagged.
     * Normally, this doesn't cause any issues, as both set_freepointer()
     * and get_freepointer() are called with a pointer with the same tag.
     * However, there are some issues with CONFIG_SLUB_DEBUG code. For
     * example, when __free_slub() iterates over objects in a cache, it
     * passes untagged pointers to check_object(). check_object() in turns
     * calls get_freepointer() with an untagged pointer, which causes the
     * freepointer to be restored incorrectly.
     */
    return (void *)((unsigned long)ptr ^ s->random ^
            swab((unsigned long)kasan_reset_tag((void *)ptr_addr)));
#else
    return ptr;
#endif
}

在某些内核题目中,当开启了 CONFIG_SLAB_FREELIST_HARDENED 选项,freelist_ptr 函数会对object对象混淆后的next指针进行的解密。

kmem_cache_alloc

引用:

 

创建新slab其实就是申请对应order的内存页,用来放足够数量的对象。值得注意的是其中order以及对象数量的确定,这两者又是相互影响的。order和object数量同时存放在kmem_cache成员kmem_cache_order_objects中,低16位用于存放object数量,高位存放order。order与object数量的关系非常简单:((PAGE_SIZE << order) - reserved) / size

1
2
3
kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
    slab_alloc(s, gfpflags, _RET_IP_, s->object_size)
      slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr, orig_size)

最终到达 slab_alloc_node(快分配)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
        gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
{
    void *object;
    struct kmem_cache_cpu *c;
    struct page *page;
    unsigned long tid;
    struct obj_cgroup *objcg = NULL;
    bool init = false;
 
    //hook函数,预处理
    s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
    if (!s)
        return NULL;
 
    //kfence_alloc未实现,return NULL
    object = kfence_alloc(s, orig_size, gfpflags);
    if (unlikely(object))
        goto out;
 
redo:
    /*
    1.开启抢占
    2.通过cpu ptr读取kmem_cache cpu数据
    3.在读取时可以在CPU之间切换,只要最终回到原来的CPU上
    4.保证tid和kmem_cache在cpu上是一致性的,如果开启了CONFIG_PREEMPTION,那么可能不同,所以需要check
    */
    do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPTION) && unlikely(tid != READ_ONCE(c->tid)));
 
    //内存屏障,保证访问顺序。通过一个 cache 一致性协议来避免数据不一致的问题,防止由于优化被影响。
    //此时c是当前的kmem_cache.cpu_slab
    barrier();
    object = c->freelist;    //    kmem_cache_cpu->freelist,相当于freelist第一项
    page = c->page;            //    kmem_cache_cpu->page
 
    //如果当前CPU的空闲list和分配的list中的page都为空;或者node和page->node不匹配
    //调用__slab_alloc走慢分配
    if (unlikely(!object || !page || !node_match(page, node))) {
        object = __slab_alloc(s, gfpflags, node, addr, c);
    }
    else {
        //获得当前obj的下一个obj地址。
        void *next_object = get_freepointer_safe(s, object);
 
        /*
        原子化的如下操作:
        if(s->cpu_slab->freelist == object && s->cpu_slab->tid == tid){
            s->cpu_slab->freelist = next_object; 相当于freelist链上取下来了first obj,然后把next obj挂上去
            s->cpu_slab->tid = next_tid(tid);
            return 1;
        }else{return 0;}
        */
        if (unlikely(!this_cpu_cmpxchg_double(
                s->cpu_slab->freelist, s->cpu_slab->tid,
                object, tid,
                next_object, next_tid(tid))))
        {
 
            note_cmpxchg_failure("slab_alloc", s, tid);
            goto redo;
        }
        //gcc,数据预取
        prefetch_freepointer(s, next_object);
        //记录状态
        stat(s, ALLOC_FASTPATH);
    }
    //擦除freeptr指针
    maybe_wipe_obj_freeptr(s, object);
    init = slab_want_init_on_alloc(gfpflags, s);
 
out:
    slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
 
    return object;
}

get_freepointer_safe 行为如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
{
    unsigned long freepointer_addr;
    void *p;
 
    if (!debug_pagealloc_enabled_static())
    //先检测kasan,然后解密指针
        return get_freepointer(s, object);
 
    //return (void *)object;
    object = kasan_reset_tag(object);
    //通过当前obj的地址+s->offset(偏移)获得obj中指向freelist下一项的指针的地址
    freepointer_addr = (unsigned long)object + s->offset;
    //安全地从freepointer_addr中读取next指针,放入void *p
    copy_from_kernel_nofault(&p, (void **)freepointer_addr, sizeof(p));
    //如果开了slab_hardened,这里会对obj的next指针做解密
    return freelist_ptr(s, p, freepointer_addr);
}

__slab_alloc 慢分配

 

slab_alloc_node -> __slaballoc -> \__slab_alloc

1
2
3
4
5
//如果当前CPU的空闲list和分配的list中的page都为空;或者node和page->node不匹配
//调用__slab_alloc走慢分配
if (unlikely(!object || !page || !node_match(page, node))) {
    object = __slab_alloc(s, gfpflags, node, addr, c);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
              unsigned long addr, struct kmem_cache_cpu *c)
{
    /*
    进入慢分配路径的前提:
    1.freelist
    或者
    2.需要执行调度
    */
    void *freelist;
    struct page *page;
 
    stat(s, ALLOC_SLOWPATH);
 
    page = c->page;//kmem_cache_cpu->page为空代表没有可用的slab
    if (!page) {
        //如果node不在线或者没有正常的内存,忽略node的约束限制
        if (unlikely(node != NUMA_NO_NODE &&
                 !node_isset(node, slab_nodes)))
            node = NUMA_NO_NODE;
        goto new_slab;
    }
redo:
    //page->node与node是否相等?
    if (unlikely(!node_match(page, node))) {
        //如果不相等
        if (!node_isset(node, slab_nodes)) {
            node = NUMA_NO_NODE;
            goto redo;
        } else {
            //记录状态为ALLOC_NODE_MISMATCH
            stat(s, ALLOC_NODE_MISMATCH);
            //把一个slab移除slab_cache
            deactivate_slab(s, page, c->freelist, c);
            goto new_slab;
        }
    }
 
 
    //如果当前的page是PF_MEMALLOC,调用deactivate_slab
    /*
    当前进程有很多可以释放的内存,如果能分配一点紧急内存给当前进程,那么当前进程可以返回更多的内存给系统。
    非内存管理子系统不应该使用这个标记,除非这次分配保证会释放更大的内存给系统。如果每个子系统都滥用这个标记,
    可能会耗尽内存管理子系统的保留内存。
    */
    if (unlikely(!pfmemalloc_match(page, gfpflags))) {
        deactivate_slab(s, page, c->freelist, c);
        goto new_slab;
    }
 
 
    //再检查一下freelist,防止由于cpu迁移或者中断导致freelist非空
    freelist = c->freelist;
    if (freelist)
        goto load_freelist;
 
    //获取 (struct kmem_cache_cpu *c)->freelist
    freelist = get_freelist(s, page);
 
    if (!freelist) {
        c->page = NULL;
        stat(s, DEACTIVATE_BYPASS);
        goto new_slab;
    }
 
    stat(s, ALLOC_REFILL);
 
load_freelist:
    //c->page指向的是这些已经被分配obj所在的页面,应该被cpu冻结,以保证分配发挥作用
    VM_BUG_ON(!c->page->frozen);
    //更新kmem_cache_cpu对应的指针
    c->freelist = get_freepointer(s, freelist);
    c->tid = next_tid(c->tid);
    return freelist;
 
new_slab:
    //判断我们的kmem_cache_cpu中是否存在半满的partial slab,即只有部分空间被使用的page
    if (slub_percpu_partial(c)) {
        page = c->page = slub_percpu_partial(c);//如果存在,将partial链表中的page拿来分配obj
        slub_set_percpu_partial(c, page);        //更新partial链表
        stat(s, CPU_PARTIAL_ALLOC);                //记录为parital分配
        goto redo;
    }
    /*
    new_slab_objects:
    1.首先尝试从kmem_cache_node的partial链表中分配page
    2.使用new_slab底层调用buddy system,分配page
    */
    freelist = new_slab_objects(s, gfpflags, node, &c);
    if (unlikely(!freelist)) {
        slab_out_of_memory(s, gfpflags, node);
        return NULL;
    }
 
    page = c->page;
    if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
        goto load_freelist;
 
    /* Only entered in the debug case */
    if (kmem_cache_debug(s) &&
            !alloc_debug_processing(s, page, freelist, addr))
        goto new_slab;    /* Slab failed checks. Next slab needed */
 
    deactivate_slab(s, page, get_freepointer(s, freelist), c);
    return freelist;
}

deactivate_slab

 

该函数主要将slab放回node

  1. 首先计算cpu freelist中obj的数量。
  2. 解冻page,将每个per-cpu freelist全部释放回page->freelist。(
  3. 根据page中slab的不同情况,进行page的移动到不同的队列。更新page状态,frozen代表slab在cpu_slub,unfroze代表在partial队列或者full队列。根据情况决定是否释放该slab

kmem_cache_free

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void kmem_cache_free(struct kmem_cache *s, void *x)
{
  /*
  cache_from_obj:
  功能:定位object所在的 kmem_cache
  过程:
  1.如果没有开启CONFIG_SLAB_FREELIST_HARDENED && SLAB_CONSISTENCY_CHECKS,直接返回用户选择的kmem_cache
  2.否则判定用户传入的不可信,通过以下路径:
      -> virt_to_cache(x)
          -> virt_to_head_page(obj=x)
              -> virt_to_head_page(obj),经由对象地址获得其页面page管理结构
 
      在获取了obj对应的page之后返回page->slab_cache即对应的更加准确的struct kmem_cache
      然后判断我们获取的跟用户传入的是否相等。
  3.最终定位定位object所在的 kmem_cache,返回
  */
    s = cache_from_obj(s, x);
  //判断obj对应的kmem_cache是否获取成功
    if (!s)
        return;
  /*
  slab_free
      ->slab_free_freelist_hook,如果开启了harened,这里做了指针加密
      ->do_slab_free
  */
    slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
    trace_kmem_cache_free(_RET_IP_, x, s->name);
}

do_slab_free(快速路径)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
static __always_inline void do_slab_free(struct kmem_cache *s,
                struct page *page, void *head, void *tail,
                int cnt, unsigned long addr)
{
    void *tail_obj = tail ? : head;
    struct kmem_cache_cpu *c;
    unsigned long tid;
 
    memcg_slab_free_hook(s, &head, 1);
redo:
    //保证tid的同步
    do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPTION) && unlikely(tid != READ_ONCE(c->tid)));
 
    //内存屏障,防止优化导致的问题
    barrier();
    //判断我们待时放的obj所属的page是否是kmem_cache_cpu的page
    if (likely(page == c->page))
    {
        void **freelist = READ_ONCE(c->freelist);
        //tail_obj是待插入的obj,这里将freelist放到obj对应的指针域
        // *(obj+offset) = freelist
        set_freepointer(s, tail_obj, freelist);
        /*
 
        原子操作:
        验证成功后:
        s->cpu_slab->freelist = head(此时就是待插入的obj);
        s->cpu_slab->tid = next_tid(tid);
 
        **********************************************
        当这一步运行结束时。free obj已经被成功插入了freelist
        **********************************************
 
        */
        if (unlikely(!this_cpu_cmpxchg_double(
                s->cpu_slab->freelist, s->cpu_slab->tid,
                freelist, tid,
                head, next_tid(tid)))) {
 
            note_cmpxchg_failure("slab_free", s, tid);
            goto redo;
        }
        stat(s, FREE_FASTPATH);
    }
    //如果不是,则进入__slab_free走慢释放
    else
        __slab_free(s, page, head, tail_obj, cnt, addr);
 
}

__slab_free(慢路径)

___slab_free

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
static void __slab_free(struct kmem_cache *s, struct page *page,
            void *head, void *tail, int cnt,
            unsigned long addr)
 
{
    void *prior;
    int was_frozen;
    struct page new;
    unsigned long counters;
    struct kmem_cache_node *n = NULL;
    unsigned long flags;
 
    stat(s, FREE_SLOWPATH);
 
    //return false,未实现
    if (kfence_free(head))
        return;
 
    if (kmem_cache_debug(s) &&
        !free_debug_processing(s, page, head, tail, cnt, addr))
        return;
 
    do {
        //释放free_debug_processing设置的自旋锁
        if (unlikely(n)) {
            spin_unlock_irqrestore(&n->list_lock, flags);
            n = NULL;
        }
 
        prior = page->freelist;
        counters = page->counters;
 
        //tail此时是待插入的obj,设置obj的freepointer
        set_freepointer(s, tail, prior);
 
        new.counters = counters;
        was_frozen = new.frozen;
        new.inuse -= cnt;    //根据释放了多少个obj更新inuse
 
        /*
        如果当前的page没有正在被使用的obj 或者 没有可以被使用的free obj
        并且
        不处于frozen状态,即不属于某一个CPU
        的slab cache。
        那么
        */
        if ((!new.inuse || !prior) && !was_frozen) {
            //如果当前kmem_cache存在cpu_slab的partial链表,且没有可以使用的空闲obj(freelist为空);则标记page被冻结(属于cpu slab)
            //并且后续准备放入cpu_slab->partial
            if (kmem_cache_has_cpu_partial(s) && !prior) {
                new.frozen = 1;
 
            } else { /* Needs to be taken off a list */
                //获取node,加锁
                n = get_node(s, page_to_nid(page));
                spin_lock_irqsave(&n->list_lock, flags);
 
            }
        }
    /*
    page->freelist = head
    page->counters = new.counters
    */
    } while (!cmpxchg_double_slab(s, page,
        prior, counters,
        head, new.counters,
        "__slab_free"));
 
    /*
     n为空的可能性较大,即当前释放的对象是slab中的最后一个对象
     的可能性较小。其他的可能情况为:
     1. slab已满,并且slab不属于某个CPU
     2. slab已经属于某个CPU
     3. 无论slab是否属于某个CPU,slab的freelist不为空,且inuse
     字段不为0
     */
    if (likely(!n)) {
        //如果page被冻结,那么只更新FREE_FROZEN信息
        //此时说明slab已经属于其他CPU的slab cache,而当前的cpu不是冻结slab的cpu
        if (likely(was_frozen)) {
            stat(s, FREE_FROZEN);
        }
        //对于刚刚更新的frozen操作,此时cpu与冻结的操作是一致的,将page添加到当前CPU的slab cache的partial链表中
        else if (new.frozen) {
            put_cpu_partial(s, page, 1);
            stat(s, CPU_PARTIAL_FREE);
        }
 
        return;
    }
 
    //如果释放当前obj之后,slab为空,并且partial中的半满page数量高于最小值,进行slab的直接释放
    if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
        goto slab_empty;
 
 
    //释放当前obj后,仍有obj在slab中,slab为半满
    if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
        //将obj从fulllist中删除,插入partial list尾部
        remove_full(s, n, page);
        add_partial(n, page, DEACTIVATE_TO_TAIL);
        stat(s, FREE_ADD_PARTIAL);
    }
    //解锁
    spin_unlock_irqrestore(&n->list_lock, flags);
    return;
 
slab_empty:
    if (prior) {
        //如果当前page存在可用的obj,那么说明当前page在partial中,所以在partial list中删除page
        remove_partial(n, page);
        stat(s, FREE_REMOVE_PARTIAL);
    } else {
        //从full list中删除
        remove_full(s, n, page);
    }
 
    spin_unlock_irqrestore(&n->list_lock, flags);
    stat(s, FREE_SLAB);
    /* 释放slab */
    discard_slab(s, page);
}

discard_slab

 

释放路径:discard_slab -> free_slab -> __freeslab -> \_free_pages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static void __free_slab(struct kmem_cache *s, struct page *page)
{
    //获取page的order
    int order = compound_order(page);
    int pages = 1 << order;
 
    if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) {
        void *p;
 
        //安全检查
        slab_pad_check(s, page);
        //对于每个obj,调用check_object进行检查
        for_each_object(p, s, page_address(page),
                        page->objects)
            check_object(s, page, p, SLUB_RED_INACTIVE);
    }
    //清除page标志位
    __ClearPageSlabPfmemalloc(page);
    __ClearPageSlab(page);
 
    page->slab_cache = NULL;
 
    //更新当前进程的内存回收状态
    if (current->reclaim_state)
        current->reclaim_state->reclaimed_slab += pages;
    //更新系统状态
    unaccount_slab_page(page, order, s);
    //调用buddy system,释放page
    __free_pages(page, order);
}

check_object 与 CONFIG_SLUB

SLUB DEBUG可以检测内存越界(out-of-bounds)和访问已经释放的内存(use-after-free)等问题。

 

如何开启:

1
2
3
4
5
CONFIG_SLUB=y
 
CONFIG_SLUB_DEBUG=y
 
CONFIG_SLUB_DEBUG_ON=y

推荐阅读:

 

Linux内核slab内存的越界检查——SLUB_DEBUG

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
static int check_object(struct kmem_cache *s, struct page *page,
                    void *object, u8 val)
{
    u8 *p = object;
    u8 *endobject = object + s->object_size;
 
    //如果存在RED_ZONE,则进行检测
    if (s->flags & SLAB_RED_ZONE) {
        // 检测left red zone(用于检测是否左边发生了OOB)
        if (!check_bytes_and_report(s, page, object, "Left Redzone",
            object - s->red_left_pad, val, s->red_left_pad))
            return 0;
        // 检测RED ZONE(当前的obj是否有OOB)
        if (!check_bytes_and_report(s, page, object, "Right Redzone",
            endobject, val, s->inuse - s->object_size))
            return 0;
    } else {
        if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
            //检测padding区域
            check_bytes_and_report(s, page, p, "Alignment padding",
                endobject, POISON_INUSE,
                s->inuse - s->object_size);
        }
    }
 
    if (s->flags & SLAB_POISON) {
        if (val != SLUB_RED_ACTIVE && (s->flags & __OBJECT_POISON) &&
            //检测obj是否已经被free(obj最后一个byte是不是0xa5
            (!check_bytes_and_report(s, page, p, "Poison", p,
                    POISON_FREE, s->object_size - 1) ||
             !check_bytes_and_report(s, page, p, "End Poison",
                p + s->object_size - 1, POISON_END, 1)))
            return 0;
        check_pad_bytes(s, page, p);
    }
    //检测overlap
    if (!freeptr_outside_object(s) && val == SLUB_RED_ACTIVE)   
        return 1;
 
    //检测freelist有效性
    if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
        object_err(s, page, p, "Freepointer corrupt");
        //如果无效,则抛弃剩余的obj
        set_freepointer(s, p, NULL);
        return 0;
    }
    return 1;
}

进程

对于用户态的每个进程,运行时都存在不同的段,段有自己的属性(是否可执行、可读等),而这段与段之间也不一定是连续的。而内核的进程vma结构体就是来对运行时的段进行维护。

 

对于每一个进程的 task_struct 来说:

1
task_struct -> mm_struct -> struct vm_area_struct *mmap;

<u>图片来自公众号:LoyenWang</u>

 

haZSr6.png

 

vm_area_struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
struct vm_area_struct {
  //当前vma在进程中对应内存的起始和结束地址
    unsigned long vm_start;       
    unsigned long vm_end;       
 
    //vm双链表指针
    struct vm_area_struct *vm_next, *vm_prev;
    //红黑树节点
    struct rb_node vm_rb;
    //记录此vma和上一个vma之间的空闲大小(双链表或红黑树)
    unsigned long rb_subtree_gap;
    struct mm_struct *vm_mm;    //指向vma属于的mm_struct
    pgprot_t vm_page_prot;        //vma访问权限
    unsigned long vm_flags;        //标志位
    /*
     * For areas with an address space and backing store,
     * linkage into the address_space->i_mmap interval tree.
     */
    struct {
        struct rb_node rb;
        unsigned long rb_subtree_last;
    } shared;
 
    /*
     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
     * list, after a COW of one of the file pages.    A MAP_SHARED vma
     * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
     * or brk vma (with NULL file) can only be in an anon_vma list.
     */
  //一个文件pages被COW之后,MAP_PRIVATE vma可以在i_mmap与anon_vma之中
  //MAP_SHARED vma只能在i_mmap树中
  //匿名的MAP_PRIVATE,栈或者brk的vma只能在anon_vma链表中
    struct list_head anon_vma_chain;
    struct anon_vma *anon_vma;   
    const struct vm_operations_struct *vm_ops;//虚表指针
    unsigned long vm_pgoff;        //以page为单位的文件映射偏移量
    struct file * vm_file;        //映射了哪个文件
    void * vm_private_data;        /* was vm_pte (shared mem) */
    ......
} __randomize_layout;

find_vma

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
//通过addr查找vma
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
    struct rb_node *rb_node;
    struct vm_area_struct *vma;
 
    //首先从vma cache中寻找
    vma = vmacache_find(mm, addr);
    if (likely(vma))
        return vma;
    //红黑树跟节点
    rb_node = mm->mm_rb.rb_node;
    //遍历红黑树上的vma
    while (rb_node) {
        struct vm_area_struct *tmp;
        //当前节点
        tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
 
        if (tmp->vm_end > addr) {
            vma = tmp;
            if (tmp->vm_start <= addr)
                break;
            rb_node = rb_node->rb_left;
        } else
            rb_node = rb_node->rb_right;
    }
    //如果成功找到,更新vma cache
    if (vma)
        vmacache_update(addr, vma);
    return vma;
}

vmacache_find

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
{
    //    addr右移,定位idx
    int idx = VMACACHE_HASH(addr);
    int i;
    // 记录一次VMACACHE_FIND_CALLS事件
    count_vm_vmacache_event(VMACACHE_FIND_CALLS);
    //mm是否是当前进程的mm_struct
    if (!vmacache_valid(mm))
        return NULL;
    //从idx位置开始扫描查找vma
    for (i = 0; i < VMACACHE_SIZE; i++) {
        struct vm_area_struct *vma = current->vmacache.vmas[idx];
        if (vma) {
#ifdef CONFIG_DEBUG_VM_VMACACHE
            if (WARN_ON_ONCE(vma->vm_mm != mm))
                break;
#endif
            //判断是否查找成功,如果成功,记录事件VMACACHE_FIND_HITS
            if (vma->vm_start <= addr && vma->vm_end > addr) {
                count_vm_vmacache_event(VMACACHE_FIND_HITS);
                return vma;
            }
        }
        //如果已经扫描了[idx,VMACACHE_SIZE],那么idx归0,从起点开始扫描
        if (++idx == VMACACHE_SIZE)
            idx = 0;
    }
 
    return NULL;
}

insert_vm_struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
    struct vm_area_struct *prev;
    struct rb_node **rb_link, *rb_parent;
 
    //定位vma的插入位置(双链表与红黑树)
    if (find_vma_links(mm, vma->vm_start, vma->vm_end,
               &prev, &rb_link, &rb_parent))
        return -ENOMEM;
 
    if ((vma->vm_flags & VM_ACCOUNT) &&
         security_vm_enough_memory_mm(mm, vma_pages(vma)))
        return -ENOMEM;
    //如果是匿名的vma那么需要设置vm_pgoff
    if (vma_is_anonymous(vma)) {
        BUG_ON(vma->anon_vma);
        vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
    }
    //vma插入红黑树与双链表中
    vma_link(mm, vma, prev, rb_link, rb_parent);
    return 0;
}

Page Fault

什么时候会发生page falut:

  • page table中找不到对应的PTE

    • 无效地址(通过地址的addr来找vma,如果没找到说明地址无效,段错误,内核panic掉)
    • 有效地址但是没有载入主存
      • 首次访问,发生调页。
      • 如果当前page的present=0,说明不在主存中被swap out了,需要从外存调入主存。
      • COW时访问语义冲突,比如PTE不可写,但是做了写操作,会触发COW机制,在copy page中write
  • 对应虚拟地址的PTE拒绝访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/*
 * Page fault error code bits:
 *
 *   bit 0 ==     0: no page found    1: protection fault
 *   bit 1 ==     0: read access        1: write access
 *   bit 2 ==     0: kernel-mode access    1: user-mode access
 *   bit 3 ==                1: use of reserved bit detected
 *   bit 4 ==                1: fault was an instruction fetch
 *   bit 5 ==                1: protection keys block access
 *   bit 15 ==                1: SGX MMU page-fault
 */
enum x86_pf_error_code {
    X86_PF_PROT    =        1 << 0,
    X86_PF_WRITE    =        1 << 1,
    X86_PF_USER    =        1 << 2,
    X86_PF_RSVD    =        1 << 3,
    X86_PF_INSTR    =        1 << 4,
    X86_PF_PK    =        1 << 5,
    X86_PF_SGX    =        1 << 15,
};

handle_page_fault

在5.13的内核中 __do_page_fault 已经被移除(x86),取而代之的是 handle_page_fault

1
2
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
    ->handle_page_fault(regs, error_code, address)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
static __always_inline void
trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,
             unsigned long address)
{
    if (!trace_pagefault_enabled())
        return;
    /*
    (((regs)->msr & MSR_PR) != 0)
    */
    //通过检查MSR的对应位查看是否是用户态(比如用户态的越界访问)
    //DEFINE_PAGE_FAULT_EVENT记录相应的事件
    /*
    DEFINE_PAGE_FAULT_EVENT(page_fault_user)
    DEFINE_PAGE_FAULT_EVENT(page_fault_kernel)
    */
    if (user_mode(regs))
        trace_page_fault_user(address, regs, error_code);
    else
        trace_page_fault_kernel(address, regs, error_code);
}
 
static __always_inline void
handle_page_fault(struct pt_regs *regs, unsigned long error_code,
                  unsigned long address)
{
    //记录事件
    trace_page_fault_entries(regs, error_code, address);
    //跟踪判断是否是kmmio的falut,如果是调用kmmio_handler
    if (unlikely(kmmio_fault(regs, address)))
        return;
 
    //判断是发生在内核态还是用户态的fault
    //fault_in_kernel_space判断address是否处于vsyscall区域(大于TASK_SIZE_MAX),但是这片区域不被当作kernel sapce对待。
    if (unlikely(fault_in_kernel_space(address))) {
        do_kern_addr_fault(regs, error_code, address);
    } else {
        do_user_addr_fault(regs, error_code, address);
        /*
         * User address page fault handling might have reenabled
         * interrupts. Fixing up all potential exit points of
         * do_user_addr_fault() and its leaf functions is just not
         * doable w/o creating an unholy mess or turning the code
         * upside down.
         */
        local_irq_disable();
    }
}

do_kern_addr_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static void
do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
           unsigned long address)
{
    //X86_PF_PK只针对用户态的page起作用
    WARN_ON_ONCE(hw_error_code & X86_PF_PK);
    //检测是否由于特有的bug产生
    if (is_f00f_bug(regs, hw_error_code, address))
        return;
  /*
    确保故障不是由以下引起的:
    1.不是在一个有保留位设置的PTE上的故障
    2.不是由user mode访问kernel mem访问引起的故障
    3.不是由违反页级保护引起的故障(反问一个不存在的页,且它的X86_PF_PROT==0
    */
  //32位下检测是否是由于vmalloc_fault导致
    if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
        if (vmalloc_fault(address) >= 0)
            return;
    /*    是否是由于TLB陈旧导致的,没有及时更新 */
    if (spurious_kernel_fault(hw_error_code, address))
        return;
    /* 是否是kprobe hook了缺页错误 */
    if (kprobe_page_fault(regs, X86_TRAP_PF))
        return;
    //非法地址访问导致的错误
    bad_area_nosemaphore(regs, hw_error_code, address);
}

vmalloc_fault

 

内核使用 vmalloc 来分配在虚拟内存中连续但在物理内存中不一定连续的内存

 

处理vmalloc或模块映射区的故障 。之所以需要这样做,是因为在vmalloc映射代码更新PMD到它与系统中其他页表同步这一更新的时间点之间存在着一个竞争条件。在这个竞争窗口中,另一个线程/CPU可以在同一个PMD上映射一个区域,发现它已经存在,并且还没有与系统的其他部分同步。因此,vmzalloc可能会返回未被系统中的每个页表映射的区域,当这些区域被访问时,会引起未处理的页错误。

 

主要的修复过程就是把 init 进程的 页表项 (全局)复制到当前进程的 页表项 中,这样就可以实现所有进程的内核内存地址空间同步。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
static noinline int vmalloc_fault(unsigned long address)
{
    unsigned long pgd_paddr;
    pmd_t *pmd_k;
    pte_t *pte_k;
 
    /* 确保我们在系统的vmalloc区域 */
    if (!(address >= VMALLOC_START && address < VMALLOC_END))
        return -1;
    //首先拿到进程内核的PGD的物理地址(进程内核页表)以及全局内核页表,然后复制更新进程内核页表,完成同步
    pgd_paddr = read_cr3_pa();
    pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
 
    if (!pmd_k)
        return -1;
    //是否是大页(PSE)
    if (pmd_large(*pmd_k))
        return 0;
 
    //在pmd中找到pte的对应指针
    pte_k = pte_offset_kernel(pmd_k, address);
    //检查PTE是否存在于内存中,如果不存在则缺页
    if (!pte_present(*pte_k))
        return -1;
 
    return 0;
}

spurious_kernel_fault

 

本函数用于处理由于TLB entry没有及时更新导致的虚假错误。

 

可能发生的原因:TLB entry对应的permission比页表entry的少

 

可能导致的原因:

 

1.向 ring0 做write操作。

 

2.对一块NX区域fetch。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
static noinline int
spurious_kernel_fault(unsigned long error_code, unsigned long address)
{
    pgd_t *pgd;
    p4d_t *p4d;
    pud_t *pud;
    pmd_t *pmd;
    pte_t *pte;
    int ret;
 
    //只可能由如下标志导致X86_PF_WRITE / X86_PF_INSTR
    if (error_code != (X86_PF_WRITE | X86_PF_PROT) &&
        error_code != (X86_PF_INSTR | X86_PF_PROT))
        return 0;
 
    //定位内核页表
    pgd = init_mm.pgd + pgd_index(address);
    //判断pgd是否在内存中
    if (!pgd_present(*pgd))
        return 0;
    //通过偏移获得p4d entry
    p4d = p4d_offset(pgd, address);
    if (!p4d_present(*p4d))
        return 0;
    //是否是大页?
    if (p4d_large(*p4d))
        //如果p4d开启了PSE,那么调用spurious_kernel_fault_check,此时p4d作为pte被检测
        return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
 
    pud = pud_offset(p4d, address);
    if (!pud_present(*pud))
        return 0;
    if (pud_large(*pud))
        return spurious_kernel_fault_check(error_code, (pte_t *) pud);
    pmd = pmd_offset(pud, address);
    if (!pmd_present(*pmd))
        return 0;
    if (pmd_large(*pmd))
        return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
    //最终检测到pte
    pte = pte_offset_kernel(pmd, address);
    if (!pte_present(*pte))
        return 0;
    ret = spurious_kernel_fault_check(error_code, pte);
    if (!ret)
        return 0;
 
    //如果在pte阶段还是没有check到当前的虚假错误,那么说明页表可能有bug
    ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
    WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
 
    return ret;
}

bad_area_nosemaphore

 

bad_area_nosemaphore -> __bad_area_nosemaphore

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
               unsigned long address, u32 pkey, int si_code)
{
    struct task_struct *tsk = current;
 
    //如果请求不是来自用户态,而是内核中发生缺页(或者vsyscall区域),尝试调用 kernelmode_fixup_or_oops
    //先尝试进行修复,修不好了就kernel panic,然后oops打印信息。
    if (!user_mode(regs)) {
        kernelmode_fixup_or_oops(regs, error_code, address, pkey, si_code);
        return;
    }
    //如果是用户对内核内存的隐性访问,oops掉
    if ( !(error_code & X86_PF_USER) ) {
        /* Implicit user access to kernel memory -- just oops */
        page_fault_oops(regs, error_code, address);
        return;
    }
 
    /*
    用户模式的访问只会导致SIGSEGV。这里有可能关闭中断。
    */
    local_irq_enable();
    /*
    如果访问来自用户态
    */
    if (is_prefetch(regs, error_code, address))
        return;
 
    if (is_errata100(regs, address))
        return;
    //伪造访问错误,防止泄漏内核信息
    sanitize_error_code(address, &error_code);
    //单独fixup vdso错误
    if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
        return;
    if (likely(show_unhandled_signals))
    // 打印错误信息
        show_signal_msg(regs, error_code, address, tsk);
    set_signal_archinfo(address, error_code);
 
    if (si_code == SEGV_PKUERR)
        force_sig_pkuerr((void __user *)address, pkey);
    else
        // 发送SIGSEGV信号
        force_sig_fault(SIGSEGV, si_code, (void __user *)address);
 
    local_irq_disable();
}

kernelmode_fixup_or_oops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
static noinline void
kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
             unsigned long address, int signal, int si_code)
{
    //如果是用户态调用此函数,直接oops
    WARN_ON_ONCE(user_mode(regs));
 
    /*
    尝试通过搜索exception table,调用ex_fixup_handler(search_exception_tables(regs->ip))进行修复
    */
    if (fixup_exception(regs, X86_TRAP_PF, error_code, address))
    {
        /*
         任何发生故障的中断都会得到修复。
         这使得下面的递归故障逻辑只适用于来自task context任务上下文的故障
         */
        if (in_interrupt())
            return;
 
        /*
        在这种情况下,我们需要确保我们没有通过emulate_vsyscall()逻辑进行递归故障
         */
        if (current->thread.sig_on_uaccess_err && signal) {
            //设置 error_code |= X86_PF_PROT,防止泄漏内核页表信息。
      //sanitize_error_code将访问内核空间的错误伪造成protection faults
            sanitize_error_code(address, &error_code);
 
            set_signal_archinfo(address, error_code);
            force_sig_fault(signal, si_code, (void __user *)address);
        }
        return;
    }
 
    /*
     AMD错误#91在PREFETCH指令上表现为虚假的页面故障。
     */
    if (is_prefetch(regs, error_code, address))
        return;
    //使用oops打印错误
    page_fault_oops(regs, error_code, address);
}

do_user_addr_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
static inline
void do_user_addr_fault(struct pt_regs *regs,
            unsigned long error_code,
            unsigned long address)
{
    struct vm_area_struct *vma;
    struct task_struct *tsk;
    struct mm_struct *mm;
    vm_fault_t fault;
    unsigned int flags = FAULT_FLAG_DEFAULT;
 
    tsk = current;
    mm = tsk->mm;
 
    // this is kernel mode code trying to execute from  user memory
    // 这里代表尝试从用户空间执行内核代码
    // 除非是AMD的一些bug,否则直接oops
    if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) == X86_PF_INSTR)) {
        if (is_errata93(regs, address))
            return;
        page_fault_oops(regs, error_code, address);
        return;
    }
    //是否是kprobe hook掉了
    if (unlikely(kprobe_page_fault(regs, X86_TRAP_PF)))
        return;
 
    /*
       Reserved bits 不应该被设置在用户态的页表项
       如果设置了就发生页表错误
     */
    if (unlikely(error_code & X86_PF_RSVD))
        pgtable_bad(regs, error_code, address);
    /*
     如果开了SMAP,并且kernel-space尝试访问user-sapce,那么直接page_fault_oops
     */
    if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
             !(error_code & X86_PF_USER) &&
             !(regs->flags & X86_EFLAGS_AC))) {
        page_fault_oops(regs, error_code, address);
        return;
    }
 
    /*
    如果我们处于中断中,没有用户上下文;
    或者运行在page faults 禁止的区域
    直接调用bad_area_nosemaphore
    */
    if (unlikely(faulthandler_disabled() || !mm)) {
        bad_area_nosemaphore(regs, error_code, address);
        return;
    }
 
    /*
    在cr2寄存器被保存,以及vmalloc  fault被处理后,允许中断是安全的
    */
    if (user_mode(regs)) {
        local_irq_enable();
        flags |= FAULT_FLAG_USER;
    } else {
        if (regs->flags & X86_EFLAGS_IF)
            local_irq_enable();
    }
    //记录事件
    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
    // 更新flag标志位
    if (error_code & X86_PF_WRITE)
        flags |= FAULT_FLAG_WRITE;
    if (error_code & X86_PF_INSTR)
        flags |= FAULT_FLAG_INSTRUCTION;
 
#ifdef CONFIG_X86_64
    /*
    由于vsyscall没有vma,所以对其进行vma模拟在查找对应vma之前
    */
    if (is_vsyscall_vaddr(address)) {
        if (emulate_vsyscall(error_code, regs, address))
            return;
    }
#endif
    /*
    内核模式对用户地址空间的访问应该只发生在在异常表(exception_tables)中列出的定义明确的单一指令上
    但是如果kernel faults同时发生在持有mmap_lock的区域之外,可能导致deadlock
 
    只有当我们可能面临死锁的风险时,才会进行昂贵的异常表搜索。 这发生在我们:
         * 1. 没能获得mmap_lock
          * 2. 访问不是从用户空间开始的
    */
    if (unlikely(!mmap_read_trylock(mm)))
    {
        if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
            /*
             * Fault from code in kernel from
             * which we do not expect faults.
             */
            bad_area_nosemaphore(regs, error_code, address);
            return;
        }
retry:
        mmap_read_lock(mm);
    } else {
        //当mmap_read_trylock成功,可能错过mmap_read_lock中的might_sleep,这里补一下
        might_sleep();
    }
    //查找vma,判断合法性
    vma = find_vma(mm, address);
    if (unlikely(!vma)) {
        bad_area(regs, error_code, address);
        return;
    }
    if (likely(vma->vm_start <= address))
        goto good_area;
 
    // 越界访问
    // address > vma->vm_start
    // 判断对应的进程vma对应的用户空间的地址是否是向下增长的,如果是,说明访问出现在holing区域?
    if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
        bad_area(regs, error_code, address);
        return;
    }
 
    // 用户堆栈的扩展
    // 如果address的访问在栈空间,判断是否进行堆栈扩充
    // 在expand_stack()中只是建立堆栈区的vm_area_struct结构,并未建立起新扩展的页面对物理内存的映射,该任务由good_area完成
    if (unlikely(expand_stack(vma, address))) {
        bad_area(regs, error_code, address);
        return;
    }
 
    /*
     到这里为止我们创建好了需要的vma结构
     */
good_area:
    //判断当前的操作是否与vma的权限不符合,此时access_error中如果error_code为X86_PF_PK,那么不进行接下来的处理,直接返回,留给COW
    if (unlikely(access_error(error_code, vma))) {
        bad_area_access_error(regs, error_code, address, vma);
        return;
    }
 
    /*
     缺页错误,尝试进行最后的处理和判断,防止不停的重复此故障
     */
    fault = handle_mm_fault(vma, address, flags, regs);
    if (fault_signal_pending(fault, regs)) {
        /*
         * Quick path to respond to signals.  The core mm code
         * has unlocked the mm for us if we get here.
         */
        if (!user_mode(regs))
            kernelmode_fixup_or_oops(regs, error_code, address,
                         SIGBUS, BUS_ADRERR);
        return;
    }
 
    /*
     * If we need to retry the mmap_lock has already been released,
     * and if there is a fatal signal pending there is no guarantee
     * that we made any progress. Handle this case first.
     */
    if (unlikely((fault & VM_FAULT_RETRY) &&
             (flags & FAULT_FLAG_ALLOW_RETRY))) {
        flags |= FAULT_FLAG_TRIED;
        goto retry;
    }
    //解锁
    mmap_read_unlock(mm);
    if (likely(!(fault & VM_FAULT_ERROR)))
        return;
 
    if (fatal_signal_pending(current) && !user_mode(regs)) {
        kernelmode_fixup_or_oops(regs, error_code, address, 0, 0);
        return;
    }
    //是否需要启动oom?
    if (fault & VM_FAULT_OOM) {
        /* Kernel mode? Handle exceptions or die: */
        if (!user_mode(regs)) {
            kernelmode_fixup_or_oops(regs, error_code, address,
                         SIGSEGV, SEGV_MAPERR);
            return;
        }
        //启动oom killer
        pagefault_out_of_memory();
    }
    else {
        if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
                 VM_FAULT_HWPOISON_LARGE))
            do_sigbus(regs, error_code, address, fault);
        else if (fault & VM_FAULT_SIGSEGV)
            bad_area_nosemaphore(regs, error_code, address);
        else
            BUG();
    }
}

handle_mm_fault

1
handle_mm_fault -> __handle_mm_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
        unsigned long address, unsigned int flags)
{
    struct vm_fault vmf = {
        .vma = vma,
        .address = address & PAGE_MASK,
        .flags = flags,
        .pgoff = linear_page_index(vma, address),
        .gfp_mask = __get_fault_gfp_mask(vma),
    };
 
    unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
 
    struct mm_struct *mm = vma->vm_mm;
    pgd_t *pgd;
    p4d_t *p4d;
    vm_fault_t ret;
    // mm->pgd + pgd_index(address)
    //定位到对应的pgd中的entry
    pgd = pgd_offset(mm, address);
    //尝试分配新的p4d(针对5级页表:PGD->P4D->PUD->PMD->PTE),如果没有5级页表,就返回pgd
    p4d = p4d_alloc(mm, pgd, address);
    if (!p4d)
        return VM_FAULT_OOM;
    //定位到pud
    vmf.pud = pud_alloc(mm, p4d, address);
    if (!vmf.pud)
        return VM_FAULT_OOM;
 
retry_pud:
    // 如果PUD表项为空,且开启透明大页(不支持匿名页),huge_fault触发
    if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
        ret = create_huge_pud(&vmf);
        if (!(ret & VM_FAULT_FALLBACK))
            return ret;
    }
    else {
        pud_t orig_pud = *vmf.pud;
 
        barrier();
        // 如果pud开启PSE或者是devmap
        if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
            // pud被更新为脏,触发huge_fault(不支持匿名页)
            if (dirty && !pud_write(orig_pud)) {
                ret = wp_huge_pud(&vmf, orig_pud);
                if (!(ret & VM_FAULT_FALLBACK))
                    return ret;
            } else {
                huge_pud_set_accessed(&vmf, orig_pud);
                return 0;
            }
        }
    }
 
    vmf.pmd = pmd_alloc(mm, vmf.pud, address);
    if (!vmf.pmd)
        return VM_FAULT_OOM;
 
    /* Huge pud page fault raced with pmd_alloc? */
    if (pud_trans_unstable(vmf.pud))
        goto retry_pud;
 
    // 如果pmd为空,且vma可以创建THP,则调用create_huge_pmd创建THP
    if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
        ret = create_huge_pmd(&vmf);
        if (!(ret & VM_FAULT_FALLBACK))
            return ret;
    } else {
        pmd_t orig_pmd = *vmf.pmd;
        barrier();
        //判断pmd是否在swap分区中(被换出内存)
        if (unlikely(is_swap_pmd(orig_pmd))) {
            //如果支持THP迁移,但是orig_pmd不是迁移对应的entry
            VM_BUG_ON(thp_migration_supported() &&
                      !is_pmd_migration_entry(orig_pmd));
            if (is_pmd_migration_entry(orig_pmd))
                pmd_migration_entry_wait(mm, vmf.pmd);
            return 0;
        }
        // pud具有_PAGE_PSE标志位(开启THP), 或者pud为devmap
        if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
            if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
                return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
            if (dirty && !pmd_write(orig_pmd)) {
                ret = wp_huge_pmd(&vmf, orig_pmd);
                if (!(ret & VM_FAULT_FALLBACK))
                    return ret;
            } else {
                huge_pmd_set_accessed(&vmf, orig_pmd);
                return 0;
            }
        }
    }
    //根据对应的vmf分配物理页面
    return handle_pte_fault(&vmf);
}

handle_pte_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
    pte_t entry;
    //如果对应的pmd不存在,则pte不存在
    if (unlikely(pmd_none(*vmf->pmd))) {
        vmf->pte = NULL;
    } else {
        /*
        static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
        {
            return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
        }
        pmd_devmap:通过检测_PAGE_DEVMAP判断是否是devmap,如果不是devmap,则:调用pmd_trans_unstable
        pmd_trans_unstable:
            1.如果kernel不支持THP,那么就是noop
            2.如果支持THP,那么调用pmd_none_or_trans_huge_or_clear_bad
                首先检测pmd是否为空 或者 是否可以转换成THP 或者 是否允许THP迁移,并且是否设置了_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
                    return 1
                否则进入pmd_clear_bad,调用pmd_ERROR->pr_err打印错误,最终调用 native_set_pmd(pmd, native_make_pmd(0))
        */
        if (pmd_devmap_trans_unstable(vmf->pmd))
            return 0;
        /*
         至此,一个普通的pmd建立起来了,而且此时它不能再从我们下面变成一个huge pmd,
         因为我们持有mmap_lock的读侧,而khugepaged以写侧拿走了它。
         所以现在运行pte_offset_map()是安全的。
         */
        vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
        vmf->orig_pte = *vmf->pte;
        barrier();
        if (pte_none(vmf->orig_pte)) {
            pte_unmap(vmf->pte);
            vmf->pte = NULL;
        }
    }
    // 如果pte为空,进行页表分配
    if (!vmf->pte) {
        if (vma_is_anonymous(vmf->vma))
            //处理匿名页
            return do_anonymous_page(vmf);
        else
            //处理文件映射的页,此时不是匿名页,调用do_fault调入页
            //这里可以看https://bbs.pediy.com/thread-264199.htm#msg_header_h3_5
            return do_fault(vmf);
    }
 
    //此时物理页已经存在
 
    //如果不在内存,swap进来
    if (!pte_present(vmf->orig_pte))
        return do_swap_page(vmf);
    //维持node平衡,进行页迁移
    if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
        return do_numa_page(vmf);
 
    vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
    spin_lock(vmf->ptl);
    entry = vmf->orig_pte;
 
    //检测我们的pte是否发生变化,如果发生变化,更新TLB,然后解锁,返回
    if (unlikely(!pte_same(*vmf->pte, entry))) {
        update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
        goto unlock;
    }
 
    // 如果这个fault是由于write触发
    if (vmf->flags & FAULT_FLAG_WRITE) {
        if (!pte_write(entry))
            //调用do_wp_page处理COW中断,发现页此时已经调入但是存在foll_flags语义冲突(_PAGE_RONLY但是要做写操作),调用do_wp_page
            return do_wp_page(vmf);
            // 标记为脏页
        entry = pte_mkdirty(entry);
    }
    //标记_PAGE_ACCESSED位
    entry = pte_mkyoung(entry);
 
    //ptep_set_access_flags首先判断pte是否changed, 然后用于在其他架构上设置页表项的访问位或脏位
    if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,vmf->flags & FAULT_FLAG_WRITE)) {
        // pte内容更改, mmu
        update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
    } else {
        //当pte没改变
        if (vmf->flags & FAULT_FLAG_TRIED)
            goto unlock;
        // 如果是一个写故障,可能对应COW,刷新TLB
        if (vmf->flags & FAULT_FLAG_WRITE)
            flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
    }
unlock:
    pte_unmap_unlock(vmf->pte, vmf->ptl);
    return 0;
}

本函数中更详细的一些调用还可以看:

 

https://bbs.pediy.com/thread-264199.htm

 

主要讲了一下对应的COW相关的处理。

参考

特别感谢povcfe学长的linux内存管理分析,让我学到了很多

 

https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html

 

https://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/

 

https://zhuanlan.zhihu.com/p/68465952

 

https://blog.csdn.net/jasonchen_gbd/article/details/79462014

 

https://blog.csdn.net/zhoutaopower/article/details/87090982

 

https://blog.csdn.net/zhoutaopower/article/details/88025712

 

https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html

 

https://zhuanlan.zhihu.com/p/137277724

 

https://segmentfault.com/a/1190000012269249

 

https://www.codenong.com/cs105984564/

 

https://rtoax.blog.csdn.net/article/details/108663898?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant

 

https://qinglinmao8315.github.io/linux/2018/03/14/linux-page-cache.html

 

https://www.jianshu.com/p/8a86033dfcb0

 

https://blog.csdn.net/wh8_2011/article/details/53138377

 

https://zhuanlan.zhihu.com/p/258921453?utm_source=wechat_timeline

 

https://blog.csdn.net/FreeeLinux/article/details/54754752

 

https://www.sohu.com/a/297831850_467784

 

https://www.cnblogs.com/adera/p/11718765.html

 

https://blog.csdn.net/zhuyong006/article/details/100737724

 

https://blog.csdn.net/wangquan1992/article/details/105036282/

 

https://blog.csdn.net/sykpour/article/details/24044641

 

页迁移与碎片整理


[注意] 欢迎加入看雪团队!base上海,招聘安全工程师、逆向工程师多个坑位等你投递!

收藏
点赞7
打赏
分享
最新回复 (1)
雪    币: 7961
活跃值: 活跃值 (6948)
能力值: ( LV12,RANK:210 )
在线值:
发帖
回帖
粉丝
pureGavin 活跃值 2 2021-9-2 14:23
2
0
楼主为何如此高产
游客
登录 | 注册 方可回帖
返回