Skip to content
Snippets Groups Projects
  1. Jan 15, 2022
  2. Apr 30, 2021
  3. Dec 15, 2020
  4. Oct 14, 2020
  5. Aug 15, 2020
    • Qian Cai's avatar
      mm/page_counter: fix various data races at memsw · 6e4bd50f
      Qian Cai authored
      
      Commit 3e32cb2e ("mm: memcontrol: lockless page counters") could had
      memcg->memsw->watermark and memcg->memsw->failcnt been accessed
      concurrently as reported by KCSAN,
      
       BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
      
       read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
        page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
        try_charge+0x131/0xd50 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x58/0x140
        __memcg_kmem_charge+0xcc/0x280
        __alloc_pages_nodemask+0x1e1/0x450
        alloc_pages_current+0xa6/0x120
        pte_alloc_one+0x17/0xd0
        __pte_alloc+0x3a/0x1f0
        copy_p4d_range+0xc36/0x1990
        copy_page_range+0x21d/0x360
        dup_mmap+0x5f5/0x7a0
        dup_mm+0xa2/0x240
        copy_process+0x1b3f/0x3460
        _do_fork+0xaa/0xa20
        __x64_sys_clone+0x13b/0x170
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
        page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
        try_charge+0x131/0xd50 mm/memcontrol.c:2405
        mem_cgroup_try_charge+0x159/0x460
        mem_cgroup_try_charge_delay+0x3d/0xa0
        wp_page_copy+0x14d/0x930
        do_wp_page+0x107/0x7b0
        __handle_mm_fault+0xce6/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
      
       write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
        page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
        try_charge+0x185/0xbf0 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
        __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
        __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
      
       read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
        page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
        try_charge+0x185/0xbf0 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
        __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
        __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
      
      Since watermark could be compared or set to garbage due to a data race
      which would change the code logic, fix it by adding a pair of READ_ONCE()
      and WRITE_ONCE() in those places.
      
      The "failcnt" counter is tolerant of some degree of inaccuracy and is only
      used to report stats, a data race will not be harmful, thus mark it as an
      intentional data race using the data_race() macro.
      
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Reported-by: default avatar <syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Marco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e4bd50f
  6. Aug 07, 2020
  7. Apr 02, 2020
    • Chris Down's avatar
      mm, memcg: prevent memory.min load/store tearing · c3d53200
      Chris Down authored
      
      This can be set concurrently with reads, which may cause the wrong value
      to be propagated.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/e809b4e6b0c1626dac6945970de06409a180ee65.1584034301.git.chris@chrisdown.name
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3d53200
    • Chris Down's avatar
      mm, memcg: prevent memory.low load/store tearing · f86b810c
      Chris Down authored
      
      This can be set concurrently with reads, which may cause the wrong value
      to be propagated.
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/448206f44b0fa7be9dad2ca2601d2bcb2c0b7844.1584034301.git.chris@chrisdown.name
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f86b810c
    • Johannes Weiner's avatar
      mm: memcontrol: fix memory.low proportional distribution · 503970e4
      Johannes Weiner authored
      
      Patch series "mm: memcontrol: recursive memory.low protection", v3.
      
      The current memory.low (and memory.min) semantics require protection to be
      assigned to a cgroup in an untinterrupted chain from the top-level cgroup
      all the way to the leaf.
      
      In practice, we want to protect entire cgroup subtrees from each other
      (system management software vs.  workload), but we would like the VM to
      balance memory optimally *within* each subtree, without having to make
      explicit weight allocations among individual components.  The current
      semantics make that impossible.
      
      They also introduce unmanageable complexity into more advanced resource
      trees.  For example:
      
                host root
                `- system.slice
                   `- rpm upgrades
                   `- logging
                `- workload.slice
                   `- a container
                      `- system.slice
                      `- workload.slice
                         `- job A
                            `- component 1
                            `- component 2
                         `- job B
      
      At a host-level perspective, we would like to protect the outer
      workload.slice subtree as a whole from rpm upgrades, logging etc.  But for
      that to be effective, right now we'd have to propagate it down through the
      container, the inner workload.slice, into the job cgroup and ultimately
      the component cgroups where memory is actually, physically allocated.
      This may cross several tree delegation points and namespace boundaries,
      which make such a setup near impossible.
      
      CPU and IO on the other hand are already distributed recursively.  The
      user would simply configure allowances at the host level, and they would
      apply to the entire subtree without any downward propagation.
      
      To enable the above-mentioned usecases and bring memory in line with other
      resource controllers, this patch series extends memory.low/min such that
      settings apply recursively to the entire subtree.  Users can still assign
      explicit shares in subgroups, but if they don't, any ancestral protection
      will be distributed such that children compete freely amongst each other -
      as if no memory control were enabled inside the subtree - but enjoy
      protection from neighboring trees.
      
      In the above example, the user would then be able to configure shares of
      CPU, IO and memory at the host level to comprehensively protect and
      isolate the workload.slice as a whole from system.slice activity.
      
      Patch #1 fixes an existing bug that can give a cgroup tree more protection
      than it should receive as per ancestor configuration.
      
      Patch #2 simplifies and documents the existing code to make it easier to
      reason about the changes in the next patch.
      
      Patch #3 finally implements recursive memory protection semantics.
      
      Because of a risk of regressing legacy setups, the new semantics are
      hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
      
      More details in patch #3.
      
      This patch (of 3):
      
      When memory.low is overcommitted - i.e.  the children claim more
      protection than their shared ancestor grants them - the allowance is
      distributed in proportion to how much each sibling uses their own declared
      protection:
      
      	low_usage = min(memory.low, memory.current)
      	elow = parent_elow * (low_usage / siblings_low_usage)
      
      However, siblings_low_usage is not the sum of all low_usages. It sums
      up the usages of *only those cgroups that are within their memory.low*
      That means that low_usage can be *bigger* than siblings_low_usage, and
      consequently the total protection afforded to the children can be
      bigger than what the ancestor grants the subtree.
      
      Consider three groups where two are in excess of their protection:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 8G (only A3 contributes)
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
      
        (the 12.5G are capped to the explicit memory.low setting of 10G)
      
      With that, the sum of all awarded protection below A is 30G, when A
      only grants 10G for the entire subtree.
      
      What does this mean in practice? A1 and A2 would still be in excess of
      their 10G allowance and would be reclaimed, whereas A3 would not. As
      they eventually drop below their protection setting, they would be
      counted in siblings_low_usage again and the error would right itself.
      
      When reclaim was applied in a binary fashion (cgroup is reclaimed when
      it's above its protection, otherwise it's skipped) this would actually
      work out just fine. However, since 1bc63fb1 ("mm, memcg: make scan
      aggression always exclude protection"), reclaim pressure is scaled to
      how much a cgroup is above its protection. As a result this
      calculation error unduly skews pressure away from A1 and A2 toward the
      rest of the system.
      
      But why did we do it like this in the first place?
      
      The reasoning behind exempting groups in excess from
      siblings_low_usage was to go after them first during reclaim in an
      overcommitted subtree:
      
        A/memory.low = 2G, memory.current = 4G
        A/A1/memory.low = 3G, memory.current = 2G
        A/A2/memory.low = 1G, memory.current = 2G
      
        siblings_low_usage = 2G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
      
      While the children combined are overcomitting A and are technically
      both at fault, A2 is actively declaring unprotected memory and we
      would like to reclaim that first.
      
      However, while this sounds like a noble goal on the face of it, it
      doesn't make much difference in actual memory distribution: Because A
      is overcommitted, reclaim will not stop once A2 gets pushed back to
      within its allowance; we'll have to reclaim A1 either way. The end
      result is still that protection is distributed proportionally, with A1
      getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
      
      [ If A weren't overcommitted, it wouldn't make a difference since each
        cgroup would just get the protection it declares:
      
        A/memory.low = 2G, memory.current = 3G
        A/A1/memory.low = 1G, memory.current = 1G
        A/A2/memory.low = 1G, memory.current = 2G
      
        With the current calculation:
      
        siblings_low_usage = 1G (only A1 contributes)
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
      
        Including excess groups in siblings_low_usage:
      
        siblings_low_usage = 2G
        A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
        A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
      
      Simplify the calculation and fix the proportional reclaim bug by
      including excess cgroups in siblings_low_usage.
      
      After this patch, the effective memory.low distribution from the
      example above would be as follows:
      
        A/memory.low = 10G
        A/A1/memory.low = 10G, memory.current = 20G
        A/A2/memory.low = 10G, memory.current = 20G
        A/A3/memory.low = 10G, memory.current =  8G
        siblings_low_usage = 28G
      
        A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
        A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
      
      Fixes: 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection")
      Fixes: 23067153 ("mm: memory.low hierarchical behavior")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      503970e4
  8. Jun 08, 2018
    • Roman Gushchin's avatar
      memcg: introduce memory.min · bf8d5d52
      Roman Gushchin authored
      Memory controller implements the memory.low best-effort memory
      protection mechanism, which works perfectly in many cases and allows
      protecting working sets of important workloads from sudden reclaim.
      
      But its semantics has a significant limitation: it works only as long as
      there is a supply of reclaimable memory.  This makes it pretty useless
      against any sort of slow memory leaks or memory usage increases.  This
      is especially true for swapless systems.  If swap is enabled, memory
      soft protection effectively postpones problems, allowing a leaking
      application to fill all swap area, which makes no sense.  The only
      effective way to guarantee the memory protection in this case is to
      invoke the OOM killer.
      
      It's possible to handle this case in userspace by reacting on MEMCG_LOW
      events; but there is still a place for a fail-safe in-kernel mechanism
      to provide stronger guarantees.
      
      This patch introduces the memory.min interface for cgroup v2 memory
      controller.  It works very similarly to memory.low (sharing the same
      hierarchical behavior), except that it's not disabled if there is no
      more reclaimable memory in the system.
      
      If cgroup is not populated, its memory.min is ignored, because otherwise
      even the OOM killer wouldn't be able to reclaim the protected memory,
      and the system can stall.
      
      [guro@fb.com: s/low/min/ in docs]
      Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
      Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf8d5d52
    • Roman Gushchin's avatar
      mm: memory.low hierarchical behavior · 23067153
      Roman Gushchin authored
      This patch aims to address an issue in current memory.low semantics,
      which makes it hard to use it in a hierarchy, where some leaf memory
      cgroups are more valuable than others.
      
      For example, there are memcgs A, A/B, A/C, A/D and A/E:
      
        A      A/memory.low = 2G, A/memory.current = 6G
       //\\
      BC  DE   B/memory.low = 3G  B/memory.current = 2G
               C/memory.low = 1G  C/memory.current = 2G
               D/memory.low = 0   D/memory.current = 2G
      	 E/memory.low = 10G E/memory.current = 0
      
      If we apply memory pressure, B, C and D are reclaimed at the same pace
      while A's usage exceeds 2G.  This is obviously wrong, as B's usage is
      fully below B's memory.low, and C has 1G of protection as well.  Also, A
      is pushed to the size, which is less than A's 2G memory.low, which is
      also wrong.
      
      A simple bash script (provided below) can be used to reproduce
      the problem. Current results are:
        A:    1430097920
        A/B:  711929856
        A/C:  717426688
        A/D:  741376
        A/E:  0
      
      To address the issue a concept of effective memory.low is introduced.
      Effective memory.low is always equal or less than original memory.low.
      In a case, when there is no memory.low overcommittment (and also for
      top-level cgroups), these two values are equal.
      
      Otherwise it's a part of parent's effective memory.low, calculated as a
      cgroup's memory.low usage divided by sum of sibling's memory.low usages
      (under memory.low usage I mean the size of actually protected memory:
      memory.current if memory.current < memory.low, 0 otherwise).  It's
      necessary to track the actual usage, because otherwise an empty cgroup
      with memory.low set (A/E in my example) will affect actual memory
      distribution, which makes no sense.  To avoid traversing the cgroup tree
      twice, page_counters code is reused.
      
      Calculating effective memory.low can be done in the reclaim path, as we
      conveniently traversing the cgroup tree from top to bottom and check
      memory.low on each level.  So, it's a perfect place to calculate
      effective memory low and save it to use it for children cgroups.
      
      This also eliminates a need to traverse the cgroup tree from bottom to
      top each time to check if parent's guarantee is not exceeded.
      
      Setting/resetting effective memory.low is intentionally racy, but it's
      fine and shouldn't lead to any significant differences in actual memory
      distribution.
      
      With this patch applied results are matching the expectations:
        A:    2147930112
        A/B:  1428721664
        A/C:  718393344
        A/D:  815104
        A/E:  0
      
      Test script:
        #!/bin/bash
      
        CGPATH="/sys/fs/cgroup"
      
        truncate /file1 --size 2G
        truncate /file2 --size 2G
        truncate /file3 --size 2G
        truncate /file4 --size 50G
      
        mkdir "${CGPATH}/A"
        echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
        mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
      
        echo 2G > "${CGPATH}/A/memory.low"
        echo 3G > "${CGPATH}/A/B/memory.low"
        echo 1G > "${CGPATH}/A/C/memory.low"
        echo 0 > "${CGPATH}/A/D/memory.low"
        echo 10G > "${CGPATH}/A/E/memory.low"
      
        echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
        echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
        echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
        echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
      
        echo "A:   " `cat "${CGPATH}/A/memory.current"`
        echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
        echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
        echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
        echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
      
        rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
        rmdir "${CGPATH}/A"
        rm /file1 /file2 /file3 /file4
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23067153
    • Roman Gushchin's avatar
      mm: rename page_counter's count/limit into usage/max · bbec2e15
      Roman Gushchin authored
      This patch renames struct page_counter fields:
        count -> usage
        limit -> max
      
      and the corresponding functions:
        page_counter_limit() -> page_counter_set_max()
        mem_cgroup_get_limit() -> mem_cgroup_get_max()
        mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
        memcg_update_kmem_limit() -> memcg_update_kmem_max()
        memcg_update_tcp_limit() -> memcg_update_tcp_max()
      
      The idea behind this renaming is to have the direct matching
      between memory cgroup knobs (low, high, max) and page_counters API.
      
      This is pure renaming, this patch doesn't bring any functional change.
      
      Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbec2e15
  9. Nov 02, 2017
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  10. Nov 06, 2015
  11. Feb 12, 2015
  12. Dec 11, 2014
    • Johannes Weiner's avatar
      mm: memcontrol: remove obsolete kmemcg pinning tricks · 64f21993
      Johannes Weiner authored
      
      As charges now pin the css explicitely, there is no more need for kmemcg
      to acquire a proxy reference for outstanding pages during offlining, or
      maintain state to identify such "dead" groups.
      
      This was the last user of the uncharge functions' return values, so remove
      them as well.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64f21993
    • Johannes Weiner's avatar
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner authored
      
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
Loading