Skip to content
  • Joonsoo Kim's avatar
    mm/page_alloc: use ac->high_zoneidx for classzone_idx · 3334a45e
    Joonsoo Kim authored
    Patch series "integrate classzone_idx and high_zoneidx", v5.
    
    This patchset is followup of the problem reported and discussed two years
    ago [1, 2].  The problem this patchset solves is related to the
    classzone_idx on the NUMA system.  It causes a problem when the lowmem
    reserve protection exists for some zones on a node that do not exist on
    other nodes.
    
    This problem was reported two years ago, and, at that time, the solution
    got general agreements [2].  But it was not upstreamed.
    
    [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
    [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com
    
    This patch (of 2):
    
    Currently, we use classzone_idx to calculate lowmem reserve proetection
    for an allocation request.  This classzone_idx causes a problem on NUMA
    systems when the lowmem reserve protection exists for some zones on a node
    that do not exist on other nodes.
    
    Before further explanation, I should first clarify how to compute the
    classzone_idx and the high_zoneidx.
    
    - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
      represents the index of the highest zone the allocation can use
    
    - classzone_idx was supposed to be the index of the highest zone on the
      local node that the allocation can use, that is actually available in
      the system
    
    Think about following example.  Node 0 has 4 populated zone,
    DMA/DMA32/NORMAL/MOVABLE.  Node 1 has 1 populated zone, NORMAL.  Some
    zones, such as MOVABLE, doesn't exist on node 1 and this makes following
    difference.
    
    Assume that there is an allocation request whose gfp_zone(gfp_mask) is the
    zone, MOVABLE.  Then, it's high_zoneidx is 3.  If this allocation is
    initiated on node 0, it's classzone_idx is 3 since actually
    available/usable zone on local (node 0) is MOVABLE.  If this allocation is
    initiated on node 1, it's classzone_idx is 2 since actually
    available/usable zone on local (node 1) is NORMAL.
    
    You can see that classzone_idx of the allocation request are different
    according to their starting node, even if their high_zoneidx is the same.
    
    Think more about these two allocation requests.  If they are processed on
    local, there is no problem.  However, if allocation is initiated on node 1
    are processed on remote, in this example, at the NORMAL zone on node 0,
    due to memory shortage, problem occurs.  Their different classzone_idx
    leads to different lowmem reserve and then different min watermark.  See
    the following example.
    
    root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
    Node 0, zone      DMA
      per-node stats
    ...
      pages free     3965
            min      5
            low      8
            high     11
            spanned  4095
            present  3998
            managed  3977
            protection: (0, 2961, 4928, 5440)
    ...
    Node 0, zone    DMA32
      pages free     757955
            min      1129
            low      1887
            high     2645
            spanned  1044480
            present  782303
            managed  758116
            protection: (0, 0, 1967, 2479)
    ...
    Node 0, zone   Normal
      pages free     459806
            min      750
            low      1253
            high     1756
            spanned  524288
            present  524288
            managed  503620
            protection: (0, 0, 0, 4096)
    ...
    Node 0, zone  Movable
      pages free     130759
            min      195
            low      326
            high     457
            spanned  1966079
            present  131072
            managed  131072
            protection: (0, 0, 0, 0)
    ...
    Node 1, zone      DMA
      pages free     0
            min      0
            low      0
            high     0
            spanned  0
            present  0
            managed  0
            protection: (0, 0, 1006, 1006)
    Node 1, zone    DMA32
      pages free     0
            min      0
            low      0
            high     0
            spanned  0
            present  0
            managed  0
            protection: (0, 0, 1006, 1006)
    Node 1, zone   Normal
      per-node stats
    ...
      pages free     233277
            min      383
            low      640
            high     897
            spanned  262144
            present  262144
            managed  257744
            protection: (0, 0, 0, 0)
    ...
    Node 1, zone  Movable
      pages free     0
            min      0
            low      0
            high     0
            spanned  262144
            present  0
            managed  0
            protection: (0, 0, 0, 0)
    
    - static min watermark for the NORMAL zone on node 0 is 750.
    
    - lowmem reserve for the request with classzone idx 3 at the NORMAL on
      node 0 is 4096.
    
    - lowmem reserve for the request with classzone idx 2 at the NORMAL on
      node 0 is 0.
    
    So, overall min watermark is:
    allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
    allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
    
    Allocation initiated on node 1 will have some precedence than allocation
    initiated on node 0 because min watermark of the former allocation is
    lower than the other.  So, allocation initiated on node 1 could succeed on
    node 0 when allocation initiated on node 0 could not, and, this could
    cause too many numa_miss allocation.  Then, performance could be
    downgraded.
    
    Recently, there was a regression report about this problem on CMA patches
    since CMA memory are placed in ZONE_MOVABLE by those patches.  I checked
    that problem is disappeared with this fix that uses high_zoneidx for
    classzone_idx.
    
    http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
    
    
    
    Using high_zoneidx for classzone_idx is more consistent way than previous
    approach because system's memory layout doesn't affect anything to it.
    With this patch, both classzone_idx on above example will be 3 so will
    have the same min watermark.
    
    allocation initiated on node 0: 750 + 4096 = 4846
    allocation initiated on node 1: 750 + 4096 = 4846
    
    One could wonder if there is a side effect that allocation initiated on
    node 1 will use higher bar when allocation is handled on local since
    classzone_idx could be higher than before.  It will not happen because the
    zone without managed page doesn't contributes lowmem_reserve at all.
    
    Reported-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
    Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
    Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.com
    
    
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3334a45e
Loading