- Jun 05, 2023
-
-
Greg Kroah-Hartman authored
Link: https://lore.kernel.org/r/20230601131939.051934720@linuxfoundation.org Tested-by:
Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20230601143327.479886832@linuxfoundation.org Tested-by:
Florian Fainelli <florian.fainelli@broadcom.com> Tested-by:
Takeshi Ogasawara <takeshi.ogasawara@futuring-girl.com> Tested-by:
Conor Dooley <conor.dooley@microchip.com> Tested-by:
Ron Economos <re@w6rz.net> Tested-by:
Jon Hunter <jonathanh@nvidia.com> Tested-by:
Bagas Sanjaya <bagasdotme@gmail.com> Tested-by:
Linux Kernel Functional Testing <lkft@linaro.org> Tested-by:
Markus Reichelt <lkt+2023@mareichelt.com> Tested-by:
Florian Fainelli <florian.fainelli@broadcom.com> Tested-by:
Salvatore Bonaccorso <carnil@debian.org> Tested-by:
Guenter Roeck <linux@roeck-us.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Yanteng Si authored
commit 5d1ac59f upstream. Picking the changes from: 91d0b78c ("inet: Add IP_LOCAL_PORT_RANGE socket option") Silencing these perf build warnings: Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h' diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h Signed-off-by:
Yanteng Si <siyanteng@loongson.cn> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: loongson-kernel@lists.loongnix.cn Link: https://lore.kernel.org/r/23aabc69956ac94fbf388b05c8be08a64e8c7ccc.1683712945.git.siyanteng@loongson.cn Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Paul Blakey authored
commit 9b7c68b3 upstream. Currently, offloaded conntrack entries (flows) can only be deleted after they are removed from offload, which is either by timeout, tcp state change or tc ct rule deletion. This can cause issues for users wishing to manually delete or flush existing entries. Support deletion of offloaded conntrack entries. Example usage: # Delete all offloaded (and non offloaded) conntrack entries # whose source address is 1.2.3.4 $ conntrack -D -s 1.2.3.4 # Delete all entries $ conntrack -F Signed-off-by:
Paul Blakey <paulb@nvidia.com> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Acked-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Florian Westphal <fw@strlen.de> Cc: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gautham R. Shenoy authored
commit 4badf2eb upstream. Schedutil normally calls the adjust_perf callback for drivers with adjust_perf callback available and fast_switch_possible flag set. However, when frequency invariance is disabled and schedutil tries to invoke fast_switch. So, there is a chance of kernel crash if this function pointer is not set. To protect against this scenario add fast_switch callback to amd_pstate driver. Fixes: 1d215f03 ("cpufreq: amd-pstate: Add fast switch function for AMD P-State") Signed-off-by:
Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by:
Wyes Karny <wyes.karny@amd.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Wyes Karny authored
commit 3bf8c630 upstream. Driver should update policy->cur after updating the frequency. Currently amd_pstate doesn't update policy->cur when `adjust_perf` is used. Which causes /proc/cpuinfo to show wrong cpu frequency. Fix this by updating policy->cur with correct frequency value in adjust_perf function callback. - Before the fix: (setting min freq to 1.5 MHz) [root@amd]# cat /proc/cpuinfo | grep "cpu MHz" | sort | uniq --count 1 cpu MHz : 1777.016 1 cpu MHz : 1797.160 1 cpu MHz : 1797.270 189 cpu MHz : 400.000 - After the fix: (setting min freq to 1.5 MHz) [root@amd]# cat /proc/cpuinfo | grep "cpu MHz" | sort | uniq --count 1 cpu MHz : 1753.353 1 cpu MHz : 1756.838 1 cpu MHz : 1776.466 1 cpu MHz : 1776.873 1 cpu MHz : 1777.308 1 cpu MHz : 1779.900 183 cpu MHz : 1805.231 1 cpu MHz : 1956.815 1 cpu MHz : 2246.203 1 cpu MHz : 2259.984 Fixes: 1d215f03 ("cpufreq: amd-pstate: Add fast switch function for AMD P-State") Signed-off-by:
Wyes Karny <wyes.karny@amd.com> [ rjw: Subject edits ] Cc: 5.17+ <stable@vger.kernel.org> # 5.17+ Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Anuj Gupta authored
commit 46930b7c upstream. commit <8af870aa> ("block: enable bio caching use for passthru IO") introduced bio-cache for passthru IO. In case when nr_vecs are greater than BIO_INLINE_VECS, bio and bvecs are allocated from mempool (instead of percpu cache) and REQ_ALLOC_CACHE is cleared. This causes the side effect of not freeing bio/bvecs into mempool on completion. This patch lets the passthru IO fallback to allocation using bio_kmalloc when nr_vecs are greater than BIO_INLINE_VECS. The corresponding bio is freed during call to blk_mq_map_bio_put during completion. Cc: stable@vger.kernel.org # 6.1 fixes <8af870aa> ("block: enable bio caching use for passthru IO") Signed-off-by:
Anuj Gupta <anuj20.g@samsung.com> Signed-off-by:
Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230523111709.145676-1-anuj20.g@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ido Schimmel authored
This reverts commit a71f3880. Commit a71f3880 ("thermal/drivers/mellanox: Use generic thermal_zone_get_trip() function") was backported as a dependency of the fix in upstream commit 6d206b1e ("mlxsw: core_thermal: Fix fan speed in maximum cooling state"). However, it is dependent on changes in the thermal core that were merged in v6.3. Without them, the mlxsw driver is unable to register its thermal zone: mlxsw_spectrum 0000:03:00.0: Failed to register thermal zone mlxsw_spectrum 0000:03:00.0: cannot register bus device mlxsw_spectrum: probe of 0000:03:00.0 failed with error -22 Fix this by reverting this commit and instead fix the small conflict with the above mentioned fix. Tested using the test case mentioned in the change log of the fix: # cat /sys/class/thermal/thermal_zone2/cdev0/type mlxsw_fan # echo 10 > /sys/class/thermal/thermal_zone2/cdev0/cur_state # cat /sys/class/hwmon/hwmon1/name mlxsw # cat /sys/class/hwmon/hwmon1/pwm1 255 After setting the fan to its maximum cooling state (10), it operates at 100% duty cycle instead of being stuck at 0 RPM. Fixes: a71f3880 ("thermal/drivers/mellanox: Use generic thermal_zone_get_trip() function") Reported-by:
Joe Botha <joe@atomic.ac> Tested-by:
Joe Botha <joe@atomic.ac> Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ruihan Li authored
commit 000c2fa2 upstream. Previously, channel open messages were always sent to monitors on the first ioctl() call for unbound HCI sockets, even if the command and arguments were completely invalid. This can leave an exploitable hole with the abuse of invalid ioctl calls. This commit hardens the ioctl processing logic by first checking if the command is valid, and immediately returning with an ENOIOCTLCMD error code if it is not. This ensures that ioctl calls with invalid commands are free of side effects, and increases the difficulty of further exploitation by forcing exploitation to find a way to pass a valid command first. Signed-off-by:
Ruihan Li <lrh2000@pku.edu.cn> Co-developed-by:
Marcel Holtmann <marcel@holtmann.org> Signed-off-by:
Marcel Holtmann <marcel@holtmann.org> Signed-off-by:
Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by:
Dragos-Marian Panait <dragos.panait@windriver.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Mario Limonciello authored
commit ca475186 upstream. APUs before Raven didn't support s0ix. As we just relieved some of the safety checks for s0ix to improve power consumption on APUs that support it but that are missing BIOS support a new blind spot was introduced that a user could "try" to run s0ix. Plug this hole so that if users try to run s0ix on anything older than Raven it will just skip suspend of the GPU. Fixes: cf488dcd ("drm/amd: Allow s0ix without BIOS support") Suggested-by:
Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Hariprasad Kelam authored
[ Upstream commit cb5edce2 ] Upon physical link change, firmware reports to the kernel about the change along with the details like speed, lmac_type_id, etc. Kernel derives lmac_type based on lmac_type_id received from firmware. In a few scenarios, firmware returns an invalid lmac_type_id, which is resulting in below kernel panic. This patch adds the missing validation of the lmac_type_id field. Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 35.321595] Modules linked in: [ 35.328982] CPU: 0 PID: 31 Comm: kworker/0:1 Not tainted 5.4.210-g2e3169d8e1bc-dirty #17 [ 35.337014] Hardware name: Marvell CN103XX board (DT) [ 35.344297] Workqueue: events work_for_cpu_fn [ 35.352730] pstate: 40400089 (nZcv daIf +PAN -UAO) [ 35.360267] pc : strncpy+0x10/0x30 [ 35.366595] lr : cgx_link_change_handler+0x90/0x180 Fixes: 61071a87 ("octeontx2-af: Forward CGX link notifications to PFs") Signed-off-by:
Hariprasad Kelam <hkelam@marvell.com> Signed-off-by:
Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by:
Sai Krishna <saikrishnag@marvell.com> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Zhu Yanjun authored
[ Upstream commit b2b1ddc4 ] In the function rxe_create_qp(), rxe_qp_from_init() is called to initialize qp, internally things like rxe_init_task are not setup until rxe_qp_init_req(). If an error occurred before this point then the unwind will call rxe_cleanup() and eventually to rxe_qp_do_cleanup()/rxe_cleanup_task() which will oops when trying to access the uninitialized spinlock. If rxe_init_task is not executed, rxe_cleanup_task will not be called. Reported-by:
<syzbot+cfcc1a3c85be15a40cba@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?id=fd85757b74b3eb59f904138486f755f71e090df8 Fixes: 8700e3e7 ("Soft RoCE driver") Fixes: 2d4b21e0 ("IB/rxe: Prevent from completer to operate on non valid QP") Signed-off-by:
Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20230413101115.1366068-1-yanjun.zhu@intel.com Signed-off-by:
Leon Romanovsky <leon@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Johannes Berg authored
[ Upstream commit 457d7fb0 ] If we do get multiple notifications from firmware, then we might have allocated 'notif', but don't free it. Fix that by checking for duplicates before allocation. Fixes: 4da46a06 ("wifi: iwlwifi: mvm: Add support for wowlan info notification") Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230418122405.116758321cc4.I8bdbcbb38c89ac637eaa20dda58fa9165b25893a@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Haim Dreyfuss authored
[ Upstream commit 905d50dd ] As part of version 2 we don't need to have wake_packet_bufsize and wake_packet_length. The first one is already calculated by the driver, the latter is sent as part of the wake packet notification. Signed-off-by:
Haim Dreyfuss <haim.dreyfuss@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230413213309.3b53213b10d4.Ibf2f15aca614def2d262dd267d1aad65931b58f1@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Stable-dep-of: 457d7fb0 ("wifi: iwlwifi: mvm: fix potential memory leak") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Eric Huang authored
[ Upstream commit d33fc8d0 ] Use primary channel index to determine which 5 MHz mask should be enable. This mask is used to prevent noise from channel edge to effect CCA threshold in wide bandwidth (>= 40 MHZ). Fixes: 1b00e923 ("rtw89: 8852c: add set channel of BB part") Fixes: 6b069898 ("wifi: rtw89: 8852b: add chip_ops::set_channel") Cc: stable@vger.kernel.org Signed-off-by:
Eric Huang <echuang@realtek.com> Signed-off-by:
Ping-Ke Shih <pkshih@realtek.com> Signed-off-by:
Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20230406072841.8308-1-pkshih@realtek.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
David Epping authored
[ Upstream commit 71460c9e ] By default the VSC8501 and VSC8502 RGMII/GMII/MII RX_CLK output is disabled. To allow packet forwarding towards the MAC it needs to be enabled. For other PHYs supported by this driver the clock output is enabled by default. Fixes: d3169863 ("net: phy: mscc: add support for VSC8502") Signed-off-by:
David Epping <david.epping@missinglinkelectronics.com> Reviewed-by:
Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by:
Vladimir Oltean <olteanv@gmail.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Yunsheng Lin authored
[ Upstream commit 368d3cb4 ] page_pool_ring_[un]lock() use in_softirq() to decide which spin lock variant to use, and when they are called in the context with in_softirq() being false, spin_lock_bh() is called in page_pool_ring_lock() while spin_unlock() is called in page_pool_ring_unlock(), because spin_lock_bh() has disabled the softirq in page_pool_ring_lock(), which causes inconsistency for spin lock pair calling. This patch fixes it by returning in_softirq state from page_pool_producer_lock(), and use it to decide which spin lock variant to use in page_pool_producer_unlock(). As pool->ring has both producer and consumer lock, so rename it to page_pool_producer_[un]lock() to reflect the actual usage. Also move them to page_pool.c as they are only used there, and remove the 'inline' as the compiler may have better idea to do inlining or not. Fixes: 78862447 ("net: page_pool: Add bulk support for ptr_ring") Signed-off-by:
Yunsheng Lin <linyunsheng@huawei.com> Acked-by:
Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Qingfang DENG authored
[ Upstream commit 542bcea4 ] We use BH context only for synchronization, so we don't care if it's actually serving softirq or not. As a side node, in case of threaded NAPI, in_serving_softirq() will return false because it's in process context with BH off, making page_pool_recycle_in_cache() unreachable. Signed-off-by:
Qingfang DENG <qingfang.deng@siflower.com.cn> Tested-by:
Felix Fietkau <nbd@nbd.name> Signed-off-by:
David S. Miller <davem@davemloft.net> Stable-dep-of: 368d3cb4 ("page_pool: fix inconsistency for page_pool_ring_[un]lock()") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Yan Zhao authored
[ Upstream commit 4752354a ] Check physical PFN is valid before converting the PFN to a struct page pointer to be returned to caller of vfio_pin_pages(). vfio_pin_pages() pins user pages with contiguous IOVA. If the IOVA of a user page to be pinned belongs to vma of vm_flags VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning struct page address for this PFN. This is because usually this kind of PFN (e.g. MMIO PFN) has no valid struct page address associated. Upon this error, vaddr_get_pfns() will obtain the physical PFN directly. While previously vfio_pin_pages() returns to caller PFN arrays directly, after commit 34a255e6 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"), PFNs will be converted to "struct page *" unconditionally and therefore the returned "struct page *" array may contain invalid struct page addresses. Given current in-tree users of vfio_pin_pages() only expect "struct page * returned, check PFN validity and return -EINVAL to let the caller be aware of IOVAs to be pinned containing PFN not able to be returned in "struct page *" array. So that, the caller will not consume the returned pointer (e.g. test PageReserved()) and avoid error like "supervisor read access in kernel mode". Fixes: 34a255e6 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()") Cc: Sean Christopherson <seanjc@google.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230519065843.10653-1-yan.y.zhao@intel.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Tian Lan authored
[ Upstream commit 3e94d54e ] If multiple CPUs are sharing the same hardware queue, it can cause leak in the active queue counter tracking when __blk_mq_tag_busy() is executed simultaneously. Fixes: ee78ec10 ("blk-mq: blk_mq_tag_busy is no need to return a value") Signed-off-by:
Tian Lan <tian.lan@twosigma.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit e5c6de5f ] The read_skb() logic is incrementing the tcp->copied_seq which is used for among other things calculating how many outstanding bytes can be read by the application. This results in application errors, if the application does an ioctl(FIONREAD) we return zero because this is calculated from the copied_seq value. To fix this we move tcp->copied_seq accounting into the recv handler so that we update these when the recvmsg() hook is called and data is in fact copied into user buffers. This gives an accurate FIONREAD value as expected and improves ACK handling. Before we were calling the tcp_rcv_space_adjust() which would update 'number of bytes copied to user in last RTT' which is wrong for programs returning SK_PASS. The bytes are only copied to the user when recvmsg is handled. Doing the fix for recvmsg is straightforward, but fixing redirect and SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then call this from skmsg handlers. This fixes another issue where a broken socket with a BPF program doing a resubmit could hang the receiver. This happened because although read_skb() consumed the skb through sock_drop() it did not update the copied_seq. Now if a single reccv socket is redirecting to many sockets (for example for lb) the receiver sk will be hung even though we might expect it to continue. The hang comes from not updating the copied_seq numbers and memory pressure resulting from that. We have a slight layer problem of calling tcp_eat_skb even if its not a TCP socket. To fix we could refactor and create per type receiver handlers. I decided this is more work than we want in the fix and we already have some small tweaks depending on caller that use the helper skb_bpf_strparser(). So we extend that a bit and always set the strparser bit when it is in use and then we can gate the seq_copied updates on this. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit 6df7f764 ] When TCP stack has data ready to read sk_data_ready() is called. Sockmap overwrites this with its own handler to call into BPF verdict program. But, the original TCP socket had sock_def_readable that would additionally wake up any user space waiters with sk_wake_async(). Sockmap saved the callback when the socket was created so call the saved data ready callback and then we can wake up any epoll() logic waiting on the read. Note we call on 'copied >= 0' to account for returning 0 when a FIN is received because we need to wake up user for this as well so they can do the recvmsg() -> 0 and detect the shutdown. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit ea444185 ] A common mechanism to put a TCP socket into the sockmap is to hook the BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program that can map the socket info to the correct BPF verdict parser. When the user adds the socket to the map the psock is created and the new ops are assigned to ensure the verdict program will 'see' the sk_buffs as they arrive. Part of this process hooks the sk_data_ready op with a BPF specific handler to wake up the BPF verdict program when data is ready to read. The logic is simple enough (posted here for easy reading) static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; sock->ops->read_skb(sk, sk_psock_verdict_recv); } The oversight here is sk->sk_socket is not assigned until the application accepts() the new socket. However, its entirely ok for the peer application to do a connect() followed immediately by sends. The socket on the receiver is sitting on the backlog queue of the listening socket until its accepted and the data is queued up. If the peer never accepts the socket or is slow it will eventually hit data limits and rate limit the session. But, important for BPF sockmap hooks when this data is received TCP stack does the sk_data_ready() call but the read_skb() for this data is never called because sk_socket is missing. The data sits on the sk_receive_queue. Then once the socket is accepted if we never receive more data from the peer there will be no further sk_data_ready calls and all the data is still on the sk_receive_queue(). Then user calls recvmsg after accept() and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler. The handler checks for data in the sk_msg ingress queue expecting that the BPF program has already run from the sk_data_ready hook and enqueued the data as needed. So we are stuck. To fix do an unlikely check in recvmsg handler for data on the sk_receive_queue and if it exists wake up data_ready. We have the sock locked in both read_skb and recvmsg so should avoid having multiple runners. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit 901546fd ] The sockmap code is returning EAGAIN after a FIN packet is received and no more data is on the receive queue. Correct behavior is to return 0 to the user and the user can then close the socket. The EAGAIN causes many apps to retry which masks the problem. Eventually the socket is evicted from the sockmap because its released from sockmap sock free handling. The issue creates a delay and can cause some errors on application side. To fix this check on sk_msg_recvmsg side if length is zero and FIN flag is set then set return to zero. A selftest will be added to check this condition. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Tested-by:
William Findlay <will@isovalent.com> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit 405df89d ] We noticed some rare sk_buffs were stepping past the queue when system was under memory pressure. The general theory is to skip enqueueing sk_buffs when its not necessary which is the normal case with a system that is properly provisioned for the task, no memory pressure and enough cpu assigned. But, if we can't allocate memory due to an ENOMEM error when enqueueing the sk_buff into the sockmap receive queue we push it onto a delayed workqueue to retry later. When a new sk_buff is received we then check if that queue is empty. However, there is a problem with simply checking the queue length. When a sk_buff is being processed from the ingress queue but not yet on the sockmap msg receive queue its possible to also recv a sk_buff through normal path. It will check the ingress queue which is zero and then skip ahead of the pkt being processed. Previously we used sock lock from both contexts which made the problem harder to hit, but not impossible. To fix instead of popping the skb from the queue entirely we peek the skb from the queue and do the copy there. This ensures checks to the queue length are non-zero while skb is being processed. Then finally when the entire skb has been copied to user space queue or another socket we pop it off the queue. This way the queue length check allows bypassing the queue only after the list has been completely processed. To reproduce issue we run NGINX compliance test with sockmap running and observe some flakes in our testing that we attributed to this issue. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Suggested-by:
Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Tested-by:
William Findlay <will@isovalent.com> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit bce22552 ] Now that the backlog manages the reschedule() logic correctly we can drop the partial fix to reschedule from recvmsg hook. Rescheduling on recvmsg hook was added to address a corner case where we still had data in the backlog state but had nothing to kick it and reschedule the backlog worker to run and finish copying data out of the state. This had a couple limitations, first it required user space to kick it introducing an unnecessary EBUSY and retry. Second it only handled the ingress case and egress redirects would still be hung. With the correct fix, pushing the reschedule logic down to where the enomem error occurs we can drop this fix. Fixes: bec21719 ("skmsg: Schedule psock work if the cached skb exists on the psock") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit 29173d07 ] Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser. The flow for Cilium's common use case with a stream parser is, tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work); The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next. Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side. To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'. To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue. >From on list discussion. This commit bec21719("skmsg: Schedule psock work if the cached skb exists on the psock") was intended to address similar race, but had a couple cases it missed. Most obvious it only accounted for receiving traffic on the local socket so if redirecting into another socket we could still get an sk_buff stuck here. Next it missed the case where copied=0 in the recv() handler and then we wouldn't kick the scheduler. Also its sub-optimal to require userspace to kick the internal mechanisms of sockmap to wake it up and copy data to user. It results in an extra syscall and requires the app to actual handle the EAGAIN correctly. Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Tested-by:
William Findlay <will@isovalent.com> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
John Fastabend authored
[ Upstream commit 78fa0d61 ] The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] <IRQ> [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: 04919bed ("tcp: Introduce tcp_read_skb()") Reported-by:
William Findlay <will@isovalent.com> Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Tested-by:
William Findlay <will@isovalent.com> Reviewed-by:
Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Henning Schild authored
[ Upstream commit 3002b864 ] In fact the device with chip id 0xD283 is called NCT6126D, and that is the chip id the Nuvoton code was written for. Correct that name to avoid confusion, because a NCT6116D in fact exists as well but has another chip id, and is currently not supported. The look at the spec also revealed that GPIO group7 in fact has 8 pins, so correct the pin count in that group as well. Fixes: d0918a84 ("gpio-f7188x: Add GPIO support for Nuvoton NCT6116") Reported-by:
Xing Tong Wu <xingtong.wu@siemens.com> Signed-off-by:
Henning Schild <henning.schild@siemens.com> Acked-by:
Simon Guinot <simon.guinot@sequanux.org> Signed-off-by:
Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Shay Drory authored
[ Upstream commit 8c253dfc ] devcom events are sent to all registered component. Following the cited patch, it is possible for two components, e.g.: two eswitches, to send devcom events, while both components are registered. This means eswitch layer will do double un/pairing, which is double allocation and free of resources, even though only one un/pairing is needed. flow example: cpu0 cpu1 ---- ---- mlx5_devlink_eswitch_mode_set(dev0) esw_offloads_devcom_init() mlx5_devcom_register_component(esw0) mlx5_devlink_eswitch_mode_set(dev1) esw_offloads_devcom_init() mlx5_devcom_register_component(esw1) mlx5_devcom_send_event() mlx5_devcom_send_event() Hence, check whether the eswitches are already un/paired before free/allocation of resources. Fixes: 09b27846 ("net: devlink: enable parallel ops on netlink interface") Signed-off-by:
Shay Drory <shayd@nvidia.com> Reviewed-by:
Mark Bloch <mbloch@nvidia.com> Signed-off-by:
Saeed Mahameed <saeedm@nvidia.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit eca9bfaf ] When receive buffer is small we try to copy out the data from TCP into a skb maintained by TLS to prevent connection from stalling. Unfortunately if a single record is made up of a mix of decrypted and non-decrypted skbs combining them into a single skb leads to loss of decryption status, resulting in decryption errors or data corruption. Similarly when trying to use TCP receive queue directly we need to make sure that all the skbs within the record have the same status. If we don't the mixed status will be detected correctly but we'll CoW the anchor, again collapsing it into a single paged skb without decrypted status preserved. So the "fixup" code will not know which parts of skb to re-encrypt. Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser") Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit c1c607b1 ] We'll need to copy input skbs individually in the next patch. Factor that code out (without assuming we're copying a full record). Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Stable-dep-of: eca9bfaf ("tls: rx: strp: preserve decryption status of skbs when needed") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit 14c4be92 ] If a record is partially decrypted we'll have to CoW it, anyway, so go into copy mode and allocate a writable skb right away. This will make subsequent fix simpler because we won't have to teach tls_strp_msg_make_copy() how to copy skbs while preserving decrypt status. Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Stable-dep-of: eca9bfaf ("tls: rx: strp: preserve decryption status of skbs when needed") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit 8b0c0dc9 ] We call tls_rx_msg_size(skb) before doing skb->len += chunk. So the tls_rx_msg_size() code will see old skb->len, most likely leading to an over-read. Worst case we will over read an entire record, next iteration will try to trim the skb but may end up turning frag len negative or discarding the subsequent record (since we already told TCP we've read it during previous read but now we'll trim it out of the skb). Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser") Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit 210620ae ] alloc_skb_with_frags() fills in page frag sizes but does not set skb->len and skb->data_len. Set those correctly otherwise device offload will most likely generate an empty skb and hit the BUG() at the end of __skb_nsg(). Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser") Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Kicinski authored
[ Upstream commit b3a03b54 ] skb->len covers the entire skb, including the frag_list. In fact we're guaranteed that rxm->full_len <= skb->len, so since the change under Fixes we were not checking decrypt status of any skb but the first. Note that the skb_pagelen() added here may feel a bit costly, but it's removed by subsequent fixes, anyway. Reported-by:
Tariq Toukan <tariqt@nvidia.com> Fixes: 86b259f6 ("tls: rx: device: bound the frag walk") Tested-by:
Shai Amiram <samiram@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Mario Limonciello authored
[ Upstream commit b54147fa ] After suspend/resume cycle there is an error message and auto-mode or CnQF stops working. [ 5741.447511] amd-pmf AMDI0100:00: SMU cmd failed. err: 0xff [ 5741.447523] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_RESPONSE:ff [ 5741.447527] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_ARGUMENT:7 [ 5741.447531] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_MESSAGE:16 [ 5741.447540] amd-pmf AMDI0100:00: [AUTO_MODE] avg power: 0 mW mode: QUIET This is because the DRAM address used for accessing metrics table needs to be refreshed after a suspend resume cycle. Add a resume callback to reset this again. Fixes: 1a409b35 ("platform/x86/amd/pmf: Get performance metrics from PMFW") Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20230513011408.958-1-mario.limonciello@amd.com Reviewed-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jeremy Sowden authored
[ Upstream commit 5f5486b6 ] When building sign-file, the call to get the CFLAGS for libcrypto is missing white-space between `pkg-config` and `--cflags`: $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null) Removing the redirection of stderr, we see: $ make -C tools/testing/selftests/bpf sign-file make: Entering directory '[...]/tools/testing/selftests/bpf' make: pkg-config--cflags: No such file or directory SIGN-FILE sign-file make: Leaving directory '[...]/tools/testing/selftests/bpf' Add the missing space. Fixes: fc975906 ("selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc") Signed-off-by:
Jeremy Sowden <jeremy@azazel.net> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/bpf/20230426215032.415792-1-jeremy@azazel.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Sudeep Holla authored
[ Upstream commit c6e04536 ] Commit bb1be749 ("firmware: arm_ffa: Add v1.1 get_partition_info support") adds support to discovery the UUIDs of the partitions or just fetch the partition count using the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag. However the commit doesn't handle the fact that the older version doesn't understand the flag and must be MBZ which results in firmware returning invalid parameter error. That results in the failure of the driver probe which is in correct. Limit the usage of the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag for the versions above v1.0(i.e v1.1 and onwards) which fixes the issue. Fixes: bb1be749 ("firmware: arm_ffa: Add v1.1 get_partition_info support") Reported-by:
Jens Wiklander <jens.wiklander@linaro.org> Reported-by:
Marc Bonnici <marc.bonnici@arm.com> Tested-by:
Jens Wiklander <jens.wiklander@linaro.org> Reviewed-by:
Jens Wiklander <jens.wiklander@linaro.org> Link: https://lore.kernel.org/r/20230419-ffa_fixes_6-4-v2-2-d9108e43a176@arm.com Signed-off-by:
Sudeep Holla <sudeep.holla@arm.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Nicolas Dichtel authored
[ Upstream commit 3632679d ] With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the protocol field of the flow structure, build by raw_sendmsg() / rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy lookup when some policies are defined with a protocol in the selector. For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to specify the protocol. Just accept all values for IPPROTO_RAW socket. For ipv4, the sin_port field of 'struct sockaddr_in' could not be used without breaking backward compatibility (the value of this field was never checked). Let's add a new kind of control message, so that the userland could specify which protocol is used. Fixes: 1da177e4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jakub Sitnicki authored
[ Upstream commit 91d0b78c ] Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by:
Marek Majkowski <marek@cloudflare.com> Reviewed-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by:
Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 3632679d ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-