aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* dlfcn: Implement the RTLD_DI_PHDR request type for dlinfogentoo/2.34Florian Weimer2022-05-145-5/+159
| | | | | | | | | | | The information is theoretically available via dl_iterate_phdr as well, but that approach is very slow if there are many shared objects. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@rehdat.com> (cherry picked from commit d056c212130280c0a54d9a4f72170ec621b70ce5) (cherry picked from commit 91c2e6c3db44297bf4cb3a2e3c40236c5b6a0b23)
* manual: Document the dlinfo functionFlorian Weimer2022-05-145-13/+103
| | | | | | | | | | | Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@rehdat.com> (cherry picked from commit 93804a1ee084d4bdc620b2b9f91615c7da0fabe1) Also includes partial backport of commit 5d28a8962dcb6ec056b81d730e (the addition of manual/dynlink.texi). (cherry picked from commit e4a2fb76efb45210c541ee3f8ef32f317783c3a8)
* x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]Noah Goldstein2022-05-142-1/+16
| | | | | | | | | | | | | | | | | | | | | | | Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not __wcscmp_avx2. commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:21 2022 -0600 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which can cause spurious aborts. This change will need to be backported. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 9fef7039a7d04947bc89296ee0d187bc8d89b772) (cherry picked from commit e123f08ad5ea4691bc37430ce536988c221332d6)
* x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]Noah Goldstein2022-05-143-5/+17
| | | | | | | | | | | | | | | | Logic can read before the start of `s1` / `s2` if both `s1` and `s2` are near the start of a page. To avoid having the result contimated by these comparisons the `strcmp` variants would mask off these comparisons. This was missing in the `strncmp` variants causing the bug. This commit adds the masking to `strncmp` so that out of range comparisons don't affect the result. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as well a full xcheck on x86_64 linux. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit e108c02a5e23c8c88ce66d8705d4a24bb6b9a8bf) (cherry picked from commit 5373c90f2ea3c3fa9931a684c9b81c648dfbe8d7)
* x86: Set .text section in memset-vec-unaligned-ermsNoah Goldstein2022-05-141-0/+1
| | | | | | | | | | | | | | commit 3d9f171bfb5325bd5f427e9fc386453358c6e840 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero Remove setting the .text section for the code. This commit adds that back. (cherry picked from commit 7912236f4a597deb092650ca79f33504ddb4af28) (cherry picked from commit 70509f9b4807295b2b4b43bffe110580fc0381ef)
* x86-64: Optimize bzeroH.J. Lu2022-05-1410-105/+382
| | | | | | | | | | | | memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 3d9f171bfb5325bd5f427e9fc386453358c6e840) (cherry picked from commit 5cb6329652696e79d6d576165ea87e332c9de106)
* x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only)Noah Goldstein2022-05-141-3/+4
| | | | | | | | | | | | | | commit b62ace2740a106222e124cc86956448fa07abf4d Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Feb 6 00:54:18 2022 -0600 x86: Improve vec generation in memset-vec-unaligned-erms.S Revert usage of 'pshufb' in broadcast logic as it is an SSSE3 instruction and memset.S is restricted to only SSE2 instructions. (cherry picked from commit 1b0c60f95bbe2eded80b2bb5be75c0e45b11cde1) (cherry picked from commit 190ea5f7e4e7e98b9b6e3f29835ae8b1f6a5442e)
* x86: Improve vec generation in memset-vec-unaligned-erms.SNoah Goldstein2022-05-145-87/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. Split vec generation into multiple steps. This allows the broadcast in AVX2 to use 'xmm' registers for the L(less_vec) case. This saves an expensive lane-cross instruction and removes the need for 'vzeroupper'. For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for byte broadcast. Results for memset-avx2 small (geomean of N = 20 benchset runs). size, New Time, Old Time, New / Old 0, 4.100, 3.831, 0.934 1, 5.074, 4.399, 0.867 2, 4.433, 4.411, 0.995 4, 4.487, 4.415, 0.984 8, 4.454, 4.396, 0.987 16, 4.502, 4.443, 0.987 All relevant string/wcsmbs tests are passing. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit b62ace2740a106222e124cc86956448fa07abf4d) (cherry picked from commit ea19c490a3f5628d55ded271cbb753e66b2f05e8)
* x86-64: Fix strcmp-evex.SH.J. Lu2022-05-141-1/+1
| | | | | | | | | | | | | Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:39 2022 -0600 x86: Optimize strcmp-evex.S (cherry picked from commit 0e0199a9e02ebe42e2b36958964d63f03573c382) (cherry picked from commit 53ddafe917a8af17b16beb794c29e5b09b86d534)
* x86-64: Fix strcmp-avx2.SH.J. Lu2022-05-141-1/+1
| | | | | | | | | | | | | Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:38 2022 -0600 x86: Optimize strcmp-avx2.S (cherry picked from commit c15efd011cea3d8f0494269eb539583215a1feed) (cherry picked from commit d299032743e05571ef326c838a5ecf6ef5b3e9c3)
* x86: Optimize strcmp-evex.SNoah Goldstein2022-05-141-793/+919
| | | | | | | | | | | | | | | | | | | | Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases as well are nearly universally improved. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9) (cherry picked from commit c41a66767d23b7f219fb943be6fab5ddf822d7da)
* x86: Optimize strcmp-avx2.SNoah Goldstein2022-05-141-652/+940
| | | | | | | | | | | | | | | | | | | | | | Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases are improved most for smaller sizes [0, 128] and go about even for (128, 4096]. The loop page cross logic is improved so some more significant speedup is seen there as well. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45) (cherry picked from commit 0d5b36c8cc15f064e302d29692853f8a760e1547)
* manual: Clarify that abbreviations of long options are allowedSiddhesh Poyarekar2022-05-141-1/+2
| | | | | | | | | | | | The man page and code comments clearly state that abbreviations of long option names are recognized correctly as long as they are unique. Document this fact in the glibc manual as well. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Andreas Schwab <schwab@linux-m68k.org> (cherry picked from commit db1efe02c9f15affc3908d6ae73875b82898a489) (cherry picked from commit 31af92b9c8cf753992d45c801a855a02060afc08)
* Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.hJoseph Myers2022-05-141-0/+2
| | | | | | | | Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17. Tested with build-many-glibcs.py for aarch64-linux-gnu. (cherry picked from commit 866c599182e87f116440b5d854f9e99533c48eb3) (cherry picked from commit 97cb8227b864b8ea0d99a4a50e4163baad3e1c72)
* aarch64: Add HWCAP2_ECV from Linux 5.16Szabolcs Nagy2022-05-141-0/+1
| | | | | | | | | Indicates the availability of enhanced counter virtualization extension of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable in userspace. (cherry picked from commit 5a1be8ebdf6f02d4efec6e5f12ad06db17511f90) (cherry picked from commit c108e87026d61d6744e3e55704e0bea937243f5a)
* Add SOL_MPTCP, SOL_MCTP from Linux 5.16 to bits/socket.hJoseph Myers2022-05-141-0/+2
| | | | | | | | | | Linux 5.16 adds constants SOL_MPTCP and SOL_MCTP to the getsockopt / setsockopt levels; add these constants to bits/socket.h. Tested for x86_64. (cherry picked from commit fdc1ae67fef27eea1445bab4bdfe2f0fb3bc7aa1) (cherry picked from commit f858bc309315a03ff6b1a048f59405c159d23430)
* Update kernel version to 5.17 in tst-mman-consts.pyJoseph Myers2022-05-141-1/+1
| | | | | | | | | | | This patch updates the kernel version in the test tst-mman-consts.py to 5.17. (There are no new MAP_* constants covered by this test in 5.17 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit 23808a422e6036accaba7236fd3b9a0d7ab7e8ee) (cherry picked from commit 0499c3a95fb864284fef36d3e9c5a54f6646b2db)
* Update kernel version to 5.16 in tst-mman-consts.pyJoseph Myers2022-05-141-1/+1
| | | | | | | | | | | This patch updates the kernel version in the test tst-mman-consts.py to 5.16. (There are no new MAP_* constants covered by this test in 5.16 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit 790a607e234aa10d4b977a1b80aebe8a2acac970) (cherry picked from commit 81181ba5d916fc49bd737f603e28a3c2dc8430b4)
* Update syscall lists for Linux 5.17Joseph Myers2022-05-1426-2/+28
| | | | | | | | | | | Linux 5.17 has one new syscall, set_mempolicy_home_node. Update syscall-names.list and regenerate the arch-syscall.h headers with build-many-glibcs.py update-syscalls. Tested with build-many-glibcs.py. (cherry picked from commit 8ef9196b26793830515402ea95aca2629f7721ec) (cherry picked from commit 6af165658d0999ac2c4e9ce88bee020fbc2ee49f)
* Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.hJoseph Myers2022-05-141-0/+2
| | | | | | | | | | | | Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol numbers for PF_CAN") but apparently missed for glibc at the time. Tested for x86_64. (cherry picked from commit a94d9659cd69dbc70d3494b1cbbbb5a1551675c5) (cherry picked from commit 5146b73d72ced9bab125e986aa99ef5fe2f88475)
* Update kernel version to 5.15 in tst-mman-consts.pyJoseph Myers2022-05-141-1/+1
| | | | | | | | | | | This patch updates the kernel version in the test tst-mman-consts.py to 5.15. (There are no new MAP_* constants covered by this test in 5.15 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit 5c3ece451d46a7d8721311609bfcb6faafacb39e) (cherry picked from commit fd5dbfd1cd98cb2f12f9e9f7004a4d25ab0c977f)
* Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.hJoseph Myers2022-05-141-1/+3
| | | | | | | | | | Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add these constants to bits/socket.h. Tested for x86_64. (cherry picked from commit bdeb7a8fa9989d18dab6310753d04d908125dc1d) (cherry picked from commit bc6fba3c8048b11c9f73db03339c97a2fec3f0cf)
* posix/glob.c: update from gnulibDJ Delorie2022-05-142-12/+59
| | | | | | | | | | Copied from gnulib/lib/glob.c in order to fix rhbz 1982608 Also fixes swbz 25659 Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 7c477b57a31487eda516db02b9e04f22d1a6e6af) (cherry picked from commit c66c92181ddbd82306537a608e8c0282587131de)
* linux: Fix fchmodat with AT_SYMLINK_NOFOLLOW for 64 bit time_t (BZ#29097)Adhemerval Zanella2022-05-145-6/+32
| | | | | | | | | | | The AT_SYMLINK_NOFOLLOW emulation ues the default 32 bit stat internal calls, which fails with EOVERFLOW if the file constains timestamps beyond 2038. Checked on i686-linux-gnu. (cherry picked from commit 118a2aee07f64d605b6668cbe195c1f44eac6be6) (cherry picked from commit 88a8637cb4658cd91a002659db05867716b88b36)
* i386: Regenerate ulpsCarlos O'Donell2022-05-142-2/+2
| | | | | | | | | | These failures were caught while building glibc master for Fedora Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse' using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon processor. (cherry picked from commit e465d97653311c3687aee49de782177353acfe86) (cherry picked from commit 55640ed3fde48360a8e8083be4843bd2dc7cecfe)
* linux: Fix missing internal 64 bit time_t stat usageAdhemerval Zanella2022-05-142-4/+4
| | | | | | | | | These are two missing spots initially done by 52a5fe70a2c77935. Checked on i686-linux-gnu. (cherry picked from commit 834ddd0432f68d6dc85b6aac95065721af0d86e9) (cherry picked from commit 9681691402052b727e01ae3375c73e0f76566593)
* x86: Optimize L(less_vec) case in memcmp-evex-movbe.SNoah Goldstein2022-05-141-193/+56
| | | | | | | | | | | | | | | | | | | | No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit abddd61de090ae84e380aff68a98bd94ef704667) (cherry picked from commit c796418d00f65c8c5fbed477f3ba6da2bee64ece)
* x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNIH.J. Lu2022-05-141-2/+5
| | | | | | | | | Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since they won't lower CPU frequency when ZMM load and store instructions are used. (cherry picked from commit ceeffe968c01b1202e482f4855cb6baf5c6cb713) (cherry picked from commit f3a99b2216114f89b20329ae7664b764248b4bbd)
* x86-64: Use notl in EVEX strcmp [BZ #28646]Noah Goldstein2022-05-142-6/+36
| | | | | | | | | | Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 4df1fa6ddc8925a75f3da644d5da3bb16eb33f02) (cherry picked from commit 4bbd0f866ad0ff197f72346f776ebee9b7e1a706)
* x86: Shrink memcmp-sse4.S code sizeNoah Goldstein2022-05-141-1621/+646
| | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 2f9062d7171850451e6044ef78d91ff8c017b9c0) (cherry picked from commit 7cb126e7e7febf9dc3e369cc3e4885e34fb9433b)
* x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.hNoah Goldstein2022-05-142-14/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This patch doubles the rep_movsb_threshold when using ERMS. Based on benchmarks the vector copy loop, especially now that it handles 4k aliasing, is better for these medium ranged. On Skylake with ERMS: Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) 4096, 0, 0, 0, 0.975 4096, 0, 0, 1, 0.953 4096, 12, 0, 0, 0.969 4096, 12, 0, 1, 0.872 4096, 44, 0, 0, 0.979 4096, 44, 0, 1, 0.83 4096, 0, 12, 0, 1.006 4096, 0, 12, 1, 0.989 4096, 0, 44, 0, 0.739 4096, 0, 44, 1, 0.942 4096, 12, 12, 0, 1.009 4096, 12, 12, 1, 0.973 4096, 44, 44, 0, 0.791 4096, 44, 44, 1, 0.961 4096, 2048, 0, 0, 0.978 4096, 2048, 0, 1, 0.951 4096, 2060, 0, 0, 0.986 4096, 2060, 0, 1, 0.963 4096, 2048, 12, 0, 0.971 4096, 2048, 12, 1, 0.941 4096, 2060, 12, 0, 0.977 4096, 2060, 12, 1, 0.949 8192, 0, 0, 0, 0.85 8192, 0, 0, 1, 0.845 8192, 13, 0, 0, 0.937 8192, 13, 0, 1, 0.939 8192, 45, 0, 0, 0.932 8192, 45, 0, 1, 0.927 8192, 0, 13, 0, 0.621 8192, 0, 13, 1, 0.62 8192, 0, 45, 0, 0.53 8192, 0, 45, 1, 0.516 8192, 13, 13, 0, 0.664 8192, 13, 13, 1, 0.659 8192, 45, 45, 0, 0.593 8192, 45, 45, 1, 0.575 8192, 2048, 0, 0, 0.854 8192, 2048, 0, 1, 0.834 8192, 2061, 0, 0, 0.863 8192, 2061, 0, 1, 0.857 8192, 2048, 13, 0, 0.63 8192, 2048, 13, 1, 0.629 8192, 2061, 13, 0, 0.627 8192, 2061, 13, 1, 0.62 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 475b63702ef38b69558fc3d31a0b66776a70f1d3) (cherry picked from commit cecbac52123456e2fbcff062a4165bf7b9174797)
* x86: Optimize memmove-vec-unaligned-erms.SNoah Goldstein2022-05-146-224/+381
| | | | | | | | | | | | | | | | | | | | | | | | No bug. The optimizations are as follows: 1) Always align entry to 64 bytes. This makes behavior more predictable and makes other frontend optimizations easier. 2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have significant benefits in the case that: 0 < (dst - src) < [256, 512] 3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%] improvement and for FSRM [-10%, 25%]. In addition to these primary changes there is general cleanup throughout to optimize the aligning routines and control flow logic. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit a6b7502ec0c2da89a7437f43171f160d713e39c6) (cherry picked from commit a7392db2ff2b9dd906500941ac6361dbe2211b0d)
* x86-64: Replace movzx with movzblFangrui Song2022-05-142-4/+4
| | | | | | | | | | | | | | | Clang cannot assemble movzx in the AT&T dialect mode. ../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction movzx (%rsi), %ecx ^~~~ Change movzx to movzbl, which follows the AT&T dialect and is used elsewhere in the file. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6720d36b6623c5e48c070d86acf61198b33e144e) (cherry picked from commit 2e64237a8744dd50f9222293275fa52e7248ff76)
* x86-64: Remove Prefer_AVX2_STRCMPH.J. Lu2022-05-145-15/+2
| | | | | | | | | | | | Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1 VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1 VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. (cherry picked from commit 14dbbf46a007ae5df36646b51ad0c9e5f5259f30) (cherry picked from commit a182bb7a3922404f79def09d79ef89678b4049f0)
* x86-64: Improve EVEX strcmp with masked loadH.J. Lu2022-05-141-218/+243
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In strcmp-evex.S, to compare 2 32-byte strings, replace VMOVU (%rdi, %rdx), %YMM0 VMOVU (%rsi, %rdx), %YMM1 /* Each bit in K0 represents a mismatch in YMM0 and YMM1. */ VPCMP $4, %YMM0, %YMM1, %k0 VPCMP $0, %YMMZERO, %YMM0, %k1 VPCMP $0, %YMMZERO, %YMM1, %k2 /* Each bit in K1 represents a NULL in YMM0 or YMM1. */ kord %k1, %k2, %k1 /* Each bit in K1 represents a NULL or a mismatch. */ kord %k0, %k1, %k1 kmovd %k1, %ecx testl %ecx, %ecx jne L(last_vector) with VMOVU (%rdi, %rdx), %YMM0 VPTESTM %YMM0, %YMM0, %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi, %rdx). */ VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} kmovd %k1, %ecx incl %ecx jne L(last_vector) It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55) (cherry picked from commit f35ad30da4880a1574996df0674986ecf82fa7ae)
* x86: Replace sse2 instructions with avx in memcmp-evex-movbe.SNoah Goldstein2022-05-141-2/+2
| | | | | | | | | | | | | | | | | This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'. it could potentially be dangerous to use SSE2 if this function is ever called without using 'vzeroupper' beforehand. While compilers appear to use 'vzeroupper' before function calls if AVX2 has been used, using SSE2 here is more brittle. Since it is not absolutely necessary it should be avoided. It costs 2-extra bytes but the extra bytes should only eat into alignment padding. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit bad852b61b79503fcb3c5fc379c70f768df3e1fb) (cherry picked from commit baf3ece63453adac59c5688930324a78ced5b2e4)
* x86: Optimize memset-vec-unaligned-erms.SNoah Goldstein2022-05-145-95/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. Optimization are 1. change control flow for L(more_2x_vec) to fall through to loop and jump for L(less_4x_vec) and L(less_8x_vec). This uses less code size and saves jumps for length > 4x VEC_SIZE. 2. For EVEX/AVX512 move L(less_vec) closer to entry. 3. Avoid complex address mode for length > 2x VEC_SIZE 4. Slightly better aligning code for the loop from the perspective of code size and uops. 5. Align targets so they make full use of their fetch block and if possible cache line. 6. Try and reduce total number of icache lines that will need to be pulled in for a given length. 7. Include "local" version of stosb target. For AVX2/EVEX/AVX512 jumping to the stosb target in the sse2 code section will almost certainly be to a new page. The new version does increase code size marginally by duplicating the target but should get better iTLB behavior as a result. test-memset, test-wmemset, and test-bzero are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit e59ced238482fd71f3e493717f14f6507346741e) (cherry picked from commit 6d18a93dbbde2958001d65dff3080beed7ae675a)
* x86: Optimize memcmp-evex-movbe.S for frontend behavior and sizeNoah Goldstein2022-05-141-192/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. The frontend optimizations are to: 1. Reorganize logically connected basic blocks so they are either in the same cache line or adjacent cache lines. 2. Avoid cases when basic blocks unnecissarily cross cache lines. 3. Try and 32 byte align any basic blocks possible without sacrificing code size. Smaller / Less hot basic blocks are used for this. Overall code size shrunk by 168 bytes. This should make up for any extra costs due to aligning to 64 bytes. In general performance before deviated a great deal dependending on whether entry alignment % 64 was 0, 16, 32, or 48. These changes essentially make it so that the current implementation is at least equal to the best alignment of the original for any arguments. The only additional optimization is in the page cross case. Branch on equals case was removed from the size == [4, 7] case. As well the [4, 7] and [2, 3] case where swapped as [4, 7] is likely a more hot argument size. test-memcmp and test-wmemcmp are both passing. (cherry picked from commit 1bd8b8d58fc9967cc073d2c13bfb6befefca2faa) (cherry picked from commit 5ec3416853c4150c4d13312e05f93a053586d528)
* x86: Modify ENTRY in sysdep.h so that p2align can be specifiedNoah Goldstein2022-05-141-2/+5
| | | | | | | | | | | | | | No bug. This change adds a new macro ENTRY_P2ALIGN which takes a second argument, log2 of the desired function alignment. The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this doesn't affect any existing functionality. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit fc5bd179ef3a953dff8d1655bd530d0e230ffe71) (cherry picked from commit b5a44a6a471aafd3677659a610f32468c40a666b)
* x86-64: Optimize load of all bits set into ZMM register [BZ #28252]H.J. Lu2022-05-1410-64/+11
| | | | | | | | | | | | | | | | | | | Optimize loads of all bits set into ZMM register in AVX512 SVML codes by replacing vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX and vmovups .L_2il0floatpacket.13(%rip), %zmmX with vpternlogd $0xff, %zmmX, %zmmX, %zmmX This fixes BZ #28252. (cherry picked from commit 78c9ec9000f873abe7a15a91b87080a2e4308260) (cherry picked from commit 16245986fb9bfe396113fc7dfd1929f69a9e748e)
* scripts/glibcelf.py: Mark as UNSUPPORTED on Python 3.5 and earlierFlorian Weimer2022-05-141-0/+6
| | | | | | | | enum.IntFlag and enum.EnumMeta._missing_ support are not part of earlier Python versions. (cherry picked from commit b571f3adffdcbed23f35ea39b0ca43809dbb4f5b) (cherry picked from commit 83cc145830bdbefdabe03787ed884d548bea9c99)
* dlfcn: Do not use rtld_active () to determine ld.so state (bug 29078)Florian Weimer2022-05-1415-14/+160
| | | | | | | | | | | | | | | | | | | | | | When audit modules are loaded, ld.so initialization is not yet complete, and rtld_active () returns false even though ld.so is mostly working. Instead, the static dlopen hook is used, but that does not work at all because this is not a static dlopen situation. Commit 466c1ea15f461edb8e3ffaf5d86d708876343bbf ("dlfcn: Rework static dlopen hooks") moved the hook pointer into _rtld_global_ro, which means that separate protection is not needed anymore and the hook pointer can be checked directly. The guard for disabling libio vtable hardening in _IO_vtable_check should stay for now. Fixes commit 8e1472d2c1e25e6eabc2059170731365f6d5b3d1 ("ld.so: Examine GLRO to detect inactive loader [BZ #20204]"). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 8dcb6d0af07fda3607b541857e4f3970a74ed55b) (cherry picked from commit bc56ab1f4aa937665034373d3e320d0779a839aa)
* INSTALL: Rephrase -with-default-link documentationFlorian Weimer2022-05-142-9/+9
| | | | | | Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit c935789bdf40ba22b5698da869d3a4789797e09f) (cherry picked from commit 0d477e92c49db2906b32e44135b98746ccc73c7b)
* misc: Fix rare fortify crash on wchar funcs. [BZ 29030]gentoo/glibc-2.34-17Joan Bruguera2022-04-252-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If `__glibc_objsize (__o) == (size_t) -1` (i.e. `__o` is unknown size), fortify checks should pass, and `__whatever_alias` should be called. Previously, `__glibc_objsize (__o) == (size_t) -1` was explicitly checked, but on commit a643f60c53876b, this was moved into `__glibc_safe_or_unknown_len`. A comment says the -1 case should work as: "The -1 check is redundant because since it implies that __glibc_safe_len_cond is true.". But this fails when: * `__s > 1` * `__osz == -1` (i.e. unknown size at compile time) * `__l` is big enough * `__l * __s <= __osz` can be folded to a constant (I only found this to be true for `mbsrtowcs` and other functions in wchar2.h) In this case `__l * __s <= __osz` is false, and `__whatever_chk_warn` will be called by `__glibc_fortify` or `__glibc_fortify_n` and crash the program. This commit adds the explicit `__osz == -1` check again. moc crashes on startup due to this, see: https://bugs.archlinux.org/task/74041 Minimal test case (test.c): #include <wchar.h> int main (void) { const char *hw = "HelloWorld"; mbsrtowcs (NULL, &hw, (size_t)-1, NULL); return 0; } Build with: gcc -O2 -Wp,-D_FORTIFY_SOURCE=2 test.c -o test && ./test Output: *** buffer overflow detected ***: terminated Fixes: BZ #29030 Signed-off-by: Joan Bruguera <joanbrugueram@gmail.com> Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit 33e03f9cd2be4f2cd62f93fda539cc07d9c8130e) (cherry picked from commit ca0faa140ff8cebe4c041d935f0f5eb480873d99)
* Default to --with-default-link=no (bug 25812)Florian Weimer2022-04-228-118/+192
| | | | | | | | | | | | This is necessary to place the libio vtables into the RELRO segment. New tests elf/tst-relro-ldso and elf/tst-relro-libc are added to verify that this is what actually happens. The new tests fail on ia64 due to lack of (default) RELRO support inbutils, so they are XFAILed there. (cherry picked from commit 198abcbb94618730dae1b3f4393efaa49e0ec8c7) (cherry picked from commit f0c71b34f96c816292c49122d50da3a511b67bf2)
* scripts: Add glibcelf.py moduleFlorian Weimer2022-04-223-0/+1402
| | | | | | | | | | | | | | | | | | Hopefully, this will lead to tests that are easier to maintain. The current approach of parsing readelf -W output using regular expressions is not necessarily easier than parsing the ELF data directly. This module is still somewhat incomplete (e.g., coverage of relocation types and versioning information is missing), but it is sufficient to perform basic symbol analysis or program header analysis. The EM_* mapping for architecture-specific constant classes (e.g., SttX86_64) is not yet implemented. The classes are defined for the benefit of elf/tst-glibcelf.py. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit 30035d67728a846fa39749cd162afd278ac654c4) (cherry picked from commit 3e0a91b79b409a6a4113a0fdb08221c0bb29cfce)
* nptl: Fix pthread_cancel cancelhandling atomic operationsAdhemerval Zanella2022-04-221-1/+2
| | | | | | | | | | | The 404656009b reversion did not setup the atomic loop to set the cancel bits correctly. The fix is essentially what pthread_cancel did prior 26cfbb7162ad. Checked on x86_64-linux-gnu and aarch64-linux-gnu. (cherry picked from commit 62be9681677e7ce820db721c126909979382d379) (cherry picked from commit 71326f1f2fd09dafb9c34404765fb88129e94237)
* mips: Fix mips64n32 64 bit time_t stat support (BZ#29069)=Joshua Kinard2022-04-182-15/+25
| | | | | | | | Add missing support initially added by 4e8521333bea6e89fcef1020 (which missed n32 stat). (cherry picked from commit 78fb88827362fbd2cc8aa32892ae5b015106e25c) (cherry picked from commit b87b697f15d6bf7e576a2eeadc1f740172f9d013)
* hurd: Fix arbitrary error codeSamuel Thibault2022-04-181-1/+1
| | | | | | | ELIBBAD is Linux-specific. (cherry picked from commit 67ab66541dc1164540abda284645e38be90b5119) (cherry picked from commit 5d8c7776343b3f1b96ef7777e4504378f23c041a)
* nptl: Handle spurious EINTR when thread cancellation is disabled (BZ#29029)Adhemerval Zanella2022-04-1816-91/+484
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some Linux interfaces never restart after being interrupted by a signal handler, regardless of the use of SA_RESTART [1]. It means that for pthread cancellation, if the target thread disables cancellation with pthread_setcancelstate and calls such interfaces (like poll or select), it should not see spurious EINTR failures due the internal SIGCANCEL. However recent changes made pthread_cancel to always sent the internal signal, regardless of the target thread cancellation status or type. To fix it, the previous semantic is restored, where the cancel signal is only sent if the target thread has cancelation enabled in asynchronous mode. The cancel state and cancel type is moved back to cancelhandling and atomic operation are used to synchronize between threads. The patch essentially revert the following commits: 8c1c0aae20 nptl: Move cancel type out of cancelhandling 2b51742531 nptl: Move cancel state out of cancelhandling 26cfbb7162 nptl: Remove CANCELING_BITMASK However I changed the atomic operation to follow the internal C11 semantic and removed the MACRO usage, it simplifies a bit the resulting code (and removes another usage of the old atomic macros). Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu, and powerpc64-linux-gnu. [1] https://man7.org/linux/man-pages/man7/signal.7.html Reviewed-by: Florian Weimer <fweimer@redhat.com> Tested-by: Aurelien Jarno <aurelien@aurel32.net> (cherry-picked from commit 404656009b459658138ed1bd18f3c6cf3863e6a6) (cherry picked from commit 290db09546b260a30137d03ce97a857e6f15b648)