INTEL MCE INJECTOR DRIVER DOWNLOAD
Please consider signing up for a subscription and helping to keep LWN publishing. In either case, the hardware doesn’t immediately cause a machine check but rather flags the data unit as poisoned until read or consumed. The OS can then take appropriate action, like killing the process with the corrupted data or logging the event properly to disk. See this LWN article for further details about this issue. However, these pages containing critical kernel data cannot be isolated. Er, maybe I’m missing the thrust of your question, but I thought it was sort of straightforward:
|Date Added:||6 June 2006|
|File Size:||24.90 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
See this LWN article for further details about this issue. Posted Dec 4, For “Action Optional” machine checks that can happen asynchronously to program execution such as due to scrubbingthe OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you.
In any case, this bit allows previously poisoned pages to be ignored by the handler. Studies about memory errors A good study on memory errors from the University of Rochester.
As memory density increases, error rates also rise. Dirty pages are unmapped from all associated processes, which are subsequently killed. Reserved kernel pages and zero count pages are ignored with the peril of a system ijtel. Huge pages fail since reverse mapping is not supported to identify the process which owns the page.
mcelog — further reading
This allow system soft- ware to perform recovery action on certain class of uncorrected errors and continue If I’m not mistaken, that’s the processor family this article was referring to. Second, bad-memory containment must be done at a level where the kernel mcce manages memory.
If the erroneous data is never read, no machine check is necessary.
And they go on to say that the poison handler runs some time after the time that the specific bad subset is used. ECC is able to recover from multib i y te errors Posted Dec 4, 9: These delays include asynchronous hardware reporting of the machine check event, and delayed execution of the handler via a workqueue.
Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction s may have consumed the bogus data.
Newer Intel CPUs support a new class of machine checks called recoverable action optional. Do you have different documentation that suggests otherwise?
This document is dated Juneso it’s not like it’s anceint. Posted Sep 1, The handler must allow for multiple poisoning events occurring in a short time window. Includes an overview of modern mcelog.
Er, maybe I’m missing the thrust of your question, but I thought it was sort of straightforward: In any case, both are machine checks. Many details described in the old paper are outdated by now. Memory “poisoning”, with its delayed handling of errors, allows for a more graceful recovery from and isolation of uncorrected memory errors rather than just crashing the system.
That’s not how I read this. Yes, that’s the scenario in the sentences I excerpted from the article. Now flip with me to page and look at what SRAO errors are architecturally defined, there in section Posted Dec 4, 9: Please consider signing up for a subscription and helping to keep LWN publishing.
So there we have it.
Maybe the article is confusing multiple scenarios. Background scrubbing is entirely asynchronous to process execution.
Alternatively, memory may be occasionally “scrubbed.