Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code 12 in Windows with Tesla M40 (24gb) and Asus P8Z77-V Deluxe with Enable 4G and ReBarUEFI activated #45

Open
5 tasks done
skenizen opened this issue Apr 2, 2023 · 8 comments
Labels
bios issue with firmware that needs patching

Comments

@skenizen
Copy link

skenizen commented Apr 2, 2023

System

  • Motherboard: Asus P8Z77-V Deluxe
  • BIOS Version: 2104
  • GPU: Nvidia Tesla M40
  • CPU: i7-3770K (22nm Ivy Bridge with support for PCIe 3.0)
  • CSM is turned off
  • 4G decoding is enabled
  • UEFIPatch is applied
  • DSDT looks similar
  • I have read Common issues (and fixes)

Description

I'm trying to get an Nvidia Tesla M40 (24gb) working on an Asus P8Z77-V Deluxe. I've did all the modifications required/advised here but the card still report a code 12 in windows. I succeed to get a ReBar working on a Polaris RX 590 (it was showing 8gb on GPU-Z and properly displayed in Large Memory on Device Manager on Windows (as can be seen on the screenshots). I've tried the registry trick for the AMD drivers but wasn't seeing the option appearing in the Adrenalin software.
device-manager-rx590
RX590

When I plug the Tesla M40, only the PCI bus is displayed in Large Memory.
device-manager-tesla-bug
TeslaM40

Trying to boot on Linux (Ubuntu 20.04 Server) the kernel is spamming the following error (extract from dmesg):

[    0.077256] kernel: [mem 0xdf200000-0xf7ffffff] available for PCI devices
[...]
[    5.106057] kernel: nvidia: loading out-of-tree module taints kernel.
[    5.106073] kernel: nvidia: module license 'NVIDIA' taints kernel.
[    5.106075] kernel: Disabling lock debugging due to kernel taint
[    5.135499] kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[    5.135505] kernel: 
[    5.137085] kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[    5.137846] kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                       NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
[    5.137868] kernel: nvidia: probe of 0000:01:00.0 failed with error -1
[    5.137882] kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
[    5.137883] kernel: NVRM: None of the NVIDIA devices were initialized.
[    5.138031] kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234

I'm not sure what size was set with ReBarState.exe when I saved this specific dmesg log file so the first line telling the memory available for PCI devices might be different. Although the rest (specifically NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0) ) was always displayed regardless of the size specified.

Troubleshooting tried

  • Deactivating as many devices in the bios as possible
  • Using two separate source of power (from the same PSU) for the two pins of the GPU (in case it was a power issue).
  • Tested the GPU in a different system to check if the GPU had an issue (it works in a different system with the same drivers tested here).
  • Tested Windows 10 (22H2) and 11.
  • Tested with lower memory assigned with ReBarState.exe (2048).
  • looked at the DSDT and it seemed to have the proper values so I didn't change it.
  • Tinkering with hidden unlocked options related to PCIe in the Asus bios.

I've noticed that the GPU-Z on windows when the Polaris card was displayed was mentionning 8GB of VRAM on BAR0 and the linux drivers complains about BAR1. I thought that this could maybe be the issue? But I have no idea how to change that.

I'm at my wit end on this issue, I've seen in a different issue that it was stated that what a Tesla M40 requires is not exactly ResizableBar but only large address, but I would have expected it to work with this activated as that's what I had to activate on another motherboard to get it working there.

If you have ideas of what is wrong and can help, that would be nice. Thanks for the great project!

@xCuri0
Copy link
Owner

xCuri0 commented Apr 2, 2023

@skenizen how much ram is installed in your system ? It looks like 32 GB so it won't work because the firmware uses top to bottom allocation leaving no space for the 16GB BAR M40. It should work with less RAM if you have applied UEFIPatch

It is possible to do bottom to top but that would require hooking AllocateMemorySpace to replace it with bottom to top allocation for 64-bit MMIO

Most UEFI allocate will like this (top to bottom) with 4G decode. Keep in mind BARs must be aligned with their size and that Ivy Bridge only has 64GB (36-bit) of physical address space.

  • 64 - 56GB: Intel ME, Audio, Ethernet, iGPU, etc
  • 56 - 48GB: RX 590 8GB BAR
    No room for M40 16GB BAR if 32GB RAM because it would need to be at 32 - 48GB or 16 - 32GB

You can see the only 2 other remaining places the 16GB BAR M40 can be allocated at are 16GB and 32GB using this allocation method.

Bottom to top allocation should look like this (with 32GB RAM)

  • 64 - 48GB: Tesla M40 16GB BAR
  • 48 - 40GB: RX 590 8GB BAR
  • 40 - 32.5GB: Intel ME, Audio, Ethernet, iGPU, etc

FYI if you do want to try writing a hook for AllocateMemorySpace this is what the call by the PciHostBridge DXE for 64-bit (4G Decoding) looks like.

AllocateMemorySpace (EfiGcdAllocateMaxAddressSearchTopDown,
		EfiGcdMemoryTypeMemoryMappedIo,
	        Alignment,
		BarLength,
		&BaseAddress,
	        ImageHandle,
                DeviceHandle);

image

@skenizen
Copy link
Author

skenizen commented Apr 2, 2023

Thanks for the answer and the explanation.

You were right about the RAM I have, it has 32gb (which is the maximum for that CPU). After reading your initial message, I've removed 3 of the RAM sticks and left the computer with only one 8GB RAM stick and tried again.
Unfortunately this has not solved the issue. I'm going to locate a 4GB ram stick and try with it too.

I don't have the RX 590 plugged in the system at the same time than the Tesla, I've only mentioned it because I used it to validate that the patches were properly done. When I try the Tesla, I only have the Tesla in a PCIe port and I'm using the intel iGPU (I've set 32mb of vram use for it in the bios).

I've applied the UEFIPatch and unless I made a mistake somewhere, all should be fine on that front (I can send you the bios I flashed if that could help validating all is fine on that end).

Is there a way for me to read the allocation table to check how much the motherboard already allocated for the other devices and how much is free for the Tesla?

Update: I've tried with a 2GB and 4GB single RAM stick and it was showing similar issue.

@xCuri0
Copy link
Owner

xCuri0 commented Apr 2, 2023

@skenizen i just realized you have the 24GB Tesla M40 which uses a 32GB BAR.

In that case you will need to use bottom to top allocation with less than 32GB of RAM (28GB should work). So you will need to write the AllocateMemorySpace hook if you want it work on this PC. I've added more details about hooking it in the previous comment

It should look like this once done (24GB RAM)

  • 64 - 32GB: Tesla M40 32GB BAR
  • 32 - 24.5GB: Intel ME, Audio, Ethernet, iGPU, etc

If you want to use the RX 590 at the same too you will need to use less than ~22GB RAM. It will look like this (16GB RAM)

  • 64 - 32GB: Tesla M40 32GB BAR
  • 32 - 24GB: RX 590 8GB BAR
  • 24 - 16.5GB: Intel ME, Audio, Ethernet, iGPU, etc

/proc/iomem (as root) on Linux or Device Manager View->By Resources on Windows will show the PCI memory map

@skenizen
Copy link
Author

skenizen commented Apr 2, 2023

I'm going to look at setting up the build environment for building the module, and writing the allocator.

Do you have an advice on how to setup a test environment for iterating when I will do the allocator? I'd like to avoid having to flash my motherboard for testing the code, I suppose it's possible to patch a VM UEFI bios and using virtual machines, any advice setting up that?

On a different note, do I understand rightly that the hexadecimal numbers on the left of the devices are the memory range allocated for the hardware and that on this screenshot the memory is still allocated despite the device being disabled in the bios (the device is appearing with a lighter color for showing it's unused, but the address seems to be still reserved ?) :

Memory_Allocation2_highlight

Thanks a lot for the pointers, it's an interesting problem to try to solve, and if you are right the solution doesn't look so hard to implement.

@xCuri0
Copy link
Owner

xCuri0 commented Apr 2, 2023

@skenizen greyed out items are just those that were there before but aren't now. you can disable show hidden devices in view to remove them.

you can use QEMU OVMF to test it which is what I used to figure out hooking in this module with, ovmf firmware can just be opened in UEFITool. there are alot of examples/articles about hooking efi functions so those can be useful

you'd want to the hook to only affect 64-bit allocations which is why i've provided an example of the call by PciHostBridge in a previous comment. So you'd want to only use the alternate method when

  • GcdAllocateType: EfiGcdAllocateMaxAddressSearchTopDown
  • GcdMemoryType: EfiGcdMemoryTypeMemoryMappedIo
  • *BaseAddress: If above 0x100000000 (4GB), FYI this is a pointer and you want the actual value of the variable not the pointer address. You'll have to change this for QEMU testing because it doesn't support 4G decoding. *BaseAddress is also where you will have to set the newly found address.

You'd have to get the EFI Memory Map and find a valid (proper alignment and fits BAR) address using bottom to top search starting at 0x100000000 (4GB). You can probably use the original AddMemorySpace EfiGcdAllocateMaxAddressSearchBottomUp code as reference just change modify it to start at 0x100000000 (4GB) instead of 0 like it does

@xCuri0
Copy link
Owner

xCuri0 commented Apr 8, 2023

@skenizen any update ?

@skenizen
Copy link
Author

skenizen commented Apr 8, 2023

Hi @xCuri0 No, I unfortunately only have my weekends on this, and although I'm finding the prospect fun to do I end up ordering a new motherboard in order to both use 32gb of RAM and 2 Tesla M40 (I bought a used x299 for more pci lanes). I'm still interested in writing the custom allocator, but as a low priority. I will keep you informed if I have updates on this, feel free in the meantime to close this ticket.

One note though: Contrary to what you said in your last comment, the greyed devices were not history of what was connected in windows device manager, the motherboard seems to assign memory addresses regardless of them being activated or not in the bios. If you look at my screenshot, the addresses are not overlapping. I checked switching on/off showing hidden devices and the screen I took was with "show hidden device" off.

@xCuri0
Copy link
Owner

xCuri0 commented Aug 18, 2023

It's working on Linux Ivy Bridge LGA1155, see #77

@xCuri0 xCuri0 added the bios issue with firmware that needs patching label Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bios issue with firmware that needs patching
Projects
None yet
Development

No branches or pull requests

2 participants