More Xen troubleshooting

Updated: January 27, 2012

We had one tutorial on several common Xen problems that may occur when you try and test Xen in your environment. The first batch of tips and tricks focused on problems with the display, Python versions and modules, changing wrapper scripts and running explicit commands, and troubleshooting services.

Today, we dive deeper and try to understand, fix and work around several more problems that you may or may not have encountered. I must admit that some of the solutions could be a little ad-hoc, so don't take my word for it, but I do believe they work well overall. Hopefully, though, your Xen experience will be more, uh, zenny after this reading.

Tip 1: VNC connection to hypervisor host got refused or disconnected

After you start the Virtual Machine Manager (virt-manager), you have the option to view the contents of your virtual machines using VNC, which will then display the contents of the console of the relevant domain. Now and then, the console may turn blank and thrown the error written in the title above. The question, what do you do now?

VNC refused

My experience shows that this will happen any time the guest operating system is restarted or shutdown. Once you power it on again, you will have to most inconveniently reopen the console to view he contents. The same occurs if you use virt-viewer:

virt-viewer <domain>

Once the domain is rebooted, reset or powered off, you will lose the connection. At the time being, the simple solution is to reopen the console window, either through virt-manager or virt-viewer.

Alternatively, to establish the connection:

virsh vncdisplay <domain>
vncviewer <hostname:port returned by virsh command>

Tip 2: vif disappears while guest running, no more networking

It may occur that while your virtual machine is running, the virtual network interface, normally labeled vifX.X will disappear from the ifconfig list, turning your guest into a useless brick.

The problem may also manifest in the relevant network interface MAC address inside the virtual machine turning into all zeros. And on the host (dom0), you may see the following kind of entries in /var/log/messages:

kernel: [ 109.743854] br0: port 2(vif1.0) entering learning state
kernel: [ 109.811880] br0: port 3(tap1.0) entering learning state
kernel: [ 117.369189] br0: port 3(tap1.0) entering disabled state
kernel: [ 117.398076] device tap1.0 left promiscuous mode
kernel: [ 117.398083] br0: port 3(tap1.0) entering disabled state
kernel: [ 117.815006] br0: port 2(vif1.0) entering disabled state
kernel: [ 117.850075] br0: port 2(vif1.0) entering disabled state

So how do you go about fixing something like this? There are several options. One, a kernel upgrade that might mysteriously resolve the problem. Well, not so mysteriously, more like fixing race conditions in the Xen networking stack.

Second option is to try to tweak the network settings in the Xen configuration file, located under /etc/xen/xend-config.sxp. This file contains all kinds of directives, including how to setup the bridge and virtual network interfaces. Many claim that the networking is somewhat buggy and that you should manually configure your own scripts.

In particular, change the following directives:

(network-script network-bridge) to (network-script )

Furthermore, you should manually configure your bridge. During the testing, you ought to use the brctl and ifconfig commands, and once you're confident your changes are solid, you should then permanently commit them to your network scripts. On the RedHat and SUSE systems, this means creating ifcfg-br<number> as well as changing the existing network scripts for your network cards under /etc/sysconfig/network, and on Debian-based systems, editing the /etc/interfaces configuration file. Needless to say, backups are a must.

Here's an example of a manual bridge configuration:

brctl addbr br0
brctl setfd br0 0
ifconfig br0 10.0.0.128 up netmask 255.255.255.0
brctl addif br0 eth0
route add -net default gw 10.0.0.254 br0
ifconfig eth0 0.0.0.0 up

Next on the menu, increasing or decreasing the number of netloops, which means how many pairs of virtual devices can be created. If you run into an upper limit, then you may want to increase the number. This is done by creating a file called netloop under /etc/modprobe.d and writing inside:

option netloop nloopbacks=<some high number, e.g. 16, 32>

Or just adding this option to /etc/modprobe.conf.

Quite a bit more reading on this topic:

Xen networking

Xen network bridges explained with troubleshooting notes

SLES networking under xen, troubleshooting and recommendations

Hassle-free Xen networking

Linux networking sucks. XEN networking sucks.

Creating additional Xen virtual network bridges

If you happen to encounter kernel crashes when this happens, then these two links might also be somewhat of interest, beware the ultra-geekiness:

Kernel BUG at mm/vmalloc.c:2165

[PATCH] mm: sync vmalloc address space page tables in alloc_vm_area()

The problem may actually be resolved by reducing the number of netloops to zero.

Tip 3: Enable legacy HTTP access

libvirt makes HTTP redundant when it comes to connecting to dom0 and managing it both locally and remotely. However, some of you might be interesting in this option, especially if you intend to use tools like OpenXenManager, which uses HTTP/HTTPS to connect. To make Xen accessible from the Web, you will need to edit several directives in the xend-config.sxp configuration file:

(xend-http-server yes)
(xend-port 8000)
(xend-address '')

Configuration file

Change no to yes for xend-http-server and specify the correct port, xend-address specifies the IP address of the network interface that Xend will listen on. If you leave the default empty quotation marks, then it will listen on all available interfaces. Next, reload the configuration file:

/etc/init.d/xend reload

And then you can test the Web access:

HTTP access

Tip 4: ERROR: unable to connect to 'localhost:8000': Connection refused

A corollary of the above tip is that you may get an ugly connection refused error when you try to connect to dom0 using virt-manager. Python is notorious for being verbose, so be prepared for ugliness.

Port 8000 refused

Port 8000 refused, more

To resolve this problem, you may need to enable HTTP access. Alternatively, you need to ask yourselves why this happens in the first place. The most likely reason is that one of Xen services is not running, which forces virt-manager to fall back to legacy access. If you do not wish to use HTTP protocol, you might want to make sure that your system boots with all the relevant Xen services, libvirtd and xend in particular.

Tip 5: domU startup errors

When you try to power on your domains with xm create <domain> or xm start <domain>, you may encounter several error messages that might anger or confuse you. Let's try to dissect some of these and understand what the problem may be.

#xm start test1
Error: Domain unable to be unpaused: an integer is required
Usage: xm start <DomainName>

And a graphical equivalent:

POST error

This normally means you have specified a non-existent or unsupported hardware component, like a network card or a sound card. Please review your domain configuration file, normally stored under /etc/xen/vm for errors. You may also want to consult /var/log/messages, as well as the Xen error log - /var/log/xen/xend.log.

Now, here's another one:

#xm create test1
Using config file "./test1".
Error: (2, 'Invalid kernel', "elf_xen_note_check: ERROR: Not a Xen-ELF image: No ELF notes or '__xen_guest' section found.\n")

This could happen if you try to power on a paravirtualized guest using the HVM kernel settings in the domain configuration file. Fairly trivial to debug and resolve.

#xm create test1
Using config file "./test1".
Error: Device 768 (vbd) could not be connected.
File /tmp/disk.raw is loopback-mounted through /dev/loop1,
which is mounted in a guest domain,
and so cannot be mounted now.

You may see this if you try to reuse a hard disk file already in use by another virtual machines. Alternatively, you may have mounted it as a loopback device somewhere in order to inspect its innards.

Speaking of vbd errors, here's another one:

#xm create test1
Using config file "./test1".
Error: Device 5632 (vbd) could not be connected. Device not found.

A variation of this kind of message shown in xend.log:

DEBUG (DevController:139) Waiting for devices vscsi.
DEBUG (DevController:139) Waiting for devices vbd.
DEBUG (DevController:144) Waiting for 768.
WARNING (XendDomain:1076) Failed to setup devices for <domain id=None name=test1 memory=4294967296 state=halted>: Device 768 (vbd) could not be connected. Device not found.

The reasons here might be a badly specified hard disk file, or again, a bad use of some device, probably due to a spelling error. This is similar to the first error listed above.

Now, one question that may arise is how to know what kind of device vbd is? As the simplest workaround, you can verify this on any one booted Xen domU. All of the devices can be found under /sys/devices/xen/. For example:

#cat /sys/devices/xen/vbd-768/block/hda/hda1/dev
3:1

Here we see that vbd-768 is a block device with the major number 3. Therefore, hda1 corresponds to 3:1, which stands to logic. More about this nomenclature in a sec.

And a complete listing for a particular virtual machine:

#ll
total 0
drwxr-xr-x 2 root root    0 Dec 18 12:07 power
-rw-r--r-- 1 root root 4096 Dec 18 12:07 uevent
drwxr-xr-x 4 root root    0 Dec 18 12:05 vbd-5632
drwxr-xr-x 4 root root    0 Dec 18 12:05 vbd-768
drwxr-xr-x 3 root root    0 Dec 18 12:05 vfb-0
drwxr-xr-x 4 root root    0 Dec 18 12:05 vif-0

The more complex way is to translate the number to hexadecimal, and then use the two least significant bits for the minor number and the remaining higher bits as the major number, then refer to devices.txt list either online or under /usr/src/linux/Documentation to decipher the exact device type. For example, 768 decimal is x0300 hexa, meaning major number 3, minor number 0. This translates into first (zero) IDE device or hda. Likewise, 5632 translates into x1600, meaning major, minor 16,00 combination, which for block devices is CD-ROM. Jolly well, no? This old Xen mailing list thread might also be of interest.

Tip 6: Advanced settings

Finally, something lightweight. When booting your guests, you may want to tweak some of the advanced settings. For example, you might be interested in PAE or OpenGL. You should mark the correct options in the virtual machine wizard, or alternatively, if you're well familiar with the command line and virtual machine files, just add correct strings there.

Advanced settings

The last, unlisted tip is setting up multi-boot systems with GRUB2 in charge. Nothing new, we talked about this in the introduction article, so please hop there and take a look.

And we shall end here today.

Conclusion

I believe that any tool written in Python is designed with as much error verbosity as possible just to spite administrators and annoy them. Xen falls within this category, with some fairly profuse and inelegant messages about the problem it is facing. Not trivial, to say the least.

Still hopefully, this tutorial was of some use to you. I must admit that there's quite a bit of margin for error here. You may very well try to apply these suggestions and fail miserably, feeling anger and disappointment. This is because there are truly hundreds of different scenarios that could inflict you, each ever so slightly different. Regardless, some of the tips should be useful. How to work around the VNC disconnects, how to investigate network problems, various errors and configuration problems, and more. Well that would be all, see you around.

Cheers.