I’ve been using OpenIndiana io_151a2 for over a year now as my home storage solution. It has been rock solid and gives me better IO over iSCSI for VMware than I get in the production Netapp environment at work. It also lets me have some nice big pools with commodity drives for media, backup, VMs, and pretty much everything else I do on the computer.
Problem
Every now and then something strange happens – the OS disk craps out, memory starts failing, and the latest – a kernel panic and boot loop when trying to import one of my zpools:
[ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff000d1a5920 addr=30 occurred in module "zfs" due to a NULL pointer dereference
This zpool is a mirrored volume I use for backups. Just 2 disks, but dedup enabled. Dedup seems to be where my problem is.
My backup script only keeps 60 days of weekly backups for VMs, so the first thing it does is a find +mtime 60 -exec rm -rf {} \; VMs are easily dedup-able, especially if you’re doing full backups of them and not much of them changes regularly, so my zpool has a dedup factor of 4.78x. When I first started having this panic, I noticed it was around the time of the VM backup cronjob.
Googling around brought me to a post (in references below) where someone had a similar experience and tracked it back to code that had to do with freeing deduped blocks, which makes perfect sense in my situation. It seems this can occur randomly or not at all, and I just happened to get lucky.
Resolution
The solution is not ideal. You need to mount the pool as read only, back up your data, destroy and recreate the pool, and copy your data back. I tried mounting the pool as read only on my OI build which is supposed to help because ZFS will not continue trying to free deduped blocks, and ended up with a different panic:
assertion failed: zio->io_type != ZIO_TYPE_WRITE || spa_writeable(spa), file: ../../common/fs/zfs/zio.c, line: 2327
So OI seems to have some deep underlying zfs issues that are only now surfacing… great. I installed Solaris Express 11.1 on a spare machine, moved my pool over and tried importing r/w but ended up with the same original kernel panic, so SE11.1 isn’t any better with the pointer dereference. However, importing the pool as read only with this command did work:
zpool import -o readonly=on mypool
Checking my backup log, I see it failed on the first find and delete command of the script. An `ls -lrt` shows the oldest backup directory had a modification date of the run time of my backup script, so it was trying to delete that directory. Confirmation!
So I rsynced all my important data over to a temporary pool. I skipped the older VM backups as I don’t really need them, so I’m not sure what would happen if I tried copying the directory that was in the process of being deleted when the null pointer dereference occurred. Then it was just a matter of recreating the pool and putting it back on the mirror. However, I’m not going with dedup this time – not worth the wasted time. If I run out of room for backups, I’ll buy bigger disks. They’re cheap enough.
I am going to stick with OI, though. I’ve been mostly happy with it, and with dedup disabled, I’m hoping I won’t run into any more issues like this.
References
[zfs-discuss] kernel panic during zfs import