Let’s talk about… lagged copies.
For most Exchange administrators, the subject of lagged database copies falls somewhere between “the Kardashians’ shoe sizes” and “which of the 3 Stooges was the funniest” in terms of interest level. The concept is easy enough to understand: a lagged copy is merely a passive copy of a mailbox database where the log files are not immediately played back, as they are with ordinary passive copies. The period between the arrival of a log file and the time when it’s committed to the database is known as the lag interval. If you have a lag interval of 24 hours set to a database, a new log for that database generated at 3pm on April 4th won’t be played into the lagged copy until at least 3pm on April 5th (I say “at least” because the exact time of playback will depend on the copy queue length). The longer the lag interval, the more “distance” there is between the active copy of the mailbox database and the lagged copy.
Lagged copies are intended as a last-ditch “goalkeeper” safety mechanism in case of logical corruption. Physical corruption caused by a hardware failure will happen after Exchange has handed the data off to be written, so it won’t be replicated. Logical corruption introduced by components other than Exchange (say, an improperly configured file-level AV scanner) that directly write to the MDB or transaction log files wouldn’t be replicated in any event, so the real use case for the lagged copy is to give you a window in time during which logical corruption caused by Exchange or its clients hasn’t yet been replicated to the lagged copy. Obviously the size of this window depends on the length of the lag interval, and whether or not it is sufficient for you to a) notice that the active database has become corrupted b) play the accumulated logs forward into the lagged copy and c) activate the lagged copy depends on your environment.
The prevailing sentiment in the Exchange world has largely been “ I do backups already so lagged copies don’t give me anything.” When Exchange 2010 first introduced the notion of a lagged copy, Tony Redmond weighed in on it. Here’s what he said back then:
For now, I just can’t see how I could recommend the deployment of lagged database copies.
That seems like a reasonable stance, doesn’t it? At MEC this year, though, Microsoft came out swinging in defense of lagged copies. Why would they do that? Why would you even think of implementing lagged copies? It turns out that there are some excellent reasons that aren’t immediately apparent. (It may help to review some of the resiliency and HA improvements delivered in Exchange 2013; try this this excellent omnibus article by Microsoft’s Scott Schnoll if you want a refresher.) Here are some of the reasons why Microsoft has begun recommending the use of lagged copies more broadly.
1. Lagged copies are better in 2013
Exchange 2013 includes a number of improvements to the lagged copy mechanism. In particular, the new loose truncation feature introduced in SP1 means that you can prevent a lagged copy from taking up too much log space by adjusting the the amount of log space that the replay mechanism will use; when that limit is reached the logs will be played down to make room. Exchange 2013 (and SP1) also make a number of improvements to the Safety Net mechanism (discussed fully in Chapter 2 of the book), which can be used to play missing messages back into a lagged copy by retrieving them from the transport subsystem.
2. Lagged copies are continuously verified
When you back up a database, Exchange checks the page checksum of every page as it is backed up by computing the checksum and comparing it to the stored checksum; if that check fails, you get the dreaded JET_errReadVerifyFailure (-1018) error. However, just because you can successfully complete the backup doesn’t mean that you’ll be able to restore it when the time comes. By comparison, the Exchange log playback mechanism will log errors immediately when they are encountered during log playback. If you’re monitoring event logs on your servers, you’ll be notified as soon as this happens and you’ll know that your lagged copy is unusable now, not when you need to restore it. If you’re not monitoring your event logs, then lagged copies are the least of your problems.
3. Lagged copies give you more flexibility for recovery
When your active and passive copies of a database become unusable and you need to fall back to your lagged copy, you have several choices, as described in TechNet. You can easily play back every log that hasn’t yet been committed to the database, in the correct order, by using Move-ActiveMailboxDatabase. If you’d rather, you can play back the logs up to a certain point in time by removing the log files that you don’t want to play back. You can also play messages back directly from Safety Net into the lagged copy.
4. There’s no hardware penalty for keeping a lagged copy
Some administrators assume that you have to keep lagged copies of databases on a separate server. While this is certainly supported, you don’t have to have a “lag server” or anything like unto it. The normal practice in most designs has been to store lagged copies on other servers in the same DAG, but you don’t even have to do that. Microsoft recommends that you keep your mailbox databases no bigger than 2TB. Stuff your server with a JBOD array of the new 8TB disks (or, better yet, buy a Dell PowerVault MD1220) and you can easily put four databases on a single disk: the active copy of DB1, the primary passive copy of DB2, the secondary passive copy of DB3, and the lagged copy of DB4. This gives you an easy way to get the benefits of a 4-copy DAG while still using the full capacity of the disks you have: the additional IOPS load of the lagged copy will be low, so hosting it on a volume that already has active and passive copies of other databases is a reasonable approach (one, however, that you’ll want to test with jetstress).
It’s always been the case that the architecture Microsoft recommends when a new version of Windows or Exchange is released evolves over time as they, and we, get more experience with it in the real world. That’s clearly what has happened here; changes in the product, improvements in storage hardware, and a shift in the economic viability of conventional backups mean that lagged copies are now much more appropriate for use as a data protection mechanism than they were in the past. I expect to see them deployed more and more often as Exchange 2013 deployments continue and our collective knowledge of best practices for them improves.