This article was supposed to be about the challenges of community management. I opened my laptop to work on that article, and found that the hard drive it was on had disappeared. From that moment, and for three weeks in one form or another, I was in recovery mode. This article is not disaster recovery 101. I’m not going to talk about backup solutions, recovery strategies or any of that. Suffice it to say, you should have a good, cloud based backup in place. It should be tested periodically, and you should have plans in place to facilitate recovery if data is lost. We had all of those things. What we discovered, however, were gaps in our understanding, and problems with our processes. These issues caused a simple hard drive failure to be costlier and more time consuming than we ever imagined. It forced us to move project deadlines and put some project components at risk. In this article I’m going to walk you through the unexpected problems, the challenges of recovery, and the impact. Finally, I’ll let you know what we’ve done to change our processes, and better prepare for the next recovery.
This Article Focuses on Marketing Issues
I do a lot of different things at my consultancy, but this article focuses primarily on the impact of disaster recovery on internal and external marketing activities.
Disaster Recovery Terms
There are two key disaster recovery terms you should be aware of:
Recovery Point Objective (RPO): This is the point to which you hope to be able to recover to following a disaster. For example, the defined RPO might be the previous day. In that case you might backup data every night, and if there is a failure the following day, you can restore all data from the previous day.
Recovery Time Objective (RTO): This is the length of time it takes to restore data to achieve the RPO. Generally, this is a defined number of hours such as four, six, or eight hours. Our RTO was one business day, which, in discussions was generally thought of as six to ten hours, or a long work day. As you can see, our RTO was a bit squishy. We’ll talk more about that later.
Our primary niche is serving hi-tech clients. We have internal subject matter expertise in a number of technical areas. We had a strong understanding of these terms and what they mean on paper, as I was to find out, we did not have a strong understanding of all the things that factor in to Recovery Time Objective.
Setting the Stage
Before we go any further, I need to share the where and when. My timing was impeccable. Home based in Portland Oregon, I had just arrived in Rochester New York for a two-day visit with a great client and partner of ours. I had worked on my laptop on the flights out, and in the hotel room after I arrived. After arriving at my client’s headquarters and doing a meet and greet, I settled into the office they had set up for me. My laptop had a solid state C: drive for the OS (Windows 10), and a D: drive on which stored all my data. When I opened my laptop, I found that it no longer recognized the existence of my D: drive.
3 Lessons Learned Immediately
Taking stock of this, it was a bad situation, but there were things that could have been worse.
Lesson 1: Cloud Good! My C: drive was alive and well, so I could still boot, and access cloud based apps for marketing like Buffer, Hootsuite, DivvyHQ and more. I could also send email, and just had to download Slack to get back online with our internal group messaging. I could also access all our data that was stored in our corporate Google Drive. So, bottom line, if it’s in the cloud, the threshold for returning to a functional state is much lower. As long as you can get online, you can return to functionality. The benefit cloud apps provide of being able to work from anywhere is often undervalued.
Lesson 2: If it’s not in the cloud it’s not highly available. I was away from our main office. We don’t use on-site backups, we store critical data in the cloud, but if we had on-site backups they wouldn’t have done me any good. With employees that work on-site, and remotely, the cloud is the only way to keep our data available. While this won’t be the case for all organizations, it almost certainly applies to all remote or travelling users.
Lesson 3: Having hotspot capability is a must. I was fortunate to have access to a good Internet connection despite being on the road. I was able to start the restoration process for critical files immediately and get good throughput. There are many scenarios where this would not have been the case. Having a good WiFI hotspot would have been invaluable if had I needed it.
Lessons 4 and 5: Diagnosis Time and Assistance
When planning for disaster recovery, many organizations assume the process is black and white: A computer fails, so you begin recovery. I now believe that this is seldom the case. When my PC didn’t show a D: drive, my first thought wasn’t to initiate recovery, but to try to get access to my drive and save my data. I spent 2 hours pulling my PC apart so that I could give the crashed drive to my client’s IT support person to check. I spent another two hours, in between client meetings, attempting various drive diagnostics options before I attempted to mount the drive and discovered the data was lost.
Lesson 4: Add hours to your RTO to account for diagnosis. People will naturally spend time trying to save the latest drafts of their documents and most recent data prior to throwing in the towel and moving to restore mode. If failures happen while travelling or at remote locations, the employee, not IT staff will likely be doing this. Factor this analysis time into your planning and adjust your RTO.
Lesson 5: Have a Plan in Place to Help Remote Users. When considering disaster recover for remote and travelling users, consider how IT staff might help them perform diagnostics remotely. This may involve having them go to a computer store, buy cables, and take apart their hard drive. Alternatively, it may involve having them take their PC to a trusted nationwide computer recover or repair center. If working with remote offices that have small staff, have the office keep USB drive connection cables and diagnostics software on-hand in the event of data loss. Try to find ways to give IT staff remote access to devices to perform diagnostics. Finally, have a rapid reaction plan in the event of a total failure, like a process to expedite buying a new PC so the person can get back online quickly.
Lesson 6, 7, and 8: Recovery in the Age of Big Data Can Take a Long Time
Applications, data, and hard drives are huge today. Many PCs ship with one terabyte or more of storage. In addition to base applications, content marketers also tend to have a lot of large files on their computers. Every image on a computer is likely between three and 10 megabytes each. Videos can be 20 megabytes and up. Marketers may also have multiple versions the same file if they are designing graphics to be included in online ads, rich media, or other types of online posts such as infographics. They may also have data files for doing research on marketing personas, keywords, competitors and so forth. Anything that’s been stored locally and marked up will need to be restored in order to pick up the project where it was left off. That’s a lot of data to restore. Restoring files isn’t such a problem over a hi-speed Local Area Network (LAN) connection. But downloading from the cloud is a different story because Internet connections are significantly slower than LAN links. It can take several hours, or even days to restore large amounts of data.
Lesson 6: Collect real data, do calculations, and test data recovery times. Collect data about how much data would need to be restored for the average user. Know what connection options are available for data restoration. Make sure your plan provides access to the best available connection for data transfer, and make sure you account for restore times in your RTO.
Some Data Restoration Planning Tips
We have some tips and tools to help you plan for data restoration. First, know your network speeds. Wired Ethernet links are generally fastest. Wi-Fi is usually slower than wired Ethernet. Your Internet connection is usually slower than both. A helpful online tool is the File Transfer Time - Data Transfer Speed Calculator. It lets you select your file size, transfer speed, and estimate transfer time. Here are some common network speeds, and the time it would take to transfer 4.7 gigabytes of data (the size of a single DVD):
11b Wi-Fi (11 megabits per second (Mbps)): About one hour.
11g Wi-Fi (54 Mbps): 13 minutes.
11n Wi-Fi (300 MbpsA little over 2 minutes.
Fast Ethernet and 802.11ac (1000 Mbps or 1Gbps): About 40 seconds.
USB Superspeed (you need the port on both devices and the cable). It supports 5 Gbps transfer rates and would perform the transfer in 8 seconds.
It’s possible you have an older network with 10 or 100 Mbps Ethernet, but gigabit Ethernet is fairly standard these days.
Your Internet connection is likely 150 Mbps at most. Depending on your level of service it may be significantly less. Also, you might have dedicated bandwidth or may share bandwidth with your building or neighbors. If you’re travelling and attempting a restore using a hotel Internet connection, I wouldn’t expect anything more than 10 Mbps, and that might be shared by you any your floor. For comparison, transferring a 4.7GB file over an Internet connection:
10 Mbps Internet connection: A little over an hour.
150 Mbps Internet connection: About four minutes.
If you’re on the road, and need to recover, doing what I did, and camping out in a spare office at your client’s building to use their hi-speed Internet connection can work very well. If you don’t have that option, look for cube rental or daily office rental facilities. These companies offer work spaces for people who don’t need dedicated offices all the time. Make sure they offer hi-speed Internet connections, and get specifics about the connection speed.
How much data should you plan on? A good place to start is looking at the size of program files or other application installation folders, as well as the size of any cloud storage services like Google Drive, Drop Box, Box, or OneDrive. Then add any other data. I had to restore about 158 GB of data. About 6 GB of that were applications the rest was data. Our Google Drive synced up over the course of about three days once I got my new PC. In the meantime, I could selectively find and download any specific files I needed using the web interface for Google Drive. The big problem were files that weren’t stored in the cloud.
As I said earlier, our RTO was a little squishy. Six hours to a long work day. So let’s stop right here and define the term squishy. Squishy is a cute little way of saying undefined or non-standard practices, processes, and software that are the result of poor planning, bad habits, and human nature. In a disaster recovery scenario, squishy practices and processes will cost you lots of extra time, lots of extra money, and will likely impact internal and external projects and deadlines.
Let’s look at the issue of non-standard software first. There are two issues:
Non-standard applications: This could be any application a marketer uses to do their job. Something they purchased, installed, or otherwise brought into the company. Before you say that type of thing can’t happen at your company, it most certainly can. Do you hire people? If so, someone can bring something with them. Something they’ve always used, or something bought at as a student, using a student license, and are now using in your company (which is also a license violation). This can be sound, video, or image editing software, wire framing software, any number of marketing or content creation tools. Why do people do this? Content marketers have a lot to do and a finite amount of time to get it done. If they know how do something quickly and with quality using a tool they know, and they have or can get access to that tool, they will use that tool.
Non-standard software installation files and licenses: If a non-standard application is lost, it will need to be reinstalled. The first thing you’ll need to do is find the license. I conveniently kept mine on the drive that crashed. If you can’t find the license, you have to retrieve it from the vendor online. Then you need to install the application. Since it’s non-standard, there’s a good chance it isn’t stored on the server or in your cloud storage. If you don’t have a copy of the installation files, the software will have to be downloaded again. But it may not be that simple. If the software is over two years old, that version may no longer be available. Now you have to upgrade to a new version or learn some other software. This all takes time, in some cases lots of time. It can take one to three hours to find a license, get the software installed and configured. This can also happen at the worst time. You might not realize you need an application until its required for a deliverable, either during or long after your disaster recovery. Then, what should take 10 minutes, takes one to three hours, which can squeeze your deliverable deadlines, or force you to adjust the deliverable to work around the missing software.
Lesson 7: Identify non-standard software and deal with it. If it’s something you need, make sure the installation files are in your cloud storage where they can be accessed if necessary. Make sure the licensing is good and legal for use in your business. Implement a new policy that allows employees to request non-standard software to prevent future problems. Consider installing software inventory management tools to identify and manage installed applications and track licenses.
Lesson 8: Implement per-system image backup and restore. Tools such as Norton Ghost and others can take image backups of entire systems. Doing this periodically, such as monthly, for each user’s system. This option could quickly bring users back online, with their tools, on a new or temporary computer if their primary system fails.
Tips for Identifying Non-Standard Software
Deciding what to do about non-standard software is the easy part. Identifying it may not be so easy. Employees don’t want to get into trouble for bringing in non-standard software, or using student licensed software for work. Make it simple, make it easy. Coordinate with the IT department. If you have a software inventory management tool, take an inventory, and talk to employees about what you find. Otherwise, call a meeting and ask your team. Make it non-threatening, and offer amnesty for policy violations. You’re trying to save time and money in the event of a data loss, make that clear, and focus on that goal.
Are you overwhelmed by the leaps and bounds of marketing technology? Get a free MarTech seminar. We'll talk about the state of MarTech currently, and explain how advanced technologies can help your marketing and sales!
Lesson 9: Practice Good Asset and Revision Management
When marketers are working on content and other deliverables, they are creating or revising many different types of content. Frequently, content is developed in a collaborative environment with teams contributing to the final product. Through this disaster recovery we found that how individual team members work and save their data is critical to having true fault tolerance, and quick recoverability. Let’s address each of the two areas:
Asset storage: We do our content planning and graphic content creation in DivvyHQ. Our policy was to post assets to DivvyHQ so everything was in our content collaboration tool. This policy was made two years ago, and never strictly enforced. The result, while everything started in DivvyHQ, once ideas were finalized and tasks assigned, assets were only copied back to DivvyHQ part of the time. Instead, team members used tools to share and collaborate. I was able to get to DivvyHQ within 30 minutes of my data loss. I had three content projects due. I had to spend two hours running down the assets for my projects just to get back to the point where I’d left off. The assets were in Slack, Google Drive, and some like original stock images were in local download folders. I had to ask team members to post them before I could access them. Sorting through assets can take a lot of time in a marketing scenario. There are often several versions of a single image to sort through, with different sizes, resolutions, and graphic modifications. Had the assets I needed been in DivvyHQ, I would have been able to resume work on my deliverables 30 minutes after my I sat down to work on them, instead of two hours.
Draft and revision storage. This is a similar problem to the one above. Our policy was to post content revisions at the end of every day. That’s a bad policy. Everyone saves locally when they stop working on something but that’s not good enough. The hard drive you’re saving to might not be there when you boot up again. Remember that cloud storage like, Google Drive, Drop Box, etc. store local copies first, then sync to the cloud. If your Internet is down, or you’re using slow hotel Internet, syncing my not happen right away or at all. Several of my recent revisions were stored locally on my PC. I had not posted them to DivvyHQ. They were lost, and rewriting them took way too much time while I was also trying get deliverables out the door. Posting to cloud collaboration tools needs to be as second-nature as saving. Simply put, it’s not saved until it’s in the collaboration tool or the cloud.
Lesson 9: Save assets and revisions to the collaboration tool or the cloud as soon as you are done with them. Post them to the designated cloud tool, or location, and confirm synchronization has occurred. It sounds tedious, but it’s only a few seconds, but it can save hours, and prevent missed deadlines and deliverables if a data loss occurs. It also ensures that other team members can easily step in to help or take over for you if need be.
Lessons 10 and 11: Plan Your Recovery Hours Objective.
I think we need to add a new term to the disaster recovery vernacular, Recovery Hours Objective (RHO). RHO should describe the hours spent by non-IT staff in the recovery process. The Recovery Point Objective usually describes the time it takes for IT staff to hand the affected employee a new system, with their core applications and access to cloud data. From that point, the person affected will need to take additional time that should be counted as part of the recovery process such as:
All of the things we’ve discussed thus far.
Configuring and personalizing applications.
Waiting for data to download (if it’s restoring from cloud based storage).
Accessing non-standard cloud software they’ve subscribed to.
Rebuilding browser favorites pages.
Resetting browser based password access (which they shouldn’t be using anyway, but many marketers do), to the cloud applications and portals.
Continuing to hunt for missing data and helping IT staff locate the best versions to restore. Marketers typically have lots of projects, and don’t always realize what data is missing until they revisit a project due to a deliverable deadline or a question from a team member. You won’t always know what data you’re missing right away after a data loss. Especially if you’ve been “squishy” about where you stored things.
The thing to remember about RHO is that, every hour a person spends hunting for, sorting through, configuring, and restoring applications and data is an hour they cannot spend doing their real job. It impacts the workload of other team members, it impacts costs, it impacts deliverable quality, and it impacts deadlines.
Lesson 10: Figure on 50% of work hours the 5 work days following a disaster, with a half-life of 50% for three weeks. To state it another way, you’ll spend half your time (20 hours out of a 40 hour week) on data recovery the first work week after a disaster. 25 percent (10 hours) the second week after a disaster, and about 12 percent (5-6 hours) the third week after a disaster.
Lesson 11: If you do the things we recommend in this article, you RHO drops dramatically. If you make a plan to deal with non-standard software, make sure assets, revisions, and data are in the cloud, and collaboration tools, and use per-system imaging, your RHO will drop significantly. It won’t go to zero, because data has to be restored, and applications configured, but, it may only take a few hours.
Lesson 12: Maintain and Replace Old Hardware When It’s Time
My PC had been throwing errors for months. We’d run diagnostics, and they were inconclusive. A few months before the hard drive crashed, we replaced the RAM, because that’s what the diagnostics pointed to. I never took it out of service to do a thorough diagnostics test. Despite being 13 months past its schedule replacement date of 2 years old, it worked fine except for these occasional lock ups. The budget was there for a new PC, but frankly I hate the solid week it takes to build a new PC from scratch and transfer all the apps and data, and get it working just the way I like.
I had to do all of that that anyway, 3000 miles from our office, and during a time when I was working on our website redesign, and the launch of our new training offering.
Had I turned in my system for a thorough diagnosis, or bought a new one on schedule, I could have done all of that in a place of my choosing, before I had two major internal projects and three client projects on my plate.
Lesson 12: Maintain and replace your hardware when it’s time. Planned downtime is so much better and less impactful than unplanned downtime. It’s the difference between pulling yourself or a team member off of internal and external deliverables for a week, versus fighting each for two weeks just to come up for air.
Statistics from My Disaster Recovery
Disaster Date: August 23rd
Last Date I performed any recovery related Tasks: September 15
Time until I was back online and functional with cloud applications: 30 minutes.
Time diagnosing the drive to attempt direct recovery before restoring from backup: 4 hours.
Time to recover my most critical files (about 25 MB): 1 hour.
Applications restored (in size): 6 GB.
Time to restore 152 GB of data from cloud backup (BackBlaze): 3 Days doing selective restores non-sequentially (could have been done in 1.5 days sequentially).
Internal projects impacted: 2 (website re-design and launch of our new training classes).
Internal projects delayed: 2 (by two weeks each).
Client projects impacted:
Client projects delayed: 1 (an article for Social Media Today).
Total hours spent doing recovery related tasks: 60 (25 to remain functional with my old laptop while traveling, and 35 to set up my new laptop, install applications and transfer data).
Total recovery cost (not counting project and deliverable delays): $6,000, factoring in my internal billable rate.
Percentage of our staff that would have experienced the same problems if they lost data: 100%.
Over to You
If you have experience some of these issues, or have insights about how to mitigate the impact of data loss or computer system failure, please let me know in the comments.