Your Cart

Close [x]
Plan Price Remove
Total:  $0.00

STEP 2: Who is the plan for?

You don't need to enter a name, but it will help
with your setup

STEP 3: Add a plan to your cart

Learning to forget: Why web companies need to fix their data archiving policies

Man looking at paper records

The good old days of paper records.  Image courtesy Ed Uthman via CC license.

It’s time for web companies to learn how to forget.  It’s particularly time for Web 2.0 companies to learn how to forget.

The digital nature of the Internet makes it easy for websites to collect massive amounts of data: every click, every interaction, every search term, every referrer, every error… you get the idea.   This massive data harvest can be dumped into a SQL database to be analyzed, cross-tabulated, summed, totaled, averaged, and dissected.  In general, this is good.  Web companies should learn from their visitors, and web companies should take advantage of the power of digital data collection.  Important trends can be spotted, and products can be improved.

The problem comes when companies keep too much data too long.  Take the example of a search engine.  To a search engine, it is very useful to know what search terms are popular today:  Google uses each day’s search terms to compile a list of the hottest search terms of the day, and undoubtedly uses the same data for anti-spam and quality control.  So far, so good.  Google is using its data in interesting ways for an appropriate amount of time.

The problem comes when data connecting search terms to individual users is kept too long.  Six months from now, your search queries don’t matter.  Maybe there’s some data that is useful in the aggregate (like the hottest search terms of the year, used to create Google Zeitgeist), but Google doesn’t need to know who entered each search query; the data has become stale and less valuable.  Keeping non-aggregate data around too long is an invitation to privacy breaches, like what happened when AOL revealed thousands of search histories.  AOL claimed that the data was anonymized, but it was possible to identify many individuals.  Even more data can be revealed when web servers are hacked—Google claims that its servers were recently attacked in China, and it is not publicly known how much data was accessed.  The more data that was still on Google’s servers, the more data could have been revealed.  The same goes for insider theft, computers left unsecured, and any other means of getting at the data.

To put it simply, the cost-benefit tradeoff of keeping data changes as the data gets older.  The benefit of keeping data decreases as it ages; data that has business value today (like clickstream data, search queries, and website interactions) loses value over time because it becomes too stale to use for business decisions.  If long-term trends need to be spotted, then data can be aggregated and the original fine-grained data destroyed.

But the cost of keeping old data doesn’t decrease: to end-users, revealing old data can be just as harmful as revealing new data.  A site that reveales embarrassing search queries from 2 years ago is just as dangerous as a site that reveals embarrassing search queries from last week.  Here, Web 2.0 companies are particularly at risk.  They know a ton about users’ social, political, and inner lives — information that is often very personal.  They often know every interaction between two users — what profiles have you been clicking on?  what messages have you been sending? who have you “poked” lately?  were you on the Jersey Shore fanpage for an hour looking at pictures of Snooki?  A site that collects this information is constantly at risk of losing it.

The solution is to destroy data, or at least take it offline and preferably move it into non-digital form.  Search engines have recognized this in part, and have generally similar plans to destroy clickstream data within 6-18 months.  But it’s not clear that a lot of Web 2.0 companies do.  I know that many of my old Facebook interactions are still stored in a production database because I can still access them.  There is simply no need for this data to still be in a production database that is vulnerable to hacking, data leaks, insider theft, and more.  One data security incident could reveal the entire history of social interactions on the site.  This is a privacy Sword of Damocles, silently hanging over every user’s head.  What embarrassing thing have you done on Facebook in the last few years?  What private messages have you sent?  With one data dump, it could all be revealed.

Instead, Facebook could simply announce a policy to archive all interactions more than 12 months old, then move them offline.  Or it could just delete them entirely: do we really need 5 years of history of “pokes”?  Or, if users really want to keep their data, then let users download an archive with all their interactions and delete them from the server.

To be fair, forgetting is hard.  Why don’t web companies forget more often?  Often, it’s just inertia.  It takes programmers’ energy to archive data, and it takes careful business decisions to determine when and how to archive data.  Sometimes it’s like an overdue library book: you know that you need to return it, but you just never get around to it until it is very overdue.

Sometimes, the good old days are best.  Remember paper files?  Paper records are nothing like digital: they are slow to process, hard to store, and are corrupted over time.  But maybe those are features rather than bugs.

In bullet points:

  • Web companies collect massive amounts of data
    • Clickstream, social interactions, emails and messages, credit cards and payment info, preferences, actions, and activities…
  • It often seems easier to keep old data than delete it
    • Disk space is nearly free, and databases make it easy to keep old records
    • Programmers often think that old data will have some kind of marketing value
    • Archiving is a pain
  • But old personal data can be embarrassing or dangerous
    • Information about people’s financial, social, and political beliefs can cause embarrassment
    • Some data that seems benign (like your Netflix movie rentals) can reveal a lot more (like your sexual orientation)
    • Some data that has identifying information removed can still be used to identify you (like your AOL search queries)
    • Information about people’s names, addresses, and family can cause safety issues and encourage identity theft
  • That said, information about places, things, and science should be more available
    • News reports, scientific papers, and scientific data generally do not present the same privacy problems
  • Old digital data is particularly likely to be problematic
    • Data that is instantly accessible in a production database is instantly accessible to a hacker or data accident
    • Insiders can leak data, intentionally or accidentally
    • Once out, it can be digitally scanned, searched, sorted, and remixed
  • Old data is less likely to be useful in a live environment
  • There are solutions
    • Move content into an archive that the user controls
    • Delete marketing and clickstream data
    • Research and trend data can be aggregated
  • There’s something to be said for paper records.  Paper records have a very high transaction cost; that can be a feature, not a bug.

6 comments ↓

#1 Raoul Duke on 01.22.10 at 1:40 pm

word.

#2 Rob Frappier on 01.22.10 at 2:13 pm

Usually, I wouldn’t approve a one word comment, but Raoul Duke gets the special treatment.

#3 Colm on 01.22.10 at 5:25 pm

Two words.

#4 Josh Zytkiewicz on 01.24.10 at 10:32 pm

I really disagree with you on this. Deleting data just because it’s old? How much would we not know about the past if our fathers and grandfathers threw away old letters and journals after a year?

Or transferring it to paper form? Aren’t businesses, governments, and other organizations spending millions of dollars converting paper to digital?

#5 Dave Thompson on 01.25.10 at 11:34 am

Hi Josh,

Thanks for writing. I think we agree in part. There’s lots of data that should be kept permanently: scientific information, news reports, and that sort of thing. And users should be given the choice whether to preserve their communications; if you want your grandchildren to be able to read your emails and Facebook messages 50 years from now, then that’s your choice.

But to many people, the risk of their personal data being disclosed outweighs the benefits of nostalgia. When lots of information gets kept for too long, we end up with people getting their privacy invaded when AOL releases their search queries or Netflix releases their video rental preferences. Most grandkids do not want to know every search query you have ever entered into Google; it’s just not very interesting to them. But keeping a list of search queries (or movie rentals, or clicks in a social network) creates a risk of privacy invasions when data gets hacked, leaked, analyzed, and so forth.

The point of paper records is that they are not vulnerable to large-scale hacking and distribution. If you really need to keep a piece of information forever, put on paper and store it in a salt mine. I think we’ll see problems as the government converts paper records to digital: digital records are vulnerable to being copied and stolen. There are plenty of records that should be digital, but I’ll bet good money that we’ll see a serious hacking incident based on newly-digital records.

Maybe the right answer is even more choice? If you want to download your clickstream data and preserve it for posterity, you should be able to. But I would much prefer that mine get deleted.

Thanks for writing,

Dave Thompson

#6 Josh Zytkiewicz on 01.25.10 at 10:30 pm

I agree that more choice would be good. Though I would do the opposite of you, saving it unless you proactively ask to delete it.

The thing about information is that you never know what’s going to be important. To you or I our daily searches on Google are practically meaningless, but who knows how important those could be in the decades or centuries to follow.

Take the Rosetta stone for example. The actual text is about taxes on temple priests, pretty mundane stuff. But without it we wouldn’t be able to read any ancient Egyptian hieroglyphs.

I just feel it’s pretty short sighted to destroy or make accessing information harder, when there are so many instances of Humanity losing information forever.

Leave a Comment

Questions?

You don’t love it,
you don’t pay.

We believe in our products so strongly we offer a Money Back Guarantee.

Award-winning service & technology

Headquartered in Silicon Valley, we employ an unrivaled customer service team, world-class scientists, and powerful ORM tools created from years of cutting-edge research and development. This year alone, we won awards for both customer service and technological innovation.