Sep 10 2008

Google Improves Privacy, Petulantly

Category: UncategorizedOren Hurvitz @ 11:19 pm

Google have announced that they’ll reduce the amount of time that they keep individually-identifiable information about searches from 18 months to 9 months. I would like to think that my previous post on this topic played a small part in this decision, but it looks like it was mostly due to pressure from the European Union.

In that post, I showed a request that I had sent to Google, asking them to reduce the amount of time they keep private data to just one month. I asked everyone who read the post to send a similar request to Google, and judging from the comments some people did so. For the record, I did receive a reply from Google, but it was just a standard email that didn’t actually address my points: it just reiterated their position on data retention. In fine Google tradition, it would appear that no human was involved in sending that response.

Although I would like to see Google reduce the retention period further, to one month, this is a big step in the right direction. Google deserves credit for listening to the public and changing their practices. It is therefore unfortunate that they chose to pepper this announcement with vague threats:

  • “Back in March 2007, Google became the first leading search engine to announce a policy to anonymize our search server logs in the interests of privacy. [...] Although that was good for privacy, it was a difficult decision because the routine server log data we collect has always been a critical ingredient of innovation.”
  • “When we began anonymizing after 18 months, we knew it meant sacrifices in future innovations in all of these areas [search quality, security, fighting fraud and reducing spam]. We believed further reducing the period before anonymizing would degrade the utility of the data too much and outweigh the incremental privacy benefit for users.”
  • “While we’re glad that this will bring some additional improvement in privacy, we’re also concerned about the potential loss of security, quality, and innovation that may result from having less data.”

What on earth could they mean?

Translation #1: “Ok world, you win, we’ll keep the data for less time. But you’re going to be sorry!”

Translation #2: “We’ve now reduced our retention period as far as humanly possible, and then some. Please don’t make us reduce it any more!”

Google are keeping this discussion (of how long to keep the data) at a superficial level: they throw a number (”18 months”), the European Union throws a number (”6 months”?), I throw a number (”1 month”). You, too, can become a highly respected privacy advocate by coming up with your own number (that no one else has claimed yet) and writing about it!

A more substantive discussion would require Google to reveal some of their cards: how much of a benefit to fraud protection do they derive from keeping this data for 9 months (vs. a shorter length of time)? How do 9 months of individually-identifiable information help them improve their algorithms vs. 1 month of such information, especially given that they will always have an unlimited amount of anonymized data?

Of course, Google will never reveal this information because it would hurt their competitive position. But an experienced programmer, well-versed in the art, can make some reasonable guesses.

Individually-identifiable information is most important for security, fraud prevention, and fighting spam. But since these are time-sensitive tasks, the information quickly loses its value. I believe that the residual value of this information is close to zero after a few weeks have passed.

The other use for this data, improving search quality, can be handled with anonymized data for the most part. One example that Google commonly give is their automatic spell checker. But they don’t need individually-identifiable information in order to figure out that people who search for “brittaney” really mean “britney”. Yes, I can envision some types of search quality improvements that would benefit from studying individually-identifiable information, but they are a minority, and Google can learn how to do that while keeping data for a shorter period of time. I therefore stand by my position that 1 month of private data would strike the right balance between privacy and security/fraud prevention/spam detection/search quality.

(Photo by lesprit_descalier)