The reams of data that man modem business collect—dubbed “big data” —can provide powerful insights. It is the key to Netflix’ s recommendations engines, Facebook’ s social ads, and even Amazon’ s methods for speeding up the new Web browser, Silk, which comes with its new Fire tablet. But big data is like any powerful tool. Using it carelessly can have dangerous results.
A new paper by Kate Crawford, an associate professor at the university of New South Wales and Microsoft senior researcher Danah Boyd spells out the reasons that businesses and academics should proceed with caution. Whole privacy invasions—both deliberate and accidental— are obvious issues; the paper also warns that data can easily be incomplete and distorted. “With big data comes big responsibilities” , says Crawford. “There’ s been the emergence of a philosophy that big data is all you need” , she adds, “We would suggest that, actually, numbers don’ t speak for themselves. ”
Google is a poster child for the power of data. The company has transformed a massive amount of information, gathered through its search engine, into a commanding ad network and powerful role as the gatekeeper of much of the world’ s information. Google’ s director of research, Peter Norvig, demonstrated the true power of a large data set, using the example of machine translation. With enough data, Norvig said, even the worst training algorithm performs far better than what can be achieved with a smaller data set.
But Crawford and Boyd’ s work shows that studying large data still requires finesse. Twitter, which is commonly scrutinized for insights about people’ s moods, attitudes toward politics, and other aspects of daily life, presents a number of problems, the researchers say. About 40 percent of Twitter’ s active users sign in to listen, not to post, which Crawford and Boyd say, suggests that posts could come from a certain type of person, rather than a random sample. They also note that few researchers have access to all Twitter posts—most use smaller samples provided by the company. Without better information about how those samples were collected, studies could arrive at skewed results, they argue.
Crawford notes that many big data sets—particularly social data—come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.
The researchers add that big data can also raise serious ethical concerns. Many times, Crawford notes, combing data from different sources can lead to unexpected results for the people involved. For example, other researchers have previously shown that they can identify individuals by using social media data in combination with supposedly anoymized behavioral data provided by companies.
Handling big data sets takes almost impossible care. Given the quantity of information now available on the internet, Crawford argues, researchers need to slow down and think about the methods the use.