Google Flu Trends

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Google Flu Trends is a web service operated by Google. It provides estimates of influenza activity for more than 25 countries. By aggregating Google search queries, it attempts to make accurate predictions about flu activity. This project was first launched in 2008 by Google.org to help predict outbreaks of flu.[1]

Introduction[edit]

The idea behind Google Flu Trends (GFT) is that, by monitoring millions of users’ health tracking behaviors online, the large number of Google search queries gathered can be analyzed to reveal if there is the presence of flu-like illness in a population. Google Flu Trends compares these findings to a historic baseline level of influenza activity for its corresponding region and then reports the activity level as either minimal, low, moderate, high, or intense. These estimates have been generally consistent with conventional surveillance data collected by health agencies, both nationally and regionally.

Roni Zeiger helped develop Google Flu Trends.[2]

Methods[edit]

Google Flu Trends was described as using the following method to gather information about flu trends.[3][4]

First, a time series is computed for about 50 million common queries entered weekly within the United States from 2003 to 2008. A query's time series is computed separately for each state and normalized into a fraction by dividing the number of each query by the number of all queries in that state. By identifying the IP address associated with each search, the state in which this query was entered can be determined.

A linear model is used to compute the log-odds of Influenza-like illness (ILI) physician visit and the log-odds of ILI-related search query:

\operatorname{logit}(P) = \beta_0 + \beta_1 \times \operatorname{logit}(Q) + \epsilon

P is the percentage of ILI physician visit and Q is the ILI-related query fraction computed in previous steps. β0 is the intercept and β1 is the coefficient, while ε is the error term.

Each of the 50 million queries is tested as Q to see if the result computed from a single query could match the actual history ILI data obtained from the U.S. Centers for Disease Control and Prevention (CDC). This process produces a list of top queries which gives the most accurate predictions of CDC ILI data when using the linear model. Then the top 45 queries are chosen because, when aggregated together, these queries fit the history data the most accurately. Using the sum of top 45 ILI-related queries, the linear model is fitted to the weekly ILI data between 2003 and 2007 so that the coefficient can be gained. Finally, the trained model is used to predict flu outbreak across all regions in the United States.

This algorithm has been subsequently revised by Google, partially in response to concerns about accuracy, and attempts to replicate its results have suggested that the algorithm developers "felt an unarticulated need to cloak the actual search terms identified".[5]

Privacy concerns[edit]

Google Flu Trends tries to avoid privacy violations by only aggregating millions of anonymous search queries, without identifying individuals that performed the search.[1][6] Their search log contains the IP address of the user, which could be used to trace back to the region where the search query is originally submitted. Google runs programs on computers to access and calculate the data, so no human is involved in the process. Google also implemented the policy to anonymize IP address in their search logs after 9 months.[7]

However, Google Flu Trends has raised privacy concerns among some privacy groups. Electronic Privacy Information Center and Patient Privacy Rights sent a letter to Eric Schmidt in 2008, then the CEO of Google.[8] They conceded that the use of user-generated data could support public health effort in significant ways, but expressed their worries that "user-specific investigations could be compelled, even over Google'’s objection, by court order or Presidential authority".

Impact[edit]

An initial motivation for GFT was that being able to identify disease activity early and respond quickly could reduce the impact of seasonal and pandemic influenza. One report was that Google Flu Trends was able to predict regional outbreaks of flu up to 10 days before they were reported by the CDC (Centers for Disease Control and Prevention).[9]

In the 2009 flu pandemic Google Flu Trends tracked information about flu in the United States.[10] In February 2010, the CDC identified influenza cases spiking in the mid-Atlantic region of the United States. However, Google’s data of search queries about flu symptoms was able to show that same spike two weeks prior to the CDC report being released.

“The earlier the warning, the earlier prevention and control measures can be put in place, and this could prevent cases of influenza,” said Dr. Lyn Finelli, lead for surveillance at the influenza division of the CDC. “From 5 to 20 percent of the nation’s population contract the flu each year, leading to roughly 36,000 deaths on average.” [9]

Google Flu Trends is example of collective intelligence that can be used to identify trends and calculate predictions. The data amassed by search engines is significantly insightful because the search queries represent people’s unfiltered wants and needs. “This seems like a really clever way of using data that is created unintentionally by the users of Google to see patterns in the world that would otherwise be invisible,” said Thomas W. Malone, a professor at the Sloan School of Management at MIT. “I think we are just scratching the surface of what’s possible with collective intelligence.” [9]

Accuracy[edit]

The initial Google paper stated that the Google Flu Trends predictions were 97% accurate comparing with CDC data.[3] However subsequent reports asserted that Google Flu Trends' predictions have sometimes been very inaccurate—especially over the interval 2011-2013, when it consistently overestimated flu prevalence,[5] and over one interval in the 2012-2013 flu season predicted twice as many doctors' visits as the CDC recorded.[5][11]

One source of problems is that people making flu-related Google searches may know very little about how to diagnose flu; searches for flu or flu symptoms may well be researching disease symptoms that are similar to flu, but are not actually flu.[12] Furthermore analysis of search terms reportedly tracked by Google, such as "fever" and "cough", as well as effects of changes in their search algorithm over time, have raised concerns about the meaning of its predictions.[5] In fall 2013, Google began attempting to compensate for increases in searches due to prominence of flu in the news, which was found to have previously skewed results.[13] However, one analysis concluded that "by combining GFT and lagged CDC data, as well as dynamically recalibrating GFT, we can substantially improve on the performance of GFT or the CDC alone."[5]

Competition with Wikipedia data[edit]

In April 2014, scientists at Boston Children's Hospital announced that they analyzed traffic on Wikipedia to estimate flu activity. The scientists said that they created a computer model that they used to estimate "peak influenza activity" 17 percent more often than when they estimated flu activity using Google Flu Trends. Their computer model used data about how frequent certain Wikipedia articles were viewed by the public.[14]

References[edit]

  1. ^ a b "Google Flu Trends | How". Retrieved 10 November 2012. 
  2. ^ Zeiger, Roni (6 October 2009). "Google Flu Trends Overview - YouTube". youtube.com. Retrieved 6 June 2013. 
  3. ^ a b Ginsberg, Jeremy. "Detecting influenza epidemics using search engine query data". Retrieved 10 November 2012. 
  4. ^ Ginsberg, Jeremy; Mohebbi, Matthew H.; Patel, Rajan S.; Brammer, Lynnette; Smolinski, Mark S.; Brilliant, Larry (19 February 2009). "Detecting influenza epidemics using search engine query data". Nature 457: 1012–1014. doi:10.1038/nature07634. 
  5. ^ a b c d e Lazer, David; Kennedy, Ryan; King, Gary; Vespignani, Alessandro (14 March 2014). "The Parable of Google Flu: Traps in Big Data Analysis". Science 343 (6176): 1203–1205. doi:10.1126/science.1248506. 
  6. ^ Helft, Miguel (13 November 2008). "Is There a Privacy Risk in Google Flu Trends?". The New York Times. Retrieved 10 November 2012. 
  7. ^ "Privacy Policy – Policies & Principles – Google". Retrieved 10 November 2012. 
  8. ^ Peel, Deborah. "EPIC's November 12, 2008 Letter to Google Concerning Google Flu Trends". Retrieved 10 November 2012. 
  9. ^ a b c "Google Uses Searches to Track Flu’s Spread". Retrieved 10 November 2012. 
  10. ^ Cook, S.; Conrad, C.; Fowlkes, A. L.; Mohebbi, M. H. (2011). "Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic". In Cowling, Benjamin J. PLoS ONE 6 (8): e23610. doi:10.1371/journal.pone.0023610. PMC 3158788. PMID 21886802.  edit
  11. ^ Butler, Declan (13 February 2013). "When Google got flu wrong". Nature 494: 155–156. doi:10.1038/494155a. 
  12. ^ http://siliconangle.com/blog/2014/03/24/google-flu-trends-a-case-of-big-data-gone-bad.
  13. ^ Richard Harris (2014-03-13). "Google's Flu Tracker Suffers From Sniffles". NPR. 
  14. ^ Parrish, Ryan (23 April 2014). "New model predicts flu trends using Internet traffic on Wikipedia articles". Vaccine News Daily. Retrieved 25 April 2014.