The U.S. federal statistical program first began with the 1790 Census. Its establishment was an innovation:Enshrining this invention in our Constitution marked a turning point in world history. Previously censuses had been used mainly to tax or confiscate property or to conscript youth into military service. The genius of the Founders was taking a tool of government and making it a tool of political empowerment for the governed over their government.
A few decades passed and another innovation, gerrymandering, opened up new uses for census data and any other public or private data that might further the goal of favorable redistricting.
Today, the Association of Public Data Users (APDU) guides professional statisticians and social scientists through the constant change and innovation that is happening inside and outside of federal statistical programs. The association also coordinates professional user community input into government policy developments affecting the quality of federal statistical surveys. Members are “public data producers, disseminators, and users”—including businesses, universities, policy organizations, and government organizations at the local, state, and federal levels.
APDU convened on Sept. 16–17 in Washington, D.C. to update members on current challenges to the quality of federal statistical data and opportunities to innovate around those challenges. The Census Bureau in particular is challenged by the rising costs of conducting surveys, a reduced budget, and Americans’ growing reluctance to spend time providing answers to government surveys. Census and other federal agencies face continuing cuts to budgets and staffing. Global enthusiasm for raw government data, Big Data, and social media data has obscured the effort that goes into existing statistical programs. (Along this line, APDU is sponsoring a free webinar on Government Data and Confidentiality: Compatible Companions With the Help of Statistical Disclosure Control. For this and other APDU webinars, see the website.) The meeting agenda and links to available slide decks for the APDU 2013 conference, titled A Sea Change for Public Data, are available online.
Social Media Intelligence
Rising costs and shrinking budgets have pushed statisticians to explore sources of data to complement survey research results. Acknowledging alternative sources, APDU included a panel on Social Media Data as a Public Resource to showcase research on using social media data for analysis. Michael J. Paul of the Johns Hopkins University computer science department presented the results of his work evaluating user-generated content on Twitter as a predictor of local flu outbreaks. He found correlations of up to 99% when his Twitter analysis was compared with the authoritative data on seasonal influenza from Centers for Disease Control and Prevention. Paul and others on the panel acknowledged that there were limits for using Twitter as a predictor and the work needed to clean up sample sets to account for spammers, copycats, and other users who were of little value.
Paul presented other study results, including assessing the value of intelligence gathered from over-sharing drug abusers on an online forum. Mining the forum proved to be useful for identifying new drugs of abuse and new health symptoms associated with their use. The data was not as useful for demographic analysis given contributors’ incentives to lie about their personal location, age, and other data.
Another panelist, Paul Hitlin of the Pew Center for Excellence in Journalism, brought it all back to Earth with the observation that “[I]f Twitter users were reflective of the nation as a whole, Ron Paul would be President now.” Speaker Thomas Levine of the Zipfian Academy followed up with the observations that “a text search is not statistics,” statistics are applied to the search results, and we “still have to do good statistics.”