15-09-2016 15-09-2016 | 1 1 Big data: overvloed en onbehagen - … · 2016-09-20 · meaningful...

1 | 15-09-2016 1 | 15-09-2016

Big data: overvloed en onbehagen

Kees Aarts SWR conferentie, 16-17 september 2016

2 | 15-09-2016

Inhoud › KNAW-Verkenningscommissie › Wat is big data? › Big data en onderzoeksmethodologie › Spanningsvelden ›  Toekomst

3 | 15-09-2016

Commissie Big Data Ingesteld september 2015. De commissie heeft twee taken: •  uitvoeren van een brede verkenning naar

effecten van ‘big data’ op wetenschappelijk onderzoek met het accent op wetenschapsgebieden die werken met personen

•  voorbereiden van een KNAW-advies over enkele geselecteerde onderwerpen.

4 | 15-09-2016

Gevolgde werkwijze › Discussiebijeenkomsten met focusgroepen:

§  Onderzoekers in big data §  Informatica-specialisten §  (komt nog) Jongere onderzoekers in big data

5 | 15-09-2016

Dutcher (2014)

What Is Big Data? - Blog https://datascience.berkeley.edu/what-is-big-data/

1 of 12 13-11-2015 4:30

6 | 15-09-2016

Een vaag omlijnd begrip › Big data: wat is ‘big’? De drie v’s (volume,

velocity, variety) › Verwante maar onderscheiden termen

§  Data science §  E-science (e-humanities) §  Computational social science §  Data-driven research §  Open access, open data, open science

7 | 15-09-2016

Volume, velocity › Camerabeelden, GPS gegevens, social media (Twitter; Hosch-Dayican et al. 2014), zoekgedrag op web

subsample of the text documents. The hand-coded subsample is then used as a training set to classifyand code the rest of the documents automatically in the second step. This ‘‘supervised learningapproach’’ to classification has several advantages over automated methods based on the use of dic-tionaries. First of all, the need for a clear coding scheme urges researchers to develop clear defini-tions of concepts to be measured and studied. Second, supervised learning methods are easier tovalidate, with clear statistics that summarize model performance. Third, the probability of misclas-sification of text that does not contain straightforward language, such as tweets with sarcastic con-tents, can be reduced due to the use of human coders (see Grimmer & Stewart, 2013; Hopkins &King, 2010).9

For taking the first step, a random subsample corresponding to approximately 1% of all the tweetswas drawn from the corpus. Four coders were appointed to manually code this subsample indepen-dent of each other using a coding scheme that was developed by the authors. All coded variableswere then tested for intercoder reliability using Krippendorff’s a, the result of which showed a highlevel of agreement implying that the coding scheme is dependable.10

The hand-coded data were then used as a training set to code the rest of the tweets. In order toclassify the text, we implemented a naive Bayes classifier (in PhP). To improve the performanceof the classifier, we used unigrams as well as bigrams. We also removed common Dutch stop wordsand used word stemming in order to deal with only the stems of the words. With this classifier, wefirst classified the tweets on whether they were related to politics. For this, two sets of 388 tweetswere used for training purposes. One set consisting of politically related tweets and the other con-sisting of tweets not related to politics. We used 10-fold cross-validation to arrive at the values

Tweets on Dutch Elections 2012

No electoral campaigning

Electoralcampaigning

Persuasivecampaigning

Negativecampaigning

Electoral campaigning

Type of campaigning

Figure 1. An overview of the nested structure of the variables.

Table 2. The Precision, Recall, and Accuracy of the Classifier for Predicting if a Tweet Is Related to DutchParliamentary Elections (Correct to two Decimal Places).

Type of Tweet Precision Recall Accuracy F Measure

Related to elections 0.93 0.73 0.84 0.82Not related to elections 0.78 0.94 0.85

8 Social Science Computer Review

at Universiteit Twente on August 17, 2015ssc.sagepub.comDownloaded from

8 | 15-09-2016

Variety › Stelsel van sociaal-statistische bestanden (CBS): virtuele volkstelling (Bakker et al. 2014)

416 B.F.M. Bakker et al. / The System of social statistical datasets

Fig. 2. Conceptual model of the SSD register system. [Rectangles:object types; lines: relations between object types; PIN: person iden-tification number; HIN: household identification number; AIN: ad-dress identification number; OIN: organization identification num-ber; the indication x:y denotes the type of relation].

sence of coordination. Moreover, data sharing amongorganizational units entails increased interdependencyas well as the potential for unwanted output overlap.Therefore, being able to monitor the production sched-ules of other units is of paramount importance. In short,coordination is essential to simplify the combined useof data, to increase consistency between statistical reg-isters, avoid duplicated work, ensure the appropriateapplication and interpretation of data, and for plan-ning and control. Four types of coordination are distin-guished: organizational, technical, content-related andoutput-related. These will be examined consecutivelybelow.

Organizational coordinationSN’s Division of Socioeconomic and Spatial Statis-

tics consists of a number of organizational units. Eachunit is responsible for the production of statistical out-put pertaining to a specific domain, e.g. employment,social security, demography. These units carry out reg-ister processing and store the resulting statistical reg-isters in the central data library of the SSD. They arethe formal owners of these registers, which means theyare accountable for the timely processing as well asthe quality of the registers. Several supporting tasks areperformed by two central organizational units: one isresponsible for assigning linkage keys to statistical reg-isters. To that end, it maintains the CLFP and devel-ops and applies matching algorithms. The other centralorganizational unit carries out a broad range of activi-ties aimed at the integrity of the SSD and the efficient

use of its contents. For instance, it performs micro-integration of different statistical registers, developsand maintains software tools and provides courses onthe principles of the SSD. Lastly, two consultation bod-ies are worth mentioning. First, representatives of allorganizational units participate in a consultative bodywhich aims to coordinate the contents and technical as-pects of the SSD. Second, a steering committee over-sees current and future aspects of the SSD and takesaction in the case of conflicts of interest.

Technical coordinationStandardization is the most prominent aspect of

technical coordination within the SSD. File formats,data formats of linkage keys, naming conventions,metadata, IT infrastructure and planning tools are allstandardized. Technical coordination also aims to pre-vent redundancy (the same variable in different statis-tical registers) and ambiguity (same variable under dif-ferent names). In addition, a key feature of the SSDis an unambiguous link between data and metadata(Fig. 3). Meta-information and its structure is impor-tant for the proper processing and understanding ofstatistical data e.g. [17,44]. The transition to register-based statistics has broadened the demands on meta-data as it entails a stronger dependence on external fac-tors such as legislation underlying the administrativeregisters, variable definitions and data collection meth-ods employed by the register keeper [11,36,51]. Themetadata of the SSD are stored in a central metadatarepository. Statistical registers are connected one-to-one with their corresponding metadata files, on the ba-sis of the register name. Similarly, variables are relatedto their metadata on the basis of the variable name.

Content-related coordinationSeveral processes are directed at the coordination

of content. Firstly, when either new statistical registersor modifications of existing registers are developed,the specifications are sent to all organizational units toenable stakeholders to contribute comments that rep-resent their interests. Secondly, a central productionschedule is kept within the SSD framework. Organi-zational units make their own timetables using a stan-dardized planning tool. These timetables are automati-cally incorporated into a central schedule which can beconsulted by all the units. Thirdly, if a historical reg-ister is updated frequently in order to produce timelystatistics, coordinated versions are identified which areto be used for all statistics with less strict timelines.For instance, the demographic register, which is de-

9 | 15-09-2016

Paradigmawisseling (Hey et al. 2009)

10 | 15-09-2016

Data is een misleidende term › Data zijn nooit gegeven maar worden altijd

geconstrueerd (waarnemingstheorie, datatheorie) §  Iemand maakt de keuze wat wordt

waargenomen; die keuze heeft gevolgen voor geldigheid en betrouwbaarheid

§  Een observatie kan worden geïnterpreteerd als uiteenlopende data

› Dit wordt vaak vergeten als het om big data gaat

11 | 15-09-2016

Toetsen verliezen hun betekenis › Conventies bij statistische toetsen zijn

ontwikkeld vanuit minimalistisch, experimenteel perspectief (hoe groot moet n zijn om een verdeling te benaderen? Wat is bij die n een acceptabele type-I fout?)

› Bij grote n wordt volgens deze conventies vrijwel elk verband significant

12 | 15-09-2016

Geldigheid wordt problematisch ›  Externe geldigheid: in hoeverre zijn de data/

relaties generaliseerbaar? ›  Interne geldigheid: in welke mate kun je een

correlatie causaal interpreteren?

13 | 15-09-2016

Verificatie en replicatie › Data zouden moeten voldoen aan de FAIR

principes: §  findable §  accessible §  interoperable §  re-usable

14 | 15-09-2016

Eigenaarschap (Einav & Levin 2014)

RESEARCH

7 NOVEMBER 2014 • VOL 346 ISSUE 6210 7 15SCIENCE sciencemag.org

BACKGROUND: Economic science has

evolved over several decades toward

greater emphasis on empirical work. The

data revolution of the past decade is likely

to have a further and profound effect on

economic research. Increasingly, econo-

mists make use of newly available large-

scale administrative data or private sector

data that often are obtained through col-

laborations with private firms, giving rise

to new opportunities and challenges.

ADVANCES: These new data are affecting

economic research along several dimen-

sions. Many fields have shifted from a

reliance on relatively small-sample govern-

ment surveys to administrative data with

Economics in the age of big data

ECONOMICS

Liran Einav1,2* and Jonathan Levin1,2

The rising use of non–publicly available data in economic research. Here we show the

percentage of papers published in the American Economic Review (AER) that obtained an ex-

emption from the AER’s data availability policy, as a share of all papers published by the AER

that relied on any form of data (excluding simulations and laboratory experiments). Notes and

comments, as well as AER Papers and Proceedings issues, are not included in the analysis. We

obtained a record of exemptions directly from the AER administrative sta� and coded each ex-

emption manually to re� ect public sector versus private data. Our check of nonexempt papers

suggests that the AER records may possibly understate the percentage of papers that actually

obtained exemptions. The asterisk indicates that data run from when the AER started collecting

these data (December 2005 issue) to the September 2014 issue. To make full use of the data,

we de� ne year 2006 to cover October 2005 through September 2006, year 2007 to cover

October 2006 through September 2007, and so on.

2006 2007 2008 2009 2010 2011 2012 2013 2014

91%86%

95%

80%

67%71%

72%

54% 55%

20%22%

15%19%

20%

13%

7%

7% 7%4%

4% 5%12% 10% 13%

24%26%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Publication year*

Sh

are

of

all

pu

bli

sh

ed

pa

pe

rs w

ith

da

ta

No exemption

Exemption (private data)

Exemption (administrative data)

REVIEW SUMMARY

universal or near-universal population

coverage. This shift is transformative, as it

allows researchers to rigorously examine

variation in wages, health, productivity,

education, and other measures across dif-

ferent subpopulations; construct consis-

tent long-run statistical indices; generate

new quasi-experimental research designs;

and track diverse outcomes from natural

and controlled experiments.

Perhaps even more notable is the expan-

sion of private sector data on economic

activity. These data, sometimes available

from public sources but other times ob-

tained through data-sharing agreements

with private firms, can help to create more

granular and real-time measurement of ag-

gregate economic statistics. The data also

offer researchers a look inside the “black

box” of firms and markets by providing

meaningful statistics on economic behav-

ior such as search and information gath-

ering, communication, decision-making,

a n d m i c r o l e vel t r a ns-

actions. Collaborations

w i t h d a t a - o r i e n t e d

firms also create new

opportunities to con-

duct and evaluate ran-

domized experiments.

Economic theory plays an important

role in the analysis of large data sets with

complex structure. It can be difficult to or-

ganize and study this type of data (or even

to decide which variables to construct)

without a simplifying conceptual frame-

work, which is where economic models

become useful. Better data also allow for

sharper tests of existing models and tests

of theories that had previously been diffi-

cult to assess.

OUTLOOK: The advent of big data is al-

ready allowing for better measurement

of economic effects and outcomes and is

enabling novel research designs across a

range of topics. Over time, these data are

likely to affect the types of questions econ-

omists pose, by allowing for more focus

on population variation and the analysis

of a broader range of economic activities

and interactions. We also expect econo-

mists to increasingly adopt the large-data

statistical methods that have been devel-

oped in neighboring fields and that often

may complement traditional econometric

techniques.

These data opportunities also raise some

important challenges. Perhaps the primary

one is developing methods for researchers

to access and explore data in ways that re-

spect privacy and confidentiality concerns.

This is a major issue in working with both

government administrative data and pri-

vate sector firms. Other challenges include

developing the appropriate data manage-

ment and programming capabilities, as

well as designing creative and scalable

approaches to summarize, describe, and

analyze large-scale and relatively unstruc-

tured data sets. These challenges notwith-

standing, the next few decades are likely

to be a very exciting time for economic

research. �

1Department of Economics, Stanford University, Stanford, CA 94305, USA. 2National Bureau of Economic Research, 1050 Massachusetts Avenue, Cambridge, MA 02138, USA.*Corresponding author. E-mail: [email protected] this article as L. Einav, J. Levin, Science 346, 1243089 (2014); DOI: 10.1126/science.1243089

Read the full article at http://dx.doi.org/10.1126/science.1243089

ON OUR WEB SITE

Published by AAAS

15 | 15-09-2016

AOL searcher No. 4417749 “My goodness, it’s my whole personal life…I had no idea somebody was looking over my shoulder.”

16 | 15-09-2016

Persoonsbescherming › Mensen zijn zich doorgaans volstrekt

onvoldoende bewust van de geïntegreerde kennis die over hun persoon en hun gedrag beschikbaar is

› Disclaimers worden niet begrepen

17 | 15-09-2016

Infrastructuur nodig! › Data infrastructuur:

§  Voor de kwaliteit van metingen §  Voor methodologische en statistische

expertise §  Voor maximale generaliseerbaarheid §  Om de FAIR principes operationeel te maken §  Om eigenaarschap te regelen §  Om privacy te beschermen

18 | 15-09-2016

Twee stappen gezet NDSW Dataplatform voor de mensen maatschappijwetenschappen Koepelvoorstel nieuwe nationale roadmap Start: 27 oktober

M3 Onderdeel van KNAW Agenda Grootschalige Wetenschappelijke Infrastructuur Integreert biologie, medicijnen, genetica, informatica

15-09-2016 15-09-2016 | 1 1 Big data: overvloed en onbehagen - … · 2016-09-20 · meaningful...

Documents

Transcript of 15-09-2016 15-09-2016 | 1 1 Big data: overvloed en onbehagen - … · 2016-09-20 · meaningful...