Abstract
No data source is perfect. Mistakes inevitably creep in. Spotting errors is hard enough when dealing with survey responses from several thousand people, but the difficulty is multiplied hugely when that mysterious beast Big Data comes into play.
Statistics Netherlands is about to publish its first figures based on Big Data – specifically road sensor data, which counts the number of cars passing a particular point. Later, we plan to use cell phone data for statistics on the daytime population and tourism, and we are considering an indicator to capture the “mood of the nation” based on sentiment expressed through social media.1
Statistics derived from unedited data sets of any size would be biased or inaccurate. But the challenge Statistics Netherlands faces in dealing with Big Data sets is to find data editing processes that scale up appropriately to allow quick and efficient cleaning of a huge number of records.
How huge? For the sentiment indicator, we plan to use 3 billion public messages predominantly gathered from Facebook and Twitter,2 and for the road sensor data there are 105 billion records. But size is not the only distinguishing characteristic of a Big Data set.
A clear, generally accepted definition of “Big Data” does not exist, though descriptions often refer to the three Vs: volume, velocity, and variety.3 So, not only do we have a large amount of data to deal with (volume), but the frequency of observations is very high (velocity). For the road sensor data, for example, we have data on a minute‐by‐minute basis. Big Data also tends to be “messy” in comparison to traditional data (variety). Again, for the road sensor data, we only know how many vehicles passed by. We do not know who drove the cars. In addition, background characteristics, which are important for data editing and estimation methods, are lacking, thus making such methods difficult to apply.
Statistics Netherlands is about to publish its first figures based on Big Data – specifically road sensor data, which counts the number of cars passing a particular point. Later, we plan to use cell phone data for statistics on the daytime population and tourism, and we are considering an indicator to capture the “mood of the nation” based on sentiment expressed through social media.1
Statistics derived from unedited data sets of any size would be biased or inaccurate. But the challenge Statistics Netherlands faces in dealing with Big Data sets is to find data editing processes that scale up appropriately to allow quick and efficient cleaning of a huge number of records.
How huge? For the sentiment indicator, we plan to use 3 billion public messages predominantly gathered from Facebook and Twitter,2 and for the road sensor data there are 105 billion records. But size is not the only distinguishing characteristic of a Big Data set.
A clear, generally accepted definition of “Big Data” does not exist, though descriptions often refer to the three Vs: volume, velocity, and variety.3 So, not only do we have a large amount of data to deal with (volume), but the frequency of observations is very high (velocity). For the road sensor data, for example, we have data on a minute‐by‐minute basis. Big Data also tends to be “messy” in comparison to traditional data (variety). Again, for the road sensor data, we only know how many vehicles passed by. We do not know who drove the cars. In addition, background characteristics, which are important for data editing and estimation methods, are lacking, thus making such methods difficult to apply.
Original language | English |
---|---|
Pages (from-to) | 26-29 |
Journal | Significance |
Volume | 12 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2015 |