We developed a series of video tutorials which show you how to work with the National Science Experiment data to excel in the NSE Big Data Challenge. In these sessions, you will learn about the basic principles of data analysis with specific links to the NSE data.

What these videos are NOT:

  • A comprehensive statistics course
  • A how-to guide to winning the big data challenge
  • Another boring lecture

What these videos ARE:

  • Useful and common techniques applied to the NSE data to answer important questions
  • A source of inspiration for winning entries
  • A fun way to learn about applied statistics in real-world contexts

Now, without further ado, please join us in viewing the video lectures (which are best enjoyed in sequence):

1 Averages

Main points about averages:

  • They allow us to make predictions based on the central tendency of a sample or population
  • Means are nice if the distribution is relatively ‘normal’ and bell-shaped
  • Otherwise medians may be a better choice to represent central tendency
  • Modes are useful for categorical data
  • Real data is never as simple to work with as ‘clean’ example data sets
  • Samples are subject to bias and it is important to consider these biases when using them to make predictions about the populations

Further reading:

http://www.mathsisfun.com/definitions/average.html

http://www.wolframalpha.com/examples/DescriptiveStatistics.html

http://www.theweatherprediction.com/habyhints/190/

http://faculty.washington.edu/chudler/stat3.html

2 Bias

Main points about Bias:

  • Distributions of events may not have nice, simple, normal bell-shapes, but rather may have features which can help us answer our questions if we apply the correct statistical treatments (which we will cover in future lesson)
  • Histograms can be composed by counting events or items which fit into bins, and plotting them with frequency on the y, bin or label on the x axis
  • Real NSE data is subject to many types of sample bias, and a few to watch out for are:
    • That our samples come exclusively from students, so beware extrapolation to Singapore
    • That only the students from the most passionate teachers are part of the set
    • That the areas the students live in may be unique relative to other geographies in Singapore
    • That only students who chose to actively participate have their data in the dataset
    • That we only have data for areas where Wi-Fi is present
    • That the experiments run during very specific times of the year

Further reading:

http://www.ma.utexas.edu/users/mks/statmistakes/biasedsampling.html

http://www.mathsisfun.com/data/histograms.html

http://en.wikipedia.org/wiki/Wi-Fi_positioning_system

3 Time-series analysis

Main points about Time-series analysis:

  • Be aware that certain processes require faster sampling rate than the SENSg is capable of, for example you cannot record a conversation because we only sample sound pressure every 13 seconds
  • Pay attention to repeated trends over time, and try and convert them into the ‘frequency domain’ by measuring how often they occur (e.g. how many humidity increases etc.)
  • Automating peak finding is very useful in these types of analysis
  • Visit the NSE portal to gain more inspiration for which analysis you could perform in the time-domain

Further reading:

http://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node149.html

http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf

http://community.wolfram.com/groups/-/m/t/571799