Evaluating Statistics and Experimental Design

From Game Technology Lab

Jump to: navigation, search

Understanding the statistics that are generated from data collection is critical for all research area.

Designing your experiments correctly can improve the quality and validity of the data you are collecting, and it can help others to replicate your work.


Experimental Design with Software

Developing software for research purposed is different to commercial development, or student projects. It requires a different approach to programming and development.

Saving and Loading

The first thing you usually do is develop a state save and load function. This allows you to save the state of the system at any point. This creates trace-ability, so that when asked you could record each state change and follow the progress of the most important part of your program. It also allows results which take a long time to run to be saved as a partial result. The last advantage for a parameter based load and save is that you might be able to spread your code over several machines and manually allocated parts of the task to perform.


If you use random numbers you must seed the random generator with a number, and record that number along with all the other configuration parameters in the save file. With the seed recorded you can recreate the data. This allows you to and investigate anything that seems wrong, like bugs, and also to set up automatic regression tests.

If you are using the rand function in C it might give different random numbers for the same seed on different machines. This damages repeatability and tracability. One way to check that the random numbers are going to give you the same result it to record a test sequence of random numbers saved in your program. You can then verify the random number generator by running a simple seedRandom(13) followed by 10 calls to random. If you want to make your results as repeatable as possible, consider importing a random number generator into you code so that regardless of the machine you can regenerate the data.

Readability of Code

Your thesis is a very long project. You can expect to be reading code that you wrote 18 months earlier. Given that you will not be able to remember what everything is for you must write your code with the intention of maintaining the code. When the easy thing is to add a "magic number" directly into the code, you need to take the time and create a variable which describes what the number is and why it is being used.

All comments should be in English! Your code may need to be reviewed by other student or international collaborators.

Tracing data

For many of you there will be more people interested in your results than expert software engineers. Thus when writing your code you may want to consider other audiences who are interested.


File:Http://xkcd.com/882/There will always be a lot of statistics when you generate a lot of data. Understanding the significance of results relies on understand the potential baises and sampling error that come with any sort of measurement. In research we are often looking to say that this new thing we made is better than the current solution. The main way to do this is to show that when you test both solutions, the difference between them was not a result of sampling error. The standard value used for this is 95% confidence. The means that there is a 5% chance that the result you are presenting happened by chance. This page Running multiple tests gives a view of running many independant tests. XKCD also have an excellent example of this http://xkcd.com/882/.

For example, if I thought that men were taller than women, and my sample size is 2, then I pick one man, and one woman. It could be that that particular women is taller than the man. Should this lead me to decide that women are taller than men. Obviously not. We need to work out how likely it is that my sample is representative. This relates to the population size the standard deviation of the population and the sample size. Generally the more people you sample the more likely you will be able to accurately estimate the true population.

There are some issue to be aware of:

  1. Selection bias. Are you randomly allocating people to your experimental and control group, or are they able to select the group


Many of you will be using questionnaires of users to assess the performance of a particular tool, process, or system. What and how you ask a question can significantly affect the answers you get. It is generally good practice to find a standard set of questions, that are being used by researchers in your field, and use those questions. If you need to ask the questions in another language, look for an existing translation, or if non exists (quite possible for Norwegian) create a translation and make it publicly available.

For software acceptance there are several sets of questions. The benefit of using standard questions is that it becomes easier to compare results with other studies.

  1. SUS - System Usability Scale. This is a 10 question survey with a 5 point Likert scale. (Brooke, John, 1996. SUS: A "quick and dirty" usability scale. In P. W. Jordan, B. Thomas, B. A. Weerdmeester, & I. L. McClelland (Eds.), Usability evaluation in industry 189–194. London)
  2. TAM - Technology Acceptance Model. Relatively simplistic model of why people use a piece of software. (Davis, F. D., Bagozzi, R. P., and Warshaw, P. R. "User Acceptance of Computer Technology: A Comparison of Two Theoretical Models," Management Science, 35, 1989, 982-1003)
  3. UTUAT and UTAUT2 - Unified Theory of Acceptance and Use of Technology. These have large numbers of potential variables that can modify the results. Often used when you have a very large number of participants. (Venkatesh, V., Morris, M.G., Davis, F.D., and Davis, G.B. “User Acceptance of Information Technology: Toward a Unified View,” MIS Quarterly, 27, 2003, 425-478)

Likert Scale

The Likert scale is the system of asking radio button questions with a range from agree to disagree. The systems are usually an odd number between 5 and 11, were the center is neither agree or disagree.

One of the problems of simply averaging on the likert scale is that assumes that you are sampling from a normal distribution. If you use a standard student T test to say that one system is better than another, you are assuming that the underlying distribution is a Normal distribution. If you look at the graph on the right, you can see two different graphs of answers to an 8 point question. The first on Global Warming and the second on Genetically Modified food. As you can see, there are two different groups in the left graph. If you merely averaged numbers you will get an incorrect assessment of people opinions.

So rather than just giving the average of the five point scale by assigning a number 0-4 and giving an average of 2.1, you need to keep the grouping separate. You need to look at other tests. The test you use depends on the labels you have used. If you use the standard Likert scale you can use Chi-squared for each question. The Chi-squared test is basically comparing the number who answered in a particular way compared to the expected number. It is standard to group responses, so you might decide to group 'strongly disagree' and 'disagree' into a single response group. The Chi-squared then tells you about the differences between the disagree group for one condition compared to the other condition.


If you have a system which gives you a binary detector for something, you have to understand the nature of the data. If you have a detector then there are 4 states.

  1. True Positive : the test tells you the item is an A, and it is indeed and A
  2. False Positive : the system claims the item is A, but it is not
  3. False Negative : the system says not A, but it is
  4. True Negative : the system says not A, and the object is not A.

These four lead to two general numbers. Sensitivity and Specificity

The Sensitivity of a test tells you how likely it is to detect every positive example in the sample, while specificity is how likely it is to get it wrong and think everybody is of type A.

Control Group

Pet Peeves

Never use nowadays

A statistically significant result does not mean a significant result. A 1% improvement in an algorithm may be statistically significant but you should not say that you have a significant improvement in the performance.

Personal tools