Measures of Central Tendency
- Mean: the average of the data. Is not outlier resistant
- Median: midpoint of data. If even number of data points, then median will be the average of two points. Outlier resistant.
- If mean and median far from each other, indicates presence of outliers or skewed distribution.
- Mode: number that appears the most often in a set of data.
Distributions
- Normal Distribution: can transform any normal distribution to a standard distribution with a mean of zero and a standard deviation one 1.
- Skewed Distribution: Contains a tail on one side of the data set and is thus not symmetric.
- Negative-Skewed: has a tail on the left, mean will be lower than the median
- Positively-skewed: has a tail on the right, mean will be larger than the median.
- Bimodal Distribution: Has two peaks, can sometimes be measured as two different distributions.
Measures of Distribution
- Range: difference between the largest and smallest values of a data set. Heavily affected by presence of data outliers. Standard deviation can be approximated as ¼ * range
- Interquartile Range: The third quartile minus the first quartile
- Quartiles: divide data into groups that comprise one-fourth of the entire data set.
- To calculate position of first quartile: sort data in ascending order and multiply n by 1/4
- If this is a whole number, the quartile is the mean of the value at this position and the next highest position
- If this is a decimal, round up to the next whole number and take that as the quartile position.
- For 3rd quartile, multiply n by 3/4. Do same process as first quartile.
- Outliers are those points that fall outside of 1.5*IQR
- Quartiles: divide data into groups that comprise one-fourth of the entire data set.
- Standard Deviation:
- If data point falls more than three standard deviations from the mean, it is considered an outlier.
- On a normal distribution: 68-95-99 rule applies.
- Outliers: usually results from one of three causes:
- True statistical anomaly
- A measurement error
- Distribution is not approximated by a normal distribution.
Probability
- Mutually Exclusive Outcomes: cannot occur at the same time
- Exhaustive set of outcomes: no other possible outcomes.
Calculations
- For independent events, probability of two or more events occurring at the same time is the product of their probabilities alone
- The probability of at least one of two events occurring is equal to the sum of their initial probabilities minus the probability that will both occur.
Statistical Testing
Hypothesis Testing
- Null Hypothesis: hypothesis of equivalence, says that two populations are equal.
- Alternative Hypothesis: non-direction (not equal) or direction (greater than or less than)
- Z-tests or t-tests are commonly used tests. Test Statistic is calculated form collected data, and compared to a table in order to determine the likelihood that the statistic was obtained by random choice. This likelihood is known as the p-value.
- If p-value > level of significance (usually 0.05) then the null hypothesis cannot be rejected.
- When null is rejected, results are statistically significant since there is a difference between the two groups.
- Level of significance is the level of risk that is accepted for incorrectly rejecting the null hypothesis. Also known as a type I error.
- Type I Error: Likelihood that we report a difference between the two population when one does not actually exist
- Type II Error: incorrectly fail to reject the null hypothesis. When no difference is reported when there actually is one. (b)
- Power: the probability of correctly rejecting the null hypothesis: 1-b
- Confidence: the probability of correctly failing to reject the null hypothesis when no difference exists.
Ho True (no difference) | Ha true (difference exists) | |
Reject Ho | Type I error (a) | Power (1-B) |
Fail to Reject Ho | Confidence | Type II error (B) |
Confidence Intervals
- Reverse of hypothesis testing, start off with a desired confidence (usually 95%) and use a table to find corresponding Z/t values. Scores are then multiplied by standard deviation and then added/subtracted from the mean
Charts, Graphs, and Tables
Types of Charts
- Pie/Circle Charts: represent relative amounts of entities. Loses impact as number of categories increases.
- Bar Charts and Histograms: Bar charts are used for categorical data, while histograms are for numerical data.
- Box Plots: used to show the range, median, quartiles and outliers for a set of data. Box-and-whisker is a labeled box plot.
- Box: bounded by Q1 and Q3, Q2 is the line in the middle (median).
- End of Whiskers: largest and smallest values in the data set that are not outliers.
- Maps: data is demonstrated geographically
Graphs and Axes
- Linear Graphs: can be linear, parabolic, exponential or logarithmic
- Axes of a linear graph will have units that occupy the same amount of space
- Semilog and Log-Log Graphs: changes are made to one or both of the axis ratio’s.
Applying Data
- Correlation refers to a connection – direction relationship, inverse relationship, etc. – between data. This does not imply causation.