Data-Based and Statistical Reasoning


Measures of Central Tendency

  • Mean: the average of the data. Is not outlier resistant
  • Median: midpoint of data. If even number of data points, then median will be the average of two points. Outlier resistant.
    • If mean and median far from each other, indicates presence of outliers or skewed distribution.
  • Mode: number that appears the most often in a set of data.

Distributions

  • Normal Distribution: can transform any normal distribution to a standard distribution with a mean of zero and a standard deviation one 1.
  • Skewed Distribution: Contains a tail on one side of the data set and is thus not symmetric.
    • Negative-Skewed: has a tail on the left, mean will be lower than the median
    • Positively-skewed: has a tail on the right, mean will be larger than the median.
  • Bimodal Distribution: Has two peaks, can sometimes be measured as two different distributions.

Measures of Distribution

  • Range: difference between the largest and smallest values of a data set. Heavily affected by presence of data outliers. Standard deviation can be approximated as ¼ * range
  • Interquartile Range: The third quartile minus the first quartile
    • Quartiles: divide data into groups that comprise one-fourth of the entire data set.
      • To calculate position of first quartile: sort data in ascending order and multiply n by 1/4
      • If this is a whole number, the quartile is the mean of the value at this position and the next highest position
      • If this is a decimal, round up to the next whole number and take that as the quartile position.
      • For 3rd quartile, multiply n by 3/4. Do same process as first quartile.
    • Outliers are those points that fall outside of 1.5*IQR
  • Standard Deviation:
    • If data point falls more than three standard deviations from the mean, it is considered an outlier.
    • On a normal distribution: 68-95-99 rule applies.
  • Outliers: usually results from one of three causes:
    • True statistical anomaly
    • A measurement error
    • Distribution is not approximated by a normal distribution.

Probability

  • Mutually Exclusive Outcomes: cannot occur at the same time
  • Exhaustive set of outcomes: no other possible outcomes.

Calculations

  • For independent events, probability of two or more events occurring at the same time is the product of their probabilities alone
  • The probability of at least one of two events occurring is equal to the sum of their initial probabilities minus the probability that will both occur.

Statistical Testing


Hypothesis Testing
  • Null Hypothesis: hypothesis of equivalence, says that two populations are equal.
  • Alternative Hypothesis: non-direction (not equal) or direction (greater than or less than)
  • Z-tests or t-tests are commonly used tests. Test Statistic is calculated form collected data, and compared to a table in order to determine the likelihood that the statistic was obtained by random choice. This likelihood is known as the p-value.
  • If p-value > level of significance (usually 0.05) then the null hypothesis cannot be rejected.
    • When null is rejected, results are statistically significant since there is a difference between the two groups.
    • Level of significance is the level of risk that is accepted for incorrectly rejecting the null hypothesis. Also known as a type I error.
    • Type I Error: Likelihood that we report a difference between the two population when one does not actually exist
    • Type II Error: incorrectly fail to reject the null hypothesis. When no difference is reported when there actually is one. (b)
    • Power: the probability of correctly rejecting the null hypothesis: 1-b
    • Confidence: the probability of correctly failing to reject the null hypothesis when no difference exists.
Ho True
(no difference)
Ha true
(difference exists)
Reject HoType I error (a)Power (1-B)
Fail to Reject HoConfidenceType II error (B)

Confidence Intervals
  • Reverse of hypothesis testing, start off with a desired confidence (usually 95%) and use a table to find corresponding Z/t values. Scores are then multiplied by standard deviation and then added/subtracted from the mean

Charts, Graphs, and Tables


Types of Charts
  • Pie/Circle Charts: represent relative amounts of entities. Loses impact as number of categories increases.
  • Bar Charts and Histograms: Bar charts are used for categorical data, while histograms are for numerical data.
  • Box Plots: used to show the range, median, quartiles and outliers for a set of data. Box-and-whisker is a labeled box plot.
    • Box: bounded by Q1 and Q3, Q2 is the line in the middle (median).
    • End of Whiskers: largest and smallest values in the data set that are not outliers.
  • Maps: data is demonstrated geographically

Graphs and Axes
  • Linear Graphs: can be linear, parabolic, exponential or logarithmic
  • Axes of a linear graph will have units that occupy the same amount of space
  • Semilog and Log-Log Graphs: changes are made to one or both of the axis ratio’s.

Applying Data

  • Correlation refers to a connection – direction relationship, inverse relationship, etc. – between data. This does not imply causation.

Show Buttons
Hide Buttons
error: Content is protected !!