Statistics

For MathWiki statistics, see Special:Statistics and Special:WikiStats.

Statistics is a broad mathematical discipline which studies ways to collect, summarize, and draw conclusions from data. It is applicable to a wide variety of academic fields from the physical and social sciences to the humanities, as well as to business, government and industry.

Once data is collected, either through a formal sampling procedure or some other, less formal method of observation, graphical and numerical summaries may be obtained using the techniques of descriptive statistics. The specific summary methods chosen depend on the method of data collection. The techniques of descriptive statistics can also be applied to census data, which is collected on entire populations.

If the data can be viewed as a sample (a subset of some population of interest), inferential statistics can be used to draw conclusions about the larger, mostly unobserved population. These inferences, which are usually based on ideas of randomness and uncertainty quantified through the use of probabilities, may take any of several forms:

Answers to essentially yes/no questions (hypothesis testing)
Estimates of numerical characteristics (estimation)
Predictions of future observations (prediction)
Descriptions of association (correlation)
Modeling of relationships (regression)

The procedures by which such inferences are made are sometimes collectively known as applied statistics. In contrast, statistical theory (or, as an academic subject sometimes called mathematical statistics) is the subdiscipline of applied mathematics which uses probability theory and mathematical analysis to place statistical practice on a firm theoretical basis. (If applied statistics is what you do in statistics, statistical theory tells you why it works.)

In academic statistics courses, the word statistic (no final s) is usually defined as a numerical quantity calculated from a set of data. In this usage, statistics would be the plural form meaning a collection of such numerical quantities. See Statistic for further discussion.

Less formally, the word statistics (singluar statistic) is often used in a way roughly synonymous with data or simply numbers, a common example being sports "statistics" published in newspapers. In the United States, the Bureau of Labor Statistics collects data on employment and general economic conditions; also, the Census Bureau publishes a large annual volume called the Statistical Abstract of the United States based on census data.

Etymology[]

The word statistics comes from the modern Latin phrase statisticum collegium (lecture about state affairs), which gave rise to the Italian word statista (statesman or politician — compare to status) and the German Statistik (originally the analysis of data about the state). It acquired the meaning of the collection and classification of data generally in the early 19th century. The collection of data about states and localities continues, largely through national and international statistical services.

History of statistics[]

In the 9th century, the Islamic mathematician, Al-Kindi, was the first to use statistics to decipher encrypted messages and developed the first code-breaking algorithm in the House of Wisdom in Baghdad, based on frequency analysis. He wrote a book entitled Manuscript on Deciphering Cryptographic Messages, containing detailed discussions on statistics.^[1] It covers methods of cryptanalysis, encipherments, cryptanalysis of certain encipherments, and statistical analysis of letters and letter combinations in Arabic.^[2]

In the early 11th century, Al-Biruni's scientific method emphasized repeated experimentation. Biruni was concerned with how to conceptualize and prevent both systematic errors and observational biases, such as "errors caused by the use of small instruments and errors made by human observers." He argued that if instruments produce errors because of their imperfections or idiosyncratic qualities, then multiple observations must be taken, analyzed qualitatively, and on this basis, arrive at a "common-sense single value for the constant sought", whether an arithmetic mean or a "reliable estimate."^[3]

Modern history[]

: History of Statistics by Prof. Noelson Manilay

The Word statistics have been derived from Latin word “Status” or the Italian word “Statista”, meaning of these words is “Political State” or a Government. Shakespeare used a word Statist is his drama Hamlet (1602). In the past, the statistics was used by rulers. The application of statistics was very limited but rulers and kings needed information about lands, agriculture, commerce, population of their states to assess their military potential, their wealth, taxation and other aspects of government.

Gottfried Achenwall used the word statistik at a German University in 1749 which means that political science of different countries. In 1771 W. Hooper (Englishman) used the word statistics in his translation of Elements of Universal Erudition written by Baron B.F Bieford, in his book statistics has been defined as the science that teaches us what is the political arrangement of all the modern states of the known world. There is a big gap between the old statistics and the modern statistics, but old statistics also used as a part of the present statistics.

During the 18th century the English writer have used the word statistics in their works, so statistics has developed gradually during last few centuries. A lot of work has been done in the end of the nineteenth century.

At the beginning of the 20th century, William S Gosset was developed the methods for decision making based on small set of data. During the 20th century several statistician are active in developing new methods, theories and application of statistics. Now these days the availability of electronics computers is certainly a major factor in the modern development of statistics.

Basic concepts[]

There are several approaches to statistics, most of which rely on a few basic concepts.

Population vs. sample[]

In statistics, a population is the set of all objects (people, etc.) that one wishes to make conclusions about. In order to do this, one usually selects a sample of objects: a subset of the population. By carefully examining the sample, one may make inferences about the larger population.

For example, if one wishes to determine the average height of adult women aged 20–29 in the U.S., it would be impractical to try to find all such women and ask or measure their heights. However, by taking small but representative sample of such women, one may determine the average height of all young women quite closely. The matter of taking representative samples is the focus of sampling.

Randomness, probability and uncertainty[]

The concept of randomness is difficult to define precisely. In general, any outcome of an action, or series of actions, which cannot be predicted beforehand may be described as being random. When statisticians use the word, they generally mean that while the exact outcome cannot be known beforehand, the set of all possible outcomes is known — or, at least in theory, knowable. A simple example is the outcome of a coin toss: whether the coin will land heads up or tails up is (ideally) unknowable before the toss, but what is known is that the outcome will be one of these two possibilities and not, say, on edge (assuming that the coin cannot stand upright on its edge). The set of all possible outcomes is usually called the sample space.

The probability of an event is also difficult to define precisely but is basically equivalent to the everyday idea of the likelihood or chance of the event happening. An event that can never happen has probability zero; an event that must happen has probability one. (Note that the reverse statements are not necessarily true; see the article on probability for details.) All other events have a probability strictly between zero and one. The greater the probability the more likely the event, and thus the less our uncertainty about whether it will happen; the smaller the probability the greater our uncertainty.

There are several basic interpretations of probability used to assign or compute probabilities in statistics:

Relative frequency interpretation: The probability of an event is the long-run relative frequency of occurrence of the event. That is, after a long series of trials, the probability of event A is taken to be:
$\mbox{P}(A) = {\mbox{number of trials in which event } A \mbox{ happened} \over \mbox{total number of trials}}$

To make this definition rigorous, the right-hand side of the equation should be preceded by the limit as the number of trials grows to infinity.

Subjective interpretation: The probability of an event reflects our subjective assessment of the likelihood of the event happening. This idea can be made rigorous by considering, for example, how much one should be willing to pay for the chance to win a given amount of money if the event happens. For more information, see Bayesian probability.
Classical approach to assigning probability: This approach requires that each outcome is equally likely. In this case, the probability that event A occurs is equal to the number of ways that A can occur divided by the number of possible outcomes of the experiment.

Note that the relative frequency interpretation does not require that a long series of trials actually be conducted. Typically probability calculations are ultimately based upon perceived equally-likely outcomes — as obtained, for example, when one tosses a so-called "fair" coin or rolls or "fair" die. Many frequentist statistical procedures are based on simple random samples, in which every possible sample of a given size is as likely as any other.

Prior information and loss[]

Once a procedure has been chosen for assigning probabilities to events, the probabilistic nature of the phenomenon under consideration can be summarized in one or more probability distributions. The data collected is then viewed as having been generated, in a sense, according to the chosen probability distribution.

Section requires expansion...

Data collection[]

Sampling[]

Main article: Sampling (statistics)

If your business has a million customers, it isn't really feasible to ask them all what their experience of your business was like. In order to take a survey of a population, statisticians use a sample of the population. For example, you might survey five thousand people instead of a million and still come up with some answers that are close to what you would have obtained if you had surveyed everyone.

Experimental design[]

Main article: Experimental design (statistics)

Data summary: descriptive statistics[]

Main article: Descriptive statistics

Levels of measurement[]

Main article: Level of measurement

Qualitative (categorical)
- Nominal
- Ordinal
Quantitative (numerical)
- Interval
- Ratio

Graphical summaries[]

Main article: Statistical graphs

Numerical summaries[]

Main article: Summary statistics

Data interpretation: inferential statistics[]

Main article: Statistical inference

Estimation[]

Main article: Statistical estimation

Prediction[]

Main article: Statistical prediction

Hypothesis testing[]

Main article: Statistical hypothesis testing

Relationships and modeling[]

Correlation[]

Main article: Correlation

Two quantities are said to be correlated if greater values of one tend to be associated with greater values of the other (positively correlated) or with lesser values of the other (negatively correlated). In the case of interval or ratio variables, this is often apparent in a scatterplot of the data: positive correlation is reflected in an overall increasing trend in the data points when viewed left to right on the graph; negative correlation appears as an overall decreasing trend.

The correlation between two variables is a number measuring the strength and usually the direction of this relationship. Most measures of correlation take on values from -1 to 1 or from 0 to 1. Zero correlation means that greater values of one variable are associated with neither higher nor lower values of the other, or possibly with both. A correlation of 1 implies a perfect positive correlation, meaning that an increase in one variable is always associated with an increase in the other (and possibly always of the same size, depending on the correlation measure used). Finally, a correlation of -1 means that an increase in one variable is always associated with a decrease in the other.

See the article on correlation for more information.

Regression[]

Main article: Regression

Time series[]

Main article: Time series

Data mining[]

Main article: Data mining

Statistics in other fields[]

Biostatistics
Business statistics
Chemometrics
Demography
Economic statistics
Engineering statistics
Epidemiology
Geostatistics
Psychometrics
Statistical physics

Subfields or specialties in statistics[]

Mathematical statistics
Reliability
Survival analysis
Quality control
Time series
Categorical data analysis
Multivariate statistics
Large-sample theory
Bayesian inference
Regression
Sampling theory
Design of experiments
Statistical computing
Non-parametric statistics
Density estimation
Simultaneous inference
Linear inference
Optimal inference
Decision theory
Linear models
Data modeling
Sequential analysis
Spatial statistics

Probability:

Stochastic processes
Queueing theory

Related areas of mathematics[]

Probability
- Set theory
- Finite mathematics
- Discrete mathematics
  - Combinatorics
Analysis
- Calculus
- Real analysis
  - Measure theory
  - Probability theory
  - Distribution theory
  - Asymptotic analysis
Linear algebra
- Matrix theory
Numerical analysis
- Scientific computing

Statistical software[]

Main article: List of statistical software

Commercial[]

CART
ECHIPS (EChips)
Excel
- add-ins: Analyse-It, SigmaXL, statistiXL, WinSTAT, XLSTAT (XLSTAT)
JMP
Minitab
NCSS
nQuery
PASS
SAS System (SAS)
S
- descendents: S-PLUS (S-Plus), S2, S3, S4, S5, S6
SPSS
Stata
STATISTICA (Statistica)
StatXact, LogXact
SUDAAN (Sudaan)
SYSTAT (Systat)

Free versions of commercial software[]

Gnumeric — not a clone of Excel, but implements many of the same functions
R — free version of S
FIASCO or PSPP — free version of SPSS

Other free software[]

BUGS — Bayesian inference Using Gibbs Sampling
ESS — a GNU Emacs add-on
...
see also [1]

Licensing unknown[]

Genstat
XLispStat
...

World Wide Web[]

StatLib — large repository of statistical software and data sets

Online data sources[]

StatLib
...

References[]

↑ Singh, Simon (2000). The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography (1st Anchor Books ed. ed.). New York: Anchor Books. ISBN 0-385-49532-3.
↑ "Al-Kindi, Cryptgraphy, Codebreaking and Ciphers". http://www.muslimheritage.com/topics/default.cfm?ArticleID=372. Retrieved 2007-01-12.
↑ Glick, Thomas F.; Livesey, Steven John; Wallis, Faith (2005), Medieval Science, Technology, and Medicine: An Encyclopedia, Routledge, pp. 89–90, ISBN 0-415-96930-1

External links[]

[1] Singh, Simon (2000). The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography (1st Anchor Books ed. ed.). New York: Anchor Books. ISBN 0-385-49532-3.

[2] "Al-Kindi, Cryptgraphy, Codebreaking and Ciphers". http://www.muslimheritage.com/topics/default.cfm?ArticleID=372. Retrieved 2007-01-12.

[biruni_method-3] Glick, Thomas F.; Livesey, Steven John; Wallis, Faith (2005), Medieval Science, Technology, and Medicine: An Encyclopedia, Routledge, pp. 89–90, ISBN 0-415-96930-1

[1]

[2]

[3]