Некоторые идеи вероятности и статистики

В настоящее время, когда многие математики подходят к статистике как части теории вероятностей, трудно представить, что в начале статистика не имела ничего общего с вероятностью, хотя обе возникли в государствах Северной Италии в период зарождения капитализма, 17 век. Статистика – от слова «стата» (государство) – для отражения показателей народного хозяйства и населения, теория вероятности из потребности анализа карточных игр – карточная игра там была тогда средой для формирования отношений и заключения договоров. Можно сказать, что теория вероятностей зародилась в переписке П.Ферма (1601-1665) и Б. Паскаля (1623-1662).

Вероятность - это мера случайности. Что такое случайность? Вот четыре довольно разных подхода:

(а) Выбор из хорошо перемешанной колоды – классический взгляд. Вера в существование механизма, устойчиво перемешивающего колоду.

(б) Вероятность события сопряжена с частотой его появления в «вероятностном эксперименте» (по сути – та же хорошо перемешанная колода).

(в) Степень нашей уверенности в появлении события – это понимание отсеялось недавно в так называемые нечеткие множества.

(г) Подход А.Н. Колмогорова: случайность это отсутствие закономерности; тогда степень случайности данной последовательности – сложность или даже длина программы, ее порождающей. Этот взгляд объединяет случайность со сложностью и информативностью. Он связан с развитием вычислительной техники и необходимостью разработки алгоритмов для генерации «случайных» - математики говорят, псевдослучайных – чисел.

Первоначально статистика не имела ничего общего с вероятностью; она развивалась - сначала в Англии как политическая арифметика (родоначальник – Уильям Петти (1623-1687))– для демографических и народнохозяйственных оценок. Связь с вероятностью, равно как и обоснование идеологии статистического вывода, а также и практики статистического наблюдения – в трудах одного из основателей статистики как науки, включая переписи населения – бельгийца Адольфа Кетле (1796-1874) при исследовании частот различных социальных явлений таких как самоубийство. Серьезную роль сыграли российские ученые П.Л. Чебышев (1821-1894), А.А. Марков (1856-1922), А.И. Чупров (1842-1908), А.М. Ляпунов (1857-1918) и А.Н. Колмогоров (1903-1987).

5.1 Вероятностная статистическая модель как средство анализа данных.

Рассмотрим множество N наблюденных значений какого-либо показателя X={x₁,…,x_N}. Хотя рассматривается одномерная характеристика, все дальнейшее верно и для многомерных наблюдений.

Вероятностник будет считать, что эти числа – независимая случайная выборка из распределения с плотностью f(u/a), где a – параметры распределения, как например, в случае нормального распределения

f(u/(m, s))= Cexp{-(u - m)2 / 2s2},

где a=(m, s) - математическое ожидание и дисперсия.

Вероятностный статистик будет склонен рассматривать математическую модель наблюдений, например, в форме уравнения

x_i = a + ei, для всех i=1,2,…, N

где ei - аддитивные ошибки, подчиняющиеся определенному закону распределения.

И в том, и в другом случае данные используются для оценки параметров соответствующих моделей. Основные проблемы – насколько точно оценены параметры? Как сделать оценки поточнее? Нельзя ли уменьшить зависимость от вида функции распределения? Как убедиться, что распределения соответствуют предположениям?

5.2 Смысл коэффициента корреляции.

Рассмотрим множество N наблюденных пар значений показателей (x,y)={(x₁, y₁), …,(x_N, y_N)}. Вопрос – можно ли что-нибудь сказать о связи y с x? На этот вопрос Ф. Галтон – а его интересовала связь способностей родителей и их детей, - предложил свою теорию регрессионного анализа (1896), смысл которой состоял в том, что надо построить линейную зависимость y=ax+b с оценкой постоянных коэффициентов a и b исходя из идеи минимизации суммарной квадратичной погрешности оцененных через х значений y с наблюденными. Согласно этой теории, характеристикой связи x и y является коэффициент корреляции

r = [ S_i (xi – mx)(yi-my) ] ⁄ [ Ns(x) s(y) ]

А именно, оптимальные значения коэффициентов регрессионного уравнения определяются формулами

a = r s(y) /s(x)

b = my –a*mx

где mx, my – средние значения, а s(y), s(x) – среднеквадратичные (стандартные) отклонения переменных величин x_i и y_i, соответственно. При этом минимальное значение суммы квадратов погрешностей в значениях y_i, получаемых если принять линейную зависимость y_i = ax_i+b, равно

L_m(a,b) = Ns²(y)(1- r²)

так что величина r² показывает, насколько уменьшается дисперсия y, если учесть линейную зависимость y от x.

Таким образом, коэффициент корреляции, действительно, характеризует связь между переменными, но не вообще, как неправильно пишут в интернете, а именно ЛИНЕЙНОЙ связи. Например, на Рис. 2 представлены три ситуации нулевого коэффициента корреляции, из которых только одна характеризует полное отсутствие связи.

Figure 2. Three scatter-plots corresponding to zero or almost zero correlation coefficient r; the case on the left: no relation between x and y; the case in the middle: a non-random quadratic relation y=(x-2)²+5; the case on the right: two symmetric linear relations, y=2x-5 and y=-2x+3, each holding at a half of the entities.

К. Пирсон, сотрудничавший с Ф. Галтоном, нашел потрясающую вероятностную интерпретацию коэффициенту корреляции, тем самым дав фундамент математической статистике на последующие десятилетия. Оказалось, что коэффициент корреляции может трактоваться как выборочная оценка параметра двумерного Гауссового распределения. Функция плотности z-score нормированного Гауссова распределения:

f(u, å)= C* exp{-u^Tå^-1u/2}

где = (x, y) двумерный вектор с компонентами x и y, а å - ковариационная (корреляционная) матрица

в которой r - это параметр определяющий параметры эллиптических линий уровня функции плотности. На любой линии уровня u^Tå^-1u является постоянной, так что x²-2rxy+y²=const. Как известно, это – уравнение эллипса, вырождающегося в окружность при r=0 и в пару прямых при r =± 1.

Рис. 3. Пример двумерного нормального распределения.

Этот факт – в основе следующего недоразумения. Некоторые статистики утверждают, что расчет линейной регрессии и коэффициента корреляции имеет смысл только в ситуации двумерной нормальности (Гауссовости). Очевидно, что здесь –расширение импликации «Гаусс Þ Коэффициент корреляции» до эквивалентности «Гаусс Û Коэффициент корреляции», как обычно, совершенно неоправданное.

Линейная зависимость сама по себе ничего общего не имеет с Гауссовостью; как описано выше, в чисто аппроксимационном контексте коэффициент корреляции имеет вполне разумное, хотя и другое истолкование.

5.3. Проблема выбора вида модели описания корреляции: Бритва Окхама, принципы простоты и минимума сложности описания, сложность по Вапнику-Червоненкису.

To specify a problem of learning correlation in a data table, one has to distinguish between two parts in the feature set: predictor, or input, features and target, or output, features. Typically, the number of target features is small, and in generic tasks, there is just one target feature. Target features are usually difficult to measure or impossible to know beforehand. This is why one would want to derive a decision rule relating predictors and targets so that prediction of targets can be made after measuring predictors only. Examples of learning problems include:

(a) chemical compounds: input features are of the molecular structure, whereas target features are activities such as toxicity or healing effects;

(b) types of grain in agriculture: input features are those of the seeds, ground and weather, and target features are of productivity and protein contents,

(c) industrial enterprises: input features refer to technology, investment and labor policies, whereas target features are of sales and profits;

(d) postcode districts in marketing research: input features refer to demographic, social and economic characteristics of the district residents, target features – to their purchasing behavior;

(e) bank loan customers: input features characterize demographic and income, whereas output features are of (potentially) bad debt;

(f) gene expression data: input features relate to levels of expression of DNA materials in the earlier stages of an illness, and output features to those at later stages.

A decision rule predicts values of target features from values of input features. A rule is referred to as a classifier if the target is categorical and as a regression if the target is quantitative. A generic categorical target problem is defined by specifying just a subset of entities labeled as belonging to the class of interest – the correlation problem in this case would be of building such a decision rule that would recognize, for each of the entities, whether it belongs to the labeled class or not. A generic regression problem – the bivariate linear regression – has been considered выше.

Figure 5.1. Structure of a training/testing problem: In training, on the top, the decision rule is fitted to minimize the difference between the predicted and observed target data. In testing, the bottom part, the rule is used to calculate the difference so that no feedback to the rule is utilized.

A decision rule is learnt over a dataset in which values of the targets are available. These data are frequently referred to as the training data. The idea underlying the process of learning is to look at the difference between predicted and observed target feature values on the training data set and to minimize them over a class of admissible rules. The structure of such a process is presented on the upper part of Figure 5.1.

The notion that it ought to be a class of admissible rules pre-specified emerges because the training data is finite and, therefore, can be fit exactly by using a sufficient number of parameters. However, this would be valid on the training set only, because the fit would capture all the errors and noise inevitable in data collecting processes. Take a look, for example, at the 2D regression problem on Figure 5.2 depicting seven points on (x,u) -plane corresponding to observed combinations of input feature x and target feature u.

Figure 5.2. Possible graphs of interrelation between x and u according to observed data points (black circles).

The seven points on Figure 5.2 can be exactly fitted by a polynomial of 6^th order u = p(x) = a₀+a₁x+a₂x²+ a₃x³ +a₄x⁴+a₅x⁵+a₆x⁶. Indeed, they would lead to 7 equations u_i=p(x_i) (i=1,…,7), so that, in a typical case, the 7 coefficients a_k of the polynomial can be exactly determined. Having N points observed would require an N -th degree polynomial to exactly fit them.

However, the polynomial, on which graph all observations lie, has no predictive power both within and beyond the range. The curve may go either course (like those shown) depending on small changes in the data. The power of a theory – and a regression line is a theory in this case – rests on its generalization power, which, in this case, can be cast down as the relation between the number of observations and the number of parameters: the greater the better. When this ratio is relatively small, statisticians would refer to this as an over-fitted rule. The overfitting normally would produce very poor predictions on newly added observations. The blue straight line fits none of the points, but it expresses a simple and very robust tendency and should be preferred because it summarizes the data much deeper: the seven observations are summarized here in just two parameters, slope and intercept, whereas the polynomial line provides no summary: it involves as many parameters as the data entities. This is why, in learning decision rules problems, a class of admissible rules should be selected first. Unfortunately, as of this moment, there is no model based advice, within the data analysis discipline, on how this can be done, except very general ones like “look at the shapes of scatter plots”. If there is no domain knowledge to choose a class of decision rules to fit, it is hard to tell what class of decision rules to use.

A most popular advice relates to the so-called Occam’s razor, which meansthat the complexity of the data should be balanced by the complexity of the decision rule. A British monk philosopher William Ockham (c. 1285–1349) said: “Entities should not be multiplied unnecessarily.” This is usually interpreted as saying that all other things being equal, the simplest explanation tends to be the best one. Operationally, this is further translated as the Principle of Maximum Parsimony, which is referred to when there is nothing better available. In the format of the so-called “Minimum description length” principle, this approach can be meaningfully applied to problems of estimation of parameters of statistic distributions (see P.D. Grünwald 2007). Somewhat wider, and perhaps more appropriate, explication of the Occam’s razor is proposed by Vapnik (2006). In a slightly modified form, to avoid mixing different terminologies, it can be put as follows: “Find an admissible decision rule with the smallest number of free parameters such that explains the observed facts” (Vapnik 2006, p. 448). However, even in this format, the principle gives no guidance about how to choose an adequate functional form. For example, which of two functions, the power function f(x)=ax^b or logarithmic one, g(x)=blog(x)+a, both having just two parameters a and b, should be preferred as a summarization tool for graphs on Figure 5.3?

Figure 5.3. Graph of one two functions, f(x)=65x^0.3 and g(x)=50log(x)+30, both with an added normal noise N(0,15), is presented on each plot. Can the reader give an educated guess of which is which? (Answer: f(x) is on the right and g(x) on the left.)

Another set of advices, not incompatible with those above, relates to the so-called falsifability principle by K. Popper (1902-1994), which can be expressed as follows: “Explain the facts by using such an admissible decision rule which is easiest to falsify” (Vapnik 2006, p. 451). Inprinciple, to falsify a theory one needs to give an example contradicting to it. Falsifability of a decision rule can be formulated in terms of the so-called VC-complexity, a measure of complexity of classes of decision rules: the smaller VC-complexity the greater the falsifability.

Figure 5.4. Any two-part split of three points (not on one line) can be made by a linear function, but the presented case on four points cannot be solved by a line.

Let us explain the concept of VC-complexity for the case of a categorical target, so that a decision rule to be would be a classifier. However many categorical target features are specified, different combinations of target categories can be assigned different labels, so that a classifier is bound to predict a label. A set of classifiers F is said to shatter the training sample if for any possible assignment of the labels, a classifier exactly reproducing the labels can be found in F. Given a set of admissible classifiers F, the VC-complexity of a classifying problem is the maximum number of entities that can be shattered by classifiers from F. For example, 2D points have VC complexity 3 in the class of linear decision rules. Indeed, any three points, not lying on a line, can be shattered by a line; yet not all four-point sets can be shattered by lines, as shown on Figure 5.4, the left and right parts, respectively.

The VC complexity is an important characteristic of a correlation problem especially within the probabilistic machine learning paradigm. Under conventional conditions of the independent random sampling of the data, an reliable classifier “with probability a % will be b % accurate, where b depends not only on a, but also on the sample size and VC-complexity” (Vapnik 2006).

The problem of learning correlation in a data table can be stated, in general terms, as follows. Given N pairs (x_i, u_i),i =1, …, N, in which xi are predictor/input p -dimensional vectors x_i=(x_i1,…,x_ip) and u_i = (u_i1,…,u_iq) are target/output q -dimensional vectors (usually q =1), build a decision rule