Types of statistical data. Forms of presentation of statistical data. Statistical calculations of the confectionery market

The subject of statistics has changed throughout the history of the development of statistical science, until now scientists have not come to an unambiguous answer on this issue.

The subject of statistics is the study of social phenomena and their analysis.

So the English statisticians JE Yula, MJ Kendal believe: “Regardless of the branch of knowledge obtained numerical data, they have a certain kind of properties, to identify which may require a special kind of scientific method of processing. The latter is known as a statistical method or statistics. "

The universality of statistics as a science stems from the fact that it deals with methods of measurement and interpretation, both in the social sciences and in the natural sciences. Statistics are recognized as a special method used in various fields of activity, in solving a variety of problems, defined as "collecting, presenting and interpreting numerical data."

Statistical methodology and practice are inextricably linked, complement and develop each other. Statistical theory generalizes the experience of practical work, develops new ideas and methods that enrich the practical statistical activity. Statistical practice is scientifically organized work.

In this way, statistics - a science that studies the quantitative side of mass social phenomena in order to establish patterns in an inextricable connection with their qualitative side in the specific conditions of place and time in their interconnection and interdependence (NN Ryauzovsky "General theory of statistics").

The essence of this definition is related to six main points:

1. Not all phenomena are studied, but only social and socio-economic ones. These phenomena are complex, diverse (for example: production, labor, health care, cultural activities, population, etc.), differ from natural phenomena that have a relatively stable character and recurrence over time.

2. Mass socio-economic phenomena are investigated, not isolated ones, since the patterns of development are manifested through a variety of facts, when generalizing data with a sufficiently large number of units (the law of large numbers).

3. The phenomena are given a quantitative assessment, on the basis of which their qualitative content is revealed (for example: for a quantitative analysis of unemployment, the employment indicator and the unemployment rate are used).

4. The numerical characteristics of the same phenomenon are different in space and time.

5. Socio-economic phenomena are studied in dynamics in order to identify trends and directions of development, forecast future situations.

6. Study of phenomena in interconnection and interdependence.



Thus, when using statistical methods, it is important to remember the unity of the quantitative and qualitative aspects of the phenomenon under study.

So, statistics is concerned with the study of mass phenomena or aggregates.

The aggregate - is a group that is homogeneous by some attribute, which consists of a core and the phenomena surrounding it ("layer"). The core is a concentrated expression of all the specific properties of a given group that distinguish one set from others. "Layer" - units with an incomplete set of specific properties that belong to a given population with a certain probability.

For example: the population is students, among students there are:

- “ideal student” - studies well, reads a lot, actively participates in extracurricular work - this is the core.

A student for whom only “interesting”, special knowledge is important; is one layer.

A student who is only interested in extracurricular life, etc. Is another layer.

Thus, the "quality" of some students can be almost unmistakably attributed to one type or another, while others are quite difficult.

The ratio of the core and its environment in different aggregates is different, and depends on the conditions of existence of the aggregate: duration, stability, interaction with other aggregates, etc. However, the nucleus should constitute the majority of the units of the aggregate, since it determines its characteristic features.

Since statistics are concerned with the study of phenomena at a particular moment of place and time, they have a limited amount of data.

Statistical population - this is a set of objectively existing units of the studied phenomenon, united by a single qualitative basis, by a common connection, but differing from each other in individual features. (For example, a set of households, a set of families, a set of enterprises, firms, associations, etc.).

The aggregate must be distinguished from the system and structure, since in the aggregate there is no order, here all the elements are separated.

Feature - it is a qualitative feature of a unit of the population.

By the nature of the display of the properties of units of the studied population, the features are divided into two main groups:

1. Quantitative - signs that have a direct quantitative expression, that is, they can be added (for example: age, income, number of children, number of years of study, work experience, etc.). Assumes a more-less relationship.

2. Quality - signs that do not have a direct quantitative expression, that is signs that cannot be added (for example: gender, profession, nature of work, attitude to something). Assume an equality-inequality relationship. (! do not allow more-less relationships.)

All qualitative signs are divided into:

Attributive - which are a feature of this phenomenon (for example: profession, nature of work, etc.)

Alternative - options opposite in meaning (for example: products are good or damaged, for representatives of certain age groups there is a probability of living or not living up to the next age group; each person can be married or not, man or woman, etc.).

In addition, signs in statistics can be divided into different groups, depending on the base. The main classifications of features are shown in Figure 1.2.

Feature classifications in statistics

Descriptive - signs expressed verbally (form of ownership of the enterprise, type of raw materials used, profession, etc.) Descriptive signs are divided into nominal, which cannot be ordered, ranked (nationality, industry affiliation of the enterprise, etc.) and ordinal, which can be ranked (tariff category , student progress score, company ratings, etc.).

Quantitative signs - those whose individual values \u200b\u200bhave a numerical expression (area of \u200b\u200bthe region's territory, the value of the company's assets, the price of goods, etc.).

Primary features characterize a unit of the population as a whole. They can be measured, counted, weighed and exist on their own regardless of their statistical study (the number of residents of the city, gross grain harvest, the amount of insurance payments).

Secondary characteristics are obtained by calculation through the ratio of primary characteristics. Secondary signs are products of human consciousness, the results of knowing the object under study.

Direct signs - properties inherent in the object that is characterized by them.

Indirect signs are properties inherent not in the studied object itself, but in other aggregates related to the object.

Alternative signs are those that take only the bottom of the meaning (sex of a person, place of residence (city-village), signs of possession or non-possession of something.

Discrete signs. only have integer values.

Continuous signs - capable of taking any values, both whole and fractional. All secondary signs are considered continuous.

Momentary signs - characteristics of a state, the presence of something at a certain point in time.

Interval signs - characteristics of the process for a certain period of time: a year, half a year, a quarter, a month, a day, etc.

A feature of statistical research is that only varying signs are studied in it, i.e. characteristics that take different values \u200b\u200b(for attributive, alternative characteristics) or have different quantitative levels for individual units of the population.

Variation is a significant property of a statistical population.

Variation - This is a property of a statistical population, reflecting the ability to change, due to both external and internal factors, both related to the essence of the object under study, and not related to it.

Statistical pattern - This is a regularity established by the law of large numbers in mass variable phenomena combined in a statistical population.

The statistical pattern shows up in trends.

Statistics functions:

1. Descriptive - with the help of numbers and numbers, a description of a particular situation, process, phenomenon is given

2. Explanatory - cause-and-effect relationships between phenomena and processes are revealed; the factors that determine these or those connections are revealed.

The nature of statistics is due to 3 main properties:

1. Uncertainty of statistical data

2. The probabilistic nature of statistical data (a sign may or may not accept this value)

3. Abstractness of statistical data.


Eliseeva I.I. Workshop on the general theory of statistics. Moscow: Finance and Statistics, 2008. P.8.

Statistical data must be adequate, firstly, to the object of study, and secondly, to the time at which they are collected and used.

This chapter describes the sources of statistical data, their types and methods of obtaining, as well as techniques for describing and presenting numerical and non-numerical data.

After studying this chapter, YOU should be able to:

  • -to build a program of statistical research;
  • -determine the sources of statistical information;
  • -produce summary and grouping of statistical data and generate statistical tables;
  • - to display the results of grouping in the form of diagrams;
  • - to assess the main characteristics: relative value, mean value, variance, standard deviation, median, mode, range.

Getting initial data

Obtaining information about the object of research is one of the main tasks of statistical research.

Statistical research should be guided by the objectives and requirements for results. They define the methods of statistical analysis, on the basis of which the collection of initial data is organized. In the process of statistical research, one should be wary of following errors: goals are not clearly formulated, observation methods are incorrectly applied.

Obtaining initial data for a statistical study can be performed in two ways:

  • - an active experiment specially organized to determine statistical dependences;
  • -statistical observation.

An active experiment is used in technical and economic research, when, for example, the task is to optimize the modes of technological processes according to economic criteria.

When conducting a statistical study of socio-economic processes, it seems possible to use only observation. The program is the basis for this method of obtaining information. It consists of three main stages:

  • -determination of the research object;
  • -selection of a unit of the population;
  • - determination of the system of indicators to be registered.

The object of observation is a set of units of the studied phenomenon, about which statistical information can be collected. To clearly define the object of observation, you should answer the following questions:

  • -what? (what elements we will investigate);
  • -Where? (where the surveillance will be conducted _;
  • -when? (for what period).

From the point of view of the organization of statistical observation, there are two main forms: reporting and specially organized statistical observation.

Reporting as a form of observation is characterized by the fact that statistical bodies systematically receive information from enterprises, institutions and organizations in a timely manner on the conditions and results of work for the past period, the volume and content of which are determined by the approved reporting forms.

A specially organized statistical observation is a collection of information in the form of censuses of one-time counts and surveys. They are organized to study those phenomena that cannot be covered by mandatory reporting.

The types of statistical observation are distinguished by the time of data registration and by the degree of coverage of the units of the studied population. By the nature of data recording over time, observation can be classified:

  • -continuous (for example, accounting of manufactured products);
  • -periodic (financial statements);
  • - one-time, in case of need for information, for example, a population census.

By the degree of coverage of units of the studied population:

  • - discontinuous, selective, when not the entire population is examined, but some of it;
  • - continuous, i.e. description of all units of the population;
  • -monographic, when typical objects are described in detail.

The main methods of obtaining statistical information are direct observation, documentary method and questioning.

The method of direct observation is characterized by the fact that representatives of state statistics bodies or other organizations record data in statistical documents after personal inspection, recalculation, measurement or weighing of units of non-observance.

In the case of a documentary method of observation, various documents serve as the source. This method is used in the preparation of statistical reporting by enterprises and institutions on the basis of documents of standard accounting.

In a survey, the source of information is the responses of the respondents. The survey can be organized in different ways: expeditionary method, self-registration, correspondence method and questionnaire method.

In the expeditionary method, representatives of the statistical authorities ask the person being examined and, from his words, write down the information in the observation forms.

When using the self-registration method, the surveyed units (enterprises or citizens) are handed a survey form and given instructions on how to fill it out. Completed forms are sent by mail within the specified period.

In the case of the correspondent method, volunteer correspondents provide information to the statistical authorities.

The questionnaire method of collecting data is based on the principle of voluntary filling in of questionnaires by addressees.

Therefore, a statistical table is usually defined as a form of compact visual presentation of statistical data.

Analysis of tables allows you to solve many problems in the study of changes in phenomena over time, the structure of phenomena and their relationships. Thus, statistical tables serve as a universal means of rational presentation, generalization and analysis of statistical information.

Externally statistical table is a system of specially constructed horizontal rows and vertical columns with a common heading, headings of graphs and rows, at the intersection of which statistical data are recorded.

Each figure in statistical tables is a specific indicator characterizing the size or levels, dynamics, structure or relationship of phenomena in specific conditions of place and time, that is, a certain quantitative and qualitative characteristic of the phenomenon under study.

If the table is not filled with numbers, that is, it has only a common title, headings of columns and rows, then we have a layout of the statistical table. It is with its development that the process of compiling statistical tables begins.

The main elements of the statistical table are subject and predicate of a table.

Subject table - is an object of statistical study, that is, individual units of the population, their groups or the entire population as a whole.

Predictable tables Are statistical indicators characterizing the object under study.

The subject and the indicators of the predicate of the table must be determined very precisely. As a rule, the subject is located on the left side of the table and makes up the content of the lines, and the predicate is on the right side of the table and makes up the content of the columns.

Usually, when arranging the indicators of the predicate in the table, the following rule is followed: first, the absolute indicators characterizing the volume of the studied population are given, then the calculated relative indicators reflecting the structure, dynamics and relationships between the indicators.

Building analytical tables

The construction of analytical tables is as follows. Any table consists of a subject and a predicate. The subject reveals the economic phenomenon referred to in this table and contains a set of indicators that reflect this phenomenon. The predicate of the table explains which features represent the subject.

Some tables reflect changes in structure of any kind. Such tables contain information on the composition of the analyzed economic phenomenon both in the baseline and in the reporting period. Based on these data, the proportion (specific weight) of each part in the total population is determined and deviations from the basic specific weights for each part are calculated.

Separate tables can reflect the relationship between economic indicators for some reason. In such tables, information on this economic indicator is arranged in ascending or descending order of the numerical values \u200b\u200bcharacterizing this indicator.

In the economic analysis, tables are also compiled, reflecting the results of determining the influence of individual factors on the value of the analyzed generalizing (effective) indicator. When drawing up such tables, first place information about the factors affecting the generalizing indicator, then information about the generalizing indicator itself and finally about the change in this indicator in the aggregate, as well as due to the impact of each analyzed factor. Separate analytical tables reflect the results of calculating the reserves for improving economic indicators, identified as a result of the analysis. Such tables show both the actual and theoretically possible size of the influence of individual factors, as well as the possible size of the reserve for the growth of the generalizing indicator due to the influence of each individual factor.

Finally, in the analysis of economic activity, tables are also compiled that are intended to summarize the results of the analysis.

The practice of statistics has developed the following rules for compiling tables:
  • The table should be expressive and compact. Therefore, instead of one cumbersome table for many features, it is better to make several small in volume, but visual, corresponding to the task of studying the tables.
  • The name of the table, the headings of the columns and lines should be formulated accurately and concisely.
  • The table must indicate: the object under study, the territory, and the time to which the data given in the table refer, the units of measurement.
  • If some data is absent, then in the table either put an ellipsis, or write "no information", if some phenomenon did not take place, then put a dash
  • The values \u200b\u200bof the same indicators are given in the table with the same degree of accuracy.
  • The table should have totals for groups, subgroups and in general. If the summation of the data is impossible, then the multiplication sign "*" is put in this column.
  • In large tables, divide a gap after every five rows to make it easier to read and analyze the table.

Types of statistical tables

Among the methods, the most common is the tabular method (method) of displaying the investigated digital data. The fact is that both the initial data for the analysis, and various calculations, as well as the results of the study, are drawn up in the form of analytical tables. Tables are a very useful and visual form of displaying the numerical information used in. In analytical tables, in a certain order, there is digital information about the studied economic phenomena. Tabular material is much more informative and visual in comparison with the textual presentation of the material. Tables allow you to present analytical materials in the form of a single holistic system.

The type of statistical table is determined by the nature of the development of indicators of its lying.

There are three types of statistical tables:
  • simple
  • group
  • combinational

Simple tables contain a list of individual units that make up the aggregate of the analyzed economic phenomenon. IN group tables digital information in the context of individual constituent parts of the studied data set is combined into certain groups in accordance with any feature. Combined tables contain separate groups and subgroups into which they are subdivided, characterizing the studied economic phenomenon. Moreover, such a subdivision is carried out not according to one, but according to several criteria. in group tables, simple grouping of indicators is carried out, and in combined tables, combined grouping. Simple tables do not contain any grouping of indicators at all. The last type of tables contains only an ungrouped set of information about the analyzed economic phenomenon.

Simple tables

Simple tables have in the subject a list of units of population, time, or territories.

Group tables

Group tables are tables that have units of the aggregate in the subject to be grouped according to one attribute.

Combination tables

Combination tables have aggregate units to be grouped according to two or more characteristics.

By the nature of the development of predicate indicators, they are distinguished:

  • tables with a simple development of predicate indicators, in which there is a parallel arrangement of predicate indicators.
  • tables with a complex development of predicate indicators, in which a combination of predicate indicators takes place: within groups formed according to one characteristic, subgroups are distinguished according to another characteristic.

A table with a simple development of predicate indicators

The predicate of this table gives data first about the distribution of students by gender, and then by age, i.e. there are isolated characteristics in two ways.

Table with complex development of predicate indicators

Departments

Number of students, people

Including

of which age, years

of which age, years

23 and more

23 and more

Evening

The predicate of this table not only characterizes the distribution of students according to each of the two selected characteristics, but also allows us to study the composition of each group, selected according to one criterion - gender, according to another criterion - the age of the student, i.e. there is a combination of two features.

Consequently, tables with complex development of predicate indicators provide more ample opportunities to analyze the studied indicators and the relationships between them. A simple and complex development of predicate indicators can have a table of any type: simple, group, combination.

Depending on the stage of statistical research, the tables are divided into:
  • development (auxiliary), the purpose of which is to summarize information for individual units of the population to obtain totals.
  • summary, whose task is to show the results for groups and for the entire population as a whole.
  • analytical tables, the task of which is the calculation of generalizing characteristics and the preparation of an information base for the analysis of both the structure and structural shifts, the dynamics of the studied phenomena and the relationship between the indicators.

So, we examined the tabular method of displaying the studied digital data, which is widely used in the analysis of economic phenomena, statistical data and the economic activities of organizations.

Statistical methodology - a system of techniques and methods aimed at studying quantitative patterns that are manifested in the structure, dynamics and relationships of socio-economic phenomena.

Statistical research consists of three stages:

1. Statistical observation;

2. Primary processing, summary and grouping of observation results;

3. Analysis of the received consolidated materials.

The passage of each stage of the study is associated with the use of special methods, explained by the content of the work performed.

1) Statistical observation - scientifically organized collection of information about the studied socio-economic processes or phenomena. The data obtained are the starting material for the subsequent stages of statistical research. This data must be processed in a certain way. This processing is the next stage of statistical research.

2) A summary of the initial data to obtain generalized characteristics of the investigated process or phenomenon. The results of the statistical summary and grouping are presented in the form of statistical tables.

3) Statistical analysis - the final stage of statistical research. In its process, the structure, dynamics and interconnections of social phenomena and processes are investigated. The following main stages of analysis are distinguished:

· Statement of facts and their assessment;

· Establishing the characteristic features and causes of the phenomenon;

· Comparison of the phenomenon with other phenomena;

· Formulation of hypotheses, conclusions and assumptions;

· Statistical testing of the hypotheses put forward using special statistical indicators.

General theory of statistics - the science of the most general principles, rules and laws of digital coverage of socio-economic phenomena. It is the methodological basis for all branches of statistics.

Statistical data - a set of quantitative characteristics of socio-economic phenomena and processes obtained as a result of statistical observation, their processing or corresponding calculations.

Statistical observation - This is a massive, planned, scientifically organized observation of the phenomena of social and economic life, which consists in the registration of selected features for each unit of the population. The statistical observation process includes the following steps:

  1. Observation preparation. At this stage, scientific and methodological (determination of the purpose and object of observation, the composition of signs to be registered; development of documents for collecting data; selection of the reporting unit and the unit for which the observation will be carried out, as well as methods, means and time for obtaining data, etc.) and organizational issues (determination of the composition of the bodies conducting the observation; selection and training of personnel for the observation; drawing up a timetable for the preparation, conduct and processing of observation materials; replication of documents for data collection, etc.).
  2. Conducting mass data collection.
  3. Development of proposals for improving statistical observation.

3 / Program-methodological and organizational issues of statistical observation.

Program-methodological issues determine the goals and objects of observation, signs to be registered, documents are developed for collecting data, methods and means of obtaining data are determined, and more.

Organizational issues imply the following types of work: selection and training of personnel; drawing up a work schedule for the preparation and conduct of statistical observation; materials that will be used in statistical observation are processed.

Objective of observation - obtaining reliable information to identify the dependences of the development of phenomena and processes.

Observation object - some statistical aggregate in which the investigated socio-economic phenomena and processes occur.

To determine the object, it is necessary to determine the boundaries of the studied population, for which it is necessary to indicate the most important features that distinguish it from other similar populations. Each object consists of individual elements, i.e. observation units that are the carrier of the characteristics to be registered.

The reporting unit is the entity from which the data about the observation unit is received.

Observation program Is a list of signs (questions) to be registered during the observation process.

Statistical form Is a single sample document containing the program and the results of observation. An example can be a questionnaire, questionnaire, questionnaire, etc. In this case, two systems of the statistical form are distinguished:

1) Individual (card), which provides for recording answers to questions about only one observation unit.

2) The list provides answers to questions about several units of observation.

The timing of the observation is based on two questions:

- establishment of a critical moment (date) or time interval.

- determination of the period or period of observation.

Critical moment (date) - a specific day of the year, hour of the day, as of which the registration of signs for each unit of the studied population should be carried out.

Term (period) of observation - this is the time during which statistical forms are filled in, i.e. the time required to conduct a bulk data collection.

Forms, types and methods of statistical observation.

1) Reporting is the main form of statistical observation, with the help of which the statistical bodies receive the necessary data from enterprises and institutions in the form of established reporting documents at a certain time.

As a rule, reporting is based on primary accounting and is its generalization.

Primary accounting - registration of various facts, events that are produced as they occur.

Registration takes place on a specific social document, while the current statistical reporting is typical and specialized.

Typical - the same for all enterprises, and in the specialized composition of the indicators of industries varies depending on the characteristics of individual industries.

Reporting is daily, weekly, bi-weekly, monthly, quarterly, and annual. All listed, except for the annual, are current.

2) Specially organized statistical observation.

A striking example is the census - specially organized reporting, which is repeated at regular intervals in order to obtain data on the number, composition and condition of an object for a number of indicators.

Features of the census:

Simultaneity of its holding throughout the country

Unity of a statistical observation program

Registration of observation units as of the same critical moment.

This form includes budget surveys that characterize the structure of consumer spending and family income.

3) Register - a system that constantly monitors the state of the observation unit and evaluates the strength of the impact of various factors on the studied indicators.

The population register is a named and regularly updated list of the country's inhabitants. In this case, the observation program is limited by general characteristics (gender, date and place of birth, date of marriage).

There is such a sign as marital status (variable sign).

The register of enterprises, which includes all types of economic activities and contains the value of the main characteristics for each unit of observation for a certain period or point in time. Contains data on the time of establishment or registration of enterprises, name, address, telephone number, organizational and legal form, type of economic activity, number of employees, etc. complete information about the company.

The object of research in applied statistics is statistical data obtained as a result of observations or experiments. Statistical data is a collection of objects (observations, cases) and features (variables) that characterize them. For example, the objects of research - the countries of the world and features, - the geographical and economic indicators characterizing them: continent; terrain height above sea level; average annual temperature; place of the country in the list by quality of life, share of GDP per capita; public spending on health care, education, the army; average life expectancy; the proportion of unemployment, illiterate; quality of life index, etc.
Variables are quantities that, as a result of measurement, can take on different values.
Independent variables are variables whose values \u200b\u200bcan be changed during the experiment, while dependent variables are variables whose values \u200b\u200bcan only be measured.
Variables can be measured on various scales. The difference between the scales is determined by their information content. The following types of scales are considered, presented in ascending order of their information content: nominal, ordinal, interval, ratio scale, absolute. These scales also differ from each other in the number of permissible mathematical operations. The "poorest" scale is nominal, since not a single arithmetic operation is defined, the "rich" one is absolute.
Measurement in the nominal (classification) scale means determining the belonging of an object (observation) to a particular class. For example: gender, branch of service, profession, continent, etc. In this scale, you can only count the number of objects in classes - frequency and relative frequency.
Measurement in an ordinal (rank) scale, in addition to determining the class of belonging, allows you to streamline observations by comparing them with each other in some respect. However, this scale does not determine the distance between classes, but only which of the two observations is preferable. Therefore, ordinal experimental data, even if they are represented by numbers, cannot be considered as numbers and arithmetic operations on them cannot be performed 5. In this scale, in addition to calculating the frequency of the object, you can calculate the rank of the object. Examples of variables measured on an ordinal scale: student scores, prize places in competitions, military ranks, a country's place in the list for quality of life, etc. Sometimes nominal and ordinal variables are called categorical, or grouping, since they allow you to divide the objects of study into subgroups.
When measured on an interval scale, the ordering of the observations can be done so precisely that the distances between any two of them are known. The scale of intervals is unique up to linear transformations (y \u003d ax + b). This means that the scale has an arbitrary reference point - conditional zero. Examples of variables measured on an interval scale: temperature, time, terrain altitude. Variables in this scale can be used to determine the distance between observations. Distances are full-fledged numbers and any arithmetic operations can be performed on them.
The ratio scale is similar to the interval scale, but it is unique up to a transformation of the form y \u003d ax. This means that the scale has a fixed reference point - absolute zero, but an arbitrary scale of measurement. Examples of variables measured on a scale of relationships: length, weight, amperage, amount of money, public spending on health, education, military, life expectancy, etc. Measurements in this scale are full numbers and any arithmetic operations can be performed on them.
An absolute scale has both absolute zero and an absolute unit of measure (scale). An example of an absolute scale is a number line. This scale is dimensionless, so measurements on it can be used as an exponent or base of a logarithm. Examples of measurements on an absolute scale: unemployment rate; the proportion of illiterates, the quality of life index, etc.
Most statistical methods relate to parametric statistics methods, which are based on the assumption that a random vector of variables forms some multivariate distribution, as a rule, normal or transforms to a normal distribution. If this assumption is not confirmed, you should use nonparametric methods of mathematical statistics.

Correlation analysis. There can be a functional relationship between variables (random variables), which manifests itself in the fact that one of them is defined as a function of the other. But between the variables there can also be a connection of another kind, manifested in the fact that one of them reacts to a change in the other by changing its distribution law. This relationship is called stochastic. It appears when there are common random factors affecting both variables. The correlation coefficient (r) is used as a measure of the dependence between the variables, which varies from –1 to +1. If the correlation coefficient is negative, it means that as the values \u200b\u200bof one variable increase, the values \u200b\u200bof the other decrease. If the variables are independent, then the correlation coefficient is 0 (the converse is true only for variables with a normal distribution). But if the correlation coefficient is not equal to 0 (variables are called uncorrelated), then this means that there is a dependence between the variables. The closer the r value is to 1, the stronger the dependence. The correlation coefficient reaches its limit values \u200b\u200b+1 or -1, if and only if the relationship between the variables is linear. Correlation analysis allows you to establish the strength and direction of the stochastic relationship between variables (random variables). If the variables are measured at least on an interval scale and have a normal distribution, then the correlation analysis is carried out by calculating the Pearson correlation coefficient, otherwise the Spearman, Kendal tau, or Gamma correlations are used.

Regression analysis. Regression analysis models the relationship of one random variable to one or more other random variables. In this case, the first variable is called dependent, and the rest are called independent. The choice or appointment of the dependent and independent variables is arbitrary (conditional) and is carried out by the researcher depending on the problem he is solving. The explanatory variables are called factors, regressors, or predictors, and the dependent variable is called the outcome characteristic, or response.
If the number of predictors is 1, the regression is called simple, or one-way, if the number of predictors is more than 1 - multiple or multivariate. In general, the regression model can be written as follows:

Y \u003d f (x 1, x 2, ..., x n),

Where y is the dependent variable (response), x i (i \u003d 1,…, n) are predictors (factors), n is the number of predictors.
Regression analysis can be used to solve a number of important problems for the problem under study:
one). Reducing the dimension of the space of the variables being analyzed (factor space) by replacing some of the factors with one variable - the response. This problem is solved more fully by factor analysis.
2). Quantifying the effect of each factor, i.e. multiple regression, allows the researcher to ask a question (and probably get an answer) about "what is the best predictor for ..." At the same time, the influence of individual factors on the response becomes clearer, and the researcher better understands the nature of the phenomenon under study.
3). Calculation of predicted response values \u200b\u200bfor certain values factors, i.e. regression analysis, creates the basis for a computational experiment in order to obtain answers to questions like "What will happen if ...".
4). In regression analysis, the causal mechanism appears in a more explicit form. In this case, the forecast lends itself better to meaningful interpretation.

Canonical analysis. Canonical analysis is intended for the analysis of dependencies between two lists of features (independent variables) that characterize objects. For example, you can study the relationship between various unfavorable factors and the appearance of a certain group of symptoms of the disease, or the relationship between two groups of clinical and laboratory parameters (syndromes) of a patient. Canonical analysis is a generalization of multiple correlation as a measure of the relationship between one variable and many other variables. As you know, multiple correlation is the maximum correlation between one variable and a linear function of other variables. This concept has been generalized to the case of relationships between sets of variables - features that characterize objects. In this case, it is enough to restrict consideration to a small number of the most correlated linear combinations from each set. Suppose, for example, the first set of variables consists of signs у1, ..., ur, the second set consists of - х1, ..., хq, then the relationship between these sets can be estimated as the correlation between linear combinations a1y1 + a2y2 + ... + apyp, b1x1 + b2x2 + ... + bqxq, which is called canonical correlation. The problem of canonical analysis is to find weight coefficients in such a way that the canonical correlation is maximal.

Average comparison methods. In applied research, there are often cases when the average result of some feature of one series of experiments differs from the average result of another series. Since the averages are the results of measurements, then, as a rule, they always differ, the question is whether the detected discrepancy of the means can be explained by inevitable random errors of the experiment or is it caused by certain reasons. If we are talking about comparing two means, then the Student's test (t-test) can be applied. This is a parametric criterion, since it is assumed that the characteristic has a normal distribution in each series of experiments. Currently, it has become fashionable to use nonparametric criteria for comparing the mean
Comparison of the average result is one of the ways to identify dependencies between variable signs that characterize the studied set of objects (observations). If, when dividing the objects of study into subgroups using the categorical independent variable (predictor), the hypothesis about the inequality of the means of some dependent variable in subgroups is true, then this means that there is a stochastic relationship between this dependent variable and the categorical predictor. So, for example, if it is established that the hypothesis about the equality of the average indicators of the physical and intellectual development of children in the groups of mothers who smoked and did not smoke during pregnancy is found to be incorrect, then this means that there is a relationship between the mother's smoking during pregnancy and his intellectual and physical development.
The most common method for comparing means is analysis of variance. In ANOVA terminology, a categorical predictor is called a factor.
ANOVA can be defined as a parametric, statistical method designed to assess the influence of various factors on the result of an experiment, as well as for the subsequent planning of experiments. Therefore, in the analysis of variance, it is possible to study the dependence of a quantitative trait on one or several qualitative traits of factors. If one factor is considered, then one-way analysis of variance is used, otherwise multivariate analysis of variance is used.

Frequency analysis. Frequency tables, or as they are called single-entry tables, are the simplest method for analyzing categorical variables. Frequency tables can also be used successfully to investigate quantitative variables, although interpretation of the results can be difficult. This type of statistical study is often used as one of the exploratory analysis procedures to see how different groups of observations are distributed in the sample, or how the value of a feature is distributed over the interval from the minimum to the maximum value. Typically, frequency tables are graphically illustrated with histograms.

Cross-tabulation (pairing) - the process of combining two (or more) frequency tables so that each cell in the constructed table is represented by a single combination of values \u200b\u200bor levels of tabulated variables. Cross-tabulation allows you to combine the frequencies of occurrence of observations at different levels of the factors under consideration. By examining these frequencies, you can identify relationships between tabulated variables and explore the structure of this relationship. Usually categorical or quantitative variables are tabulated with relatively few values. If it is necessary to tabulate a continuous variable (say, blood sugar level), then first it must be recoded by dividing the range of change into a small number of intervals (for example, level: low, medium, high).

Analysis of correspondences. Conformance analysis provides more powerful descriptive and exploratory methods for analyzing two-input and multi-input tables compared to frequency analysis. The method, like the contingency tables, allows you to explore the structure and relationship of the grouping variables included in the table. In classical correspondence analysis, the frequencies in the contingency table are standardized (normalized) so that the sum of the elements in all cells is equal to 1.
One of the goals of correspondence analysis is to represent the contents of a table of relative frequencies as the distances between individual rows and / or columns of the table in a lower-dimensional space.

Cluster analysis. Cluster analysis is a classification analysis method; its main purpose is to divide the set of objects and features under study into homogeneous groups, or clusters, in a certain sense. This is a multivariate statistical method, therefore it is assumed that the initial data can be of significant volume, i.e. both the number of objects of study (observations) and the features characterizing these objects can be significantly larger. The great advantage of cluster analysis is that it makes it possible to split objects not by one feature, but by a number of features. In addition, cluster analysis, in contrast to most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration and allows one to study a variety of initial data of almost arbitrary nature. Since clusters are groups of homogeneity, the task of cluster analysis is to divide their set into m (m - whole) clusters based on the attributes of objects so that each object belongs to only one partition group. In this case, objects belonging to one cluster must be homogeneous (similar), and objects belonging to different clusters must be heterogeneous. If clustering objects are represented as points in an n-dimensional feature space (n is the number of features characterizing objects), then the similarity between objects is determined through the concept of distance between points, since it is intuitively clear that the smaller the distance between objects, the more similar they are.

Discriminant analysis. Discriminant analysis includes statistical methods for classifying multivariate observations in a situation where the researcher has so-called training samples. This type of analysis is multidimensional, since it uses several features of an object, the number of which can be as large as desired. The purpose of discriminant analysis is to classify it on the basis of measuring various characteristics (features) of an object, that is, to assign it to one of several specified groups (classes) in some optimal way. In this case, it is assumed that the initial data, along with the attributes of objects, contain a categorical (grouping) variable that determines the belonging of an object to a particular group. Therefore, the discriminant analysis provides for checking the consistency of the classification carried out by the method with the original empirical classification. The optimal method is understood as either the minimum mathematical expectation of losses, or the minimum probability of false classification. In the general case, the problem of discrimination (discrimination) is formulated as follows. Let the result of observation over the object is the construction of a k-dimensional random vector X \u003d (X1, X2,…, XK), where X1, X2,…, XK are the features of the object. It is required to establish a rule according to which, according to the values \u200b\u200bof the coordinates of the vector X, the object is referred to one of the possible collections i, i \u003d 1, 2,…, n. Discrimination methods can be roughly divided into parametric and nonparametric. In parametric it is known that the distribution of feature vectors in each population is normal, but there is no information about the parameters of these distributions. Nonparametric discrimination methods do not require knowledge of the exact functional form of distributions and allow solving discrimination problems based on insignificant a priori information about the populations, which is especially valuable for practical applications. If the conditions for the applicability of discriminant analysis are met - independent variables - signs (they are also called predictors) should be measured at least on an interval scale, their distribution should correspond to the normal law, it is necessary to use classical discriminant analysis, otherwise - by the method of general models of discriminant analysis.

Factor analysis. Factor analysis is one of the most popular multivariate statistical methods. If the cluster and discriminant methods classify observations by dividing them into homogeneity groups, then factor analysis classifies features (variables) that describe observations. Therefore, the main goal of factor analysis is to reduce the number of variables based on the classification of variables and the determination of the structure of relationships between them. Reduction is achieved by highlighting hidden (latent) common factors that explain the relationship between the observed features of the object, i.e. instead of the initial set of variables, it will be possible to analyze data on the selected factors, the number of which is significantly less than the initial number of interrelated variables.

Classification trees. Classification trees are a method of classification analysis that makes it possible to predict the belonging of objects to a particular class, depending on the corresponding values \u200b\u200bof the characteristics that characterize the objects. Features are called independent variables, and a variable that indicates whether objects belong to classes is called dependent. Unlike classical discriminant analysis, classification trees are capable of performing one-dimensional branching in variables different types categorical, ordinal, interval. No restrictions are imposed on the distribution of quantitative variables. By analogy with discriminant analysis, the method makes it possible to analyze the contributions of individual variables to the classification procedure. Classification trees can be, and sometimes are, very complex. However, the use of special graphical procedures makes it possible to simplify the interpretation of the results, even for very complex trees. The ability to graphically represent the results and ease of interpretation largely explain the great popularity of classification trees in applied areas, however, the most important distinguishing properties of classification trees are their hierarchy and wide applicability. The structure of the method is such that the user has the ability to build trees of arbitrary complexity using controlled parameters, achieving minimal classification errors. But it is difficult to classify a new object based on a complex tree, due to the large set of decision rules. Therefore, when constructing a classification tree, the user must find a reasonable compromise between the complexity of the tree and the complexity of the classification procedure. The wide range of applicability of classification trees makes them a very attractive tool for data analysis, but it should not be assumed that it is recommended to use it instead of traditional methods of classification analysis. On the contrary, if more rigorous theoretical assumptions imposed by traditional methods are fulfilled, and the sample distribution has some special properties (for example, the correspondence of the distribution of variables to the normal law), then the use of traditional methods will be more effective. However, as a method of exploratory analysis or as a last resort when all traditional methods fail, Classification Trees, according to many researchers, are unmatched.

Principal component analysis and classification. In practice, the task of analyzing data of large dimensions often arises. Principal component analysis and classification can solve this problem and serve two purposes:
- reducing the total number of variables (data reduction) in order to obtain "main" and "non-correlated" variables;
- classification of variables and observations, using the constructed factor space.
The method is similar to factor analysis in the formulation of the problems being solved, but it has a number of significant differences:
- principal component analysis does not use iterative methods to extract factors;
- along with active variables and observations used to extract principal components, auxiliary variables and / or observations can be specified; the auxiliary variables and observations are then projected onto the factor space calculated from the active variables and observations;
- the listed possibilities allow using the method as a powerful tool for classifying both variables and observations.
The solution to the main task of the method is achieved by creating vector space latent (hidden) variables (factors) with a dimension less than the original one. The original dimension is determined by the number of variables for analysis in the original data.

Multidimensional scaling. The method can be considered as an alternative to factor analysis, in which a reduction in the number of variables is achieved by highlighting latent (not directly observable) factors that explain the relationship between the observed variables. The purpose of multidimensional scaling is to find and interpret latent variables that enable the user to explain the similarities between objects specified by points in the original feature space. The indicators of the similarity of objects in practice can be the distance or the degree of connection between them. In factor analysis, similarities between variables are expressed using a matrix of correlation coefficients. In multidimensional scaling, an arbitrary type of object similarity matrix can be used as input data: distances, correlations, etc. Despite the fact that there are many similarities in the nature of the issues studied, the methods of multivariate scaling and factor analysis have a number of significant differences. For example, factor analysis requires that the studied data obey a multivariate normal distribution, and the dependences are linear. Multidimensional scaling does not impose such restrictions; it can be applied if a matrix of pairwise similarities of objects is specified. In terms of the differences in the results obtained, factor analysis tends to extract more factors - latent variables compared with multivariate scaling. Therefore, multidimensional scaling often leads to easier-to-interpret solutions. However, more importantly, the multidimensional scaling method can be applied to any type of distance or similarity, while factor analysis requires that the correlation matrix of the variables be used as the input data, or the correlation matrix is \u200b\u200bfirst calculated from the source data file. The main assumption of multidimensional scaling is that there is a certain metric space of essential basic characteristics, which implicitly served as the basis for the obtained empirical data on the proximity between pairs of objects. Therefore, objects can be thought of as points in this space. It is also assumed that closer (according to the initial matrix) objects correspond to smaller distances in the space of basic characteristics. Therefore, multidimensional scaling is a set of methods for analyzing empirical data on the proximity of objects, with the help of which the dimension of the space of the characteristics of the measured objects that are essential for a given meaningful problem is determined and the configuration of points (objects) in this space is constructed. This space ("multidimensional scale") is similar to commonly used scales in the sense that the values \u200b\u200bof the essential characteristics of the measured objects correspond to certain positions on the axes of space. The logic of multidimensional scaling can be illustrated by the following simple example... Suppose there is a matrix of pairwise distances (ie, the similarity of some features) between some cities. Analyzing the matrix, it is necessary to position the points with the coordinates of the cities in two-dimensional space (on a plane), keeping the actual distances between them as much as possible. The resulting placement of points on the plane can subsequently be used as an approximate geographic map. In the general case, multidimensional scaling allows thus placing objects (cities in our example) in a space of some small dimension (in this case, it is equal to two) in order to adequately reproduce the observed distances between them. As a result, these distances can be measured in terms of the found latent variables. So, in our example, we can explain distances in terms of a pair of geographic coordinates North / South and East / West.

Structural equation modeling (causal modeling). Recent advances in multivariate statistical analysis and analysis of correlation structures, combined with the latest computational algorithms, served as a starting point for the creation of a new but already recognized structural equation modeling technique (SEPATH). This incredibly powerful technique of multivariate analysis includes methods from various fields of statistics, multiple regression and factor analysis are naturally developed and combined here.
The object of modeling by structural equations is complex systems, the internal structure of which is not known ("black box"). Observing the parameters of the system using SEPATH, you can investigate its structure, establish cause-and-effect relationships between the elements of the system.
The statement of the problem of structural modeling is as follows. Let there be variables for which statistical moments are known, for example, a matrix of sample correlation coefficients or covariance. Such variables are called explicit. They can be characteristics of a complex system. The real relationships between the observed explicit variables can be quite complex, but we assume that there are a number of latent variables that explain the structure of these relationships with a certain degree of accuracy. Thus, with the help of latent variables, a model of relationships between explicit and implicit variables is built. In some problems, latent variables can be considered as causes, and explicit ones as consequences, therefore, such models are called causal. It is assumed that hidden variables, in turn, can be related to each other. The structure of links is allowed to be quite complex, but its type is postulated - these are links described by linear equations. Some parameters of linear models are known, some are not, and are free parameters.
The basic idea behind structural equation modeling is that you can check if the variables Y and X are related linear relationship Y \u003d aX, analyzing their variances and covariances. This idea is based on a simple property of mean and variance: if you multiply each number by some constant k, the mean is also multiplied by k, and the standard deviation is multiplied by the modulus k. For example, consider a set of three numbers 1, 2, 3. These numbers have a mean of 2 and a standard deviation of 1. If you multiply all three numbers by 4, you can easily calculate that the mean is 8, the standard deviation is 4, and the variance is 16. Thus, if there are sets of numbers X and Y related by the dependence Y \u003d 4X, then the variance of Y must be 16 times greater than the variance of X. Therefore, you can test the hypothesis that Y and X are related equation Y \u003d 4X, comparing variances of variables Y and X. This idea can be different ways generalized to several variables related by a system of linear equations. In this case, the transformation rules become more cumbersome, the calculations are more complicated, but the main meaning remains the same - you can check whether the variables are related by a linear relationship by studying their variances and covariance.

Survival analysis methods. Survival analysis methods were originally developed in medical, biological research and insurance, but then became widely used in the social and economic sciences, as well as in industry for engineering problems (reliability and failure time analysis). Imagine you are studying the effectiveness of a new treatment or drug. Obviously, the most important and objective characteristic is the average life expectancy of patients from the moment of admission to the clinic or the average duration of remission of the disease. Standard parametric and nonparametric methods could be used to describe mean lifetimes or remissions. However, the analyzed data has a significant feature - there may be patients who survived during the entire observation period, and in some of them the disease is still in remission. A group of patients may also form, with whom contact was lost before the end of the experiment (for example, they were transferred to other clinics). Using standard methods for assessing the mean, this group of patients would have to be excluded, thereby losing the hard-collected important information... In addition, most of these patients are survivors (recovered) during the time they were observed, which suggests a new treatment (drug). This kind of information, when there is no data on the occurrence of the event of interest to us, is called incomplete. If there is data on the occurrence of an event of interest to us, then the information is called complete. Observations that contain incomplete information are called censored observations. Censored observations are typical when the observed quantity represents the time until some critical event occurs, and the duration of the observation is limited in time. The use of censored observations is specific to the method under consideration - survival analysis. This method investigates the probabilistic characteristics of the time intervals between the successive occurrence of critical events. This kind of research is called analysis of durations until the moment of termination, which can be defined as the time intervals between the beginning of observation of an object and the moment of termination at which the object stops responding to the properties specified for observation. The purpose of the research is to determine the conditional probabilities associated with the durations until the moment of termination. The construction of tables of lifetimes, fitting the distribution of survival, estimation of the survival function using the Kaplan-Meier procedure are descriptive methods for examining censored data. Some of the proposed methods allow one to compare survival rates in two or more groups. Finally, survival analysis contains regression models for estimating relationships between multivariate continuous variables with values \u200b\u200bsimilar to lifetimes.
General models of discriminant analysis. If the conditions of applicability of discriminant analysis (DA) are not met - independent variables (predictors) must be measured at least on an interval scale, their distribution must correspond to the normal law, it is necessary to use the method of general models of discriminant analysis (ODA). The method has this name because it uses the General Linear Model (GLM) to analyze discriminant functions. In this module, discriminant function analysis is treated as a general multivariate linear model in which the categorical dependent variable (response) is represented by vectors with codes denoting the different groups for each observation. The ODA method has a number of significant advantages over the classical discriminant analysis. For example, no restrictions are imposed on the type of predictor used (categorical or continuous) or on the type of the defined model, it is possible step by step selection predictors and the choice of the best subset of predictors, if there is a cross-validated sample in the data file, the selection of the best subset of predictors can be carried out based on the fraction of misclassification for the cross-validated sample, etc.

Time series. Time series is the most intensively developing and promising area of \u200b\u200bmathematical statistics. A time (dynamic) series means a sequence of observations of some feature X (random variable) at successive equidistant moments t. Individual observations are called the levels of the series and are designated xt, t \u003d 1, ..., n. When studying a time series, several components are distinguished:
x t \u003d u t + y t + c t + e t, t \u003d 1,…, n,
where u t is a trend, a smoothly changing component that describes the net influence of long-term factors (population decline, income decline, etc.); - the seasonal component, reflecting the recurrence of processes over a not very long period (day, week, month, etc.); ct is the cyclical component, reflecting the recurrence of processes over long periods of time over one year; t is a random component reflecting the influence of random factors that cannot be taken into account and registered. The first three components are deterministic components. The random component is formed as a result of the superposition of a large number of external factors, each individually having an insignificant effect on the change in the values \u200b\u200bof the attribute X. Analysis and study of the time series allow us to build models for predicting the values \u200b\u200bof the attribute X for the future, if the sequence of observations in the past is known.

Neural networks. Neural networks are computing system, the architecture of which has an analogy with the construction of nervous tissue from neurons. The values \u200b\u200bof the input parameters are fed to the neurons of the lowest layer, on the basis of which certain decisions must be made. For example, in accordance with the values \u200b\u200bof the clinical and laboratory parameters of the patient, it is necessary to assign him to one or another group according to the severity of the disease. These values \u200b\u200bare perceived by the network as signals that are transmitted to the next layer, weakening or amplifying depending on the numerical values \u200b\u200b(weights) assigned to the interneural connections. As a result, a certain value is generated at the output of the neuron of the upper layer, which is considered as a response - the response of the entire network to the input parameters. In order for the network to work, it must be "trained" (trained) on data for which the values \u200b\u200bof the input parameters and the correct responses to them are known. The training consists in selecting the weights of interneuronal connections that provide the closest possible proximity of the answers to the known correct answers. Neural networks can be used to classify observations.

Experiment planning. The art of arranging observations in a certain order or conducting specially planned tests in order to fully utilize the possibilities of these methods is the content of the subject of "experiment planning." At present, experimental methods are widely used both in science and in various fields of practical activity. Usually, the main goal of a scientific study is to show the statistical significance of the effect of a given factor on the dependent variable of interest. As a rule, the main goal of planning experiments is to extract the maximum amount of objective information about the influence of the studied factors on the indicator (dependent variable) of interest to the researcher using the least number of expensive observations. Unfortunately, in practice, in most cases, insufficient attention is paid to research planning. They collect data (as much as they can collect), and then they carry out statistical processing and analysis. But a correctly conducted statistical analysis alone is not sufficient to achieve scientific reliability, since the quality of any information obtained as a result of data analysis depends on the quality of the data itself. Therefore, the planning of experiments is increasingly used in applied research. The purpose of the methods of planning experiments is to study the influence of certain factors on the process under study and to find the optimal levels of factors that determine the required level of the course of this process.

Quality control charts. In the conditions of the modern world, the problem of the quality of not only manufactured products, but also services provided to the population is extremely urgent. The well-being of any firm, organization or institution largely depends on the successful solution of this important problem. The quality of products and services is formed in the process of scientific research, design and technological development, is ensured by a good organization of production and services. But the manufacture of products and the provision of services, regardless of their type, is always associated with a certain inconsistency in the conditions of production and provision. This leads to some variability in their quality traits. Therefore, the issues of developing quality control methods that will allow timely identification of signs of a violation of the technological process or the provision of services are relevant. At the same time, in order to achieve and maintain high level quality that satisfies the consumer, methods are needed that are aimed not at eliminating defects in finished products and inconsistencies in services, but at preventing and predicting the causes of their occurrence. A control chart is a tool that allows you to track the progress of the process and influence it (using the appropriate feedback), preventing its deviations from the requirements for the process. The Quality Control Chart Toolkit makes extensive use of statistical methods based on probability theory and mathematical statistics. The use of statistical methods makes it possible, with a limited amount of analyzed products, to judge the state of the quality of the products with a given degree of accuracy and reliability. Provides forecasting, optimal regulation of problems in the field of quality, making correct management decisions not based on intuition, but through scientific study and identification of patterns in the accumulated arrays of numerical information. /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e /\u003e