What is the term used to describe the set of rules of how data items are identified named and described?

This article is licensed under CC0 for maximum reuse.

A data dictionary is critical to making your research more reproducible because it allows others to understand your data. The purpose of a data dictionary is to explain what all the variable names and values in your spreadsheet really mean.

The first column should contain your variable names exactly as they appear in your spreadsheet.

This column should contain short but human-readable variable names

For instance, if ‘VAR1’ is a variable name referring to weight, then an appropriate readable variable name for VAR1 is ‘weight’.
You can use spaces, characters, and capital letters.
This is the name that you would use to label graphs and other figures.

This column should contain the measurement units for the variable.

For instance, if a column contains measurements of time, it should be clear whether they are measured in hours, minutes, or seconds.

A column should contain the range of values or accepted values for the variable.

This helps identify data entry errors.
Minimum and maximum values should be included.
Chosen values (e.g., “male”, “female”) should be included and detailed, if needed, in the description column (see below).

This column should contain a definition of the variable.

The variable definition reflects the way you use the term and intend the term to be used by others who wish to understand your work.
While there are many kinds of definition, where possible, please provide a definition with the following genus-differentia form:

“A is a B that Cs.”

For instance, “An a) attitude is a b) disposition c) to think or feel that is about something or someone, typically one that is reflected in a person's behavior.”
Avoid circular definitions (e.g. “A baseball is a ball used in baseball.”)

This column should contain, if relevant, one or more words that could be substituted for the variable name.
These synonyms should reflect the meaning of the variable name as you use it, and not merely as the variable name might be used in a different context.
Again, the purpose is to convey the meaning of the variable term you use in your data.

The final column should contain, where needed, a longer explanation of the variable.

This is a human readable description with enough information for others to understand what the variable refers to.
It should also explain terms in the variable’s definition in more depth if needed. For instance, a description of the variable might clarify what is intended by ‘disposition’ in the above definition.
It could provide sources for definitions if those definitions are not the researcher’s own.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.

Last updated on March 10, 2022

No results found

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Absolute frequency

The absolute frequency describes the number of times a particular value for a variable (data item) has been observed to occur.

See: Describing Frequencies

Administrative data

Administrative data are collected as part of the day to day processes and record keeping of organisations.

See: Data Sources

Bar chart

A bar chart is a type of graph in which each column (plotted either vertically or horizontally) represents a categorical variable or a discrete ungrouped numeric variable.

See: Frequency Distribution

Categorical variable

Categorical variables have values that describe a 'quality' or 'characteristic' of a data unit, like 'what type' or 'which category'.

See: What are Variables?

Causation

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.

See: Correlation and Causation

Census (complete enumeration)

A census is a study of every unit, everyone or everything, in a population.

See: Census and Sample

Classifications

Classifications are used to collect and organise information into categories with other similar pieces of information.

See: What are Standards?

Class interval

A class interval is a range of data values. Each class interval has a lower and upper limit and contains all observations with values in that range. Class intervals cannot overlap with one another. For example 0 - 4, 5 - 8, 9 - 12.

Cohort

A cohort is a group of data units sharing a common experience or characteristic.

Comparability

Comparability is the ability to validly compare statistics that have been collected over time, or from different sources.

See: What are Standards?

Confidence interval

A confidence interval is a range in which it is estimated the true population value lies.

See: Measures of Error

Confidentiality

Confidentiality refers to the obligation of organisations that collect information to ensure that no person or organisation is likely to be identified from any data released.

See: Confidentiality

Continuous variable

A continuous variable is a numeric variable. Observations can take any value between a certain set of real numbers.

See: What are Variables?

Correlation

Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.

See: Correlation and Causation

Coverage

The coverage is the actual population of units within the scope of a data collection about which data can actually be collected. As it is not always possible to collect data from units in the population of interest, units may be in scope but not in coverage.

See also: Scope

Cyclical effect

A cyclical effect is any regular fluctuation in daily, weekly, monthly or annual data.

See: Time Series Data

Data

Data are measurements or observations that are collected as a source of information.

See: What are Data?

Data item (or variable)

A data item is a characteristic (or attribute) of a data unit which is measured or counted, such as height, country of birth, or income.

See: What are Data?

Dataset

A dataset is a complete collection of all observations.

See: What are Data?

Data unit

A data unit is one entity (such as a person or business) in the population being studied, about which data are collected.

See: What are Data?

Data visualisation

Data visualisation involves the visual presentation of data to communicate the stories contained in the dataset.

See: Data Visualisation

Descriptive (or summary) statistics

Descriptive statistics summarise the raw data and allow data users to interpret a dataset more easily.

See: What are Statistics?

Discrete variable

A discrete variable is a numeric variable. Observations can take a value based on a count from a set of distinct whole values.

See: What are Variables?

Error (Statistical error)

Statistical error describes the difference between a value obtained from a data collection process and the 'true' value for the population.

See: Types of Error

Estimate

An estimate is a value that is inferred for a population based on data collected from a sample of units from that population.

See: Estimate and Projection

Flow series

A flow series is a series which is a measure of activity over a given period.

See: Time Series Data

Frequency

The frequency is the number of times a particular value for a variable (data item) has been observed to occur.

See: Describing Frequencies

Frequency distribution

Frequency distributions are visual displays that organise and present frequency counts so that the information can be interpreted more easily.

See: Frequency Distribution

Histogram

A histogram is a type of graph in which each column represents a numeric variable, in particular that which is continuous and/or grouped.

See: Frequency Distribution

Index number

An index number is a ratio measuring the value of a data item at one time in relation to its value at a base period. Index numbers measure change without giving the actual numerical value of the data item.

Inferential statistics

Inferential statistics are used to infer conclusions about a population from a sample of that population.

See: What are Statistics?

Interquartile range (IQR)

The interquartile range (IQR) is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest.

See: Measures of Spread

Irregular effect

An irregular effect is any movement that occurred at a specific point in time, but is unrelated to a season or cycle.

See: Time Series Data

Mean

The mean is the sum of the value of each observation in a dataset divided by the number of observations. This is also known as the arithmetic average.

See: Measures of Central Tendency

Measures of central tendency (centre or central location)

A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution.

See: Measures of Central Tendency

Measures of shape

Measures of shape describe the distribution (or pattern) of the data within a dataset.

See: Measures of Shape

Measures of spread

Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item).

See: Measures of Spread

Median

The median is the middle value in distribution when the values are arranged in ascending or descending order.

See: Measures of Central Tendency

Metadata

Metadata is the information that defines and describes data.

See: What is Metadata?

Mode

The mode is the most commonly occurring value in a distribution.

See: Measures of Central Tendency

Nominal variable

A nominal variable is a categorical variable. Observations can take a value that is not able to be organised in a logical sequence.

See: What are Variables?

Non-random (non-probability) sample

In a non-random (or non-probability) sample some units of the population have no chance of selection, the selection is non-random, or the probability of their selection can not be determined.

See: Census and Sample

Non-sampling error

Non-sampling error is caused by factors other than those related to sample selection.

See: Types of Error

Normal distribution

A normal distribution is a true symmetric distribution of observed values.

See: Measures of Shape

Numeric variable

Numeric variables have values that describe a measurable quantity as a number, like 'how many' or 'how much'.

See: What are Variables?

Observation

An observation is an occurrence of a specific data item that is recorded about a data unit.

See: What are Data?

Ordinal variable

An ordinal variable is a categorical variable. Observations can take a value that can be logically ordered or ranked.

See: What are Variables?

Original time series

An original time series shows the actual movements in the data over time.

See: Time Series Data

Outlier

Outliers are extreme, or atypical data value(s) that are notably different from the rest of the data.

See: Measures of Central Tendency

Percentage

A percentage expresses a value for a variable in relation to a whole population as a fraction of one hundred.

See: Describing Frequencies

Population

A population is any complete group with at least one characteristic in common.

See: What is a Population?

Projection

A projection indicates what the future changes in a population would be if the assumptions about future trends actually occur.

See: Estimate and Projection

Proportion

A proportion describes the share of one value for a variable in relation to a whole.

See: Describing Frequencies

Qualitative data

Qualitative data are measures of 'types' and may be represented by a name, symbol, or a number code.

See: Quantitative and Qualitative Data

Quantitative data

Quantitative data are measures of values or counts and are expressed as numbers.

See: Quantitative and Qualitative Data

Quartiles

Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters. A dataset may also be divided into quintiles (five equal parts) or deciles (ten equal parts).

See: Measures of Spread

Random (probability) sample

In a random (or probability) sample each unit in the population has a chance of being selected, and this probability can be accurately determined.

See: Census and Sample

Range

The range is the difference between the smallest value and the largest value in a dataset.

See: Measures of Spread

Rate

A rate is a measurement of one value for a variable in relation to another measured quantity.

See: Describing Frequencies

Ratio

A ratio compares the frequency of one value for a variable with another value for the variable.

See: Describing Frequencies

Relative frequency

A relative frequency describes the number of times a particular value for a variable (data item) has been observed to occur in relation to the total number of values for that variable.

See: Describing Frequencies

Relative standard error (RSE)

The relative standard error (RSE) is the standard error expressed as a proportion of an estimated value.

See: Measures of Error

Respondent

A respondent provides data about oneself as a unit, or as a representative of another unit in a population.

See: Data Sources

Sample (partial enumeration)

A sample is a subset of units in a population, selected to represent all units in a population of interest.

See: Census and Sample

Sampling error

Sampling error occurs solely as a result of using a sample from a population, rather than conducting a census (complete enumeration) of the population.

See: Types of Error

Scope

The scope is the set of units that comprise the population of interest (target population) about which data are being collected.