id | gender | score |
---|---|---|
id1 | man | 4.2 |
id2 | Man | 5,3 |
id3 | man | 5,9 |
id4 | woman | 3.1 |
id5 | woman | 7,2 |
Data Representation
Compiled on R 4.3.1
Current draft aims to introduce researchers to the key ideas in data representation that would help to prepare their data for data analysis.
Our target audience is primarily the research community at VUB / UZ Brussel, those who have some basic experience in R and want to know more.
We invite you to help improve this document by sending us feedback: wilfried.cools@vub.be
In preparation of data analysis, it is wise to think carefully about how to represent your data. The key ideas are listed first, and will be explained and exemplified in more detail throughout current draft.
- represent data so that
- you and fellow researchers understand it, now but also in the future,
- statistical algorithms understand it,
- the gap researcher - algorithm is minimized (efficient processing)
- allows for straightforward data manipulation, modeling, visualization.
- table formats combine rows and columns in cells:
- cells contain one and only one piece of information,
- rows relate cells to a research unit, could be a patient, a mouse, a center, … ,
- columns relate cells to a property,
- cells offer information for specific research unit - property combinations.
- ideally, data are TIDY, with meaning appropriately mapped into structure:
- each row an observation as research unit,
- each column a variable as property,
- each cell a value,
- note: data can be split into multiple tables.
- check data by
- eye-balling to ensure a correct and unambiguous interpretation of cell values,
- descriptive analysis to detect anomalies from frequency tables and summary statistics (eg., mean, median, minimum-maximum).
Challenge
Lets generate an exam with questions ordered differently depending on the exam series, using R-markdown.
Test yourself: create a data file for the following 4 participants (assuming many more), ready for analysis.
Read through this draft and if necessary alter your solution.
A possible solution is included at the end.
- Enid Charles, age 43,
- visual score 16, mathematical score 2.4,
- suggested methods A and B,
- performance score at first time point 101 and second time point 105.
- Gertrude Mary Cox, age 34,
- visual score 26, mathematical score 1.4,
- suggested methods A,
- performance score at first time point missing and second time point 115.
- Helen Berg, age 53,
- visual score 20, mathematical score missing,
- suggested methods none (not A, nor B, nor C),
- performance score at first time point 111 and second time point 110.
- Grace Wahba, age 50,
- visual score 30, mathematical score above cut-off 10,
- suggested methods A,
- performance score at first time point 91 and second time point 115.
Outline
Current draft addresses data representation with the following outline:
- a challenge: it is not always clear how (see above)
- errors and inconveniences
- common problems and solutions
In following drafts, data manipulation, modeling and visualization are considered. Typically, all are more straightforward when data are more tidy.
Errors and inconveniences
To avoid problems and frustration in your data analysis, it may be worthwhile to consider the checklist below. It points at various issues that have been encountered in actual data at ICDS and that are easy to avoid. In general most data offered by researchers whom did not attempt to do their own analysis, or at least the preliminary descriptives, is full with issues like the ones highlighted in this section.
In summary:
- inconsistencies
- ambiguities / incompleteness
- inconveniences for either software or user
Error: inconsistent specification of cell values
When labeling or scoring properties for research units (cells), avoid typo’s, inconsistent labeling, inconsistent scoring, …
Often observed problems:
- typing errors in values or labels, eg.,
man
-women
-womem
orlikely
-likly
-Likely
, - inconsistent use of capital letters, eg.,
man
-Man
-woman
. Most statistical software is case sensitive (eg., R), - inconsistent use of spaces (
_
), eg.,man__
-man
-_woman
-woman
, - inconsistent use of decimal indicators, eg.,
4.2
-5,3
-5,9
. A comma is often used locally, a dot is used internationally (scientifically), - inconsistent use of missing value indicators:
_
- NA - 99. Software differ in their default, but consistency is key !
Advice: frequency tables often suffice to detect most of these errors, or a summary for numeric values.
Var1 | Freq |
---|---|
man | 1 |
Man | 1 |
man | 1 |
woman | 2 |
Note that the average score for the table on the left appears to be 3.65, do you see what went wrong ?
Error: ambiguous and incomplete specification of cell values
When labeling or scoring properties for research units (cells), avoid ambiguity and incompleteness.
Often observed problems within cells:
- empty cells not implying missing values
- eg., those that imply the label above (eg., Excel showcase below with empty field meaning
group 1
), - eg., those implying either
missing
ornone
, no answer is different from the answer 0 or “” (eg.,types
variable in ambiguous - incomplete below),
- eg., those that imply the label above (eg., Excel showcase below with empty field meaning
- combined numerical and non-numerical values, eg.,
3.9
combined with>10
(eg.,score
variable in ambiguous - incomplete below), - combined information within a cell, eg.,
A:B
,A:C
,B
to signal treatments received (none or A, B, and/or C) (eg.,types
variable in ambiguous - incomplete below).
Each cell should best be fully interpretable on its own, with reference to both row and column only. A codebook, discussed below, serves to alleviate any possible discrepancy between the data representation and the actual data.
Often observed problems combining cells:
- multiple line headers (eg., Excel showcase
blood volume
for bothbaseline and after treatment
), - merged cells (eg., Excel showcase
baseline measurement
).
Inconvenience: use of special characters and numbers
When labeling or scoring, or when specifying a variable name, avoid characters that may not be understood properly. Note that some characters call for specific operations in certain statistical software.
Often observed inconveniences follow from using:
- special characters and spaces (eg.,
$, %, #, ", ',
), - use of names starting with numbers (eg., 1st).
Advice: keep columns with text, not part of the statistical analysis, in a separate file.
id | types | score |
---|---|---|
id1 | A:B | 4.2 |
id2 | A | |
id3 | B | 5.9 |
id4 | A:B | >10 |
id5 | 7.2 |
id | type | score |
---|---|---|
id1 | % use | 4.2 |
id2 | % use | 5,3 |
id3 | 'run' | 5,9 |
id4 | 'run' | 3.1 |
id5 | % use | 7,2 |
Inconvenience: complex and lengthy labels and values
When labeling variables or values, strike a balance between meaningful and simple. This is especially important when requesting help from data analysts who typically program their analysis and often do not understand your line of research. Some analysts may even prefer all values as numeric, (eg., 0 vs. 1) while others prefer short alphanumeric values (eg., male vs female).
Advice: To keep meaningful but long and complex headers, use a second line with simple headers to read in for the analysis. Maybe use patientID
and id1
instead of patient_identifiers_of_first_block
and patient_number_1
.
patient_identifiers_of_first_block | my type | %mg rating |
---|---|---|
patient identity number 1 | condition with extra air | 4.2 mg/s |
patient identity number 2 | condition without extra air | 5,3 mg/s |
patient identity number 3 | condition with extra air | 5,9 mg/s |
patient identity number 4 | condition with extra air (stopped early) | 3.1 mg/s |
patient identity number 5 | condition without extra air | 7,2 mg/s |
Advice: To ensure a correct interpretation, now and later, the researcher could make the following distinction,
- use numbers when values could be interpreted on a continuous scale,
- use text with clear order like
notAgree - neutral - agree
, - use text postfixed with numbers with unclear order like
r1 - r2 - r3
for ordinal scale not to be used as continuous, - use text for all remaining labels.
id | type | intensity | score | rank |
---|---|---|---|---|
id1 | black | low | 4.2 | rnk1 |
id2 | black | medium | 5.3 | rnk4 |
id3 | red | low | 5.9 | rnk3 |
id4 | yellow | high | 3.1 | rnk3 |
id5 | black | low | 7.2 | rnk2 |
A codebook could address the relation between labels and their interpretation as well.
Inconvenience: irrelevant data
When starting the analysis, or offering data to third parties, retain only the data of interest for the analysis. Store the remainder of the data in a secure place with an appropriate link.
An example of splitting up data into two tables, one for analysis and one for identification and context is included:
Note that factor turns a variable into a factor (full set of possible values, numeric in essence but used for its labels). Turning factors as numeric provides the underlying numbers.
name | score1 | score2 | sumscore | comments |
---|---|---|---|---|
Enid Charles | 3 | 4 | 7 | some problems at the start |
Gertrude Mary Cox | 3 | 3 | 6 | |
Helen Berg | 4 | 0 | 4 | patient showed no interest |
Grace Wahba | 4 | 4 | 8 |
Advice: To ensure a correct interpretation, now and later, the researcher could make the following distinction,
- use numbers when values could be interpreted on a continuous scale,
- use text with clear order like
notAgree - neutral - agree
, - use text postfixed with numbers with unclear order like
r1 - r2 - r3
for ordinal scale not to be used as continuous, - use text for all remaining labels.
id | score1 | score2 | sumscore |
---|---|---|---|
1 | 3 | 4 | 7 |
2 | 3 | 3 | 6 |
4 | 4 | 0 | 4 |
3 | 4 | 4 | 8 |
id | name | comments |
---|---|---|
1 | Enid Charles | some problems at the start |
2 | Gertrude Mary Cox | |
4 | Helen Berg | patient showed no interest |
3 | Grace Wahba |
Error: spreadsheets for human interpretation only
Spreadsheets are convenient for representing data because their base structure is a table, with rows and columns, which you need for most statistical analysis, and because they allow for straightforward manipulations of data.
Manually constructed spreadsheets, Excel or other, unfortunately, promote the use of implicit information rather than the required explicit information. For example, cells are left empty because it is, at least for a human, clear from the context what the value should be (eg., Excel showcase, empty field meaning group 1 or 2).
- incompleteness due to implicit information
- use of merged cells, not understood by algorithms
Excel deserves special attention. Understandably very popular, it often does more than expected and can cause serious problems.
Often observed problems:
- inappropriate cell types (eg., numeric values read in as if they are dates),
- inappropriate dimensions (eg., activated cells outside the data-frame or hidden columns),
Advice: A safe way to store data, once fully ready, could be a tab-delimited text file. While inconvenient to manipulate, risks for unwanted behavior are eliminated. It is straightforward to convert one into the other.
Common problems and solutions
For data analysis data is most often represented in one or more tables. It is repeated that:
- Tables combine rows and columns into cells (see key message):
- with rows that relate cells within a research unit (eg., a row contains all information about a particular observation),
- with columns that relate cells to a property (eg., a column contains all information about a particular variable)
- with cells that contain values which offer one and only one piece of information, combining a research unit and a property.
- Tables for different but related research units are linked by identifiers (eg., table for observation information, table for participant information).
A bad bad exemplary case, using R to turn it around
While it is best to avoid a bad data table from the start, it is in many cases not impossible to convert tables into more appropriate forms.
Purely for illustration purposes, R
code is included using the tidyverse
package to show a possible data transformation starting from a bad example turning it into another data representation. In current draft the focus is on data representation, not on changing it. More details on how to manipulate, visualize and model data are offered in future drafts.
Consider this monstrous dataset, showing various features that are common in data offered for analysis.