Data Concepts and Types
Data+ starts with foundational data literacy. Data types: quantitative (numerical — discrete like count of events, continuous like temperature or revenue), qualitative (categorical — nominal (no order: colour, country), ordinal (ordered: low/medium/high, star rating)). Structured data: rows and columns, relational databases — easily queryable with SQL. Semi-structured data: partial structure, key-value or hierarchical — JSON, XML, CSV. Unstructured data: no predefined schema — text documents, images, video, audio — requires NLP or computer vision to extract structure. Data sources: primary (collected directly for this purpose — surveys, sensors, experiments), secondary (collected for another purpose — public datasets, purchased data, operational databases). Data pipelines: ETL (Extract from source, Transform to clean/reshape/join, Load to destination) or ELT (Extract, Load raw, Transform in the destination — common with cloud data warehouses). Data quality dimensions: accuracy (correct values), completeness (no missing data), consistency (same values across systems), timeliness (current enough for the use case), uniqueness (no duplicates), validity (conforms to defined rules).
Data Analysis and Statistics
Statistical analysis for Data+. Descriptive statistics: mean (average — sum/count, sensitive to outliers), median (middle value — resistant to outliers, preferred for skewed distributions like income), mode (most frequent value — useful for categorical data), range (max - min), standard deviation (average distance from mean — larger SD = more spread), variance (SD squared). Distributions: normal distribution (bell curve — mean = median = mode, 68-95-99.7 rule for 1/2/3 SDs). Skewness: right skew (tail on right, mean > median — income data), left skew (tail on left, mean < median). Correlation: measures relationship strength between two variables — correlation coefficient (r) from -1 to 1. Positive correlation (as X increases, Y increases), negative correlation (as X increases, Y decreases), r = 0 means no linear relationship. Correlation is not causation — a third variable (confounding variable) may explain both. Regression: predict the value of a dependent variable from one or more independent variables — linear regression fits a straight line to the data. Hypothesis testing: null hypothesis (H0 — no effect, no difference), alternative hypothesis (H1 — there is an effect). P-value < 0.05 (or chosen alpha) = reject null hypothesis — the result is statistically significant.
Data Visualisation and Reporting
Choosing the right visualisation is a core Data+ competency. Chart types: bar chart (compare categories — best for comparing discrete groups), line chart (show trend over time — best for continuous time-series data), scatter plot (show relationship between two numeric variables — correlation visualisation), pie chart (show proportions of a whole — limited to 5-7 slices maximum, use bar chart for more), histogram (show distribution of a single continuous variable — binned frequencies), box plot (show distribution statistics — median, quartiles, outliers — compare distributions across groups), heat map (show matrix data with colour intensity — good for correlation matrices), waterfall chart (show cumulative effect of positive and negative changes — financial P&L). Visualisation best practices: match chart type to data type and question, eliminate chart junk (3D effects, unnecessary gridlines, decorative elements), use colour purposefully (not for decoration — use to highlight, to encode a third dimension, or for categorical grouping), always label axes, include data source and date. Dashboard design: executive dashboards show KPIs and trend indicators; operational dashboards show real-time metrics; analytical dashboards allow drill-down exploration.
Data Governance and Ethics
Data governance ensures data is trustworthy, secure, and used appropriately. Data governance programme: data catalogue (metadata inventory of all data assets — what exists, where it is, what it means, who owns it), data lineage (tracking data from origin through transformations to final use — essential for debugging data quality issues), data classification (sensitivity labels: public, internal, confidential, restricted — drives access controls and retention policies), master data management (MDM — single authoritative source for key business entities: customer, product, employee — prevents duplicates and inconsistencies across systems). Data privacy regulations: GDPR (EU — consent required, right to access, right to erasure, 72-hour breach notification, applies to EU citizens globally), CCPA (California — opt-out of sale, right to know, right to delete, applies to California residents), HIPAA (US healthcare — PHI protected, covered entities and business associates). Data ethics: data collection (only collect what you need — data minimisation), data use (only use data for stated purposes — purpose limitation), fairness (examine training data and model outputs for bias), transparency (be clear about how data is used and decisions are made).