This course assumes you have a working knowledge of basic mathematics and familiarity with computers.
Course Information Pack: Please read this document for orientation.
Course Website contains all the learning content for students to work through in their own time and space.
Before Class:
During Class:
After Class:
Flexible Learning: Classes are recorded, allowing you to watch them at your convenience via Canvas in EchoCenter.
Teamwork is encouraged (worksheet activities and group assignments). This closely mirrors the real-world workforce learning through on-the-job interactions with peers and teammates.
Data Wrangling course is supported by DataCamp for Classroom initiative.
Course assessment is comprised of the following:
Self study:
We will define 5 major tasks for data preprocessing framework, namely : Get, Understand, Tidy & Manipulate, Scan and Transform.
Most statistical theory concentrates on data modelling, prediction, and statistical inference while it is usually assumed that data are in the correct state for the analysis. However, in practice, a data analyst spends most of his/her time (usually 50\%-80\% of an analyst time) on making ready the data before doing any statistical operation.
Despite the amount of time it takes, there has been surprisingly very little emphasis on how to preprocess data well. Real-world data are commonly incomplete, noisy, inconsistent, and don't have all the correct labels and codes that are required for the analysis.
Data Preprocessing, which is also commonly referred to as data wrangling, data manipulation, data cleaning, etc., is a process and the collection of operations needed to prepare all forms of untidy data (incomplete, noisy and inconsistent data) for statistical analysis.
We will define 5 major tasks for data preprocessing framework, namely : Get, Understand, Tidy & Manipulate, Scan and Transform.
In the following modules of this course, we will unwrap each of these preprocessing tasks by providing details of operations related to that task.
readr
, tidyr
, dplyr
, mlr
, stringr
, lubridate
, RMarkdown
packages and many others.We will cover eight different modules in this course namely:
In each module, we will unwrap these data preprocessing tasks by providing details of operations related to that task.
By completion of these modules you should be able to:
Apply data integration techniques to import and combine different sources of data.
Critically reflect upon different data sources, types, formats and structures.
Apply different data manipulation techniques to recode, filter, select, split, aggregate, and reshape the data into a format suitable for statistical analysis.
Justify data by detecting and handling missing values, outliers, inconsistencies and errors.
By completion of class worksheets, module exercises, datacamp modules and assignments, you will demonstrate practical experience by having been exposed to real data problems.
Effectively use leading open source software for reproducible, automated data preprocessing.
R is a free programming language and environment for statistical computing - https://www.r-project.org/
Why to learn R?
RStudio is a free integrated development environment for R and makes using R a lot easier and more efficient - https://www.rstudio.com/.
RStudio requires R to be installed.
Speaking of software, we will use R and RStudio in this course. You won't learn anything about Excel, SPSS, SQL, SAS, Python, Julia, or any other statistical package/programming language useful for data preprocessing. This isn't because I think that these tools are bad or redundant. They are not.
In practice, most data analytics teams use a mixture of these tools and programming languages. I strongly believe that R is a great place to start your data analysis journey as it is a comprehensive language for data analysis. You can use R effectively in almost each step of data analysis, from data collection to reporting. You can collect, preprocess, visualise and analyse your data using R functions, report and publish your findings using RMarkdown.
RStudio interface consists of four main windows called source window (or source editor), the console, the environment window and Files, Plots Help and Viewer windows.
Source window:
The Source window is the place where you can open or create an R script file, add, edit, save and share your R codes to reproduce your analysis. Any codes sitting in your script are not active unless you select and run them by hitting the RUN button or CTRL+R in keyboard.
Packages are collections of related functions.
Comprehensive R Archive Network (CRAN) lists over 10,000 available packages!
Packages are the reason why R is so powerful.
Packages need to be installed first.
install.packages("dplyr", dependencies = TRUE)
dependencies = TRUE
option as many packages require other packages to run. This option checks and installs dependent packages where required.library(dplyr)
Read through the Course Information Pack.
How to access the course Canvas shell through myRMIT.
How to access our Course website.
Learn how to install R and RStudio
Know how to install and load R packages (See Module 1 notes).
Know how to get further help for R statistical programming language (refer to Module 1 notes).
Don’t panic. R has a slow learning curve, but you will get heaps of practice in this course!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |