Module 1 Demonstration
Data Preprocessing: From Raw Data to Ready to Analyse
1 / 20

Image credits: RMIT University, https://flic.kr/p/vArsRW

2 / 20

Get Started3 / 20

Course Orientation

This course assumes you have a working knowledge of basic mathematics and familiarity with computers.
Course Information Pack: Please read this document for orientation.
Course Website contains all the learning content for students to work through in their own time and space.

4 / 20

Course Orientation Cont.

Class time:
- Announcements and Questions (~ 5-30 mins)
- Demonstration (~ 1 hr - 1.5 hrs)
- In class activities (~ 1 hr - 1.5 hrs, hands on exercises)

Before Class:
- Watch the pre-recorded lectures.
- Read through the module notes.
- Work on Module 1 worksheet questions.
During Class:
- Actively engage in demonstrations, learning activities and supervised self-study.
After Class:
- Module-based assessments (online tests on Canvas).
- DataCamp modules (for extra study).

5 / 20

Course Orientation Cont.

Flexible Learning: Classes are recorded, allowing you to watch them at your convenience via Canvas in EchoCenter.
Teamwork is encouraged (worksheet activities and group assignments). This closely mirrors the real-world workforce learning through on-the-job interactions with peers and teammates.

6 / 20

DataCamp Online Courses

Data Wrangling course is supported by DataCamp for Classroom initiative.

During this semester, you will have free access to DataCamp learning modules.
I have selected specific modules that you will need to complete as a skill builder Here.
Note that you need to first sign-up to DataCamp.

You will have 6 months of FREE access to the full DataCamp course curriculum (>250 hours).
- Access to premium courses (i.e., R, Python and SQL courses).
- You can participate in leaderboards and private discussion forums with your fellow classmate.
- You may also complete other online courses that you are interested as they will help you with your other studies.

7 / 20

Assessment

Course assessment is comprised of the following:
- Practical (project) assessments 1 & 2 (Weighting 35% and 45%)
- Module-based assessments 1-8 (Weighting each 2.5%)

Self study:
- Worksheet activities (not graded): Each module will be accompanied by worksheet activities.
- DataCamp assignments (not graded): Students can complete them as a skill builder or for an extra study.

8 / 20

Module 1 Basics : What is Data Preprocessing?9 / 20

What is Data Preprocessing?

Data Preprocessing is a process and the collection of operations needed to prepare all forms of untidy data, incomplete, noisy and inconsistent data for statistical analysis.

We will define 5 major tasks for data preprocessing framework, namely : Get, Understand, Tidy & Manipulate, Scan and Transform.

10 / 20

Most statistical theory concentrates on data modelling, prediction, and statistical inference while it is usually assumed that data are in the correct state for the analysis. However, in practice, a data analyst spends most of his/her time (usually 50\%-80\% of an analyst time) on making ready the data before doing any statistical operation.

Despite the amount of time it takes, there has been surprisingly very little emphasis on how to preprocess data well. Real-world data are commonly incomplete, noisy, inconsistent, and don't have all the correct labels and codes that are required for the analysis.

Data Preprocessing, which is also commonly referred to as data wrangling, data manipulation, data cleaning, etc., is a process and the collection of operations needed to prepare all forms of untidy data (incomplete, noisy and inconsistent data) for statistical analysis.

We will define 5 major tasks for data preprocessing framework, namely : Get, Understand, Tidy & Manipulate, Scan and Transform.

In the following modules of this course, we will unwrap each of these preprocessing tasks by providing details of operations related to that task.

What will you learn in this course?

Module 1: Basics: What is DP?
Module 2: Get
Module 3: Understand
Module 4: Tidy & Manipulate
Module 5: Scan: Missing Values
Module 6: Scan: Outliers
Module 7: Transform
Module 8: Special Operations

Technology: Open Source R/RStudio^*
Practical experience:
- Class worksheets
- Data challenges
- Assignments
- DataCamp

Including Base R functions, readr, tidyr, dplyr, mlr, stringr, lubridate, RMarkdown packages and many others.

11 / 20

We will cover eight different modules in this course namely:

In each module, we will unwrap these data preprocessing tasks by providing details of operations related to that task.

By completion of these modules you should be able to:

Apply data integration techniques to import and combine different sources of data.

Critically reflect upon different data sources, types, formats and structures.

Apply different data manipulation techniques to recode, filter, select, split, aggregate, and reshape the data into a format suitable for statistical analysis.

Justify data by detecting and handling missing values, outliers, inconsistencies and errors.

By completion of class worksheets, module exercises, datacamp modules and assignments, you will demonstrate practical experience by having been exposed to real data problems.

Effectively use leading open source software for reproducible, automated data preprocessing.

R and RStudio Quick Overview

R is a free programming language and environment for statistical computing - https://www.r-project.org/
Why to learn R?
- Recognised across industries
- Promotes coding and computational skills
- Provide access to the world’s largest and most comprehensive library of statistical functions
- Powerful and grows with you
- Works on all major operating systems
- R and RStudio can be used in combination to create new functions and statistical programs, build dynamic and interactive reports, dashboards, websites, slideshows, statistical web applications and all for ... FREE!
RStudio is a free integrated development environment for R and makes using R a lot easier and more efficient - https://www.rstudio.com/.
RStudio requires R to be installed.

12 / 20

Speaking of software, we will use R and RStudio in this course. You won't learn anything about Excel, SPSS, SQL, SAS, Python, Julia, or any other statistical package/programming language useful for data preprocessing. This isn't because I think that these tools are bad or redundant. They are not.

In practice, most data analytics teams use a mixture of these tools and programming languages. I strongly believe that R is a great place to start your data analysis journey as it is a comprehensive language for data analysis. You can use R effectively in almost each step of data analysis, from data collection to reporting. You can collect, preprocess, visualise and analyse your data using R functions, report and publish your findings using RMarkdown.

R and RStudio Quick Overview Cont.1

13 / 20

RStudio interface consists of four main windows called source window (or source editor), the console, the environment window and Files, Plots Help and Viewer windows.

R and RStudio Quick Overview Cont.2

14 / 20

Source window:

The Source window is the place where you can open or create an R script file, add, edit, save and share your R codes to reproduce your analysis. Any codes sitting in your script are not active unless you select and run them by hitting the RUN button or CTRL+R in keyboard.

Installing and Loading Packages

Packages are collections of related functions.
Comprehensive R Archive Network (CRAN) lists over 10,000 available packages!
Packages are the reason why R is so powerful.
Packages need to be installed first.

install.packages("dplyr", dependencies = TRUE)

Include the dependencies = TRUE option as many packages require other packages to run. This option checks and installs dependent packages where required.

15 / 20

Installing Packages Overview

16 / 20

Installing and Loading Packages Cont.

Once a package is installed, it needs to be loaded into an R session in order to make its functions available.

library(dplyr)

You will need to reload packages each time you need to start a new R session. Always start your scripts, notebooks or markdown files by loading all the packages you will need.

17 / 20

Loading Packages Overview

18 / 20

What do you need to know by Week 1

Read through the Course Information Pack.
How to access the course Canvas shell through myRMIT.
How to access our Course website.

Learn how to install R and RStudio
Know how to install and load R packages (See Module 1 notes).
Know how to get further help for R statistical programming language (refer to Module 1 notes).
Don’t panic. R has a slow learning curve, but you will get heaps of practice in this course!

19 / 20

Worksheet questions

Complete the following worksheet:

Module 1 Worksheet

Return to Course Website

20 / 20

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help