Get in Touch

Course Outline

Introduction to Data Science for Big Data Analytics

  • Overview of Data Science
  • Overview of Big Data
  • Data Structures
  • Drivers and complexities of Big Data
  • The Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and challenges
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to the Data Analytics Lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation and communication of results
  • Operationalisation
  • Exercise: Case study

From this point, approximately 80% of training time will be dedicated to practical examples and exercises using R and related big data technologies.

Getting Started with R

  • Installing R and RStudio
  • Key features of the R language
  • Objects in R
  • Working with data in R
  • Data manipulation
  • Big data challenges
  • Exercises

Getting Started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop operating modes
  • HDFS (Hadoop Distributed File System)
  • MapReduce architecture
  • Overview of Hadoop-related projects
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting to Hadoop
  • The RHadoop architecture
  • Hadoop streaming with R
  • Solving data analytics problems with RHadoop
  • Exercises

Pre-processing and Preparing Data

  • Steps in data preparation
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling and feature subset selection
  • Dimensionality reduction
  • Discretisation and binning
  • Exercises and case study

Exploratory Data Analysis Methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualisation – preliminary steps
  • Visualising single variables
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and case study

Data Visualisations

  • Basic visualisations in R
  • Data visualisation packages: ggplot2, lattice, plotly
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating Future Values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods: ridge regression and the lasso
  • Generalisations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalised additive models
  • Regression with RHadoop
  • Exercises and case study

Classification

  • Classification-related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbours
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and case study

Assessing Model Performance and Selection

  • Bias, variance and model complexity
  • Accuracy versus interpretability
  • Evaluating classifiers
  • Measures of model and algorithm performance
  • Hold-out validation method
  • Cross-validation
  • Tuning machine learning algorithms with the caret package
  • Visualising model performance using ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and case study

Support Vector Machines for Classification and Regression

  • Maximal margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVMs for classification problems
    • SVMs for regression problems
  • Exercises and case study

Identifying Unknown Groupings within a Dataset

  • Feature selection for clustering
  • Representative-based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic-based algorithms: EM
  • Density-based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and case study

Discovering Connections with Link Analysis

  • Link analysis concepts
  • Metrics for analysing networks
  • The PageRank algorithm
  • Hyperlink-Induced Topic Search
  • Link prediction
  • Exercises and case study

Association Pattern Mining

  • Frequent pattern mining model
  • Scalability issues in frequent pattern mining
  • Brute force algorithms
  • Apriori algorithm
  • The FP-growth approach
  • Evaluation of candidate rules
  • Applications of association rules
  • Validation and testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and case study

Constructing Recommendation Engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems using the recommenderlab package
  • Evaluating recommender systems
  • Recommendations with RHadoop
  • Exercise: Building a recommendation engine

Text Analysis

  • Steps in text analysis
  • Collecting raw text
  • Bag of words
  • Term Frequency – Inverse Document Frequency (TF-IDF)
  • Determining sentiments
  • Exercises and case study
 35 Hours

Number of participants


Price per participant

Testimonials (2)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories