Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction to Data Science for Big Data Analytics
- Overview of Data Science
- Overview of Big Data
- Data Structures
- Drivers and complexities of Big Data
- The Big Data ecosystem and a new approach to analytics
- Key technologies in Big Data
- Data Mining process and challenges
- Association Pattern Mining
- Data Clustering
- Outlier Detection
- Data Classification
Introduction to the Data Analytics Lifecycle
- Discovery
- Data preparation
- Model planning
- Model building
- Presentation and communication of results
- Operationalisation
- Exercise: Case study
From this point, approximately 80% of training time will be dedicated to practical examples and exercises using R and related big data technologies.
Getting Started with R
- Installing R and RStudio
- Key features of the R language
- Objects in R
- Working with data in R
- Data manipulation
- Big data challenges
- Exercises
Getting Started with Hadoop
- Installing Hadoop
- Understanding Hadoop operating modes
- HDFS (Hadoop Distributed File System)
- MapReduce architecture
- Overview of Hadoop-related projects
- Writing programs in Hadoop MapReduce
- Exercises
Integrating R and Hadoop with RHadoop
- Components of RHadoop
- Installing RHadoop and connecting to Hadoop
- The RHadoop architecture
- Hadoop streaming with R
- Solving data analytics problems with RHadoop
- Exercises
Pre-processing and Preparing Data
- Steps in data preparation
- Feature extraction
- Data cleaning
- Data integration and transformation
- Data reduction – sampling and feature subset selection
- Dimensionality reduction
- Discretisation and binning
- Exercises and case study
Exploratory Data Analysis Methods in R
- Descriptive statistics
- Exploratory data analysis
- Visualisation – preliminary steps
- Visualising single variables
- Examining multiple variables
- Statistical methods for evaluation
- Hypothesis testing
- Exercises and case study
Data Visualisations
- Basic visualisations in R
- Data visualisation packages: ggplot2, lattice, plotly
- Formatting plots in R
- Advanced graphs
- Exercises
Regression (Estimating Future Values)
- Linear regression
- Use cases
- Model description
- Diagnostics
- Problems with linear regression
- Shrinkage methods: ridge regression and the lasso
- Generalisations and nonlinearity
- Regression splines
- Local polynomial regression
- Generalised additive models
- Regression with RHadoop
- Exercises and case study
Classification
- Classification-related problems
- Bayesian refresher
- Naïve Bayes
- Logistic regression
- K-nearest neighbours
- Decision trees algorithm
- Neural networks
- Support vector machines
- Diagnostics of classifiers
- Comparison of classification methods
- Scalable classification algorithms
- Exercises and case study
Assessing Model Performance and Selection
- Bias, variance and model complexity
- Accuracy versus interpretability
- Evaluating classifiers
- Measures of model and algorithm performance
- Hold-out validation method
- Cross-validation
- Tuning machine learning algorithms with the caret package
- Visualising model performance using ROC and Lift curves
Ensemble Methods
- Bagging
- Random Forests
- Boosting
- Gradient boosting
- Exercises and case study
Support Vector Machines for Classification and Regression
- Maximal margin classifiers
- Support vector classifiers
- Support vector machines
- SVMs for classification problems
- SVMs for regression problems
- Exercises and case study
Identifying Unknown Groupings within a Dataset
- Feature selection for clustering
- Representative-based algorithms: k-means, k-medoids
- Hierarchical algorithms: agglomerative and divisive methods
- Probabilistic-based algorithms: EM
- Density-based algorithms: DBSCAN, DENCLUE
- Cluster validation
- Advanced clustering concepts
- Clustering with RHadoop
- Exercises and case study
Discovering Connections with Link Analysis
- Link analysis concepts
- Metrics for analysing networks
- The PageRank algorithm
- Hyperlink-Induced Topic Search
- Link prediction
- Exercises and case study
Association Pattern Mining
- Frequent pattern mining model
- Scalability issues in frequent pattern mining
- Brute force algorithms
- Apriori algorithm
- The FP-growth approach
- Evaluation of candidate rules
- Applications of association rules
- Validation and testing
- Diagnostics
- Association rules with R and Hadoop
- Exercises and case study
Constructing Recommendation Engines
- Understanding recommender systems
- Data mining techniques used in recommender systems
- Recommender systems using the recommenderlab package
- Evaluating recommender systems
- Recommendations with RHadoop
- Exercise: Building a recommendation engine
Text Analysis
- Steps in text analysis
- Collecting raw text
- Bag of words
- Term Frequency – Inverse Document Frequency (TF-IDF)
- Determining sentiments
- Exercises and case study
35 Hours
Testimonials (2)
Intensity, Training materials and expertise, Clarity, Excellent communication with Alessandra
Marija Hornis Dmitrovic - Marija Hornis
Course - Data Science for Big Data Analytics
The example and training material were sufficient and made it easy to understand what you are doing.