Get in Touch

Course Outline

Each session is 2 hours

Day-1: Session 1: Business Overview of Why Big Data Business Intelligence is Essential for Government

  • Case studies from the NIH and Department of Energy (DoE)
  • Adoption rates of Big Data in government agencies and how they are aligning future operations around Big Data predictive analytics
  • Broad-scale application areas within the Department of Defence (DoD), NSA, IRS, USDA, and others
  • Integrating Big Data with legacy data systems
  • Basic understanding of enabling technologies in predictive analytics
  • Data integration and dashboard visualisation
  • Fraud management
  • Business rule and fraud detection generation
  • Threat detection and profiling
  • Cost-benefit analysis for Big Data implementation

Day-1: Session 2: Introduction to Big Data - Part 1

  • Main characteristics of Big Data: volume, variety, velocity, and veracity. MPP architecture for managing volume.
  • Data warehouses – static schemas and slowly evolving datasets
  • MPP databases such as Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
  • Hadoop-based solutions – no constraints on dataset structure.
  • Typical patterns: HDFS, MapReduce (processing), and retrieval from HDFS
  • Batch processing – suited for analytical and non-interactive tasks
  • Volume handling: CEP streaming data
  • Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
  • Less production-ready options – Storm/S4
  • NoSQL databases – (columnar and key-value): Best suited as analytical adjuncts to data warehouses/databases

Day-1: Session 3: Introduction to Big Data - Part 2

NoSQL solutions

  • KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
  • KV Store (Hierarchical) - GT.m, Cache
  • KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
  • Tuple Store - Gigaspaces, Coord, Apache River
  • Object Database - ZopeDB, DB40, Shoal
  • Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
  • Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning Issues in Big Data

  • RDBMS – static structure/schema, does not promote an agile, exploratory environment.
  • NoSQL – semi-structured, offering enough structure to store data without requiring an exact schema beforehand.
  • Data cleaning issues

Day-1: Session 4: Big Data Introduction - Part 3: Hadoop

  • When to select Hadoop?
  • STRUCTURED data – Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration)
  • SEMI-STRUCTURED data – difficult to handle with traditional solutions (DW/DB)
  • Warehousing data = HUGE effort and remains static even after implementation
  • For variety and volume of data, processed on commodity hardware – HADOOP
  • Commodity hardware required to create a Hadoop Cluster

Introduction to MapReduce /HDFS

  • MapReduce – distributed computing across multiple servers
  • HDFS – makes data available locally for the computing process (with redundancy)
  • Data – can be unstructured/schema-less (unlike RDBMS)
  • Developer responsibility to make sense of data
  • Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day-2: Session 1: Big Data Ecosystem - Building Big Data ETL: The Universe of Big Data Tools - Which One to Use and When?

  • Hadoop versus other NoSQL solutions
  • For interactive, random access to data
  • HBase (column-oriented database) on top of Hadoop
  • Random access to data but with restrictions (max 1 PB)
  • Not suitable for ad-hoc analytics; good for logging, counting, and time-series analysis
  • Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
  • Flume – Stream data (e.g., log data) into HDFS

Day-2: Session 2: Big Data Management System

  • Moving parts, compute nodes starting/failing: ZooKeeper - For configuration, coordination, and naming services
  • Complex pipelines/workflows: Oozie – manage workflows, dependencies, and daisy chains
  • Deployment, configuration, cluster management, upgrades, etc. (sys admin): Ambari
  • In Cloud: Whirr

Day-2: Session 3: Predictive Analytics in Business Intelligence - Part 1: Fundamental Techniques & Machine Learning-based BI:

  • Introduction to Machine Learning
  • Learning classification techniques
  • Bayesian Prediction - preparing training files
  • Support Vector Machine
  • KNN p-Tree Algebra & vertical mining
  • Neural Network
  • Big Data large variable problem - Random Forest (RF)
  • Big Data Automation problem – Multi-model ensemble RF
  • Automation through Soft10-M
  • Text analytic tool - Treeminer
  • Agile learning
  • Agent-based learning
  • Distributed learning
  • Introduction to Open Source Tools for predictive analytics: R, Rapidminer, Mahout

Day-2: Session 4: Predictive Analytics Ecosystem - Part 2: Common Predictive Analytic Problems in Government

  • Insight analytics
  • Visualisation analytics
  • Structured predictive analytics
  • Unstructured predictive analytics
  • Threat/fraudstar/vendor profiling
  • Recommendation Engine
  • Pattern detection
  • Rule/Scenario discovery – failure, fraud, optimisation
  • Root cause discovery
  • Sentiment analysis
  • CRM analytics
  • Network analytics
  • Text Analytics
  • Technology-assisted review
  • Fraud analytics
  • Real-Time Analytics

Day-3: Session 1: Real-Time and Scalable Analytics Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS
  • Apache Hama - for Bulk Synchronous distributed computing
  • Apache SPARK - for cluster computing for real-time analytics
  • CMU Graphics Lab2 - Graph-based asynchronous approach to distributed computing
  • KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation

Day-3: Session 2: Tools for eDiscovery and Forensics

  • eDiscovery over Big Data versus Legacy data – a comparison of cost and performance
  • Predictive coding and technology-assisted review (TAR)
  • Live demo of a TAR product (vMiner) to understand how TAR works for faster discovery
  • Faster indexing through HDFS – velocity of data
  • NLP or Natural Language Processing – various techniques and open-source products
  • eDiscovery in foreign languages - technology for foreign language processing

Day-3: Session 3: Big Data BI for Cyber Security – Understanding the Full 360-Degree View from Speedy Data Collection to Threat Identification

  • Understanding the basics of security analytics - attack surface, security misconfiguration, host defences
  • Network infrastructure/Large data pipe / Response ETL for real-time analytics
  • Prescriptive versus predictive – Fixed rule-based versus auto-discovery of threat rules from metadata

Day-3: Session 4: Big Data in the USDA: Applications in Agriculture

  • Introduction to IoT (Internet of Things) for agriculture - sensor-based Big Data and control
  • Introduction to satellite imaging and its application in agriculture
  • Integrating sensor and image data for soil fertility, cultivation recommendations, and forecasting
  • Agriculture insurance and Big Data
  • Crop loss forecasting

Day-4: Session 1: Fraud Prevention BI from Big Data in Government - Fraud Analytics:

  • Basic classification of Fraud analytics - rule-based versus predictive analytics
  • Supervised versus unsupervised Machine Learning for fraud pattern detection
  • Vendor fraud/overcharging for projects
  • Medicare and Medicaid fraud - fraud detection techniques for claim processing
  • Travel reimbursement frauds
  • IRS refund frauds
  • Case studies and live demos will be provided wherever data is available.

Day-4: Session 2: Social Media Analytics - Intelligence Gathering and Analysis

  • Big Data ETL API for extracting social media data
  • Text, image, metadata, and video
  • Sentiment analysis from social media feeds
  • Contextual and non-contextual filtering of social media feeds
  • Social Media Dashboard to integrate diverse social media sources
  • Automated profiling of social media profiles
  • Live demo of each analytics capability will be provided through the Treeminer Tool.

Day-4: Session 3: Big Data Analytics in Image Processing and Video Feeds

  • Image storage techniques in Big Data - Storage solutions for data exceeding petabytes
  • LTFS and LTO
  • GPFS-LTFS (Layered storage solution for large image data)
  • Fundamentals of image analytics
  • Object recognition
  • Image segmentation
  • Motion tracking
  • 3-D image reconstruction

Day-4: Session 4: Big Data Applications in the NIH:

  • Emerging areas of Bioinformatics
  • Meta-genomics and Big Data mining issues
  • Big Data predictive analytics for Pharmacogenomics, Metabolomics, and Proteomics
  • Big Data in downstream Genomics processes
  • Application of Big Data predictive analytics in Public Health

Big Data Dashboards for Quick Accessibility of Diverse Data and Display:

  • Integration of existing application platforms with Big Data Dashboards
  • Big Data management
  • Case Study of Big Data Dashboards: Tableau and Pentaho
  • Using Big Data apps to push location-based services in government
  • Tracking systems and management

Day-5: Session 1: How to Justify Big Data BI Implementation Within an Organisation:

  • Defining ROI for Big Data implementation
  • Case studies on saving analyst time for data collection and preparation – increased productivity gains
  • Case studies of revenue gains from saving licensed database costs
  • Revenue gains from location-based services
  • Savings from fraud prevention
  • An integrated spreadsheet approach to calculate approximate expenses versus revenue gains/savings from Big Data implementation.

Day-5: Session 2: Step-by-Step Procedure to Replace Legacy Data Systems with Big Data Systems:

  • Understanding the practical Big Data Migration Roadmap
  • What important information is needed before architecting a Big Data implementation
  • Different ways of calculating the volume, velocity, variety, and veracity of data
  • How to estimate data growth
  • Case studies

Day-5: Session 4: Review of Big Data Vendors and Their Products. Q&A Session:

  • Accenture
  • APTEAN (Formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (Formerly 10Gen)
  • MU Sigma
  • Netapp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • Qliktech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (Part of EMC)

Requirements

  • Basic knowledge of business operations and data systems within the government sector in your specific domain
  • Basic understanding of SQL/Oracle or relational databases
  • Basic understanding of Statistics (at spreadsheet level)
 35 Hours

Number of participants


Price per participant

Testimonials (1)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories