Get in Touch

Course Outline

Day 01

Overview of Big Data Business Intelligence for Criminal Intelligence Analysis

  • Case studies from law enforcement: Predictive policing
  • Big Data adoption rates within law enforcement agencies and how they are aligning future operations around Big Data predictive analytics
  • Emerging technology solutions such as gunshot sensors, surveillance video, and social media
  • Using Big Data technology to mitigate information overload
  • Interfacing Big Data with legacy data
  • Basic understanding of enabling technologies in predictive analytics
  • Data integration and dashboard visualisation
  • Fraud management
  • Business rules and fraud detection
  • Threat detection and profiling
  • Cost-benefit analysis for Big Data implementation

Introduction to Big Data

  • Main characteristics of Big Data: Volume, Variety, Velocity, and Veracity.
  • MPP (Massively Parallel Processing) architecture
  • Data warehouses: static schema, slowly evolving datasets
  • MPP databases: Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
  • Hadoop-based solutions: no conditions on dataset structure.
  • Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
  • Apache Spark for stream processing
  • Batch: suited for analytical/non-interactive tasks
  • Volume: CEP streaming data
  • Typical choices: CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
  • Less production-ready: Storm/S4
  • NoSQL databases: (columnar and key-value): best suited as an analytical adjunct to data warehouses/databases

NoSQL solutions

  • KV Store: Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store: Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
  • KV Store (Hierarchical): GT.m, Cache
  • KV Store (Ordered): TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache: Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
  • Tuple Store: Gigaspaces, Coord, Apache River
  • Object Database: ZopeDB, DB40, Shoal
  • Document Store: CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
  • Wide Columnar Store: BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to data cleaning issues in Big Data

  • RDBMS: static structure/schema, does not promote an agile, exploratory environment.
  • NoSQL: semi-structured, sufficient structure to store data without an exact schema prior to storage.
  • Data cleaning issues

Hadoop

  • When to select Hadoop?
  • STRUCTURED: Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration).
  • SEMI-STRUCTURED data: difficult to handle using traditional solutions (DW/DB).
  • Warehousing data requires HUGE effort and remains static even after implementation.
  • For variety and volume of data, processed on commodity hardware: HADOOP.
  • Commodity hardware required to create a Hadoop cluster.

Introduction to MapReduce/HDFS

  • MapReduce: distributed computing across multiple servers.
  • HDFS: makes data available locally for the computing process (with redundancy).
  • Data: can be unstructured/schema-less (unlike RDBMS).
  • Developer responsibility to interpret data.
  • Programming MapReduce: working with Java (pros/cons), manually loading data into HDFS.

Day 02

Big Data Ecosystem: Building Big Data ETL (Extract, Transform, Load) – Which Big Data tools to use and when?

  • Hadoop versus other NoSQL solutions
  • For interactive, random access to data.
  • HBase (column-oriented database) on top of Hadoop.
  • Random access to data but with restrictions (max 1 PB).
  • Not ideal for ad-hoc analytics; suitable for logging, counting, and time-series analysis.
  • Sqoop: import from databases to Hive or HDFS (JDBC/ODBC access).
  • Flume: stream data (e.g., log data) into HDFS.

Big Data Management System

  • Moving parts, compute nodes start/fail: ZooKeeper for configuration, coordination, and naming services.
  • Complex pipeline/workflow: Oozie to manage workflows, dependencies, and daisy chains.
  • Deploy, configure, cluster management, upgrades, etc. (sys admin): Ambari.
  • In Cloud: Whirr.

Predictive Analytics: Fundamental Techniques and Machine Learning-based Business Intelligence

  • Introduction to machine learning.
  • Learning classification techniques.
  • Bayesian prediction: preparing a training file.
  • Support Vector Machine.
  • KNN p-Tree algebra and vertical mining.
  • Neural networks.
  • Big Data large variable problem: Random Forest (RF).
  • Big Data automation problem: Multi-model ensemble RF.
  • Automation through Soft10-M.
  • Text analytic tool: Treeminer.
  • Agile learning.
  • Agent-based learning.
  • Distributed learning.
  • Introduction to open-source tools for predictive analytics: R, Python, Rapidminer, Mahout.

Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis

  • Technology and the investigative process.
  • Insight analytics.
  • Visualisation analytics.
  • Structured predictive analytics.
  • Unstructured predictive analytics.
  • Threat/fraudstar/vendor profiling.
  • Recommendation engine.
  • Pattern detection.
  • Rule/scenario discovery: failure, fraud, optimisation.
  • Root cause discovery.
  • Sentiment analysis.
  • CRM analytics.
  • Network analytics.
  • Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
  • Technology-assisted review.
  • Fraud analytics.
  • Real-time analytics.

Day 03

Real-Time and Scalable Analytics Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS.
  • Apache Hama: for Bulk Synchronous distributed computing.
  • Apache Spark: for cluster computing and real-time analytics.
  • CMU Graphics Lab2: graph-based asynchronous approach to distributed computing.
  • KNN p: algebra-based approach from Treeminer for reduced hardware cost of operation.

Tools for eDiscovery and Forensics

  • eDiscovery over Big Data versus legacy data: a comparison of cost and performance.
  • Predictive coding and Technology-Assisted Review (TAR).
  • Live demo of vMiner to demonstrate how TAR enables faster discovery.
  • Faster indexing through HDFS: velocity of data.
  • NLP (Natural Language Processing): open-source products and techniques.
  • eDiscovery in foreign languages: technology for foreign language processing.

Big Data BI for Cyber Security: Getting a 360-degree view, speedy data collection, and threat identification

  • Understanding the basics of security analytics: attack surface, security misconfiguration, host defences.
  • Network infrastructure / Large datapipe / Response ETL for real-time analytics.
  • Prescriptive versus predictive: fixed rule-based versus auto-discovery of threat rules from metadata.

Gathering disparate data for Criminal Intelligence Analysis

  • Using IoT (Internet of Things) as sensors for capturing data.
  • Using satellite imagery for domestic surveillance.
  • Using surveillance and image data for criminal identification.
  • Other data gathering technologies: drones, body cameras, GPS tagging systems, and thermal imaging technology.
  • Combining automated data retrieval with data obtained from informants, interrogation, and research.
  • Forecasting criminal activity.

Day 04

Fraud Prevention BI from Big Data in Fraud Analytics

  • Basic classification of fraud analytics: rules-based versus predictive analytics.
  • Supervised versus unsupervised machine learning for fraud pattern detection.
  • Business-to-business fraud, medical claims fraud, insurance fraud, tax evasion, and money laundering.

Social Media Analytics: Intelligence gathering and analysis

  • How social media is used by criminals to organise, recruit, and plan.
  • Big Data ETL API for extracting social media data.
  • Text, image, metadata, and video.
  • Sentiment analysis from social media feeds.
  • Contextual and non-contextual filtering of social media feeds.
  • Social media dashboard to integrate diverse social media sources.
  • Automated profiling of social media profiles.
  • Live demo of each analytic will be provided through the Treeminer tool.

Big Data analytics in image processing and video feeds

  • Image storage techniques in Big Data: storage solutions for data exceeding petabytes.
  • LTFS (Linear Tape File System) and LTO (Linear Tape Open).
  • GPFS-LTFS (General Parallel File System - Linear Tape File System): layered storage solution for large image data.
  • Fundamentals of image analytics.
  • Object recognition.
  • Image segmentation.
  • Motion tracking.
  • 3-D image reconstruction.

Biometrics, DNA, and Next Generation Identification Programs

  • Beyond fingerprinting and facial recognition.
  • Speech recognition, keystroke analysis (analysing a user's typing pattern), and CODIS (Combined DNA Index System).
  • Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples.

Big Data dashboard for quick accessibility of diverse data and display:

  • Integration of existing application platforms with Big Data dashboards.
  • Big Data management.
  • Case study of Big Data dashboards: Tableau and Pentaho.
  • Using Big Data apps to push location-based services in government.
  • Tracking systems and management.

Day 05

How to justify Big Data BI implementation within an organisation:

  • Defining the ROI (Return on Investment) for implementing Big Data.
  • Case studies for saving analyst time in data collection and preparation: increasing productivity.
  • Revenue gain from lower database licensing costs.
  • Revenue gain from location-based services.
  • Cost savings from fraud prevention.
  • An integrated spreadsheet approach for calculating approximate expenses versus revenue gain/savings from Big Data implementation.

Step-by-step procedure for replacing a legacy data system with a Big Data system

  • Big Data migration roadmap.
  • What critical information is needed before architecting a Big Data system?
  • What are the different ways for calculating the Volume, Velocity, Variety, and Veracity of data?
  • How to estimate data growth.
  • Case studies.

Review of Big Data vendors and their products.

  • Accenture
  • APTEAN (formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (formerly 10Gen)
  • MU Sigma
  • NetApp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • Qliktech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (part of EMC)

Q/A session

Requirements

  • Knowledge of law enforcement processes and data systems
  • Basic understanding of SQL/Oracle or relational databases
  • Basic understanding of statistics (at spreadsheet level)

Audience

  • Law enforcement specialists with a technical background
 35 Hours

Number of participants


Price per participant

Testimonials (2)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories