Course Outline
Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis
- Case studies from law enforcement: Predictive policing
- Big Data adoption rates within law enforcement agencies and how they are aligning future operations around Big Data predictive analytics
- Emerging technology solutions such as gunshot sensors, surveillance video, and social media
- Using Big Data technology to mitigate information overload
- Interfacing Big Data with legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data integration and dashboard visualisation
- Fraud management
- Business rules and fraud detection
- Threat detection and profiling
- Cost-benefit analysis for Big Data implementation
Introduction to Big Data
- Main characteristics of Big Data: Volume, Variety, Velocity, and Veracity.
- MPP (Massively Parallel Processing) architecture
- Data warehouses: static schema, slowly evolving datasets
- MPP databases: Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
- Hadoop-based solutions: no conditions on dataset structure.
- Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
- Apache Spark for stream processing
- Batch: suited for analytical/non-interactive tasks
- Volume: CEP streaming data
- Typical choices: CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
- Less production-ready: Storm/S4
- NoSQL databases: (columnar and key-value): best suited as an analytical adjunct to data warehouses/databases
NoSQL solutions
- KV Store: Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
- KV Store: Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
- KV Store (Hierarchical): GT.m, Cache
- KV Store (Ordered): TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
- KV Cache: Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
- Tuple Store: Gigaspaces, Coord, Apache River
- Object Database: ZopeDB, DB40, Shoal
- Document Store: CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
- Wide Columnar Store: BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to data cleaning issues in Big Data
- RDBMS: static structure/schema, does not promote an agile, exploratory environment.
- NoSQL: semi-structured, sufficient structure to store data without an exact schema prior to storage.
- Data cleaning issues
Hadoop
- When to select Hadoop?
- STRUCTURED: Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration).
- SEMI-STRUCTURED data: difficult to handle using traditional solutions (DW/DB).
- Warehousing data requires HUGE effort and remains static even after implementation.
- For variety and volume of data, processed on commodity hardware: HADOOP.
- Commodity hardware required to create a Hadoop cluster.
Introduction to MapReduce/HDFS
- MapReduce: distributed computing across multiple servers.
- HDFS: makes data available locally for the computing process (with redundancy).
- Data: can be unstructured/schema-less (unlike RDBMS).
- Developer responsibility to interpret data.
- Programming MapReduce: working with Java (pros/cons), manually loading data into HDFS.
Day 02
Big Data Ecosystem: Building Big Data ETL (Extract, Transform, Load) – Which Big Data tools to use and when?
- Hadoop versus other NoSQL solutions
- For interactive, random access to data.
- HBase (column-oriented database) on top of Hadoop.
- Random access to data but with restrictions (max 1 PB).
- Not ideal for ad-hoc analytics; suitable for logging, counting, and time-series analysis.
- Sqoop: import from databases to Hive or HDFS (JDBC/ODBC access).
- Flume: stream data (e.g., log data) into HDFS.
Big Data Management System
- Moving parts, compute nodes start/fail: ZooKeeper for configuration, coordination, and naming services.
- Complex pipeline/workflow: Oozie to manage workflows, dependencies, and daisy chains.
- Deploy, configure, cluster management, upgrades, etc. (sys admin): Ambari.
- In Cloud: Whirr.
Predictive Analytics: Fundamental Techniques and Machine Learning-based Business Intelligence
- Introduction to machine learning.
- Learning classification techniques.
- Bayesian prediction: preparing a training file.
- Support Vector Machine.
- KNN p-Tree algebra and vertical mining.
- Neural networks.
- Big Data large variable problem: Random Forest (RF).
- Big Data automation problem: Multi-model ensemble RF.
- Automation through Soft10-M.
- Text analytic tool: Treeminer.
- Agile learning.
- Agent-based learning.
- Distributed learning.
- Introduction to open-source tools for predictive analytics: R, Python, Rapidminer, Mahout.
Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis
- Technology and the investigative process.
- Insight analytics.
- Visualisation analytics.
- Structured predictive analytics.
- Unstructured predictive analytics.
- Threat/fraudstar/vendor profiling.
- Recommendation engine.
- Pattern detection.
- Rule/scenario discovery: failure, fraud, optimisation.
- Root cause discovery.
- Sentiment analysis.
- CRM analytics.
- Network analytics.
- Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
- Technology-assisted review.
- Fraud analytics.
- Real-time analytics.
Day 03
Real-Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS.
- Apache Hama: for Bulk Synchronous distributed computing.
- Apache Spark: for cluster computing and real-time analytics.
- CMU Graphics Lab2: graph-based asynchronous approach to distributed computing.
- KNN p: algebra-based approach from Treeminer for reduced hardware cost of operation.
Tools for eDiscovery and Forensics
- eDiscovery over Big Data versus legacy data: a comparison of cost and performance.
- Predictive coding and Technology-Assisted Review (TAR).
- Live demo of vMiner to demonstrate how TAR enables faster discovery.
- Faster indexing through HDFS: velocity of data.
- NLP (Natural Language Processing): open-source products and techniques.
- eDiscovery in foreign languages: technology for foreign language processing.
Big Data BI for Cyber Security: Getting a 360-degree view, speedy data collection, and threat identification
- Understanding the basics of security analytics: attack surface, security misconfiguration, host defences.
- Network infrastructure / Large datapipe / Response ETL for real-time analytics.
- Prescriptive versus predictive: fixed rule-based versus auto-discovery of threat rules from metadata.
Gathering disparate data for Criminal Intelligence Analysis
- Using IoT (Internet of Things) as sensors for capturing data.
- Using satellite imagery for domestic surveillance.
- Using surveillance and image data for criminal identification.
- Other data gathering technologies: drones, body cameras, GPS tagging systems, and thermal imaging technology.
- Combining automated data retrieval with data obtained from informants, interrogation, and research.
- Forecasting criminal activity.
Day 04
Fraud Prevention BI from Big Data in Fraud Analytics
- Basic classification of fraud analytics: rules-based versus predictive analytics.
- Supervised versus unsupervised machine learning for fraud pattern detection.
- Business-to-business fraud, medical claims fraud, insurance fraud, tax evasion, and money laundering.
Social Media Analytics: Intelligence gathering and analysis
- How social media is used by criminals to organise, recruit, and plan.
- Big Data ETL API for extracting social media data.
- Text, image, metadata, and video.
- Sentiment analysis from social media feeds.
- Contextual and non-contextual filtering of social media feeds.
- Social media dashboard to integrate diverse social media sources.
- Automated profiling of social media profiles.
- Live demo of each analytic will be provided through the Treeminer tool.
Big Data analytics in image processing and video feeds
- Image storage techniques in Big Data: storage solutions for data exceeding petabytes.
- LTFS (Linear Tape File System) and LTO (Linear Tape Open).
- GPFS-LTFS (General Parallel File System - Linear Tape File System): layered storage solution for large image data.
- Fundamentals of image analytics.
- Object recognition.
- Image segmentation.
- Motion tracking.
- 3-D image reconstruction.
Biometrics, DNA, and Next Generation Identification Programs
- Beyond fingerprinting and facial recognition.
- Speech recognition, keystroke analysis (analysing a user's typing pattern), and CODIS (Combined DNA Index System).
- Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples.
Big Data dashboard for quick accessibility of diverse data and display:
- Integration of existing application platforms with Big Data dashboards.
- Big Data management.
- Case study of Big Data dashboards: Tableau and Pentaho.
- Using Big Data apps to push location-based services in government.
- Tracking systems and management.
Day 05
How to justify Big Data BI implementation within an organisation:
- Defining the ROI (Return on Investment) for implementing Big Data.
- Case studies for saving analyst time in data collection and preparation: increasing productivity.
- Revenue gain from lower database licensing costs.
- Revenue gain from location-based services.
- Cost savings from fraud prevention.
- An integrated spreadsheet approach for calculating approximate expenses versus revenue gain/savings from Big Data implementation.
Step-by-step procedure for replacing a legacy data system with a Big Data system
- Big Data migration roadmap.
- What critical information is needed before architecting a Big Data system?
- What are the different ways for calculating the Volume, Velocity, Variety, and Veracity of data?
- How to estimate data growth.
- Case studies.
Review of Big Data vendors and their products.
- Accenture
- APTEAN (formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (formerly 10Gen)
- MU Sigma
- NetApp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (part of EMC)
Q/A session
Requirements
- Knowledge of law enforcement processes and data systems
- Basic understanding of SQL/Oracle or relational databases
- Basic understanding of statistics (at spreadsheet level)
Audience
- Law enforcement specialists with a technical background
Testimonials (2)
Abhi has excellent knowledge of Alteryx and he explained things very clearly. He understood our goals and created bespoke demo datasets that were relevant to our organisation, which was very impressive. The training was well-structured and delivered at a good pace, with time for questions.
Samuel Taylor - Manchester Metropolitan University
Course - Alteryx for Data Analysis
basics and loved the prepared documents and exercises