Course Outline
Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis
- Case studies from law enforcement: Predictive policing
- Big Data adoption rates within law enforcement agencies and how they are aligning future operations around Big Data predictive analytics
- Emerging technology solutions such as gunshot sensors, surveillance video, and social media
- Using Big Data technology to mitigate information overload
- Interfacing Big Data with legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data integration and dashboard visualisation
- Fraud management
- Business rules and fraud detection
- Threat detection and profiling
- Cost-benefit analysis for Big Data implementation
Introduction to Big Data
- Main characteristics of Big Data: Volume, Variety, Velocity, and Veracity.
- MPP (Massively Parallel Processing) architecture
- Data warehouses: static schema, slowly evolving datasets
- MPP databases: Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
- Hadoop-based solutions: no conditions on dataset structure.
- Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
- Apache Spark for stream processing
- Batch: suited for analytical/non-interactive tasks
- Volume: CEP streaming data
- Typical choices: CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
- Less production-ready: Storm/S4
- NoSQL databases: (columnar and key-value): best suited as an analytical adjunct to data warehouses/databases
NoSQL solutions
- KV Store: Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
- KV Store: Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
- KV Store (Hierarchical): GT.m, Cache
- KV Store (Ordered): TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
- KV Cache: Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
- Tuple Store: Gigaspaces, Coord, Apache River
- Object Database: ZopeDB, DB40, Shoal
- Document Store: CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
- Wide Columnar Store: BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to data cleaning issues in Big Data
- RDBMS: static structure/schema, does not promote an agile, exploratory environment.
- NoSQL: semi-structured, sufficient structure to store data without an exact schema prior to storage.
- Data cleaning issues
Hadoop
- When to select Hadoop?
- STRUCTURED: Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration).
- SEMI-STRUCTURED data: difficult to handle using traditional solutions (DW/DB).
- Warehousing data requires HUGE effort and remains static even after implementation.
- For variety and volume of data, processed on commodity hardware: HADOOP.
- Commodity hardware required to create a Hadoop cluster.
Introduction to MapReduce/HDFS
- MapReduce: distributed computing across multiple servers.
- HDFS: makes data available locally for the computing process (with redundancy).
- Data: can be unstructured/schema-less (unlike RDBMS).
- Developer responsibility to interpret data.
- Programming MapReduce: working with Java (pros/cons), manually loading data into HDFS.
Day 02
Big Data Ecosystem: Building Big Data ETL (Extract, Transform, Load) – Which Big Data tools to use and when?
- Hadoop versus other NoSQL solutions
- For interactive, random access to data.
- HBase (column-oriented database) on top of Hadoop.
- Random access to data but with restrictions (max 1 PB).
- Not ideal for ad-hoc analytics; suitable for logging, counting, and time-series analysis.
- Sqoop: import from databases to Hive or HDFS (JDBC/ODBC access).
- Flume: stream data (e.g., log data) into HDFS.
Big Data Management System
- Moving parts, compute nodes start/fail: ZooKeeper for configuration, coordination, and naming services.
- Complex pipeline/workflow: Oozie to manage workflows, dependencies, and daisy chains.
- Deploy, configure, cluster management, upgrades, etc. (sys admin): Ambari.
- In Cloud: Whirr.
Predictive Analytics: Fundamental Techniques and Machine Learning-based Business Intelligence
- Introduction to machine learning.
- Learning classification techniques.
- Bayesian prediction: preparing a training file.
- Support Vector Machine.
- KNN p-Tree algebra and vertical mining.
- Neural networks.
- Big Data large variable problem: Random Forest (RF).
- Big Data automation problem: Multi-model ensemble RF.
- Automation through Soft10-M.
- Text analytic tool: Treeminer.
- Agile learning.
- Agent-based learning.
- Distributed learning.
- Introduction to open-source tools for predictive analytics: R, Python, Rapidminer, Mahout.
Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis
- Technology and the investigative process.
- Insight analytics.
- Visualisation analytics.
- Structured predictive analytics.
- Unstructured predictive analytics.
- Threat/fraudstar/vendor profiling.
- Recommendation engine.
- Pattern detection.
- Rule/scenario discovery: failure, fraud, optimisation.
- Root cause discovery.
- Sentiment analysis.
- CRM analytics.
- Network analytics.
- Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
- Technology-assisted review.
- Fraud analytics.
- Real-time analytics.
Day 03
Real-Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS.
- Apache Hama: for Bulk Synchronous distributed computing.
- Apache Spark: for cluster computing and real-time analytics.
- CMU Graphics Lab2: graph-based asynchronous approach to distributed computing.
- KNN p: algebra-based approach from Treeminer for reduced hardware cost of operation.
Tools for eDiscovery and Forensics
- eDiscovery over Big Data versus legacy data: a comparison of cost and performance.
- Predictive coding and Technology-Assisted Review (TAR).
- Live demo of vMiner to demonstrate how TAR enables faster discovery.
- Faster indexing through HDFS: velocity of data.
- NLP (Natural Language Processing): open-source products and techniques.
- eDiscovery in foreign languages: technology for foreign language processing.
Big Data BI for Cyber Security: Getting a 360-degree view, speedy data collection, and threat identification
- Understanding the basics of security analytics: attack surface, security misconfiguration, host defences.
- Network infrastructure / Large datapipe / Response ETL for real-time analytics.
- Prescriptive versus predictive: fixed rule-based versus auto-discovery of threat rules from metadata.
Gathering disparate data for Criminal Intelligence Analysis
- Using IoT (Internet of Things) as sensors for capturing data.
- Using satellite imagery for domestic surveillance.
- Using surveillance and image data for criminal identification.
- Other data gathering technologies: drones, body cameras, GPS tagging systems, and thermal imaging technology.
- Combining automated data retrieval with data obtained from informants, interrogation, and research.
- Forecasting criminal activity.
Day 04
Fraud Prevention BI from Big Data in Fraud Analytics
- Basic classification of fraud analytics: rules-based versus predictive analytics.
- Supervised versus unsupervised machine learning for fraud pattern detection.
- Business-to-business fraud, medical claims fraud, insurance fraud, tax evasion, and money laundering.
Social Media Analytics: Intelligence gathering and analysis
- How social media is used by criminals to organise, recruit, and plan.
- Big Data ETL API for extracting social media data.
- Text, image, metadata, and video.
- Sentiment analysis from social media feeds.
- Contextual and non-contextual filtering of social media feeds.
- Social media dashboard to integrate diverse social media sources.
- Automated profiling of social media profiles.
- Live demo of each analytic will be provided through the Treeminer tool.
Big Data analytics in image processing and video feeds
- Image storage techniques in Big Data: storage solutions for data exceeding petabytes.
- LTFS (Linear Tape File System) and LTO (Linear Tape Open).
- GPFS-LTFS (General Parallel File System - Linear Tape File System): layered storage solution for large image data.
- Fundamentals of image analytics.
- Object recognition.
- Image segmentation.
- Motion tracking.
- 3-D image reconstruction.
Biometrics, DNA, and Next Generation Identification Programs
- Beyond fingerprinting and facial recognition.
- Speech recognition, keystroke analysis (analysing a user's typing pattern), and CODIS (Combined DNA Index System).
- Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples.
Big Data dashboard for quick accessibility of diverse data and display:
- Integration of existing application platforms with Big Data dashboards.
- Big Data management.
- Case study of Big Data dashboards: Tableau and Pentaho.
- Using Big Data apps to push location-based services in government.
- Tracking systems and management.
Day 05
How to justify Big Data BI implementation within an organisation:
- Defining the ROI (Return on Investment) for implementing Big Data.
- Case studies for saving analyst time in data collection and preparation: increasing productivity.
- Revenue gain from lower database licensing costs.
- Revenue gain from location-based services.
- Cost savings from fraud prevention.
- An integrated spreadsheet approach for calculating approximate expenses versus revenue gain/savings from Big Data implementation.
Step-by-step procedure for replacing a legacy data system with a Big Data system
- Big Data migration roadmap.
- What critical information is needed before architecting a Big Data system?
- What are the different ways for calculating the Volume, Velocity, Variety, and Veracity of data?
- How to estimate data growth.
- Case studies.
Review of Big Data vendors and their products.
- Accenture
- APTEAN (formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (formerly 10Gen)
- MU Sigma
- NetApp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (part of EMC)
Q/A session
Requirements
- Knowledge of law enforcement processes and data systems
- Basic understanding of SQL/Oracle or relational databases
- Basic understanding of statistics (at spreadsheet level)
Audience
- Law enforcement specialists with a technical background
Testimonials (3)
basics and loved the prepared documents and exercises
Rekha Nallam - GE Medical Systems Polska Sp. z o.o.
Course - Introduction to Predictive AI
Deepthi was super attuned to my needs, she could tell when to add layers of complexity and when to hold back and take a more structured approach. Deepthi truly worked at my pace and ensured I was able to use the new functions /tools myself by first showing then letting me recreate the items myself which really helped embed the training. I could not be happier with the results of this training and with the level of expertise of Deepthi!
Deepthi - Invest Northern Ireland
Course - IBM Cognos Analytics
he was well prepared - and he is very sympathetic