Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Big Data Analytics for Network Resiliency
John S. Eberhardt III
Adjunct Professor - Volgenau School of Engineering, George Mason University
Partner - 3E Services, LLC
8 December 2015
Professional Background
• Adjunct Professor at the Volgenau School, George Mason University
• Partner and Founder, 3E Services (data consulting)• Founder and former Chief Scientist at Decision Q Corp
(machine learning)• 48 publications and conference presentations
Disclaimer: This presentation represents the personal opinions of Mr. Eberhardt, based upon his professional experience. It should not be viewed as a complete overview of the sector, and does not represent the institutional views of either George Mason University or 3E Services, LLC.
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
BiographyJohn is a Data Scientist with nearly 20 years of experience in the Analytical Sector. John has led the development of multiple advanced analytical products and methods, managing teams of scientists and engineers to rapidly create customer-centered analytical solutions.
With one patent and five patent applications in process and over 35 publications, John is a thought leader in advanced analytics with experience in machine learning, statistical algorithms, and user interface design for decision support in Security, Healthcare, Financial Services, Life Sciences, and Consumer Products.
John has developed over 20 analytical solutions in clinical decision support, cyber security, molecular diagnostics, risk management, and product marketing including award winning healthcare quality applications. John has applied his expertise with the Department of Defense, Altamira, Roche, Genentech, Novartis, Walter Reed Army Medical Center, Memorial Sloan Kettering, the University of Wisconsin, University of Mississippi, and Thomas Jefferson University among others. John has a BA Cum Laude from Duke University in Economics and History.
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Understanding the NSTAC Scoping Report
• Explore how private sector data sets and infrastructural resources can be utilized can be utilized in support of its national security and emergency preparedness activities
• Create policies to:• Identify current and emerging big data sets within the public and
private sectors• Select and/or develop models that will further encourage information
sharing• Access big data sets to support NS/EP capabilities, when appropriate
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Key Questions I Will Focus On
• How is data being created and collected?• What data is identified and made available for analysis?• How do we get the data to work with the analytics?• How do the analytic outputs support the mission?
Key Definition of a term I will be using in this briefing:An Ontology is a system of knowledge.
(My apologies if you already know this)
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Big Data Process
• NIST Draft Special Publication 1500-6 describes a big data analytics value chain• Collection, Preparation/Curation, Analysis, Visualization, and Access
• This briefing is focused on the issues of Preparation/Curation, Analysis, and Access• How they relate to the NSTAC scoping questions of data collection,
access, analysis, and use
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Network Behavior and Resiliency
Key Aspects of Understanding Network Behavior and Resiliency
• Highly polymorphic, subject to emergent behavior• Emergent Behavior defined as: “the arising of novel and coherent
structures, patterns and properties during the process of self-organization in complex systems” (1)
• e.g., Program Trading in the 1987 stock crash (2)
• Facilitated by the DOT program of the NYSE• Systems behaved rationally locally but irrationally globally• Data was coming too quickly for humans “One notable problem was the
difficulty gathering information in the rapidly changing and chaotic environment.” 2
• This is an extremely challenging data collection and analysis problem
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Network Behavior and Resiliency – Cont.
Key Aspects of Understanding Network Behavior and Resiliency
• Data structures and formats in networking data are semantically inconsistent• This makes curating data for use in analytics extraordinarily labor
intensive; and • Handicaps the ability to use computers to detect emergent patterns
• What does “RFC” mean?• While there is a great deal of established syntax, networking is
continuously evolving• This makes creating knowledge structures difficult – biology changes,
but slowly – the internet changes suddenly
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Bottom Line Up Front
We don’t have a technology problem. We have an understanding problem.
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
Antoine de Saint-Exupéry
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Where are we today? Cyber Response
• Forensic, Backward looking, heavily human dependent with a need for high technical competency
• Examples (Technology and Commercial Response)• Antivirus/Anti-Malware: Symantec, Intel/McAfee, Kaspersky,
Microsoft, F5, Barracuda, Palo Alto Networks• Signature and forensic based
• Threat Intelligence: Norse, FireEye, Barracuda, Palo Alto Networks• Focused on specific threats and activity and requires user subscription• Doesn’t provide countermeasure
• Analytics: Palo Alto Networks, Splunk• A good start, but still very rudimentary
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Where are we today? Big Data in Cyber
• Big data technologies are sufficient to address this problem• The proprietary and open source tools for data collection,
storage/retrieval, and analysis are more than adequate in their own right – the challenge is access to data and semantic structuring
• Competing, limited standards• Message formats focused on sharing threat intel rather than
providing a knowledge/semantic structure for raw data analysis• e.g., TAXII, STIX, CybOX• Useful for sharing threat intel but do not provide the knowledge
structure needed to create common analysis of raw network traffic• Structure for moving raw analytical data limited (e.g., Cisco
NetFlow)• Very high level
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Where are we today? Big Data in Cyber Cont.• Currently Human-to-Human Oriented Exchange
• Cyber Threat Intelligence Integration Center still being stood up• DHS has a number of collaboratives and working groups – this is a
great first step, but to my knowledge no one is working toward a common semantics (ontology)
• We need a common ontology to anchor analysis and research – a language of cyber that allows us to compare results objectively
• Technology is advanced, not accessible (cost and sophistication limits it to government and big corporations)• Cloud may help by moving basic information technology services to
shared providers that have the scale to protect them (e.g., AWS)• However, individual devices cannot be protected this way – medical
devices and IPv6/IoT are a great example of continued attack surface• Configuring and using current tools, especially in network security,
requires an extraordinarily high level of technical and subject matter expertise
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Gaps in the Current Architecture
• Availability of Raw Data for Research and Development• Exchange Standards with Semantic Structure to make
research results objectively comparable• Accessible Tools to implement findings beyond large
organizations
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Lack of Raw Data
• data.gov• 188,420 data sets• Number with actual IP traffic data: 2 (3)
• “PCAP, PCAP everywhere, but not a drop to drink” (Borrowed from Samuel Taylor Coleridge)
• Other Data Sets?• NETRESEC – private company, data from CTF exercises• DHS PREDICT – useful, more current, but relatively narrow and
focused on static problems• Only one intrusion detection data set from 2005-2010 (U Wisconsin –
only log data) with attack events, not raw data• SKAION simulation • C-State and Merit Network – flow data
• AFRL DARPA Intrusion Detection Data Set – 1998!• Here’s the key: almost all of it is simulated!
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Exchange Standards with Semantic Structure
• NetFlow is inadequate• Threat Intel interchange standards do not support raw data
analytical structures• Current standards are like disease coding in healthcare
• But we also have extensive languages (SNOMED, LOINC) for describing the components of the systems underlying the diagnosis to support systems research
• In network security, we have developed standards for the diagnosis, but not to describe the system (RFCs provide syntax, but no ontology)• This means that structuring real data for analysis is extremely time
consuming and challenging• This also means that the terms of reference for different analysis projects
can be radically different, making comparison of research conclusions extremely challenging
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Accessible ToolsTools are either too rudimentary, or require a high level of technical sophistication. We need to enable humans to detect patterns, and take action, sooner.
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Recommendations
• Facilitate making data sets available through creating the conditions for safe, trusted sharing on a broad basis• This may mean letting a few shady characters under the tent!
• Facilitate data exchange standards so the data is meaningful – form an ontology working group like SNOMED• Make the results of data sharing comparable
• Facilitate the development and delivery of tools that are accessible• Support research in basic methodologies for making data analysis
more broadly accessible
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Thank You!
Thank you for your time today!
If you want to reach me:[email protected]
“And so, in the end, the only thing that fails to conform to our wishes is reality”
Vaclav Havel
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
References
1. Goldstein, Jeffrey, Emergence: Complexity and Organization 1 (1): 49–72, doi:10.1207/s15327000em0101_4
2. Mark Carlson, Board of Governors of the Federal Reserve, November 2006
3. Based on “OR” keyword search of data.gov using the following terms: PCAP, PCAP Data, PCAP Files, IP Traffic, IP Network Traffic, Internet traffic, Netflow, raw IP network data, internet protocol data, network log files, netflow data
Big Data Analysis for Network Resiliency – NSTAC – John S. Eberhardt III – 8 Dec 2015
Recommended