Data mining paper survey for Health Care Support System

電機四王鴻鈞

CloudData Mining in Health Care System

2014-12-30

1

Summary

- Typical content of a generic Electronic health data (EHR) system

- How data mining on health data fill knowledge gaps and assist informed clinical decision making

- How the integration of EHR and genetic data with systems biology approaches facilitate

genotype–phenotype association studies

2

Classification of Electronic health data (EHR)

• Administrative data

- Data that serve administrative purposes

• Ancillary clinical data- Provided by laboratories, pharmacies, and radiological and medical imaging

(Another ancillary source of potentially structured data is genotype and sequence data.)

• Clinical text

- Written or dictated clinical narratives

3

Why Integrate Health Care Data

• Correlating Clinical Features

- 有些不同的疾病會有相同的症狀或者會同時出現 (co-occurrence)。

- 把一些常見的現象和一些重要的疾病連結在一起，以發現新病徵。

• Prediction from Data

- 在一些狀況下，我們可以藉由先前所發現的相關性研究或是其他事實來建立一個醫療判斷模型，藉此提供醫生一個預測病人狀況的參考依據。

• Patient Stratification

- 將患者作分群，通常相同群組會有類似的症狀。

4

Electronic health record content

The electronic health record (EHR) of a patient can be viewed as a repository of information

regarding his or her health status in a computer-readable form. An encounter with the

health-care system generates various types of patient-linked data.

5

Four ways to analyze EHR data

1) Comorbidity 2) Machine Learning 3) Patient Clustering 4) Cohort Querying

6

Comorbidity

7

Patient clustering

8

Classification by Machine Learning

9

Cohort querying

10

Deal with Clinical Text

Using Natural language processing

1. Sentence boundary detection splits the text into units of individual sentences.

2. Split the text using space and punctuation as a guide to identify individual tokens (typically individual words),

with rules for handling special cases such as dates

3. Tokens are reduced to a base form by normalizing

4. Assigns part-of-speech tags to each token to identify its grammatical category in the context

5. identifies syntactic units, most importantly noun phrases (NPs), which are grammatical units, built from a

noun with optional modifiers such as adjectives.

6. NPs and various lexical permutations are then mapped to controlled vocabularies

11

How the System Actually Implement

Take A health care system “GEMINI” for example

The GEMINI system consists of two components:

1. The PROFILING component extracts data of each patient from various ources and stores them as

information in a patient profile graph.

2.The ANALYTICS component analyzes the patient profile graphs to infer implicit information

and extract relevant features for the prediction tasks.

(Whole view of GEMINI)

12

Input of GEMINI

- Clinical Data

The repository has multiple sources of patient data: 1) structured sources containing patients’ demographics, lab test results,

medication history, etc., 2) unstructured data sources storing free-text doctor’s notes.

- Medical Knowledge Base

GEMINI utilizes a well-known medical knowledge base UMLS to interpret unstructured doctor’s notes, i.e.,

identifying medical concepts (e.g., diabetes mellitus), and relationships between concepts (e.g., HbA1c measures control of

diabetes mellitus).

Input:

13

How to do “Patient Profiling”

- This component utilizes NLP engines to extract named entities, called mentions. It then devises collective inference to

simultaneously map mentions to their semantically matched concepts in the knowledge base and discovers additional

relationships.

- To improve the accuracy of this process, the component asks doctors to verify or corroborate mention-concept mappings

and concept relationships identified.

14

How to do “Healthcare Analytics”

The ANALYTICS component of GEMINI consists of three major steps:

1) Feature Selection

- All features that are contained in the patient profile graphs can be used as features for the analytics tasks. ANALYTICS can derive

implicit and also important features with expert input from the healthcare professionals.

2) Training Data Labelling

- Leverage on doctors’ input to label a small number of patients with the most informative data (i.e., patient profile graphs) to derive a

training set

- What we need is a diverse set of labeled patients that somehow covers the whole data space as much as possible

- Avoid overwhelming the doctors with too much information

3) Analytics Algorithms

- Conventional analytics algorithms, such as classification, clustering and prediction to perform the various analytics tasks

- Might have some expert rules/heuristics for the analytics tasks ( e.g. majority-voting)

15

How to implement “Supporting Platform”

using ‘epiC”

GEMINI use a flexible parallel processing framework (epiC ) to support:

1) Distributed data storage that effectively partitions clinical data and stores them in multiple nodes.

2) Scalable NLP processing and data analytics that involve various computation models, such as Map-

Reduce model for entity extraction, Pregel model for graphical inference, deep learning for analytics, etc.

16

- Integrating genetics

- Systems biology and gene-network-based decision support

Linking to the molecular level

17

Take Genome Analyzing Startup for Example

18

19

Connect with Data Base

Using “AnnovaR ” as Annotation Engine

20

Connect with Data Base

Using “AnnovaR ” as Annotation Engine

21

Workflows in electronic health record-driven genomic

research

22

Limiting factors — key problems to overcome

- Privacy, autonomy and consent

- Interoperability across institutions, countries and continents

23

Reference

• GEMINI: An Integrative Healthcare Analytics System

• epic: an Extensible and Scalable System for Processing Big Data

• Semantics Driven Approach for Knowledge Acquisition From EMRs

• Mining electronic health record toward better research applications and clinical care

• Using electronic health records to drive discovery in disease genomics

• Contextual Crowd Intelligence

• Opportunities for genomic clinical decision support interventions

• The role of primary care in early detection and follow-up of cancer

Science

Data mining paper survey for Health Care Support System