Upload
dale-booker
View
221
Download
2
Embed Size (px)
Citation preview
蔣以仁
Search Problem
2
Monika Henzinger, Search Technologies for the Internet Science, Vol. 317. no. 5837, 468 – 471, 27 July 2007
Search Query: Jaguar Jaguar(Animal) Jaguar(Automobile) Jaguar(Watch) Jaguar(OS)
檔案的目的在為未來創造知識 …records are recognized as agency assets
used to underpin current business and legal needs, as well as the basis for a knowledge management system to meet future goals. –
HOWARD P. LOWELL
DirectorModern Records
ProgramsNARA
資料探勘資料探勘走向決策支援走向決策支援彙整同一性質資料資料探勘以產生關聯相依規律視覺化顯示協助專家研判主題定義處理指引方便建立決策支援
KDD Process
DataWarehouse
Knowledge
Selection
Preprocessing
Target Data
Preprocessed Data
PatternTransformed
Data
Data Mining
Transformation
Interpretation/Evaluation
8
BI 結構Monitor
&Integrator
Complete DataWarehouse
ExtractTransformLoadRefresh
metadata OLAPServer
1. Comprehensive Performance Management2. Analysis3. Query4. Reports5. Data mining
Data Sources
Tools
Server
Data Marts
Operational DBs
Other sources
Business Intelligence
9
Gaining market intelligence from news feeds
Sreekumar Sukumaran and Ashish Sureka
SignalDr. Bhandari said, “I first noticed this when
the New York Times did an analysis after the fact showing that early indications of the Ford-Explorer-Firestone-tire problem went undetected in a federal database. Recently, a similar analysis by CNN showed that early indications of security problems at Logan, Dulles, and Newark airports, went undetected in a federal database well before the September 11 tragedy. It is clear that the cost of missing these patterns is too high to be ignored.”
資訊整合
Call Taker: James Date: Aug. 30, 2002Duration: 10 min.CustomerID: ADC00123
Q: cust sys has stopped working.A: checked cust bios and it need updated. …
Unstructured Data
Structured Data[Call Taker] James [Date] 2002/08/30[Duration] 10 min.[CustomerID] ADC00123
[Noun] Customer[Software] BIOS[Subj...Verb] customer system..stop[SW..Problem] BIOS..need
Original Data Meta Data
LinguisticAnalysis
TaggingDependency AnalysisNamed Entity ExtractionIntention Analysis
CategoryDictionary
SynonymDictionary
Category Item Visualization & Interactive Mining
Mining
IBM TAKMI(Nasukawa, Nagano,1999)
Mining target: individual text Mining unit: >texts >category labeled items extracted from text using NLP
醫學文獻告訴我什麼醫學文獻來源: Medline可發現疾病、症狀與藥物或化合物的因果關聯
1. Swanson DR. Searching natural language text by computer. Machine indexing and text searching offer an approach to the basic problems of library automation. Science. 132:1099–1104, 21 Oct. 1960.
2. Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 30(1):7–18, 1986.
3. Swanson, D.R., Complementary structures in disjoint science literatures. In A. Bookstein, et al (Eds.), SIGIR91: Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval Chicago, Oct 13-16, 280-289, 1991.
偏頭痛 ?Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some
migraines Magnesium is a natural calcium channel blocker Spreading cortical depression (SCD) is implicated
in some migraines High levels of magnesium inhibit SCD Migraine patients have high platelet
aggregability Magnesium can suppress platelet aggregability
Smalheiser, N.R. & Swanson, D.R.. Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications, 15, 1-9, 1994.
文獻實証
migraine magnesium
stress
CCB
PA
SCD
All NutritionResearch
All MigraineResearch
找出新線索Fish oils雷諾氏現象
Raynauds
vasoconstrictions血管收縮
platelet aggregation血小板活化凝集
blood viscosity粘滯血症
Intermediate conceptsSwanson, D.R. (1994). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. Autumn;30(1):7-18, 1986 .
Hypothesis generation
不得不提到的技術 - 自然語言處理 NLP
始於 1948 年倫敦 Birkbeck College 字典查詢系統
1949- Warren Weaver 之 American Interest 破解密碼
1950- 機器翻譯 (German to English, Russian to English)
1966~ 雷聲大雨點小機器翻譯字對字 (Dr. Eye?)NLP brought the first hostility of research
funding agencies.NLP gave AI a bad name before AI had a name.
資訊巨幅成長2006 年數位資訊量已達 1,610 億 GB( 相當於 161 Exabytes) 。 IDC 預估從 2006 至 2010 年間,資訊成長量約為六倍。 2010 年時,有近 70% 的數位世界的資訊是由個人使用者所創造,而至少有 85% 的資訊量是組織企業必須負起資訊安全、隱私、可靠性及相關法規遵從的責任。
The Expanding Digital Universe, http://www.emc.com/leadership/digital-universe/expanding-digital-universe.htm
0
10
20
30
40
50
60
70
80
90
100
資料量 市場化價值
非結構資料
結構化資料
網路訊息新聞報導專利電子郵件文件… Oracle
Search Engine Roadmap
Full Text SearchIncluding complex
Boolean search
Exploratory Search
Wikis
Forum/Blogger
Dictionary/Ontology
Taxonomy Search
Synonym/Anatomy
Clustering/Categorize
Customized meta-search
Crawler
Personal taggingSharing
Custom Search
Integrate other search engines
Collaborative Filting
Feature Ranking
Web Page Features Extraction
(semi- and un-structure)
Feature Mapping
Filtering
Summarization(mobile)
Taxonomy search
Document Abstraction
Multiple abstractsorganization
Natural language processing/understanding
Knowledge collaborativesearch
Visualization
Ajax
Visual Technology
Affiliation (Topic Relevance
Analysis)Search log recorder
Topological Graphics
Web 2.0 or upperRecommendation
網路搜尋引擎 以離線方式抓去網頁,透過建立一種內部資料儲
存方式,稱之為 (反轉; inverted) 索引,儲存資料
線上檢索
Monika Henzinger, Search Technologies for the InternetScience, Vol. 317. no. 5837, 468 – 471, 27 July 2007
Search Engine Problems
Index ComprehensivenessRelevance
Deterministic SearchSearch Query
Jaguar(Animal)Jaguar(Automobile)Jaguar(Watch)Jaguar(OS)
Problem: Scalable
J, Beall, The Weaknesses of Full-Text Searching. The Journal of Academic Librianship, 34(5):438-444, 2008.
搜尋引擎之演進 第一代– 只使用“網頁內”文字資料
字頻 , 語言
第二代 -- 使用非頁內 , 網路上特殊屬性資料 連接分析 點擊資料 (What results people click on) 下錨文字 (Hyperlinks, How people refer to this page)
第三代– 回答 “查詢所知” 語意分析 -- what is this about? 專注使用者所需 , 非僅僅查詢 關鍵資料之推定 輔助使用者 整合搜尋及文件分析
1995-1997 AV, Excite, Lycos, etc
From 1998. Made
popular by Google but everyone now
Still experimental
網路搜尋問題 問題
查詢過於簡短不夠精確同意與相似字詞讓查詢匹配度難預期網頁作者混淆式安排 , 讓搜尋結果差強人意使用者需要額外功能 , 如過濾器
解決增加理解結果排列Trailblazer
Car Basketball team
Monika Henzinger, Search Technologies for the Internet Science, Vol. 317. no. 5837, 468 – 471, 27 July 2007
Expand
Crawler
Crawler
Basic Crawler
Hertrix
Wrapper/Clipper
HTML ParserStructural Features
Extraction
Semantic Crawler
Feature Mapping
Unstructured Document Features Extractions
(NLP)
Clipper Windows Specify
Machine Learning Approach
P2P Knowledge SharingCrawler
Filtering
Specific Feature Parse
Ontology
……
…
XML Parser
XHTML, DHTML Parser
Ontological Organization
Scheduling
Feature Transformation
Crawler ClassesAnnotated Crawler
CrawlerCrawler
Craw with specific terms/phasesCraw with specific terms/phases
Outside Search Outside Search EngineEngine WebWeb
Data SourcesData SourcesFilterFilter
Supporting Information Supporting Information from original sources &from original sources &Reference contentsReference contents
Relevant Relevant Information Information
Feed into Reference ListFeed into Reference List
AuthoringAuthoringUser processUser process
Filtering NEFiltering NErecordsrecords
LearnerLearner
Crawler ClassesPage/Section/Block/Item Specify
CrawlerCrawler
RepositoryRepository
Feature ExtractorFeature Extractor
SchedulerScheduler
LogLog
ComparatorComparatorCompare the extracted structure Compare the extracted structure
between two stagesbetween two stages
AdaptorAdaptor
Notify for manually tuneNotify for manually tune
Named Entities RecognitionNamed Entities Recognition
GUI Specification SystemGUI Specification System
LoggerLogger
時序性資訊彙整
事件分析
分群檢索
1. Walter Warnick, Problems of Searching in Web Databases. Science . Vol. 316. no. 5829, 1284, June 2007.
2. I-Jen Chiang, Discover the Semantic Topology in High-Dimensional Data, Expert Systems with Applications, 33 (1), September, 2007.
技術架構略圖
Raw text
Term similarity
Doc similarity
Vector centroid
分群
d
分類 / 文件追蹤META-DATA/ANNOTATION
d d d
d
d d d
d d
d d
d
d
d
t t
t t
t t t t t
t
t t
Stemming & Stop words
Tokenized text
Term Weighting
w11 w12… w1n
w21 w22… w2n
… …wm1 wm2… wmn
t1 t2 … tn d1
d2 … dm
Sentenceselection
摘要
Salton’s Vector Space Model一袋子字 (Bag of Words)Cosine Similarity
Jaccard index θ
Jaccard similarity coefficient
Tanimoto coefficient
A
B
G. Salton, A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, nr. 11, 613–620, 1975.
Curse of Dimensions
1
句意不清 : I saw the man on the hill with telescope
Using a telescope, I saw a man who was on a hill.
I saw the man on the hill with telescope
I saw a man who was on a hill and who had a telescope.
I saw the man on the hill with telescope
I saw a man who was on the hill that has a telescope on it.
I saw the man on the hill with telescope
自然語言處理新方向
1. M. Marcus. New trends in natural language processing: Statistical natural language processing. PNAS. 92. 10052-10059, 1995.
2. Current Trends in Biomedical Natural Language Processing, Ohio State University, June 20083. Tanveer Siddiqui. National Language Processing and Information Retrieval. Oxford Univ Press, 2008.4. Yorick Wilks. Natural Language Processing as a Foundation of the Semantic Web. Foundations and
Trends® in Web Science, 1(3-4). 199-327, 2009.
Text
SpeechRecognition Extractor
Speech Entities
NE
Models
地點人物
組織
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
TrainingProgram
trainingsentences answers
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
Text
SpeechRecognition Extractor
Speech Entities
NE
Models
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
TrainingProgram
trainingsentences answers
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.•Prior to 1997 - no learning approach competitive with
hand-built rule systems •Since 1997 - Statistical approaches (BBN (Bikel et al. 1997), NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performance
I-Jen Chiang
知識地圖
事件追蹤
資訊檢索
知識概念
議題內事件發生的相依關聯議題內事件發生的相依關聯
查詢以瞭解議題內相關論點
論點角度 ( 依機關、案由等 )
議題內某事件所受之影響
議題內某事件之影響
依時間追蹤事件處理狀況
深入細節以瞭解現象、處置深入細節以瞭解現象、處置
權衡輕重以瞭解處事準則權衡輕重以瞭解處事準則
事件追蹤分析議題主軸變化事件追蹤分析議題主軸變化
組合屋議題下
政府震災地區災民住宅重建信用保證基金一千億讓災民取得貸款
組合屋議題下
重建條例訂定含括工程、獎助金
Integrated BI Systems
Complete DataWarehouse
ETL
Structural Data
DBMS File System XML EA Legacy
Unstructured Data
CMSScanned
Documents Email
ETL
Text tagger & Annotator
Intermedia Data
RDBMS XML
Sreekumar Sukumaran and Ashish Sureka
標註
On November 16, 2005, IBM announced it had acquired Collation, a privately held company based in Redwood City, California for undisclosed amount.
DateAcquiring
OrganizationAcquisition
EventAcquired
Organization
Place Amount
Text Annotator
Date Organization Place Amount
Nov. 16 IBM Redwood City, CA
Undisclosed
Output toRDBMS
XMLoutput
On <Date>November 16, 2005</Date>, <ACQUIRING ORG>IBM</ACQUIRING ORG> announced it had <ACQUISITION EVENT>acquired</ACQUISITION EVENT> <ACQUIRED ORG>Collation</ACQUIRED ORG>, a privately held company based in <PLACE>Redwood City, California</PLACE> for <AMOUNT>undisclosed</AMOUNT> amount.
McIlraith, S.A., Son, T.C., Zeng, H.: Semantic web services. IEEE Intelligent Systems 16, 46–53, 2001
整合式 BI 系統Complete Data
Warehouse
ETL
Structural Data
DBMS File System XML EA Legacy
Unstructured Data
CMSScanned
Documents Email
ETL
Text tagger & Annotator
Intermedia Data
RDBMS XML
Sreekumar Sukumaran and Ashish Sureka
Knowledge-based Persistent ArchivesIngest Manage Access
(Topic Maps / Model-based Access)
(Data Handling System - Storage Resource Broker)
MC
AT
/HD
FX
ML
DT
DX
TM
DT
D
GR
IDS
EM
CA
T /
MIX
Rul
es -
KQ
L
Attribute- based Query
Feature-basedQuery
Knowledge orTopic-Based
Query
InformationAttributesSemantics
Knowledge
RelationshipsBetweenConcepts
Data FieldsContainers
Folders
InformationRepository
KnowledgeRepository for
Rules
Storage(Replicas,
Persistent IDs)
Ingest Manage Access
(Topic Maps / Model-based Access)
(Data Handling System - Storage Resource Broker)
MC
AT
/HD
FX
ML
DT
DX
TM
DT
D
GR
IDS
EM
CA
T /
MIX
Rul
es -
KQ
L
Attribute- based Query
Feature-basedQuery
Knowledge orTopic-Based
Query
InformationAttributesSemantics
Knowledge
RelationshipsBetweenConcepts
Data FieldsContainers
Folders
InformationRepository
KnowledgeRepository for
Rules
Storage(Replicas,
Persistent IDs)
NASA iLoC SBA Workspace
Ele
ctric
al
Po
we
r A
na
lysi
s
IDT DB
TopSCAPEDisciplineOntology Models
NExIOM Ontology Models
SI
T2
SI
SI
T1
RFx DB
WS
WS
WS
WS
Mapping Mapping Mapping Mapping
Translation Models
ILALBL
ILALBL
SI
W S
Semantic ApplicationInteraction LogicApplication LogicSemantic Interface C
ost
Mo
de
ling
COVE
W S
Ontology Authoring
Pe
rfo
rma
nce
M
od
elin
g
Ris
k M
od
elin
g
Tra
de
-Off
s A
na
lysi
s
Str
uct
ure
an
d
Co
nn
ect
ivity
Ele
ctric
al
Po
we
r A
na
lysi
s
W S W S W S W S W S
Text Mining for Hypertext Creation
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Concept map
Hypertext
Doc 1 Doc 2 Doc N
Type of Links
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Doc Doc Links
Doc 1 Doc 2 Doc N
Term Term Links DocTerm Links
TermDoc Links
Example from an Enterprise Architecture Process Ontology
Task
Goal
Role Measure
Agent
Process
fea: Mission
fea: intentOf
fea: Agency
brm: provides
brm: SubFunction
fea: hasIntent
brm: hasProcessbrm: Process
brm: usesResource
brm: Resource
brm: hasPerformance
prm: PerformanceMeasure
prm:hasIndicator
prm: GenericMeasurementIndicator
fea: Customer
prm:hasSpecialization
prm: OperationalizedMeasurementIndicato
r
brm: hasCustomer
srm: Service
brm: realizedWith
病歷紀錄整合 ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE
######:MRS ##### #######
15 Dec 1993 General Surgical
I reviewed this patient in clinic today. She has been followed
up for a left breast carcinoma for which she was treated with a
mastectomy. She had a prosthesis removed last year and has had
some improvement in the symptoms of chest wall discomfort since
then although she still gets quite sharp pains intermittently.
She has been reviewed in the pain clinic local to where she
lives but has not had much relief of her symptoms. She feels
though that she can bear with these and does not want any
further intervention at present.
On examination today there is no sign of recurrence of her
disease. Chest and abdominal examination were unremarkable. We
will see her again in a year's time.
28/03/2003, 10:35:26
ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE
######:MRS ##### #######
27 Aug 1998 Seen in the Follow Up Staging Clinic
This 65 year old lady has been reviewed in the Breast staging clinic.
As you know, she was originally diagnosed with a carcinoma of the left
breast in 1974 and treated with a total mastectomy. This was followed
with MEFUP chemotherapy. In 1982 she noticed a lump in the
infraclavicular region which was excised and this was followed by
radiotherapy. In 1994 she developed a tumour in the chest cavity that
was diagnosed with a CT guided biopsy and this was treated with VAC
chemotherapy and radiotherapy to the mediastinum. Since 1994 she had
noticed a slight deterioration and earlier this year she had problems
with occasional episodes of vomiting, nausea and general lethargy. She
was found to have lymphadenopathy in the right supraclavicular fossa
and was treated with Arimidex. Since being on Arimidex there was
originally stablisation of her disease but recently it appears that the
node has started to enlarge.
On examination today, she has a 1.5x1cm lymph node in the right
supraclavicular fossa and an essence of thickening probably due to
previous therapy in the left supraclavicular fossa. She also has
radiation changes in the lung which produced some physical sign at both
bases and there was no evidence of abdominal organomegaly.
Her recent staging investigations show that she has C5 carcinoma cells
present in the lymph node fine needle aspirate. A right mammogram is
unremarkable. An ultrasound of the liver was normal and a chest x-ray
showed some soft tissue thickening present in the left axilla due to
previous therapy. There is also some loss of volume in the left upper
zone but no lung nodules seen. A bone scan shows evidence of
degenerative changes but no specific evidence of bony metastases. Her
thyroid function tests show that the TSH is 0.12 and her free T3 are 4
which indicates that the TSH is slightly low. This does not amount to
primary hypothyroidism but it would be worth repeating the thyroid
function tests in three months time.
Overall, it appears that the patient has stable disease on Arimidex
apart from in the right supraclavicular fossa. The Arimidex is not
holding the disease completely and we feel that the best approach to
management would be to consider some radiotherapy to the right
supraclavicular fossa. She has previously had radiation therapy to the
left clavicular region and mediastinum. We have discussed performing a
CT scan of the thorax but she was unable to lie flat for the duration
of the investigation some months ago. We shall ask our radiotherapy
colleagues to review her and consider her for therapy. We shall review
her again in the follow up clinic in six weeks time.
28/03/2003, 10:50:25
ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE
######:MRS ##### #######
24 Jan 1997 Seen in the Chemotherapy Clinic (TPFRIDAY)
I saw ##### today in clinic. I am very pleased to say that she has had
a complete response in her superior mediastinum and right
supraclavicular fossa lymphadenopathy. There is some minimal thickening
remaining in the soft tissues around the superior mediastinum and in
fact it is felt that this might now be related to previous
radiotherapy. To be honest, however, symptomatically there has been
little in the way of benefit with overall palliative response of no
change. She is tolerating the treatment fairly well. Interestingly she
has had virtually complete alopecia with the treatment. She has been on
warfarin for about the same amount of time and I wonder whether this
may be partly responsible. We have given her a fourth cycle of
treatment today and we will see her in three weeks for consideration of
her fifth.
28/03/2003, 10:44:20
ROYAL MARSDEN NHS TRUST - DIAGNOSTIC RADIOLOGY - CT REPORT
######:#######,MRS #####
Exam 18 Dec Examination LIVER/THORAX/ABDOMEN/PELVIS
Exam Number [NUM]
Date of Birth 17 May 1933
Ref [HCA1] OUTPATIENT
Clinical
BR Verified by [HCA2]
DIAGNOSIS: Carcinoma of breast.
CT scans have been obtained through chest, abdomen and pelvis with oral
contrast only.
There is thickening in the left clavicular fossa and small-
volume residual abnormalities in the mediastinum. Comparison is made
with the most recent scan (21.7.95) and there is no discernible change
by CT criteria.
Lung changes, which may have been related to radiotherapy, are now less
extensive.
There are no abnormally-enlarged nodes in the retroperitoneum
or pelvis. There are no focal hepatic masses.
CONCLUSION: No CT evidence of disease progression.
28/03/2003, 12:35:06
疾病診斷Consider a 62-year-old man with 3 months
history of severe back pain. His weight remained stable. CBC and routine biochemistry were normal. ESR was 52 mm / hour. An x-ray of the lumbar and thoracic spine was reported to showing degenerative changes. Cancer
Low back pain
History and physical examination
History of Previous cancer
Age > 50 years or Failure of treatment or
weight loss
No significantfinding
ESR
ESR < 20 and only one clinical Finding
ESR > 20 or more than one clinical finding
ESR,spineFilms, 9%
with cancer
No cancer
X-ray2.3% cancer
特徵
What was What was donedone……
What happened…What happened… And whyAnd why
Human:1382
Mass:1666
locus
Pain:5735
locus
locus
locus
Ulcer:1945
Cancer:1914
Breast:1492
locus
Radio:1812
plansplans
Chemo:6502
plans
Clinic:4096
Biopsy:1066
Clinic:1024plans Clinic:2010plans
target
attends
attendsattends
treats
treats
finding
finding
reason
reason
reason
reason reason
reason
time time time time time time time time
Concept Lattice
C1:(D1,Ø)
C2:({d1,d2,d4},{t1,t6}) C3:({d3,d4},{t4})
C4:({d1,d2},{t1,t3,t5,t6})C5:({d4},{t1,t4,t6})C6:({d3},{t2,t4})
C7:(Ø, T1)The formal conceptC4 has two own terms{t3,t5} and two inheritedterms {t1,t6}
Given the context (D1,T1) whereD1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6}
R t1 t2 t3 t4 t5 t6d1 1 0 1 0 1 1 d2 1 0 1 0 1 1d3 0 1 0 1 0 0d4 1 0 0 1 0 1
Table: The input relation R = documents keywords
Hasse Diagram
Text Analysis Spectrum
Entity Extraction
Targeted Factsand EventsClassification
Clustering
ConceptIdentification
What is thisdocument about?
Who didwhat towhom whenwhere, etc.
Why is getting dimensional data so hard?
Hank bought plastic explosives from Henry inTucson yesterday.
Named Entity Extraction
People,Weapons,Vehicles,
Dates
NEREngine
HankHenry
Plastic explosives
Tucson11/01/07
Automatic Pattern-Learning Systems
Pros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically find appropriate statistical patternsSystem knowledge not needed by those who supply the
domain knowledge.
Cons:Annotated training data, and lots of it, is needed. Isn’t necessarily better or cheaper than hand-built sol’n
Examples: Riloff et al., AutoSlog, Soderland WHISK (UMass); Mooney
et al. Rapier (UTexas); Ciravegna (Sheffield) Learn lexico-syntactic patterns from templates
Trainer
Decoder
Model
LanguageInput
Answers
AnswersLanguageInput
P14 performed
P11 participated in
P94 has created
E31 Document
“Yalta Agreement”
E7 Activity
“Crimea Conference”
E65 Creation Event
*
E38 Image
P86 falls within
P7 took place at
P67 is referred to
by
E52 Time-Span
February 1945
P81 ongoing throughout
P82 at some time within
E39 Actor
E39 Actor
E39 Actor
E53 Place
7012124
E52 Time-Span
11-2-1945
Explicit Events, Object Identity, Symmetry
Rules ExtractionThe formal concept C4 makes it possible
the following rules R1 : t3 t1 t6R2 : t5 t1 t6R3 : t3 t5
The interpretation of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6
The rule R3 express mutual equivalence of the terms {t3,t5}: All the documents which have the term t3 also have the t5 term.
中低收入戶補助
因果圖 -- 失依兒童
各縣市福利, 信託基金的
成立
所在各縣市失依兒童狀
態
各縣市政府,社會局等介入
對單親家庭的補助之災後重建及經費相關使用
災後重建基金
規則
中文 NER – Example 2 黑色當道 少了尖叫 女星太規矩 城城活跳跳 金馬獎星光大道不若前晚金鐘獎
「峰芒」畢露,女星們規矩平穩的服裝,讓星光大道上少了一些特色,並未出現讓人眼睛一亮的驚喜。其中,在金鐘獎上讓人血脈僨張的蕭淑慎,在金馬獎上可以看出服裝「規矩」了些。總體來說,今年的星光大道造型略顯平庸。 秋冬主流黑色更在金馬星光大道上大量出現,凱渥模特兒公司老闆、也是專業資深時尚人洪偉明說:「可以發現他們選擇合適的服裝,規矩、正式的選擇,可避免遭受批評,今年確實少了些特色,但重要的國際場合,平穩的黑色服裝,也是出席正式場合的安全造型。」 洪偉明表示:「楊千嬅的服裝和她的人很搭,黑色蕾絲讓她不至於顏色過重,正式中又帶點活潑,感覺很棒。」台中市長胡志強女兒胡婷婷桃紅色的緞面禮服,也讓洪偉明很欣賞,他說:「整體感覺落落大方,亮色服裝和她的人也很適合,她的自信和星光大道主持人蔣怡的乾淨大方一樣,讓人感覺舒服,也是不錯的造型。」 舒淇鵝黃色的禮服,洪偉明笑說:「羅曼蒂克的感覺和她的笑容很搭配,讓氣色宛如戀愛中的女人一樣美好。」梁詠琪的黑色短禮服,雖然露出她的修長美腿,但洪偉明也建議:「她至少可以搭雙絲襪,整體感覺會更好。她在演唱會上展現性感,其實星光大道上也可以大膽改變。」 至於男星們的服裝,今年則是絲絨的天下,洪偉明笑說:「男星們服裝不易做出變化,敢大膽嘗試不同造型的人也不多,其中郭富城神采奕奕的精神,十分突出,張震的服裝則顯得穩重而規矩。」
專有名詞詞 詞類 出現次數
張震 [Nb] 專有名稱 1
賴雅妍 [Nb] 專有名稱 1
米蘭 [Nb] 專有名稱 1
林熙蕾 [Nb] 專有名稱 2
楊貴媚 [Nb] 專有名稱 1
林志玲 [Nb] 專有名稱 1
楊采妮 [Nb] 專有名稱 1
藍正龍 [Nb] 專有名稱 1
侯佩岑 [Nb] 專有名稱 3
梁詠琪 [Nb] 專有名稱 2
黃子佼 [Nb] 專有名稱 1
楊千嬅 [Nb] 專有名稱 1
胡婷婷 [Nb] 專有名稱 2
戴起 [Nb] 專有名稱 1
詞 詞類 出現次數
背後 [Nc] 地方詞 1
中途 [Nc] 地方詞 1
世界 [Nc] 地方詞 1
天下 [Nc] 地方詞 1
原地 [Nc] 地方詞 1
舒淇 [Nb] 專有名稱 2
高達 [Nb] 專有名稱 1
白 [Nb] 專有名稱 1
竹幼婷戴榮賢 [Nb] 專有名稱 1
郭富城 [Nb] 專有名稱 1
范文芳 [Nb] 專有名稱 1
金馬獎 [Nb] 專有名稱 3
舒淇鵝 [Nb] 專有名稱 1
金城武 [Nb] 專有名稱 2
蕭淑慎 [Nb] 專有名稱 4
黃志瑋 [Nb] 專有名稱 1
天心 [Nb] 專有名稱 1
洪偉明 [Nb] 專有名稱 2
師李 [Nb] 專有名稱 1
時間詞 詞類 出現次數
昨天 [Nd] 時間詞 4
新春 [Nd] 時間詞 1
昨晚 [Nd] 時間詞 1
早春 [Nd] 時間詞 1
前晚 [Nd] 時間詞 2
先後 [Nd] 時間詞 1
今年 [Nd] 時間詞 6
週末 [Nd] 時間詞 1
詞 詞類 出現次數
露美腿 [LN] 人名類 2
台中市長胡志強女兒胡婷婷桃紅色
[LN] 人名類 1
Generative Discriminative
重建家園專案
金融機構
貸款 震災重建暫行條例 受災戶
房屋
利息
損毀
災戶object
method
Object:attribute
Object:attribute
Object:attribute Object:
condition
Object:attribute
Object:Attribute (condition)
Object:attribute
Specify
Generalize
80
範例很適合用機洗 香味好聞 去污力強洗衣省力 氣味清香 能去除 99 種污漬洗得特別乾淨 香味好聞 白襪子洗得最乾淨氣味很香 不傷手能夠很好的去除污漬 衣服不易褪色洗衣不費力 能去除 99 種污漬用量少 洗得乾淨對皮膚刺激少 洗各種污漬都很乾淨洗得乾淨 價格適當洗衣服的效果較好 氣味不錯 一直使用該品牌洗好的衣物更白 氣味好聞 廣告印象深洗得乾淨 易漂清 不太傷手洗得乾淨 用量少洗得乾淨 用量比別的牌子少廣告大 洗得乾淨 用量少質量好 用量少 洗得乾淨包裝好 廣告多,吸引人 香味好聞洗的乾淨、白 宣傳好,廣告有趣 很多人都說好
81
語意概念萃取 for Malignancy DSS
Bag of “Words”extraction
Patient IDPatient IDESRESR
severe severe back back painpainx-rayx-ray
lumbarlumbarspinespine
degenerative degenerative changeschanges
Expressionsextraction
Patient IDESRESR
severe back painsevere back painx-rayx-ray
lumbarlumbarspinespine
degenerative changesdegenerative changes
Named Entitiesextraction
Patient ID Diagnostic termmalignancymalignancy?
ESR screening test Lumber, Spine Anatomy Term
degenerative changesdegenerative changes Symptom Symptom
Events/SentimentExtraction
Patient (Patient ID) ESR Screening (Positive)Symptom (Positive Indication) Cancer
CombinedWith structured data
Decision Making
malignancy Treatment
Knowledge InferenceInformation ExtractionInformation Retrieval
( 文件 ) 資料探勘走向決策支援彙整同一性質資料 Clustering資料探勘以產生關聯相依規律 Association
Rules視覺化顯示協助專家研判主題 Visualization定義處理指引方便建立決策支援 Processing
Guideline
發展FTP Gopher HTML
CrawlingIndexingSearch
Local data
MonitorMine
Modify
Web Servers
Web Browsers
Social Network
of Hyperlinks
Relevance Ranking
Latent Semantic Topology
Topic Directories
More structure
Clustering
Scatter-Gather
Semi-supervisedLearning
AutomaticClassification
WebCommunities
TopicDistillation
FocusedCrawling
UserProfiling
CollaborativeFiltering
WebSQL WebL
XML