An Empirical Study of XML Data Management in Business Information Systems

1

An Empirical Study of XML Data Management in Business Information Systems

Speaker: 呂瑞麟國立中興大學資訊管理學系教授

Email: [email protected]: http://web.nchu.edu.tw/~jlu

2

動機念頭起源於大約公元 2000 年

XML 剛剛興起預期會有大量的 XML 資料 / 文件被交換或者儲存

應該會使用資料庫來處理吧？因為速度比較快、同時多人存取（ concur

rency 、 locking ）、資料一致性（ consistency ）等等優點

3

動機要使用什麼樣的資料庫？

有沒有專門為 XML 設計的資料庫？（稱之為 Native XML DB ； XML 原生型資料庫； NXD ）

查（ google 、線上論文、最好是 conference 的 papers ）

常見的 RDBMS 結構不同，會有什麼樣的問題？

目前有沒有任何學者對此發表過任何論文可以參考？

4

動機可能可以進行的方向

文件還是資料？從企業的角度

使用資料庫來處理 XML 的資料的方式有哪些？其各自的優缺點為何？原因為何？

真的能用嗎？ Eric Jui-Lin Lu, Bo-Chan Wu, and Po-Yun Chuang, "An Emp

irical Study of XML Data Management in Business Information Systems", The Journal of Systems and Software, vol. 79, no. 7, July 2006, pp. 984-1000. [SCI,EI]

從資訊科技的角度現有方式的優缺點？是否能改進？想法：使用 Hierarchical 、好的儲存方式、應用在行動設備、好的 indexing 方式、好的查詢方式、好的 mapping 方式、好的 concurrency 方式、好的存取控制結構等等等

同時涵蓋 RDBMS 以及 NXD

5

目標 a complete and systematic survey of the

current development and challenges of processing XML data in relational and native XML databases

Insert a new (undefined) value XML Query SQL Query

a useful benchmark for IT practitioners who need to process XML data effectively

experimental results and detailed analysis which reveal several interesting tips that can be helpful to XML document designers

a conclusion, based on the findings of using native XML databases in EDI processes, that it is practical to use native XML databases for daily operations

6

Literature Review XML Files Relational Databases

XML RDB RDB XML 支援以上功能的資料庫稱之為 XML-enable

d Databases Native XML Databases

7

XML RDB 儲存時的挑戰－－結構不同

8

Storage Mapping 目前在對應 (mapping) 的規則上已有許多研究的成果 Tian et al. [2001] 的索引對應方法 Bourret [2001] 的方法

Table-based model Object relational model

利用 DTD 的圖型 Element Graph (Shanmugasundaram et al. [1999] 所提 )

Edge (Florescu and Kossmann [1999] 所提 )

9

Tian et al. (2001) 處理方式：

把 XML 的文件或其內容的個別元素，視為單一的一個物件

將這個物件儲存至資料庫的一個欄位內接著利用元素的長度作索引

優點簡單

缺點非常沒效率

10

Bourret (2001) – table-based 把文件視為單一或是多個表格

Used by MS SQL and Oracle Cannot handle complex XML documents

11

Bourret (2001) – object-based Create a table for each complex element Based on cardinality, assign PK and FK

12

Element Graph Depth first traversal on DTD

How to deal with attributes? Create tables for root and nodes under “*”

13

Element Graph Depth first traversal on DTD

How to deal with attributes? Create tables for root and nodes under “*”

14

Florescu & Kossmann [1999]

15

RDB XML Typical steps:

query data using XML query languages translates XML queries into SQL statements and submits the

m to the underlying RDB. the retrieved data will be mapped against some XML schema

definitions called XML views. There are two major problems with this approach (Suciu, 2

001). automatic conversions from XML queries into SQL statements

generally create redundant joins. To eliminate redundant joins, a general purpose query minimization algorithm is required. However, query minimization is NP complete for conjunctive queries.

when storing XML documents into RDBs, some conversion is required, and, consequently, structure information may be lost during the conversion.

16

RDB 之挑戰 Multiple-table joins can be costly Rich structure may be lost

Ordering Elements vs. Attributes PIs and Comments

Forward and backward mapping is costly

XML Query SQL Query

17

原生型資料庫 XML:DB consortium 的定義：

一個原生型 XML 資料庫是被設計用來儲存和操作 XML 的資料，其資料的存取是經由 XML 和其相關的標準技術，如： XPath 、 XSLT 、 DOM 和 SAX 。原生型 XML 資料庫必須能保持 XML 的結構和相關的 metadata 。其儲存的格式並不重要 ( 例如 : 物件導向技術，關聯式資料庫技術 .. 等等 ) ，但若要存取資料庫的資料時，就必須使用 XML 的相關技術。而原生型 XML 資料庫的基本儲存單元是一份 XML的文件。

18

使用 NXD 可行嗎？一般都會採用或者設計 benchmark

目前已經存在的 benchmarks 有： Micro Benchmark: Michigan benchmark (Runapongsa

et al., 2002) Application Benchmark: XBench (Yao et al., 2002a,b), X

Mach-1 (Bohme and Rahm, 2001), XMark (Schmidt et al., 2002), and XOO7 (Bressan et al., 2001, 2002; Nambiar et al., 2002a,b)

Update is not considered (except for XMach-1 which consider three simple update operations)

Number of files and the size of files are not considered No comparisons between RDBs and NXDs

19

設計 benchmark 所考量的因素各式各樣的查詢語言（例如： XPath ， XQu

ery ， XQL 及 XML-QL 等），使得系統的效能更難以評估 (Schmidt et al. [2001]) 。只考慮功能，而非語法

The design of XML queries is based on five different dimensions query target query path conditional path operators functions

20

設計 benchmark 所考量的因素

query target Ex. Simple elements, complex elements, attributes, or whole

documents query path conditional path

Ex. for /order[com/com_id = 1], /order is the query path and com/com_id = 1 is the conditional path.

to measure how the depth of both query path and conditional path affect the performance of databases. Ex. Depth(/order)=1 and depth(com/com_id = 1)=2.

Operators Ex. AND, OR, NOT, etc.

Functions Ex. Count(), max(), min(), etc.

21

設計 benchmark 所考量的因素 Missing elements (or null value)

RDB contains a null value NXD contains nothing

Insertions In addition to previous factors, also consider

the number of files and the size of files Ex. One file for each document

Insert or append Insert a complex element may involve

several tables

22

設計 benchmark 所考量的因素 Deletions

The number of elements to be deleted The distribution of to-be-deleted

elements Remove empty files

Modifications The content of elements The element itself?

23

我們的評估環境硬體資料

台灣花卉市場的資料（之前的研究） 6000 purchase orders (about 7.5 MB) were randomly gene

rated and saved in either 1, 6, or 6000 files 軟體

NXDs eXcelon 3.1 ,dbXML 1.0b2, Apache Xindicce version 1.0, Soft

ware AGs Tamino XML Server version 3.1.1.4 RDBs

Microsoft SQL Server 2000, Software AGs Tamino XML Server version 3.1.1.4

24

我們的評估環境 SQL Tables

Edge 產生 1 個表格 Element Graph 產生 2 個表格 Relational Object 產生 4 個表格

25

我們的評估環境 Because of the structure differences, the update o

perations performed on RDBs and NXDs are quite different.

Deleting an element in NXDs means removing a node and all of its sub-nodes completely. However, in RDBs, deleting an element only means putting null values to the fields corresponding to the node and all of its subnodes.

In NXDs, to create a new element, it is required to add both element names and their values.

Index on NXDs: “index on element name” or “index on element value”

indexing on an element with unique values and on non-unique values.

26

Experimental Results: Queries

27

Experimental Results: Query Depth

28

Experimental Results: Elements vs. Attributes

29

Experimental Results: Insert

30

Summary It is found that RDBs in general perfor

m better than NXDs in processing XML data. If the structure of XML documents is com

plex, the XML documents need to be flatten before further processing.

When complex elements are inserted and whole documents are inserted or deleted, NXDs run faster than RDBs.

31

Summary It is shown that the time required to process attrib

utes, including query, delete, insert, and update, is shorter than the time required to process elements. Also, the shorter the depths of query path and conditional path, the faster the databases to process XML data.

Without losing readability, it is recommended to encode data in attributes rather than elements and design XML documents as ‘‘flat’’ as possible. One example approach is to use attributes to replace elements. For example, <com><City>Taipei</City></com> can be replaced by <com City = ‘‘Taipei’’></com>. Designing ‘‘flat’’ XML documents will improve not only the performance of NXDs but also RDBs. This is because less tables are needed to represent ‘‘flat’’ XML documents in RDBs.

32

Summary For NXDs, appending data to the end of files

is faster than inserting data in the beginning of files in most cases.

The number of documents saved in a file and the number of files saved in NXDs do have impact on the performance. Therefore, it is a good practice for developers to evaluate different alternatives when employing NXDs.

It is shown that the element graph method outperforms the other methods

33

Summary In general, MS-SQL outperforms Oracle

the query result generated by SQLXML is actually saved in a table of one field with each record contains no more than 2033 characters, while the query result generated by XDK is a string formatted in XML style

Oracle runs faster than MS-SQL if programs retrieve data directly from tables and insert tags without using either SQLXML or XDK.

In other words, the reason why MS-SQL outperforms Oracle is because SQLXML is more effective than XDK.

34

Issues worth further investigation It is clear that the number of documents saved in a file has

significant impact on the performance of a native XML database. However, the details of how many documents are saved in a file should be hidden from developers.

The database management system itself should be able to optimize the storage dynamically.

As shown in the experimental results on both eXcelon and Tamino, when all documents are saved in one big file, adding indexes does not improve the query performance much.

the development of effective indexing schemes become imperative.

Almost all RDBs provide some sort of utilities to automatically convert XML documents into tables. However, at the current stage, only XML documents with simple structure can be processed.

35

個案分析研究訪問對象是電子產業中

生產顯示器的中光電 (CTX) 和誠洲 (ADI) 生產主機板的華碩 (ASUS) 及環電 (USI)

36

目前業界 EDI 處理流程

37

EDI 流程 All orders placed by customers are temporarily stored in a temporary zon

e. These orders are in plain text format. For every pre-determined period of time, the orders are automatically tra

nsformed into a database. Each order is saved into three tables which are orders, order-items, and other-info. Two additional fields are added into the table order. One is status which indicates the current status of an order.

other-info contains miscellaneous information about the order such as when this order is received. The main purpose of other-info is for dispute resolution if occurred.

After the orders in the database are processed by backend systems, the processed orders are written into a table called sales-orders. The content of the table salesorders are mainly retrieved from the tables orders and order-items. Additionally, whether or not the orders are approved (either pass or fail) and the actual sales quantities are written into sales-orders.

Later, the sales orders will be delivered to the customers who placed the orders.

38

EDI 流程 It is assumed that orders and sales orders are all encoded in XML for

mat. Also, it is assumed that when company A places orders to company B, there are already 5900 orders in company B’s database.

1. After certain period of time, company B appends the orders in the temporary zone to the database. Supposed that 100 orders are added each time.

2. After appending the orders to the database, company B appends two simple elements status and date to each order.

3. Back-end systems then query these 100 orders. After these orders are processed, back-end systems write either pass or fail and the actual sales quantity for each sales order to the database. Moreover, the information for sales orders are retrieved from the tables orders and order-items and written into the table salesorders.

4. It is also assumed that company A cancels two orders. In other words, the values of status are modified from a to c.

39

EDI 流程 the following operations are summarized:

100 orders are added twice. One is in step 1, and the other is in step 3.

For each order, 4 simple elements are added, and they are status, date, and the fields for storing pass or fail as well as the actual sales quantity.

100 orders are queried twice. All happen in step 3.

Two simple elements are modified in step 4.

40

情況模擬如依上述情況則

SQL (Element Graph) 6957*2+10*4+157*2+10*2=14288(毫秒 )

eXcelon(1 file) 9280*2+594*4+808*2+2577*2=27706 (毫秒 ) eXcelon(6 file) 6976*2+618*4+1018*2+694*2=19848 (毫秒 ) eXcelon(6000 file) 9952*2+945*4+1392*2+891*2=28250 (毫秒 )

41

情況模擬 (cont.) 假設另一情況，把新增筆數改為 10筆，而其餘流程不變則

SQL (DTD Graph) 5985*+10*4+157*2+10*2=12344(毫秒 )

eXcelon(1 file)848*2+594*4+808*2+2577*2=10842(毫秒 ) eXcelon(6 file) 731*2+618*4+1018*2+694*2=7358(毫秒 ) eXcelon(6000 file)1444*+945*4+1392*2+891*2=11234(毫秒 )

42

43

未來工作 The missing pieces (discussed in

motivation) Mobile databases Document-centric applications

Documents

An Empirical Study of XML Data Management in Business Information Systems