Download pdf - 고대8 9주 빅데이터

고려대학교 정보대학 컴퓨터학과Prof. 강장묵

([email protected] ; [email protected])

교육정보 서비스 특론9 주; 2015.4.29. 수

교육정보서비스에서 정형/반정형/비정형 데이터 처리는 어떤 의미를 갖는가?(교육정보에서 핵심 가중치를 두어야 할 데이터는 무엇인가?

몇 가지 추천하고 그 이유를 논한다.)키워드 : 교육정보, 교육데이터마이닝, 교육빅데이터

인용: http://www.korea.ac.kr/search/search.jsp인용: http://analyticstraining.com/wp-content/uploads/2014/09/30-Sept.jpg

교육정보+

가중치

mailto:[email protected]

금주 소개사이트

https://www.pinterest.com/

- 이미지 기반의 큐레이션- 소셜 큐레이션, 관계 큐레이션, 화장품 큐레이션

- 교육 큐레이션?

www.slideshare.net/mooknc

여러분의보고서, 활동역시슬라이드쉐어에공유

강의교안

http://www.slideshare.net/mooknc

일상의 스토리텔링, 숨겨진 센싱과 문맥 분석https://www.youtube.com/watch?v=OptqxagZDfM

강의 전 숙의할 질문

- 사물 (만물) 인터넷 환경에서 비정형/반정형/정형 데이터란 무엇인가?

- 클라우드 컴퓨팅/네트워크 환경에서 비정형/반정형/정형 데이터의 의미란 무엇인가?

- 웨어러블 중 학생 교복, 배지, 학생 시계, 학생 신발 등에서 정형/비정형/반정형 데이터란 무엇이고 어떻게 이용할 수 있는가?

- 아니, 반드시 이용해야 하는가? 바른 이용 (교육적 의미 부여는 어떻게 가능한가?)은 또 무엇인가?

인용: http://news.chosun.com/site/data/html_dir/2015/03/02/2015030202126.htmlhttp://cfile10.uf.tistory.com/image/182A5D50506E440612E7FF

남성분들은 잘 모르시겠지만,여성분들이 하루에 화장을 몇 번 고칠까요?

소셜 빅 데이터에 올라온 텍스트를 분석해 보면,

http://news.chosun.com/site/data/html_dir/2015/03/02/2015030202126.html

인용: http://news.chosun.com/site/data/html_dir/2015/03/02/2015030202126.html

그럼,

밤 10시에 화장을 고치는 이유는 무엇일까요?


인용: http://news.chosun.com/site/data/html_dir/2015/03/02/2015030202126.html[출처] 본 기사는 조선닷컴에서 작성된 기사 입니다

누가 왜 밤중에 화장을 고치느냐고 물어보면'셀카 찍으려고요'라고 말할 사람은 많지 않을 것입니다.

이런 것은 물어보기도 어렵고 설사 묻는다 해도잘 대답해주지 않을 것입니다.

그 시간에 화장을 고칠 것이라고는 상상하지 못하기 때문에 물어볼 생각을 못할 뿐 아니라,

설사 묻는다 해도 민망해서 대답하지 않거나 자신의 행동을 기억하지 못하기에 대답을 못하기 일쑤입니다.

'밤 10시 셀카' 같은 것들은 부지불식간에 남긴 삶의 흔적들이 모인 빅 데이터로 그녀들의 삶을 관찰했으니 찾

을 수 있었던 것입니다.

많은 기업은 소비자의 욕구를 파악하기 위해질문을 활용합니다.


인용: 온라인상 비정형 데이터를 활용한 대안적 디자인 리서치 모델에 관한 연구 -디자인 에쓰노그래피 방법론을 중심으로-, 김은정(EunJungKim),이혜선(HyeSunLee), 디자인융복합학회, <디자인융복합연구> 12권 5호. 2013 pp.205-223

본 연구에서 활용할 ‘비정형 데이터’는 바로 이 빅 데이터의 개념에서 출발하며,빅 데이터는 데이터의 정형화 정도에 따라 크게 ‘정형 혹은 구조적(structured)데이터’, ‘비정형 또는 비구조적(unstructured) 데이터’ 2가지로 나눠진다. ‘정형데이터’는 고정된 필드에 저장되어지는 구조화된 데이터를 가리키며, 반면에‘비정형 데이터’는 데이터 하나하나마다 크기와 내용이 달라 통일된 구조로 정리하기 어려운 데이터를 의미한다. 일상적으로 인터넷에서 실시간으로 업로드되는 뉴스 게시물이나 블로그, 커뮤니티 게시판의 글, 동영상, 음악, 사진 등처럼 고정된 서식에 저장되어 있지 않은 데이터들이 바로 이에 속하며 데이터 유형에 따라 텍스트(text), 이미지(photo), 음성과 영상, 로그 파일 등으로 구성된다. 이와 같은 온라인상의 비정형 데이터들은 현재 디지털 세상(digital universe)가운데 약 95%이상을 차지하고 있으며(인용: IDC, The Expanding DigitalUniverse, 2007), 앞으로 생성될 전체 데이터 가운데 약 90% 이상을 차지할 것으로 전망된다.(IDC, The Expanding Digital Universe, 2011; 함유근･채승병, 빅데이터, 경영을 바꾸다, 서울:삼성경제연구소, 2012. p.32)특히 SNS 서비스의 인기, 정보를 생산하는 개인용 디지털 기기 사용의 증가와더불어 이러한 비정형 데이터의 폭발적 생산은 필연적으로 지속될 것이며 해가지날수록 그 비율 또한 급증할 것이다. 이러한 비정형 데이터들의 특성은 3V(규모(volume), 다양성(Variety), 속도(Velocity))로 나타난다.

인용: http://www.edureka.co/blog/answering-the-big-question-what-is-big-data/

인용: 맵리듀스와 대응분석을 활용한 비정형 빅 데이터의 정형화와 시각적 해석, 최요셉(JoSephChoi),최용석(YongSeokChoi), 한국통계학회, <응용통계연구> 27권 2호. 2014 pp.169-183

2010년을 기준으로 디지털 공간에 축적된 정보의 규모는 12억TB(terabyte)에 육박하는 것으로 추정된다 (Gantz와 Reinsel,2010).Special Report(2010.02.25)에 의하면, 세계 최대의 소매 체인 월마트(Wal Mart)에서는 시간당 100만 건 이상의 거래 기록이 저장되며, 2008년까지 약 2,500TB의 정보가 축적되었다고 한다.또한, 2011년 1월을 기준으로 트위터에서는 매일 약 1억 1,000만 개의 트위트가 발신되며 (Chiang, 2011), 2020년, 관리해야할데이터량이 50배 급증할 것으로 전망된다고 한다(Jeong, 2011).


비정형 데이터(Unstructured Data): ‘비정형 데이터’란 빅데이터(Big Data)의 유형 중하나로써, 일정한 규격이나 형태를 지닌 숫자 데이터(numeric data)와 달리 그림이나영상,문서처럼 형태와 구조가 다른 구조화 되지 않은 데이터를 말한다.(인용: 정용찬, 빅데이터, 커뮤니케이션북스, 2012. p.14)

본 연구에서의 ‘비정형 데이터’는 책, 잡지, 문서의료 기록등과 같은 전통적인 비정형데이터가 아닌, 이메일, 트위터, 블로그와 같은 모바일 기기와 온라인상에서 생성 및축적되는 데이터로 용어의 범위를 제한한다.



최근 구글 자동 번역기와 지역별 검색어 빈도를 통한 독감 유행 정보, 각종인터넷포탈사이트들의품질높은검색어기능, 링크분석(link analysis)을 통한 키워드간의 연결 분석, 유권자들의 트윗(tweet)을분석하여맞춤형캠페인을펼친미국오바마대통령의선거 운동 등 비정형 빅 데이터(unstructured big data)에 대한 연구와활용이활발해지고있다.

특히, Kim과 Cho (2013)는빅데이터분석과관련이있는통계적방법론으로 고차원 회귀분석, 분류분석, 다중비교, 앙상블, 치적화알고리즘, 차원축소, 네트워크분석, 군집화, 시각화, 온라인분석, 병렬계산, Rhive인 R 프로그램을 들고 있다.본 연구에서는 이러한 빅 데이터를 분석하기 위하여, 분산 처리시스템(distribution processing system)인 맵리듀스(MapReduce)를 활용하여, 비정형 빅 데이터(unstructured big data)를 정형화하고 , 이를 분석하고 시각화하기 위하여 대응분석(correspondence analysis)을 활용하려 한다.




일상적 소비․관심 영역 키워드 분석을 통한 트랜드 발굴, 라이프 로그(life log)를 활용한새로운 라이프스타일 탐색으로 대신 될 수 있다. 먼저 비정형 데이터를 활용한 트랜드발굴은 소셜미디어 분석을 활용하여 일반적인 사람들의 일상적 소비․관심 영역을 분류,각 영역에 해당하는 개념, 행위, 제품의 종류 및 품목 관련 주요 어휘들을 포괄하여 키워드 세트로 구성하여 트랜드 분석을 진행한다. 이 때 연령대, 성별 등의 요소를 기준으로키워드를 추출한다면 기업이 소구하고자 하는 정확한 해당 타겟 그룹 내의 라이프스타일 및 삶의 가치, 감성 변화들을 시간의 흐름에 따라 분석할 수 있다.(송길영,여기에 당신의 욕망이

보인다, 빅 데이터가 찾아낸 70억 욕망의 지도, 쌤앤파커스,2012. pp.180-181)

라이프 로그(life log)를 활용한 새로운 라이프스타일 탐색의 경우는 개인의 삶이온라인상에 흔적을 남기면서 생기는 로그 파일(log file)을 이용하여 타겟 그룹의 대표적 혹은 새로운 라이프스타일을 발견하는 것이다. 비록 현재 라이프 로그를 활용한 비정형 데이터 분석 서비스는 제한적이지만, 향후 스마트 기기 내 GPS, 카메라,NFC 등의 스마트 센서들이 송신하는 라이프 로그 정보셋(위치정보, 소비내역 정보등)들을 활용한다면 보다 세밀하고 정량화된 라이프스타일 정보를 구축 할 수 있을것이라 전망된다. 특히 라이프 로그 정보들이 기업 내 CRM과 연결된다면, 해당 기업의 주 소비자층의 라이프스타일 및 트랜드를 발굴하는데 보다 맞춤화된 정보를제공 받을 수 있을 것이다.이와 같은 비정형 데이터를 활용한 분석 기법을 이용하여 도출된 사회문화적 맥락및 소비자 트랜드, 소비자 가치와 관련된 큰 흐름은 기존 시장에 존재하지 않았던시장 창출 및 제품 개발을 위한 현 디자인 트랜드 및 새로운 라이프 스타일, 감성 인지를 통해 전반적 디자인 컨셉 도출 목표로 연결되어진다.

인용: http://www.edureka.co/blog/answering-the-big-question-what-is-big-data/https://i-msdn.sec.s-msft.com/dynimg/IC197174.gif

A distributed file system (DFS) is a file system that hasdata stored in a server. The data can be accessed andprocessed as if it is stored on the local machine. TheDFS makes it really convenient to share information ina controlled manner.

http://www.edureka.co/blog/answering-the-big-question-what-is-big-data/


A DFS allows efficient and well-managed data andstorage sharing options on a network compared toany other. The DFS allows faster processing of hugeamounts of data by processing data at variouslocations and then combining them to give thedesired output. In Big Data technologies like Hadoop,it is possible to scale a Hadoop cluster to hundreds oreven thousands of nodes. In this way, the MapReducefunctions can be executed on smaller subsets oflarger data sets, and thereby providing the scalabilitythat is needed for Big Data processing.

http://www.edureka.co/blog/answering-the-big-question-what-is-big-data/

인용: http://docs.oracle.com/cd/B19306_01/appdev.102/b14259/xdb02rep.htm


This diagram shows four boxes: a, b, c, and d.Box a includes the words Data Structure? insideit. Box b includes, from top to bottom, the wordAccess?, a box labeled Repository Path Access,and a box labeled SQL Query Access. Box cincludes, from top to bottom, the wordLanguage?, and the bullet points Java, JDBC,PL/SQL, and C or C++. Box d includes, from topto bottom, the words Processing and DataManipulation?, and the bullet points DOM, SQLinserts/updates, XSLT, Queriability, andUpdatability.



This diagram shows the data storage model. The words How Structured isYour Data? appear at the top of the diagram. Three lines connect below tothree separate boxes named, from left to right, Structured Data, Semi-structured Pseudo-structured Data, and Unstructured Data.The box labeled Structured Data has two lines that connect below to thewords XML Schema Based? and the words Non-Schema Based?. XMLSchema Based? connects below to the words Use either: CLOB or StructuredStorage. Non-Schema Based? connects below to the words Store as:, whichlist three bullet points: CLOB in XMLType Table, File in Repository FolderViews, and Access through Resource APIs.The box labeled Semi-structured Pseudo-structured Data has two lines thatconnect below to the words XML Schema Based? and Non-Schema Based?.The words XML Schema Based? connect below to the words Use either:,which have three bullet points listed below: CLOB, Structured, and HybridStorage (semi-structured storage). The words Non-Schema Based? connectbelow to the words Store as:, which have three bullet points listed below:CLOB in XMLType Table, File in Repository Folder Views, and Access throughResource APIs. The box labeled Unstructured Data connects to the wordsStore as:, which have three bullet points listed below: CLOB in XMLTypeTable, File in Repository Folder Views, and Access through Resource APIs.



The figure shows a tree structure with twobranches. The top node is labeled Oracle XML DBData Access Options. The children of Oracle XMLDB Data Access Options are Query-Based Accessand Path-Based Access. The Query-Based Accessnode expands to Use SQL, which expands toAvailable Language and XMLType APIs (onenode), which has three branches: JDBC, PL/SQL,and C (OCI). The Path-Based Access nodeexpands to Use Repository, which expands toAvailable Languages and APIs (one node), whichhas three branches: SQL(RESOURCE_/PATH_VIEW), FTP, and HTTP/WebDav.

인용: http://docs.oracle.com/cd/B19306_01/appdev.102/b14259/img/adxdb006.gif


다음주 (10주) 10주:

왜 교육정보에서 메타태그가 필요한가? 교육정보 검색 서비스란 무엇인가?

녹화 11주이후부터는질의응답으로여러분이생각하는교육적가치, 교육매타값, 가중치등에대해심화학습하

겠습니다.

중간고사 보고서 확인- 9주 차 화요일 자정(메일 도착시간)까지 www.slideshare.net 에 자료 공유- PPT, PDF 등을 [email protected]로 보낼 것

http://www.slideshare.net/

mailto:[email protected]

The followings were made to supplement my shabby presentation. When you need anything,

please e-mail me at this address at any time.

[email protected]

[email protected]