73
Diseño y construcción de una plataforma de clasificación y calificación de post para una red de blogs basada en textmining para Betazeta Networks S.A. Camilo López A.

Presentacion tema memoria v1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Presentacion tema memoria v1

Diseño y construcción de una plataforma de clasificación y calificación de post para una red de blogs basada en textmining para Betazeta

Networks S.A.

Camilo López A.

Page 2: Presentacion tema memoria v1

Objetivo General

El objetivo general del trabajo es el apoyo al procesamiento manual de grandes volúmenes de publicaciones en la red de blogs de Betazeta mediante el diseño e implementación de un prototipo para la categorización automática de

estos datos utilizando text mining.

Page 3: Presentacion tema memoria v1

Objetivos Específicos

1. Entender a fondo la problemática y el contexto de la empresa junto con los conocimientos necesarios respecto a text mining, modelos y metodologías necesarias.

2. Selección de los datos históricos, las consultas sobre éstos y los modelos que permitan realizar predicciones exitosas de categorización.

3. Establecer métodos y métricas para la evaluación de la solución propuesta.

4. Utilizando el conocimiento adquirido en los objetivos anteriores, diseñar el proceso de categorización automático de posts.

Page 4: Presentacion tema memoria v1

Objetivos Específicos

5. Diseñar e implementar un prototipo que permita al usuario ingresar información en forma adecuada para su análisis y a la empresa procesarla, filtrarla y publicarla en base a criterios del negocio.

6. Implementación de la metodología de evaluación.

Page 5: Presentacion tema memoria v1

El Problema

Page 6: Presentacion tema memoria v1

betazeta

Page 7: Presentacion tema memoria v1

7,5 millones

Visitas Mensuales

Page 8: Presentacion tema memoria v1
Page 9: Presentacion tema memoria v1

User Generated Content

Page 10: Presentacion tema memoria v1

Volumen

Page 11: Presentacion tema memoria v1

Google Categories

Page 12: Presentacion tema memoria v1

Spam

Page 13: Presentacion tema memoria v1

Filtro de contenido

Page 14: Presentacion tema memoria v1

Filtro de contenido

Categorizar

Page 15: Presentacion tema memoria v1
Page 16: Presentacion tema memoria v1
Page 17: Presentacion tema memoria v1
Page 18: Presentacion tema memoria v1

Background

Teórico

Page 19: Presentacion tema memoria v1

Data Mining

Page 20: Presentacion tema memoria v1

Limpieza del texto

Page 21: Presentacion tema memoria v1

Stemming

Page 22: Presentacion tema memoria v1

Stop Words

Page 23: Presentacion tema memoria v1

Diccionario

Page 24: Presentacion tema memoria v1

Latent Dirichlet Allocation

Page 25: Presentacion tema memoria v1

LDA Latent Dirichlet Allocation

Page 26: Presentacion tema memoria v1

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.

Page 27: Presentacion tema memoria v1

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.

A

Page 28: Presentacion tema memoria v1

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.

A B

Page 29: Presentacion tema memoria v1

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.

A B

C

Page 30: Presentacion tema memoria v1

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

A

Page 31: Presentacion tema memoria v1

A B C

Page 32: Presentacion tema memoria v1
Page 33: Presentacion tema memoria v1

A

B

C LDA

Page 34: Presentacion tema memoria v1
Page 35: Presentacion tema memoria v1
Page 36: Presentacion tema memoria v1
Page 37: Presentacion tema memoria v1
Page 38: Presentacion tema memoria v1
Page 39: Presentacion tema memoria v1

Descripción de la plataforma

Page 40: Presentacion tema memoria v1

Entrenamiento

Page 41: Presentacion tema memoria v1

Entrenamiento

Data Histórica

Page 42: Presentacion tema memoria v1

Entrenamiento

Data Histórica Limpieza

Page 43: Presentacion tema memoria v1

Entrenamiento

Data Histórica Limpieza Entrenamiento

Page 44: Presentacion tema memoria v1

Entrenamiento

Data Histórica Limpieza Entrenamiento

Scrapping

Page 45: Presentacion tema memoria v1

Entrenamiento

Data Histórica Limpieza Entrenamiento

Stop Words

Frecuencia 1

Frecuencia Transversal

Page 46: Presentacion tema memoria v1

Entrenamiento

Data Histórica Limpieza Entrenamiento

LDA

Almacenamiento

Filtrado

Page 47: Presentacion tema memoria v1

Categorización

Page 48: Presentacion tema memoria v1

Categorización

Texto Plano

Page 49: Presentacion tema memoria v1

Categorización

Limpieza Texto Plano

Page 50: Presentacion tema memoria v1

Categorización

Limpieza Clasificación Texto Plano

Page 51: Presentacion tema memoria v1

Categorización

Limpieza Clasificación Texto Plano

Stop Words

Page 52: Presentacion tema memoria v1

Categorización

Limpieza Clasificación Texto Plano

Stop Words Modelo LDA

Naïve Bayes Multinomial

Otros métodos

Page 53: Presentacion tema memoria v1

Python

Page 54: Presentacion tema memoria v1

Python

Django

Page 55: Presentacion tema memoria v1

Python

Django

Modelo Vista Controlador

Page 56: Presentacion tema memoria v1

Python

Django

Modelo Vista Controlador

MySql

Page 57: Presentacion tema memoria v1

Python

Django

Modelo Vista Controlador

MySql

Web Service

Page 58: Presentacion tema memoria v1

Resultados

Page 59: Presentacion tema memoria v1

Entrenamiento

4.000 Post

Page 60: Presentacion tema memoria v1

1000 FayerWayer

1000 WayerLess

1000 Belelu

1000 Ferplei

Page 61: Presentacion tema memoria v1

5 Temáticas detectadas

Page 62: Presentacion tema memoria v1

chile colo equipo copa partido barcelona futbol universidad jugador partidos

seleccion ex jugadores alexis goles club tecnico fecha sanchez chileno real torneo

madrid final america nacional catolica gran primera argentino estadio jugar

mundial san ahora luego volante delantero clausura campeon agosto nuevo gol

argentina primer sostuvo carlos mejor liga

Page 63: Presentacion tema memoria v1

93% Ferplei

Page 64: Presentacion tema memoria v1

Validación

350 Post

Page 65: Presentacion tema memoria v1

Suma Frecuencia TFIDF NB NBM

Precission 88% 85% 83% 91% 92%

Recall 96% 97% 91% 94% 93%

F-Measure 92% 91% 87% 93% 93%

FayerWayer y Wayerless

Suma Frecuencia TFIDF NB NBM

Precission 91% 93% 79% 86% 84%

Recall 74% 67% 62% 85% 86%

F-Measure 82% 78% 70% 85% 85%

Belelú

Page 66: Presentacion tema memoria v1

Suma Frecuencia TFIDF NB NBM

Precission 96% 94% 91% 100% 100%

Recall 96% 96% 96% 89% 94%

F-Measure 96% 95% 94% 94% 97%

Ferplei

Page 67: Presentacion tema memoria v1

Entrenamiento

14.400 Post

Page 68: Presentacion tema memoria v1

Validación

1.600 Post

Page 69: Presentacion tema memoria v1

Video Juegos Futbol

Musica, Fiestas y Panoramas Telefonia Movil

Automoviles Pareja y Vida Social

Medio Ambiente Global Ciencia y Tecnologia

Medio Ambiente Pequeña Escala Mujer y Sexualidad Familia y Sociedad

Investigación Espacial Tech Gadgets

Servicios y Tecnologia Automoviles: Top Gear

Page 70: Presentacion tema memoria v1

Complejidad

Page 71: Presentacion tema memoria v1

Trabajo Futuro

Page 72: Presentacion tema memoria v1

Mejorar limpieza del texto

Sistema multimodelo

Predicción de Blog

Mejoras del modelo en el tiempo

Incorporar Stemming

Page 73: Presentacion tema memoria v1

Muchas Gracias