8
Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science

Zhen wang demo3

Embed Size (px)

Citation preview

Empower Public Health through Social Media

Empower Public Health through Social MediaZhen Wang, Ph.D.Insight Health Data Science

http://54.191.168.240

TextCleaning, TokenizingConvert to Feature Vectors

I like food!Food is good!I had some good food.

i, like, foodfood, is, goodi, had, some, good, food

e.g., TF-IDF

Im really good with numbers!

ilikefoodisgoodhadsome111000000111001010111

Downweight, NormalizeMachine LearningNumbersNatural Language Processing

Text ClassificationNormalized Retweet CountsNumber of TweetsDistribution of Tweets Sample Imbalance Classification (0/1: Not / Retweeted)Logistic Regression

Threshold: 0.005

Misclassification Error: 22%

0011Train Testdownsampling

0.810.740.260.19Normalized Confusion MatrixCodes: github.com/zweinstein/SpreadHealth_dev

Zhen (Jen) Wang

Beta TesterSince 2015

Editor since 2015

Traditional MedicineScience FictionPublic Speaking

Online EducationPh.D. in Physical Chemistry

Thank you!

See the App in Action:

Text Preprocessing PipelineText Cleaning:Convert to lower caseReplace URL, #, and @Remove special characters other than emoticonsRemove stopwords

Tokenizing:Splitting each documents into individual elements Bag-of-Words or N-gramsStemming Porter Stemmer was usedSnowball or Lancaster stemmer faster but more aggressiveLemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf):Term Frequency--tf(t,d): the number of times a term t occurs in a document dUsed to downweight frequently occurring words in the feature vectors tf(t,d)Document Frequency--df(d,f): the number of documents d that contain a term t.The implementation in Scikit-learn

Lastly, L2 normalization

Train Dataset: 10000 tweets on diabetes (4782 retweeted);Test Set Accuracy (Random Chance 0.49 on positive class): KNN: 60%Naive Bayes: 67%Logistic regression: 75% (chosen and tested on imbalanced test data)Potential Improvements: Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)Other Features:Polarity & SentimentLength Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression)Automatic Update to SQLite Database and to the ClassifierPrediction Algorithms