Empower Public Health through Social Media
Empower Public Health through Social MediaZhen Wang, Ph.D.Insight Health Data Science
http://54.191.168.240
TextCleaning, TokenizingConvert to Feature Vectors
I like food!Food is good!I had some good food.
i, like, foodfood, is, goodi, had, some, good, food
e.g., TF-IDF
Im really good with numbers!
ilikefoodisgoodhadsome111000000111001010111
Downweight, NormalizeMachine LearningNumbersNatural Language Processing
Text ClassificationNormalized Retweet CountsNumber of TweetsDistribution of Tweets Sample Imbalance Classification (0/1: Not / Retweeted)Logistic Regression
Threshold: 0.005
Misclassification Error: 22%
0011Train Testdownsampling
0.810.740.260.19Normalized Confusion MatrixCodes: github.com/zweinstein/SpreadHealth_dev
Zhen (Jen) Wang
Beta TesterSince 2015
Editor since 2015
Traditional MedicineScience FictionPublic Speaking
Online EducationPh.D. in Physical Chemistry
Thank you!
See the App in Action:
Text Preprocessing PipelineText Cleaning:Convert to lower caseReplace URL, #, and @Remove special characters other than emoticonsRemove stopwords
Tokenizing:Splitting each documents into individual elements Bag-of-Words or N-gramsStemming Porter Stemmer was usedSnowball or Lancaster stemmer faster but more aggressiveLemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf):Term Frequency--tf(t,d): the number of times a term t occurs in a document dUsed to downweight frequently occurring words in the feature vectors tf(t,d)Document Frequency--df(d,f): the number of documents d that contain a term t.The implementation in Scikit-learn
Lastly, L2 normalization
Train Dataset: 10000 tweets on diabetes (4782 retweeted);Test Set Accuracy (Random Chance 0.49 on positive class): KNN: 60%Naive Bayes: 67%Logistic regression: 75% (chosen and tested on imbalanced test data)Potential Improvements: Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)Other Features:Polarity & SentimentLength Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression)Automatic Update to SQLite Database and to the ClassifierPrediction Algorithms