Data Mining of Chinese Social Media

dc.contributor.advisorWallach, Daniel S.en_US
dc.contributor.committeeMemberJermaine, Christopher Men_US
dc.contributor.committeeMemberBronk, Chrisen_US
dc.creatorShu, Anheien_US
dc.date.accessioned2016-01-25T21:29:03Zen_US
dc.date.available2016-01-25T21:29:03Zen_US
dc.date.created2014-12en_US
dc.date.issued2014-10-31en_US
dc.date.submittedDecember 2014en_US
dc.date.updated2016-01-25T21:29:03Zen_US
dc.description.abstractWe present measurements and analysis of censorship on Weibo, a popular microblogging site in China. Since we were limited in the rate at which we could download posts, we identified users likely to participate in sensitive topics and recursively followed their social contacts, biasing our search toward a subset of Weibo where we hoped to be more likely to observe censorship. Our architecture enables us to detect post deletions within one minute of the deletion event, giving us a high-fidelity view of what is being deleted by the censors and when. We found that deletions happen most heavily in the first hour after a post has been submitted. Focusing on original posts, not reposts/retweets, we observed that nearly 30% of the total deletion events occur within 5-30 minutes. Nearly 90% of the deletions happen within the first 24 hours. Leveraging our data, we also consider a variety of hypotheses about the mechanisms used by Weibo for censorship, such as the extent to which they use retrospective keyword-based censorship, and how repost/retweet popularity interacts with censorship. By leveraging natural language processing techniques we also perform a topical analysis of the deleted posts, overcoming the usage of neologisms, named entities, and informal language that typifies Chinese social media. Using Independent Component Analysis, we find that the topics where mass removal happens the fastest are those that combine events that are hot topics in Weibo as a whole (e.g., the Beijing rainstorms or a sex scandal) with themes common to sensitive posts (e.g., Beijing, government, China, and policeman). Air pollution is a pressing concern for industrialized countries. Air quality measurements and their interpretations often take on political overtones. Similar concerns reflect the our understanding of what levels of measured pollution correspond to different levels of human nuisance, impairment, or injury. In this paper, we consider air pollution metrics from four large Chinese cities (U.S. embassy/consulate data, and Chinese domestic measurements) and compare them to a large volume of discussions on Weibo (a popular Chinese microblogging system). In the city with the worst PM2.5, Beijing, we found a strong correlation (R=0.82) between Chinese use of pollution-related terms and the ambient pollution. In other Chinese cities with lower pollution, the correlation was weaker. Nonetheless, our results show that social media may be a valuable proxy measurement for pollution, which may be quite valuable when traditional measurement stations are unavailable (or whose output is censored or misreported).en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationShu, Anhei. "Data Mining of Chinese Social Media." (2014) Diss., Rice University. <a href="https://hdl.handle.net/1911/88119">https://hdl.handle.net/1911/88119</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/88119en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectSocial networken_US
dc.subjectmicroblogen_US
dc.subjectdata miningen_US
dc.subjecttopic extractionen_US
dc.subjectcensorshipen_US
dc.titleData Mining of Chinese Social Mediaen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SHU-DOCUMENT-2014.pdf
Size:
2.7 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: