<?xml version="1.0" encoding="ISO-8859-1"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ref="http://purl.org/rss/1.0/modules/reference/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/">
	<channel rdf:about="http://www.di.uniovi.es/~dani/PFCblog/rss.rdf">
		<title>Blog para proyectantes de Dani Gayo</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php</link>
		<description><![CDATA[Ad verba per numeros]]></description>
		<items>
			<rdf:Seq>
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry100222-163216" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091226-222300" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091217-172617" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091208-122700" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091125-020557" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090922-233600" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090910-120021" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090816-211753" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090805-235915" />
				<rdf:li resource="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090731-214703" />
			</rdf:Seq>
		</items>
	</channel>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry100222-163216">
		<title>Not a kinky post - Reviewing &quot;Superfreakonomics&quot; </title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry100222-163216</link>
		<description><![CDATA[Last week I finished <a href="http://freakonomicsbook.com/" target="_blank" >&quot;Superfreakonomics&quot;</a> by Levitt and Dubner. In case you haven&#039;t read <a href="http://freakonomicsbook.com/" target="_blank" >&quot;Freakonomics&quot;</a> by the same authors just skip this review and read that book. In case you have read it and didn&#039;t like it just skip this review. <br /><br />Still here? OK. My opinions on the book. First of all, I recommend it although I think Freakonomics is much better because, to me, Superfreakonomics is a reinstallment of the first book. I mean, I enjoyed S-F but in the same way I&#039;d enjoy a good sequel to a great movie. <br /><br />As with Freakonomics, S-F has a rather zigzagging narrative which, in a few chapters, makes hard following (even finding) the authors&#039; discourse. However, I think it&#039;s a rather common style nowadays (I&#039;m not sure if this is good, bad, or just the opposite) with some authors (for instance Malcolm Gladwell).<br /><br />Again, the authors have chosen some topics which they (probably) thought could be controversial (aka sale-boosting) like prostitution and fighting global warming. Nonetheless to say, none of them are that controversial, and the global cooling measures described in the book are, at most, intriguing.<br /><br />There are, however, two chapters that I think are really worth reading: those on terrorism and altruism. The first one, &quot;Why should suicide bombers buy life insurance?&quot;, will probably appeal those of you with data-mining inclinations (although they just provide some glimpses on the topic). The later, &quot;Unbeliavable stories about apathy and altruism&quot;, is, well, really interesting, and helps to understand the issues about controlled experiments in sociology and psychology (after all, any measurement changes the thing measured even when the instrument is just the researcher).<br /><br />Additionally, I found the epilogue, &quot;Monkeys are people too&quot;, really funny which is a kind of bonus for the book :)<br /><br />My recommendation? Get the book and read it. Is it going to be useful for you? I don&#039;t really know, or the way in which it can be useful. However, if you are a researcher somewhat related to sociology/psychology/human interaction it can help you to be a little &quot;freakier&quot;, to encourage you to ask tougher questions, and trying to think outside of the box. To me, this alone makes Superfreakonomics worth the money.<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091226-222300">
		<title>Papers to read from WSDM 2010</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091226-222300</link>
		<description><![CDATA[The list of <a href="http://www.wsdm-conference.org/2010/accepted-papers.html" target="_blank" >accepted paper for WSDM 2010</a> is available (thanks to <a href="http://twitter.com/mstrohm" target="_blank" >@mstrohm</a>). As usual I&#039;ve prepared my list of papers to read (with their corresponding PDFs); those papers I&#039;d like to read but they are not yet available appear at the end of the list :(<br /><ul><li><a href="http://ciir.cs.umass.edu/~metzler/metzler-wsdm10a.pdf">Learning Concept Importance Using a Weighted Dependence Model</a>
<li><a href="http://research.yahoo.com/files/wsdm339-goyal.pdf">Learning Influence Probabilities In Social Networks</a>
<li><a href="http://www.ccs.neu.edu/home/amislove/publications/Inferring-WSDM.pdf">You are who you know: Inferring user profiles in Online Social Networks</a>
<li><a href="http://maroo.cs.umass.edu/pub/web/getpdf.php?id=900">Query Reformulation Using Anchor Text</a>
<li><a href="http://research.microsoft.com/pubs/115681/wsdm-yu.pdf">SBotMiner: Large Scale Search Bot Detection</a>
<li><a href="http://isiosf.isi.it/~cattuto/papers/wsdm2010_schifanella.pdf">Folks in folksonomies: Social link prediction from shared metadata</a>
<li><a href="http://www.cs.columbia.edu/~gravano/Papers/2010/wsdm10.pdf">Learning Similarity Metrics for Event Identification in Social Media</a>
<li><a href="http://www.mysmu.edu/staff/jsweng/papers/TwitterRank_WSDM.pdf">TwitterRank: Finding Topic-sensitive Influential Twitterers</a>
<li>Ranking with Query-Dependent Loss for Web Search
<li>Personalized Click Prediction in Sponsored Search
<li>Towards Recency Ranking in Web Search
<li>Adapting Information Bottleneck Method for Automatic Construction of Domain-oriented Sentiment Lexicon
<li>Cumulating Relevance: A Model to Estimate Document Relevance from the Clickthrough Logs
<li>Leveraging Temporal Dynamics of Document Content in Relevance Ranking
<li>Beyond DCG: User Behaviour as a Predictor of a Successful Search
<li>A Novel Click Model and Its Applications to Online Advertising
</ul><br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091217-172617">
		<title>On Twitter and influence...</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091217-172617</link>
		<description><![CDATA[<a href="http://www.inqmobile.com/lang/en/blog/twitterati/" target="_blank" >INQ has just released a Twitterati list</a> (their name); that is, a list containing the most influential twitterers. I took notice of this thanks to <a href="http://twitter.com/Zee" target="_blank" >Zee M Kane</a> who, according to that list, is the 7th most influential Briton. <br /><br />I was curious enough to find how influence was computed and I eventually reached <a href="http://www.inqmobile.com/src/pr/20091217_Twitterati_revealed.pdf" target="_blank" >the actual report</a> (PDF). There, in a footprint, you can find that INQ did not actually compute the twitterers&#039; influence but, instead, relied on the <a href="http://www.twitalyzer.com" target="_blank" >Twitalyzer.com</a> service.<br /><br />Nonetheless to say, Twitalyzer is one of the not-so-many systems to compute a twitterer&#039;s influence. Another one is <a href="http://tunkrank.com/" target="_blank" >TunkRank</a> which is an implementation of <a href="http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/" target="_blank" >an idea by Daniel Tunkelang</a>. TunkRank is kind-of a PageRank for Twitter and, thus, I find it much more pleasant: I mean, there are not really (or at least not many) <i>ad hoc</i> decisions and its results can be replicated (provided you have a Twitter graph).<br /><br />Actually, I have assembled such a subgraph (about 1.5 million users) and have applied PageRank (not yet TunkRank) to it, and this is my list of &quot;most influential British Twitter users&quot;. But, before, a piece of warning: I&#039;ve just look for the users appearing in the <a href="http://www.telegraph.co.uk/technology/twitter/6832287/Most-influential-British-Twitter-users-revealed.html" target="_blank" >Telegraph&#039;s list</a> and just added <a href="http://en.wikipedia.org/wiki/Richard_Branson" target="_blank" >Richard Branson</a> which seems to be an odd missing. Now, the list:<br /><ol>
<li>stephenfry
<li>mashable
<li>rustyrockets
<li>richardbranson
<li>imogenheap
<li>richardpbacon
<li>calvinharris
<li>andy_murray
<li>mayoroflondon
<li>sarahbrown10
<li>mrpeterandre
<li>tommcfly
<li>zee
<li>suziperry
<li>johnprescott
<li>tom_watson
<li>campbellclaret
<li>dougiemcfly</ol><br />As you can see, appart of including Richard Branson (and not including some users that do not appear in my sample such as Eddie Izzard) there are only minor, but interesting, differences: for instance, Stephen Fry is more influential than Pete Cashmore from Mashable and the Mayor of London is more influential than PM&#039;s wife Sarah Brown; with regards to Tom and Dougie from McFly, Tom is much more influential than Dougie (remember, there are other twitterers in between these celebrities).<br /><br />Oh, and what about that &quot;contest&quot; between Ashton Kutcher and CNN? It seems that both, aplusk and cnnbrk, share the first place of Twitter influencers. Yup...<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091208-122700">
		<title>The Numerati reviewed by a &quot;numerati wannabe&quot;</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091208-122700</link>
		<description><![CDATA[I learned of <a href="http://thenumerati.net/" target="_blank" >&quot;The Numerati&quot;</a> by Stephen Baker a couple of weeks ago in El Pais (a Spanish newspaper). <a href="http://thenumerati.net/index.cfm?postID=449" target="_blank" >He signed the short story &quot;Nos vigilan&quot;</a> (They are watching us) which depicts a scenario where &quot;scientists&quot; (mainly mathematicians and engineers) apply different tools and techniques in order to detect, track, and predict users behaviour under different situations (e.g. in the supermarket, during elections, when using the Internet, etc.)<br /><br />The tone of the story is a bit Big-Brother-esque for my taste and, in my opinion, it fails in providing a realistic, albeit simplified, picture of the actual state of the art. <br /><br />Because of that, I was reluctant to buy the book and, thus, I searched for some reviews on it. <a href="http://www.ams.org/notices/200909/rtx090901109p.pdf" target="_blank" >This one</a>, by Jeffrey Shallit, was pretty useful because it helped me to set my expectations towards the book which I, eventually, have bought and read.<br /><br />After reading it, I must said that I mostly agree with Shallit&#039;s review: &quot;The Numerati&quot; is not a book for the technical savvy, and it probably also fails when introducing the field to the lay person. However, I haven&#039;t found it totally unenjoyable: despite the title and the short story in the newspaper, the book is not the mixture of Big Brother paranoia and Dan Brown I was afraid of. It (tries to) describe several fields where data mining and machine learning can be applied, the challenges researchers are finding, and the goals they try to reach.<br /><br />All in all, I really recommend this book to those interested in data mining; hopefully, some of you could write a kind of version 2 of &quot;The Numerati&quot; providing a more accurate picture, while appealing to the general public at the same time.<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091125-020557">
		<title>Swarm Information Foraging (i.e. another way of exploiting crowd intelligence to improve search engines)</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry091125-020557</link>
		<description><![CDATA[Last friday I took notice of <a href="http://www.wowd.com" target="_blank" >Wowd</a>; it is, in their words, <i>&quot;a real-time search engine for discovering what&#039;s popular on the web right now&quot;</i>. Wowd exploits crowd intelligence in a really smart way: first, there is no crawler, those pages visited by the users are submitted to the index and, secondly, ranking is determined from the attention the users pay to each page.<br /><br />All of this is somewhat related to a paper we have under review at this moment and, thus, we decided to release a <a href="http://arxiv.org/abs/0911.3979" target="_blank" >draft report</a> as a preprint:<br /><br /><blockquote><b>Making the road by searching - A search engine based on Swarm Information Foraging</b> by Daniel Gayo-Avello and David J. Brenes. <i><b>Abstract:</b> Search engines are nowadays one of the most important entry points for Internet users and a central tool to solve most of their information needs. Still, there exist a substantial amount of users&#039; searches which obtain unsatisfactory results. Needless to say, several lines of research aim to increase the relevancy of the results users retrieve. In this paper the authors frame this problem within the much broader (and older) one of information overload. They argue that users&#039; dissatisfaction with search engines is a currently common manifestation of such a problem, and propose a different angle from which to tackle with it. As it will be discussed, their approach shares goals with a current hot research topic (namely, learning to rank for information retrieval) but, unlike the techniques commonly applied in that field, their technique cannot be exactly considered machine learning and, additionally, it can be used to change the search engine&#039;s response in real-time, driven by the users behavior. Their proposal adapts concepts from Swarm Intelligence (in particular, Ant Algorithms) from an Information Foraging point of view. It will be shown that the technique is not only feasible, but also an elegant solution to the stated problem; what&#039;s more, it achieves promising results, both increasing the performance of a major search engine for informational queries, and substantially reducing the time users require to answer complex information needs.</i></blockquote><br /><br />Curiously enough, a few hours after the preprint was available I received an interview request from a French journalist which ended in <a href="http://www.atelier.fr/applications/10/24112009/moteur-de-recherche-fourmi-essaim-sous-couche-logicielle-resultats-ponderer-popularite-39016-.html" target="_blank" >this description of our research (in french)</a>.<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090922-233600">
		<title>Interesting papers, useful software, and dubiously ethical research</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090922-233600</link>
		<description><![CDATA[It&#039;s only tuesday and I&#039;ve already read (OK, skimmed) a bunch of realy inspiring/enjoyable papers.<br /><br />The first one I&#039;d like to remark is <b><a href="&quot;http://www.law.berkeley.edu/files/MonroeColaresiQuinn.pdf&quot;" target="_blank" >&quot;Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict&quot;</a></b> by Monroe, Colaresi and Quinn.<blockquote>Entries in the burgeoning &quot;text-as-data&quot; movement are often accompanied by lists or visualizations of how word (or other lexical feature) usage differs across some pair or set of documents. These are intended either to establish some target semantic concept (like the content of partisan frames) to estimate word-specific measures that feed forward into another analysis (like locating parties in ideological space) or both. We discuss a variety of techniques for selecting words that capture partisan, or other, differences in political speech and for evaluating the relative importance of those words. We introduce and emphasize several new approaches based on Bayesian shrinkage and regularization. We illustrate the relative utility of these approaches with analyses of partisan, gender, and distributive speech in the U.S. Senate.</blockquote><br /><br />This paper discusses different ways to discover those terms which really define a party&#039;s position on a certain topic; from that starting point the authors discuss different approaches and point out their weaknesses. Many of those issues are highly relevant to other NLP or IR tasks and, thus, I selected some excerpts for your reflecting pleasure: <blockquote>One approach, standard in the machine learning literature, is to treat this [finding partisan terms] as a classification problem. In our example, we would attempt to find the words <i>(w)</i> that significantly predict partisanship <i>(p)</i>. A variety of established machine learning methods could be used [...]. These approaches would attempt to find some classifier function that mapped words to some unknown party label. The primary problem of this approach, for our purposes, is that it gets the data generation process backwards. <b>Party is not plausibly a function of word choice. Word choice is (plausibly) a function of party.</b></blockquote><blockquote>A common response [...] in many natural language processing applications is to eliminate &quot;function&quot; or &quot;stop&quot; words that are deemed unlikely to contain meaning. [...] We note, however, the practice of stop word elimination has been found generally<br />to create more problems than it solves, across natural language processing applications. Manning et al. (2008) observe: <i>&quot;The general trend ... over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever&quot;</i>. They give particular emphasis to the problems of searching for phrases that might disappear or change meaning without stop words (e.g., <i>&quot;to be or not to be&quot;</i>). More to the point, this <i>ad hoc</i> solution diagnoses the problem incorrectly. Function words are not dominant in the partisan word lists here because they are function words, but because they are frequent. [...] Eliminating function words not only eliminates words inappropriately but it also elevates high-frequency non–stop words inappropriately.</blockquote><blockquote>Eliminating low-frequency words. Although this is a very basic statistical idea, it is commonly unacknowledged in simple<br />feature selection and related ranking exercises. A common response is to set some frequency &quot;threshold&quot; for features to &quot;qualify&quot; for consideration. Generally, this simply removes the most problematic features without resolving the issue.</blockquote><br /><br />Enough to say that Monroe <i>et al.</i> cleary prefer model-based models and they provide a thorough description of their Bayesian approach. <br /><br />Not totally unrelated to this I think these two nifty pieces of software can be of interest to you: <a href="&quot;http://www.idsia.ch/~giorgio/jncc2.html&quot;" target="_blank" >The Java Implementation of Naive Credal Classifier 2</a> by Corani and Zaffalon and <a href="&quot;http://gking.harvard.edu/readme/&quot;" target="_blank" >ReadMe: Software for Automated Content Analysis</a> by Hopkins <i>et al</i>.<br /><br />Two other interesting/intriguing papers: <a href="&quot;http://gking.harvard.edu/files/discov.pdf&quot;" target="_blank" ><b>&quot;Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology&quot;</b></a> by Grimmer and King and <a href="&quot;http://gking.harvard.edu/files/infoex.pdf&quot;" target="_blank" ><b>&quot;An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design&quot;</b></a> by King and Lowe. After reading the work by Monroe <i>et al.</i> I&#039;d love to know what would they say about the approach by Grimmer and King in the first paper :)<blockquote>We begin with a set of text documents of variable length. For each, we adopt the most common procedures for representing them quantitatively: we transform to lower case, remove punctuation, replace words with their stems, and drop words appearing in fewer than 1% or more than 99% of documents. For English documents, about 3,500 unique word stems usually remain in the entire corpora.</blockquote><br /><br />Should you wonder about my modest opinion: I see no problem with using thresholds, most of the time they work fine (after you tune them, of course) and, by the way, word independence (required by Bayes approaches) is just a different working assumption ;)<br /><br />The last paper I&#039;d like to mention is <a href="&quot;http://people.lis.illinois.edu/~mefron/papers/efron-cocitation.pdf&quot;" target="_blank" ><b>&quot;Using cocitation information to estimate political orientation in web documents&quot;</b></a> by Miles Efron.<blockquote>This paper introduces a simple method for estimating cultural orientation, the affiliation of online entities in a polarized field of discourse. In particular, cocitation information is used to estimate the political orientation of hypertext documents. A type of cultural orientation, the political orientation of a document is the degree to which it participates in traditionally left- or right-wing beliefs. Estimating documents’ political orientation is of interest for personalized information retrieval and recommender systems. In its application to politics, the method uses a simple probabilistic model to estimate the strength of association between a document and left- and right-wing communities. The model estimates the likelihood of cocitation between a document of interest and a small number of documents of known orientation. The model is tested on three sets of data, 695 partisan web documents, 162 political weblogs, and 198 nonpartisan documents. Accuracy above 90% is obtained from the cocitation model, outperforming lexically based classifiers at statistically significant levels.</blockquote><br /><br />This paper is extremely interesting, the main idea is that &quot;a man is known by the company he keeps&quot;, that is, if your website is frequently cocited with lef- or right-wing websites then your website could be classified in the lef-right political spectrum. <br /><br />This reminds me of an extremely upsetting research disclosed very recently: the so-called project Gaydar at MIT, the aim of which is to determine if a user is or not gay according to his/her contacts in Facebook. IMHO these guys have totally crossed the line.]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090910-120021">
		<title>My to-read list from CIKM&#039;09</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090910-120021</link>
		<description><![CDATA[<a href="http://www.searchenginecaffe.com/2009/09/cikm-2009-papers.html" target="_blank" >Thanks to Jeff Dalton</a> I&#039;ve noticed the <a href="http://www.comp.polyu.edu.hk/conference/cikm2009/program/accepted_papers.htm" target="_blank" >list of accepted papers for CIKM 2009</a>.<br /><br />Needless to say, there is a bunch of them I&#039;d really like to read (see below). Unfortunately, most of them are still not available online. This is the, much shorter, list of material I&#039;m going to skim in the following days:<ul><li><strong><a href="http://faculty.washington.edu/efthimis/pubs/Pubs/cikm09.qreform.huang-efthimiadis.pdf">Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs</a></strong><li><strong><a href="http://eprints.sics.se/3651/01/km0612-sahlgren.pdf">Terminology Mining in Social Media</a></strong></li><li><a href="http://research.yahoo.com/files/fp0631-leeuwen.pdf">Compressing Tags to Find Interesting Media Groups</a></li><li><a href="http://staff.science.uva.nl/~mdr/Publications/Files/cikm2009-volume.pdf">Predicting the Volume of Comments on Online News Stories</a></li></ul>And the list of papers I&#039;ll read once they are available:<ul><li>Mining Data Streams with Periodically Changing Distributions / Mining Frequent Itemsets in Time-Varying Data Streams</li><li>Detecting Topic Evolution in Scientific Literature: How Can Citations Help?</li><li>Beyond Hyperlinks: Organizing Information Footprints in Search Logs to Support Effective Browsing</li><li>Clustering and Exploring Search Results using Timeline Constructions</li><li>Event Detection from Flickr Data through Wavelet-based Spatial Analysis</li><li>Voting in Social Networks</li><li>Semi-Supervised Learning of Semantic Classes for Query Understanding -- from the Web and for the Web</li><li>SELC: A Self-Supervised Model for Sentiment Classification</li><li>Learning to Recommend Questions Based on User Ratings</li><li>Improving Web Page Classification by Label-propagation over Click Graphs</li><li>Practical Lessons of Data Mining at Yahoo!</li><li>Who Tags the Tags?</li><li>Context Sensitive Synonym Discovery for Web Search Queries</li><li>Identifying Comparable Entities on the Web</li><li>Exploiting Bidirectional Links: Making Spamming Detection Easier</li><li>Exploring Relevance for Clicks</li><li>Learning from Past Queries for Resource Selection</li><li>What Makes Categories Difficult to Classify?</li><li>Effective Anonymization of Query Logs</li><li>Pure Spreading Activation is Pointless</li><li>Collaborative Resource Discovery in Social Tagging Systems</li><li>Data Extraction from the Web Using Wild Card Queries</li><li>Aging Effects on Query Flow Graphs for Query Suggestion</li><li>An Analysis Framework for Search Sequences</li><li>Evaluation of Methods for Relative Comparison of Retrieval Systems Based on Clickthroughs</li><li>Identifying Interesting Assertions from the Web</li><li>Opinion Classification with Tree Kernel SVM Using Linguistic Modality Analysis</li></ul>]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090816-211753">
		<title>Yet another post on mood analysis</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090816-211753</link>
		<description><![CDATA[A couple of months ago I blogged about <a href="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090616-171050" target="_blank" >mood analysis</a>; needless to say I&#039;m interested on its applications to user generated content, in particular, tweets.<br /><br />Yesterday I had notice of this really interesting paper:<br /><br /><blockquote><b><a href="http://www.springerlink.com/content/757723154j4w726k/fulltext.pdf" target="_blank" >Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and Presidents</a></b> by Peter Sheridan Dodds and Christopher M. Danforth. Journal of Happiness Studies (2009).<br /><br /><i>The importance of quantifying the nature and intensity of emotional states at the level of populations is evident: we would like to know how, when, and why individuals feel as they do if we wish, for example, to better construct public policy, build more successful organizations, and, from a scientific perspective, more fully understand economic and social phenomena. Here, by incorporating direct human assessment of words, we quantify happiness levels on a continuous scale for a diverse set of large-scale texts: song titles and lyrics, weblogs, and State of the Union addresses. Our method is transparent, improvable, capable of rapidly processing Web-scale texts, and moves beyond approaches based on coarse categorization. Among a number of observations, we find that the happiness of song lyrics trends downward from the 1960s to the mid 1990s while remaining stable within genres, and that the happiness of blogs has steadily increased from 2005 to 2009, exhibiting a striking rise and fall with blogger age and distance from the Earth’s equator.</i></blockquote><br /><br /><a href="http://bits.blogs.nytimes.com/2009/08/11/using-twitter-as-a-collective-mood-ring/" target="_blank" >According to the New York Times</a>, their authors are about to launch an online tool applied to Twitter: <a href="http://www.onehappybird.com/" target="_blank" >one happy bird</a>.<br /><br />All in all, an interesting approach, I&#039;m looking forward to give <i>one happy bird</i> a try.<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090805-235915">
		<title>Time series analysis in query logs</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090805-235915</link>
		<description><![CDATA[I&#039;ve just found this recent paper by some Googlers:<br /><br /><blockquote><b><a href="http://sites.google.com/site/massiciara/Home/emnlp_2009.pdf" target="_blank" >Gazpacho and summer rash: Lexical relationships from temporal patterns of web search queries</a></b>. E. Alfonseca, M. Ciaramita and K. Hall. Empirical Methods in Natural Language Processing (EMNLP). 2009.<br /><br /><i>In this paper we investigate temporal patterns of web search queries. We carry out several evaluations to analyze the properties of temporal profiles of queries, revealing promising semantic and pragmatic relationships between words. We focus on two applications: query suggestion and query categorization. The former shows a potential for time-series similarity measures to identify specific semantic relatedness between words, which results in state-of-the-art performance in query suggestion while providing complementary information to more traditional distributional similarity measures. The query categorization evaluation suggests that the temporal profile alone is not a strong indicator of broad topical categories.</i></blockquote><br /><br />I&#039;ve found it really enjoying, specially because one of my students (Manuel Tejeiro) recently finished his final year project and it was not another thing that a framework to perform time series analysis. In fact, for most of the testing he was using the <a href="http://www.gregsadetsky.com/aol-data/" target="_blank" >AOL 2006 query log</a> obtaining fairly interesting results (in bold the input query):<dl><dt><b>california lottery</b></dt><dd>lottery, ny lottery, georgia lottery, michigan lottery, mass lottery, calottery.com, ohio lottery, new jersey lottery, njlottery, ...<dt><b>academy awards</b></dt><dd>oscars, crash, oscar winners, box office, walk the line, ...<dt><b>disney channel</b></dt><dd>www.disneychannel.com, disneychannel, cartoonnetwork, disneychannel.com, ikea <b>(?)</b>, nick.com, hilary duff, ...</dl>Thus, in addition to the paper I talked above I&#039;d suggest you the following readings (Manuel&#039;s project was a really nice integration of ideas and methods from all of them):<ul><li><b><a href="http://www.cs.ucr.edu/~mvlachos/pubs/sigmod04.pdf">Identifying similarities, periodicities and bursts for online search queries</a></b> by Michail Vlachos, Christopher Meek, Zografoula Vagena, Dimitrios Gunopulos. In SIGMOD&#039;04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data (2004), pp. 131-142.<li><b><a href="http://www.kde.cs.uni-kassel.de/stumme/papers/2006/hotho2006trend.pdf">Trend Detection in Folksonomies</a></b>by A. Hotho, R. Jaschke, C. Schmitz, G. Stumme. In Proceedings of the First International Conference on Semantics and Digital Media Technology, Vol. 4306 (2006), pp. 56-70.<li><b><a href="http://www2007.org/papers/paper520.pdf">Why we search: visualizing and predicting user behavior</a></b> by Eytan Adar, Daniel S. Weld, Brian N. Bershad, Steven S. Gribble. In WWW&#039;07: Proceedings of the 16th international conference on World Wide Web (2007), pp. 161-170.<li><b><a href="http://acl.ldc.upenn.edu/eacl2006/companion/pd/31_balogetal_64.pdf">Why Are They Excited? Identifying and Explaining Spikes in Blog Mood Levels</a></b> by K. Balog, G. Mishne, M. de Rijke. In Association for Computational Linguistics (2006)</ul>Enjoy the reading!<br /><br />P.S. If you are still hungry you can find yet another paper related to query log analysis and including food in the title :) <b><a href="http://chato.cl/papers/boldi_2009_query_reformulation_patterns.pdf" target="_blank" >&quot;From &#039;dango&#039; to &#039;japanese cakes&#039;: Query Reformulation Models and Patterns&quot;</a></b> by Paolo Boldi, Francesco Bonchi, Carlos Castillo and Sebastiano Vigna.<br />]]></description>
	</item>
	<item rdf:about="http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090731-214703">
		<title>Worth to read</title>
		<link>http://www.di.uniovi.es/~dani/PFCblog/index.php?entry=entry090731-214703</link>
		<description><![CDATA[<blockquote>M.G. Noll, C.M. Au Yeung, N. Gibbins, C. Meinel, N. Shadbolt <b><a href="http://www.michael-noll.com/blog/2009/06/05/telling-experts-from-spammers-expertise-ranking-in-folksonomies/" target="_blank" >&quot;Telling Experts from Spammers: Expertise Ranking in Folksonomies&quot;</a></b>, Proceedings of 32nd ACM SIGIR Conference, Boston, USA, July 2009, pp. 612-619</blockquote>]]></description>
	</item>
</rdf:RDF>

