counter Blog para proyectantes de Dani Gayo

Saturday, December 26, 2009, 10:23 PM
The list of accepted paper for WSDM 2010 is available (thanks to @mstrohm). As usual I've prepared my list of papers to read (with their corresponding PDFs); those papers I'd like to read but they are not yet available appear at the end of the list :(


| enlace relacionado | ( 0 / 0 ) | Top


HOT!
Thursday, December 17, 2009, 05:26 PM
INQ has just released a Twitterati list (their name); that is, a list containing the most influential twitterers. I took notice of this thanks to Zee M Kane who, according to that list, is the 7th most influential Briton.

I was curious enough to find how influence was computed and I eventually reached the actual report (PDF). There, in a footprint, you can find that INQ did not actually compute the twitterers' influence but, instead, relied on the Twitalyzer.com service.

Nonetheless to say, Twitalyzer is one of the not-so-many systems to compute a twitterer's influence. Another one is TunkRank which is an implementation of an idea by Daniel Tunkelang. TunkRank is kind-of a PageRank for Twitter and, thus, I find it much more pleasant: I mean, there are not really (or at least not many) ad hoc decisions and its results can be replicated (provided you have a Twitter graph).

Actually, I have assembled such a subgraph (about 1.5 million users) and have applied PageRank (not yet TunkRank) to it, and this is my list of "most influential British Twitter users". But, before, a piece of warning: I've just look for the users appearing in the Telegraph's list and just added Richard Branson which seems to be an odd missing. Now, the list:
  1. stephenfry
  2. mashable
  3. rustyrockets
  4. richardbranson
  5. imogenheap
  6. richardpbacon
  7. calvinharris
  8. andy_murray
  9. mayoroflondon
  10. sarahbrown10
  11. mrpeterandre
  12. tommcfly
  13. zee
  14. suziperry
  15. johnprescott
  16. tom_watson
  17. campbellclaret
  18. dougiemcfly

As you can see, appart of including Richard Branson (and not including some users that do not appear in my sample such as Eddie Izzard) there are only minor, but interesting, differences: for instance, Stephen Fry is more influential than Pete Cashmore from Mashable and the Mayor of London is more influential than PM's wife Sarah Brown; with regards to Tom and Dougie from McFly, Tom is much more influential than Dougie (remember, there are other twitterers in between these celebrities).

Oh, and what about that "contest" between Ashton Kutcher and CNN? It seems that both, aplusk and cnnbrk, share the first place of Twitter influencers. Yup...

| enlace relacionado | ( 3 / 37 ) | Top



Tuesday, December 8, 2009, 12:27 PM
I learned of "The Numerati" by Stephen Baker a couple of weeks ago in El Pais (a Spanish newspaper). He signed the short story "Nos vigilan" (They are watching us) which depicts a scenario where "scientists" (mainly mathematicians and engineers) apply different tools and techniques in order to detect, track, and predict users behaviour under different situations (e.g. in the supermarket, during elections, when using the Internet, etc.)

The tone of the story is a bit Big-Brother-esque for my taste and, in my opinion, it fails in providing a realistic, albeit simplified, picture of the actual state of the art.

Because of that, I was reluctant to buy the book and, thus, I searched for some reviews on it. This one, by Jeffrey Shallit, was pretty useful because it helped me to set my expectations towards the book which I, eventually, have bought and read.

After reading it, I must said that I mostly agree with Shallit's review: "The Numerati" is not a book for the technical savvy, and it probably also fails when introducing the field to the lay person. However, I haven't found it totally unenjoyable: despite the title and the short story in the newspaper, the book is not the mixture of Big Brother paranoia and Dan Brown I was afraid of. It (tries to) describe several fields where data mining and machine learning can be applied, the challenges researchers are finding, and the goals they try to reach.

All in all, I really recommend this book to those interested in data mining; hopefully, some of you could write a kind of version 2 of "The Numerati" providing a more accurate picture, while appealing to the general public at the same time.

| enlace relacionado | ( 3 / 56 ) | Top



Wednesday, November 25, 2009, 02:05 AM
Last friday I took notice of Wowd; it is, in their words, "a real-time search engine for discovering what's popular on the web right now". Wowd exploits crowd intelligence in a really smart way: first, there is no crawler, those pages visited by the users are submitted to the index and, secondly, ranking is determined from the attention the users pay to each page.

All of this is somewhat related to a paper we have under review at this moment and, thus, we decided to release a draft report as a preprint:

Making the road by searching - A search engine based on Swarm Information Foraging by Daniel Gayo-Avello and David J. Brenes. Abstract: Search engines are nowadays one of the most important entry points for Internet users and a central tool to solve most of their information needs. Still, there exist a substantial amount of users' searches which obtain unsatisfactory results. Needless to say, several lines of research aim to increase the relevancy of the results users retrieve. In this paper the authors frame this problem within the much broader (and older) one of information overload. They argue that users' dissatisfaction with search engines is a currently common manifestation of such a problem, and propose a different angle from which to tackle with it. As it will be discussed, their approach shares goals with a current hot research topic (namely, learning to rank for information retrieval) but, unlike the techniques commonly applied in that field, their technique cannot be exactly considered machine learning and, additionally, it can be used to change the search engine's response in real-time, driven by the users behavior. Their proposal adapts concepts from Swarm Intelligence (in particular, Ant Algorithms) from an Information Foraging point of view. It will be shown that the technique is not only feasible, but also an elegant solution to the stated problem; what's more, it achieves promising results, both increasing the performance of a major search engine for informational queries, and substantially reducing the time users require to answer complex information needs.


Curiously enough, a few hours after the preprint was available I received an interview request from a French journalist which ended in this description of our research (in french).

| enlace relacionado | ( 3 / 63 ) | Top


Artículos
Tuesday, September 22, 2009, 11:36 PM
It's only tuesday and I've already read (OK, skimmed) a bunch of realy inspiring/enjoyable papers.

The first one I'd like to remark is "Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict" by Monroe, Colaresi and Quinn.
Entries in the burgeoning "text-as-data" movement are often accompanied by lists or visualizations of how word (or other lexical feature) usage differs across some pair or set of documents. These are intended either to establish some target semantic concept (like the content of partisan frames) to estimate word-specific measures that feed forward into another analysis (like locating parties in ideological space) or both. We discuss a variety of techniques for selecting words that capture partisan, or other, differences in political speech and for evaluating the relative importance of those words. We introduce and emphasize several new approaches based on Bayesian shrinkage and regularization. We illustrate the relative utility of these approaches with analyses of partisan, gender, and distributive speech in the U.S. Senate.


This paper discusses different ways to discover those terms which really define a party's position on a certain topic; from that starting point the authors discuss different approaches and point out their weaknesses. Many of those issues are highly relevant to other NLP or IR tasks and, thus, I selected some excerpts for your reflecting pleasure:
One approach, standard in the machine learning literature, is to treat this [finding partisan terms] as a classification problem. In our example, we would attempt to find the words (w) that significantly predict partisanship (p). A variety of established machine learning methods could be used [...]. These approaches would attempt to find some classifier function that mapped words to some unknown party label. The primary problem of this approach, for our purposes, is that it gets the data generation process backwards. Party is not plausibly a function of word choice. Word choice is (plausibly) a function of party.
A common response [...] in many natural language processing applications is to eliminate "function" or "stop" words that are deemed unlikely to contain meaning. [...] We note, however, the practice of stop word elimination has been found generally
to create more problems than it solves, across natural language processing applications. Manning et al. (2008) observe: "The general trend ... over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever". They give particular emphasis to the problems of searching for phrases that might disappear or change meaning without stop words (e.g., "to be or not to be"). More to the point, this ad hoc solution diagnoses the problem incorrectly. Function words are not dominant in the partisan word lists here because they are function words, but because they are frequent. [...] Eliminating function words not only eliminates words inappropriately but it also elevates high-frequency non–stop words inappropriately.
Eliminating low-frequency words. Although this is a very basic statistical idea, it is commonly unacknowledged in simple
feature selection and related ranking exercises. A common response is to set some frequency "threshold" for features to "qualify" for consideration. Generally, this simply removes the most problematic features without resolving the issue.


Enough to say that Monroe et al. cleary prefer model-based models and they provide a thorough description of their Bayesian approach.

Not totally unrelated to this I think these two nifty pieces of software can be of interest to you: The Java Implementation of Naive Credal Classifier 2 by Corani and Zaffalon and ReadMe: Software for Automated Content Analysis by Hopkins et al.

Two other interesting/intriguing papers: "Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology" by Grimmer and King and "An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design" by King and Lowe. After reading the work by Monroe et al. I'd love to know what would they say about the approach by Grimmer and King in the first paper :)
We begin with a set of text documents of variable length. For each, we adopt the most common procedures for representing them quantitatively: we transform to lower case, remove punctuation, replace words with their stems, and drop words appearing in fewer than 1% or more than 99% of documents. For English documents, about 3,500 unique word stems usually remain in the entire corpora.


Should you wonder about my modest opinion: I see no problem with using thresholds, most of the time they work fine (after you tune them, of course) and, by the way, word independence (required by Bayes approaches) is just a different working assumption ;)

The last paper I'd like to mention is "Using cocitation information to estimate political orientation in web documents" by Miles Efron.
This paper introduces a simple method for estimating cultural orientation, the affiliation of online entities in a polarized field of discourse. In particular, cocitation information is used to estimate the political orientation of hypertext documents. A type of cultural orientation, the political orientation of a document is the degree to which it participates in traditionally left- or right-wing beliefs. Estimating documents’ political orientation is of interest for personalized information retrieval and recommender systems. In its application to politics, the method uses a simple probabilistic model to estimate the strength of association between a document and left- and right-wing communities. The model estimates the likelihood of cocitation between a document of interest and a small number of documents of known orientation. The model is tested on three sets of data, 695 partisan web documents, 162 political weblogs, and 198 nonpartisan documents. Accuracy above 90% is obtained from the cocitation model, outperforming lexically based classifiers at statistically significant levels.


This paper is extremely interesting, the main idea is that "a man is known by the company he keeps", that is, if your website is frequently cocited with lef- or right-wing websites then your website could be classified in the lef-right political spectrum.

This reminds me of an extremely upsetting research disclosed very recently: the so-called project Gaydar at MIT, the aim of which is to determine if a user is or not gay according to his/her contacts in Facebook. IMHO these guys have totally crossed the line.
| enlace relacionado | ( 3 / 57 ) | Top



Siguiente