akicif

I never got around to including my copious list of interests on my userinfo page, so the interest predictor meme doesn't work for me.

Instead, I thought I'd dig out the interests list of of all the people and communities on my friends' list and do a bit of playing around with it.

The simple way to do this is to cut and paste everyone's userinfo page into one large file and do some serious editing. Alternatively, you can take something like http://www.livejournal.com/misc/interestdata.bml?user=akicif and merge it with a list of userids before topping and tailing it with a little PHP.

Save the source of the page you generate and sort it into alphabetical order (this makes it easier to through away what you don't need), and you find you have a three-columned table where the first two columns are the code number for each interest (the lower the number, the older the interest) and the number of people on LJ who share each interest.

There are 15,128 interests listed in total, of which 8048 are unique. The ten oldest interests are linux, programming, perl, unix, women, beer, biking, snow skiing, java and c. The ten newest are division two football, chic charnley, kenneth white, geopoetics, friends of the ham, daygloradio, ferry halim's orisinal, uncoperative hair, jacobites by name, and anticommunitarianism.

The ten most popular interests on LJ (of the ones listed, anyway) are music, movies, reading, friends, writing, computers, dancing, art and photography. There are 732 unique interests, though, so it's not really possible to list the ten least popular.

So much for the generalities. There's probably more to be learned by looking at the frequencies of interests on the friends list. The ten most popular interests are science fiction, books, reading, music, writing, sf, fantasy, cats, fandom and computers.

No surprises yet, really, so next I looked at the top thirty: science fiction (123), books (83), reading (78), music (57), writing (56), sf (55), fantasy (54), cats (52), fandom (51), computers (49), beer (47), edinburgh (46), cooking (44), chocolate (42), fanzines (41), history (40), sushi (38), food (35), neil gaiman (33), films (33), science fiction fandom (32), travel (31), sex (30), conventions (30), photography (28), iain banks (28), buffy the vampire slayer (28), monty python (27), dave langford (27) and movies (26). Still not incredibly surprising (okay, maybe the ordering towards the end), and still nothing I'm actively uninterested in.

Next, I looked at those interests where everyone who had them was on my friends list. Again ignoring the unique ones, we get swisstone (5), steer's true stories (3), independent art-wank cinema (3), thomas mcmahon (2), the convertible bus (2), stafford beer (2), slagging off scotland (2), secret nazi weapons (2), nova awards (2), longing for sunshine (2), long wide-ranging conversations (2), lilian edwards (2), internet regulation (2), fwagg (2), dorothy heydt (2), dave mooring (2), damp tweed (2), cybermog (2), cullen skink (2), citizens income (2), bloody microsoft (2), being an old leftie (2) and application development advisor (2). All at once things are looking a good deal less obvious, but maybe a little too obscure.

What I need is a way of scoring interests that selected for things rare on LJ in general but common on the FL and vice versa. I can sort of do this by sorting on the number I get if I multiply the percentage of a given interest on LJ that's on my FL by the number of times it appears on the FL. This has the advantage that it sorts all the unique interests into the middle somewhere where I can ignore them.

The top thirty interests by this metric are dave langford, plokta, the cult of livejournal, rasff, superfluous technology, science fiction fandom, corflu, swisstone, eastercon, the pointy bear game, novacon, science fiction foundation, ken macleod, bleepy shite, rec.arts.sf.fandom, conrunning, rasseff, steer's true stories, independent art-wank cinema, fanzines, smoffing, rassef, bsfa, perky gothness, reading sf group, ian mcdonald, damn fine convention, british science fiction association, charles stross and holyrood tavern. This list still contains some of the entries from the last one, but some new stuff's made its way in.

Finally, the bottom thirty interests - those that are most common out in LJ-land, but least well represented here: bowling, horror movies, you, blue, cds, the doors, drama, sunsets, hanging out, smashing pumpkins, marilyn manson, led zeppelin, surfing, fight club, drums, partying, the beach, tool, beach, skiing, coldplay, skateboarding, weezer, family guy, traveling, summer, nirvana, soccer, concerts and guys.

Before I close the lj-cut, though, there were two oddities: interests with wildcards in them are dead tricky, 'cos they matched against longer strings, and there was one interest on the FL that LJ tried to claim no-one had.

Oh, and does anyone remember "lemurs in the rain"? It's still an interest for 184 people on LJ, of whom seven are on the FL.

Flat | Top-Level Comments Only

Date: 2005-06-21 07:47 pm (UTC)

From:

akicif.livejournal.com

Doing it manually....

Sure - as long as everyone doesn't want one.

The php file contains:

<?

/* many lines like the one below */
include 'http://www.livejournal.com/misc/interestdata.bml?user=akicif';

?>

Put the file on a php-enabled web server, and browse to the page. When the page has loaded, look at the source and save it as a text file.

For the next bit, you'll need a text editor that can handle regular expressions and sorting.

First, sort the file, and throw away the lines that do not begin with numbers. You then have a bunch of lines that look like:

10076798 1 anticommunitarianism

Before we can stick these in a spreadsheet and do the counting, though, we need to replace the spaces between columns (and only those spaces) with tabs. Also, because we don't want to count, say, instances of "hot dogs" as "dogs", we need to stick a marker of some sort at the beginning and end of the text string.

So, we substitute "\n$[0-9]+$ $[0-9]+$ " with "PLUGH\n\1\t\2\tXYZZY" throughout, and this gives us something we can paste into three adjacent columns of a spreadsheet.

Use the fourth column to count how many times each interest appears by pasting "=countif($c$2:$c$11093,c2)" into D2, and then copying it to the bottom - this counts, for each row, how many times that interest appears in the third column. Now, that's a formula and subject to change, so you take that column and paste it into your text editor again and then cut and paste from their back into the spreadsheet (you might as well remove the XYZZYs and PLUGHs at this point).

This gives you a bunch of raw data, but it's easier to work with the summary generated by deleting all the duplicate lines.

By a strange coincidence, I have a spreadsheet containing this information for

nhw - oh, except some of the folk on your FL seem to have used other alphabets in places - which I have just emailed you.

Oh, and there are bound to be less manual ways to do this, but as I was originally doing it for a one-off, I didn't bother looking.