kiwitobes.com

kiwitobes.com

Author, Software Developer, and Data Magnate

kiwitobes.com RSS Feed
 

Personal data integration (part 1)

I’ve been toying with the idea of attempting “semantic integration” of a lot of personal data in my life. I’ll be sure to share more later, but so far I’ve managed to pull together my September phone records, my email history, my contacts, my calendar and my Facebook friends (via the API, not something sketchy!) into a single triple-store.

Using this data, I was able to create this chart, which shows my friend network (I have removed myself and Brooke, since we’re connected to everyone and it ruins the layout). The people who I emailed, texted or called in September are shown in green.

social_graph_2gml-yed.jpg

You can see tight clusters of my friend groups. The tightest is the big hairball near the bottom that makes up much of Brooke’s Stanford GSB class, but also clear are groupings for my friends from MIT, Chapel Hill, Boston (post-MIT return), my San Francisco tech friends and my family. My family is the only group that is isolated from the rest of the graph — everyone else is connected, which is partly because I’ve introduced some of these groups to each other, and partly just because it’s a small world.

Also good to see is that almost every cluster has at least one green node (my family notably doesn’t, but that’s because my parents aren’t on Facebook), so I’ve generally done a good job of keeping in touch with at least a few people from different phases of my life.

There’s a lot of talk about breaking the silos in the enterprise and, in the semantic-web community, data integration across the entire web. But right now, people don’t even have decent integration across their own personal information. The current proliferation of single-feature applications encourages you to store different aspects of your life in different places — the advantage of course, is that something highly specialized is much more pleasant to use, but the disadvantage is that there’s no way to query across these aspects. I’m interested in experimenting with ways that help people “break the silos” with their own information, in the hope that this will both yield useful applications and help us get a better grip on the bigger problems.

I now have code to keep my triple-store synced with my friend network, my contacts, my phone records, my email and my calendar. I can construct queries across all of this (who did I forget to call on their birthday? Who have I seen recently who went to Stanford?). I’ll be sharing this code at some point, but I want to see how far I can take this. I’m also interested in hearing from anyone who has tried similar experiments and wants to collaborate.

So, anyone have any thoughts on other sources of personal data or questions you might want to ask once it’s integrated?

Web 2.0 NYC, Freebase UG meeting, and Taleb

A few quick updates:

  • I’ll be speaking at Web 2.0 in New York City this Thursday at 3pm. If you’re at the conference, find me and say hi!
  • While I’m gone, Freebase is having a user group meeting. Here is the info. Great speakers, you’ll seriously love the GeoSearch API
  • A new article by my favorite non-fiction author, Nassim Taleb, is at Edge. Highly recommended

I’m working on a lot of new projects right now, I’ll have more to share soon.

O’Reilly interview at OSCON

While I was at OSCON earlier this year, I did a 20 minute video interview with O’Reilly. I think the idea is to take a lot of interviews and edit them down to shorter segments for some kind of video supplement, but they’ve also posted the entire thing on Youtube.

I talk a little bit about my biotech experience, my book, working at Freebase and the importance of open data to new applications. The whole 20-minute segment is embedded below.

Let me know what you think!

A San Francisco Restaurant Health-map (with code)

If you’re just interested in a heat-map of restaurant sketchiness in San Francisco, here it is! If you want to learn about how it was done, keep reading below.
San Francisco Map
Click to see the interactive map

(In addition to showing the “sketchy areas”, it also is great just for seeing where the clusters restaurants are, which is really what defines the neighborhood centers)

Thanks to the ever resourceful Adrian Holovaty, I figured out that one could actually get license-free restaurant listings by scanning city-government records (those of you not looking to republish the data could just use the Yelp API or something). I got about 3000 restaurants into Freebase, along with their health department scores.

Addresses in Freebase are geocoded automatically by the geobot, so it was pretty easy for me to make this map of San Francisco, along with all the restaurants colored by their score.

If you’d like to do something similar yourself, I’ve created a few templates and scripts that you can work through. The Google Maps API and Freebase API are well-documented, but sometimes it’s nice to have a really basic tutorial to get you started.

  1. You’ll need an API key from Google
  2. Download this Base map template
  3. If you want custom icons, you’ll need to draw or generate them. In this case, I wrote a python script to make a set of colored dots going from red to green. You can (right-click) download the script make_icons.py (requires PIL).
  4. I uploaded the icons to a directory called “icons”, and the created the icons in the page with this script
    // Create your icons
    r=[];
    for (i=2;i<11;i++) {
    r[i] = new GIcon();
    r[i].image = "icons/rated"+i+".png";
    r[i].shadow=null;
    r[i].iconSize=new GSize(8,8);
    r[i].shadowSize=new GSize(0,0);
    r[i].iconAnchor=new GPoint(3,3);
    }

    Understanding how to do custom icons was a little tricky, a lot of attributes need to be set before they work properly.
  5. If you don’t have the Freebase API installed, you’ll need to run “easy_install simplejson” and “easy_install freebase” to get it
  6. Finally, you can generate all the overlays and paste them into your code with a script like make_overlays.txt. This script just prints a bunch of javascript, which you can capture and paste into your map HTML file.

Easy, wasn’t it? I suggest you take a look at make_overlays.py and see what it’s doing. Essentially, there’s a query at the top to pull out the scores and addresses of all the topics that have my personal type “health_department_rated_business”. The addresses have geolocations attached by the geobot.

If you wanted to use this recipe to make a map of something else, you can change the query. For example, I could just pull out businesses that start with the letter ‘M’.

query={'type':'/business/business_location/address',
'name~=':'^M',
'name':None,
'/business/business_location/address':{'citytown':'San Francisco',
'/location/location/geolocation':{'latitude':None,'longitude':None}}
}

(If you try this, remember to get rid of the references to “score” in the loop below)

If you make a map using this recipe, let me know and I’ll link to it!

Update: Fixed the python code links. Sorry about the red-green problem, I’ll fix it as soon as I get a chance.

The “excluded middle” of technical books

(ok, I know that “excluded middle” has a specific meaning in philosophy and I’m using it incorrectly here, but I like the way it sounds)

A couple of months ago I read a book called A Semantic Web Primer, published by the MIT Press. It was recommended to me by my coworker Jamie, who said it was about the only thing worth reading on the subject.

I will say that I’m glad I read it, because now I understand the terminology and the way that the Semantic Web community talks about knowledge and ontology. What I found intriguing about the book, however, was the nature of the content:

  • Begin with hyperbolic vision of the future where software agents are negotiating my doctor’s appointment
  • Explain a little about RDF concepts, then spend almost half the book describing the XML serialization
  • Explain a little about ontology and then do a deep-dive on OWL, which is mostly a way of describing which properties are valid for certain classes
  • This is where it gets weird — The “applications” section. Suddenly we’re talking about how Elsevier and Audi are using “The Semantic Web” to solve all manner of problems by having shared ontologies. Or maybe they’re planning to. The descriptions of what they’re doing are no more than well-padded paragraphs with no detail

It struck me that there was a massive chunk of the book missing, which was bridging the technical details of the RDF spec with how a car company might design and implement a shared ontology using the specs just described. This gap was so apparent that it got me thinking about the different kinds of computer books out there, which generally come in three flavors:

  • High level books about technology concepts, principles or “the future”
  • Learning a specific technology at a pretty deep level
  • Algorithms and computer science books that are math-heavy and at best use pseudocode

The first category tends to sell the most, because it’s accessible to the largest group. The second has been struggling for a while because the internet makes it so easy to learn and reference information about specific technologies, and the final group will probably always have a place amongst people who really want to deep-dive into the algorithms.

When I wrote Programming Collective Intelligence, I was hoping to find a middle-ground, which would introduce readers gently to the algorithms, show them working code and then try it out on real data that they could find on the web. The book was criticized by different people for not fitting into the aforementioned categories: “Not big-picture enough”, “This isn’t real production code”, “Not deep enough on algorithms”, “why did he use Python instead of pseudocode” or “Why would I want to learn 3 things at once?”.

The overall response, however, was overwhelmingly positive — most people loved that they could learn something new, actually try it, and then have an idea about how they could work it into their project. Tim O’Reilly called it the “start of a new category rather than one more entry into an existing one”.

Anyway, I guess what I’m getting at here, is that I’d like to see more books that fill out that middle ground — show me concepts, implementation and applications all at once. People can read any number of online tutorials to get a deeper understanding of how to do something with a particular technology. Once they understand the basics of algorithms, there are plenty of textbooks and journals to teach them more. As for big picture, throw out a few ideas and people’s creativity will fill in far more than can be covered by any book.

Snoozemail

The other day my friend Jeff asked me if there was a way to “delay” email. We looked around for a service that would let us do this, but couldn’t really find anything. So I threw together snoozemail… it’s a couple of Python scripts and a cronjob, nothing fancy, but I thought it might be useful to some people:

http://snoozemail.com/

Forward an email to snoozemail and it comes back to you a specified number of days later. I’m not planning on adding any more to it, but if you find a good use for it, let me know!

Yes, I’m on Twitter

A couple of people have asked me if I use Twitter.

For those interested, you can find me here. If you find it interesting, my updates are public, so you can follow me.

I can also be found on Friendfeed and Flickr.

If you’re sick of the proliferation of social media, feel free to ignore this…

Upcoming talks

I have a few talks coming up, if you’re planning to attend any of these events, be sure to drop me a line!

  • OSCON 2008: Data-mining Wikipedia and other semantically weak sources
  • Web 2.0 Expo New York: The ecosystem of business and social data
  • Web 2.0 Tokyo: Haven’t decided on a topic yet
  • A secret one that I’m not allowed to talk about :)

Freebase User Group Meeting

I’ll be speaking at the Freebase User Group Meeting tomorrow evening, on the subject of “Business data in Freebase”. I’ll be showing how I’ve used data about public companies from the government and various other sources to populate Freebase and build cool interactive visualizations.

It’s being held at the Metaweb offices (New Montgomery and Howard, San Francisco) starting at 6:30pm. There will be free pizza and beer :) If you want to come, please RSVP at upcoming. I’m looking forward to meeting you!

Recordings of “Recommendations” Lecture

A few people commented/emailed and asked if they could get recordings of the lecture I mention in my last post. You can find the audio here, I’m not sure how much use it will be without the slides, but I hope you find it helpful!

http://www.weigend.com/files/teaching/stanford/recordings/WeigendStanford2008Class6/