Client Side API Mashups With CORS

At Heroku we have APIs for pretty much everything. Need logs for an app? Is that database available? You just beat someone at ping pong? There’s an API for that. Having such rich datasets available is great. It allows us to build dashboards with mashups of different datasets and serve them from a web browser.

Here are some of the techniques implemented in order to wire up a Backbone.js application speaking to remote hosts in a secure manner. We will explore Cross-Origin Resource Sharing (CORS) as well as HMAC based auth tokens with cryptographically timestamped data that an attacker wouldn’t be able to auto-replay. The end goal is to have an application running on a browser, and securely request data from an API running on a remote host.

The first problem when issuing XHR requests across hosts will be the same-origin policy violation. Go ahead, issue an AJAX request against a remote host. The browser should fail with an error similar to the following:

1
XMLHttpRequest cannot load https://some.remote.com. Origin https://your.site.com is not allowed by Access-Control-Allow-Origin

This is where Cross Origin Resource Sharing (CORS) comes in. The way it works is that the Origin (the client) will issue what’s called a pre-flight request, asking the server “hey, can I make a request with HTTP verb foo to path /bar with headers x-baz?”, to which the server responds, “Sure, bring it!”, or “No, you may not”. This pre-flight request is made to the same path as the actual request, but the HTTP OPTIONS verb is used instead. The server responds with the following headers, should the request be allowed:

  • Access-Control-Allow-Origin: Specifies what Origins are allowed remote XHR requests to be made against this server. Allowed values include a URL (eg: https://your.site.com), a comma separated list of URLs, or an asterisk indicating all origins are allowed.
  • Access-Control-Allow-Headers: Specifies a comma separated list of headers that the Origin is allowed to include in requests to this server. There are many reasons to include custom headers - we’ll see an example of this further down.
  • Access-Control-Max-Age: This is optional, but it allows the browser to cache this response for the given number of seconds, so browsers will save themselves the pre-flight request any subsequent times. Freely set it to a large number, like 30 days (2592000)

There are more headers that allow you to whitelist and otherwise control access to the resource. Be sure to read up on CORS.

Thus, a Sinatra app acting as the remote end of the system can respond to pre-flight OPTIONS requests like so:

1
2
3
4
5
options '/resources' do
headers 'Access-Control-Allow-Origin' => 'https://your.site.com',
'Access-Control-Allow-Headers' => 'x-your-header',
'Access-Control-Max-Age' => '2592000'
end

Inclusion of the Allow Origin and Allow Headers headers is also necessary on responses to any other remote XHR requests. We can extract the headers directive to a helper and use it on both pre-flight and other requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
options '/resources' do
cors_headers
headers 'Access-Control-Max-Age' => '2592000'
end
post '/resources' do
cors_headers
# do_work
end
private
def cors_headers
headers 'Access-Control-Allow-Origin' => 'https://your.site.com',
'Access-Control-Allow-Headers' => 'x-your-header'
end

And just like that, browsers can now issue XHR requests against remote APIs. Of course, there is no authentication in place yet.

We will implement an HMAC based auth token mechanism. Both the remote server and your app share a secret. This secret is used to generate a token containing a timestamp that is used for validating token recency. The HMAC digest is a signature that is generated with the shared secret, and it can be used to verify the authenticity of the entity that generated the token. It answers the question of whether the the client of the API request is authentic.

To generate the token, we create a JSON document containing an issued_at timestamp, and we calculate its sha256 HMAC token using the secret known to both parties. Finally, we append this signature to the JSON document and we base64 encode it to make it safe to send over the wire. Here’s an example implementation:

1
2
3
4
5
6
7
8
9
require 'openssl'
require 'json'
require 'base64'
def auth_token
data = { issued_at: Time.now }
secret = ENV['AUTH_SECRET']
signature = OpenSSL::HMAC.hexdigest('sha256', JSON.dump(data), secret)
Base64.urlsafe_encode64(JSON.dump(data.merge(signature: signature)))
end

This token is used on the API server to authenticate requests. The client can be made to send a custom header, let’s call it X_APP_AUTH_TOKEN, which it must be able to reconstruct the token from the JSON data, and then validate that the request is recent enough. For example in a Sinatra application:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def not_authorized!
throw(:halt, [401, "Not authorized\n"])
end
def authenticate!
token = request.env["HTTP_X_APP_AUTH_TOKEN"] or not_authorized!
token_data = JSON.parse(Base64.decode64(token))
received_sig = token_data.delete('signature')
regenerated_mac = OpenSSL::HMAC.hexdigest('sha256', JSON.dump(token_data), ENV['AUTH_SECRET'])
if regenerated_mac != received_sig || Time.parse(token_data['issued_at']) > Time.now - 2*60
not_authorized!
end
end

In the above code, we consider a token invalid if it was issued more than 2 minutes ago. Real applications will probably include more data in the auth token, such as the email address or some user identifier that can be used for auditing and whitelisting. All of the above data token generation and verification has been extracted to a handy little gem called fernet. Do not reimplement this, just use fernet. In addition to HMAC signature, fernet also makes it easy to encrypt the message’s payloads, which opens it up for other interesting use cases.

The authenticate! method must be invoked before serving any request. This means that the auth token must be included on every request the client makes. There are many ways of doing this. One approach, if you’re using JQuery to back Backbone.sync(), is to use its $.ajax beforeSend hook to include the header, as can be seen in the following coffeescript two-liner:

1
2
$.ajaxSetup beforeSend: (jqXHR, settings) ->
jqXHR.setRequestHeader "x-app-auth-token", App.authToken

App.authToken can come from a number of places. I decided to bootstrap it when the page is originally served, something like:

1
2
3
<script type="text/javascript">
App.authToken = "<%= auth_token %>";
</script>

In addition to that, it should be updated in an interval, so that on a single page app, that doesn’t request any page refreshes, the auth token is always fresh and subsequent API requests can be made.

The final client side code that provides the auth token and keeps it updated looks like so:

1
2
3
4
5
6
7
8
9
<script type="text/javascript">
App.authToken = "<%= auth_token %>"; //bootstrap an initial value
App.refresh_auth_token = function() {
$.getJSON('/auth_token', function(data) {
App.authToken = data.token; //request updated values
})
};
window.setInterval(App.refresh_auth_token, 29000); //every 29 seconds
</script>

The /auth_token server side endpoint simply responds with a new valid token.

The fernet token expires every minute by default. I decided to update it every 29 seconds instead so that it can be updated at least twice before it has a chance to hold and use an expired token against a remote API.

In this app, the server side is used for one thing only: user authentication. The way it works is that when a request is made, the sinatra app performs oauth authentication against our google apps domain. Once the oauth dance has suceeded, the app generates a token that is handed on to the client for authenticating against backend, remote APIs.

This whole setup has worked great for some months now.

On Top-down Design

There are many strategies for writing software.

Some developers like writing tests first, letting your tests drive the implementation, essentially becoming the first clients of your software. Others like writing production code first and kind of figure out how it works, and then they apply some tests after the fact. I am convinced that writing tests in this way is far less effective, but this is not an article on the merits of TDD. Others don’t like writing tests at all, so it’s a varying scale.

You can write software from the bottom up, where you are forced to figure out what the data model and base objects should be and how it is to support all known use cases. This is hard, particularly in this day and age where software requirements are unpredictable and are meant to change very rapidly - at least in the fun industries.

You can also write software from the top down. In this case you let the outer shell of your software dictate what the inner layers need. For example, start all the way on the user interactions via some sketches and HTML/CSS in a web app. Or think through the CLI that you want. Readme driven development can help with this, too.

Outside-in development puts you in a great position to practice top-down design.

The advantage of Outside-in development is twofold. Not only are you left with acceptance tests that will catch regressions and help refactor later. But also, the top layers of your software becomes a client of the code you are about to write, helping define its interface in a similar way that practicing TDD for your unit tests help guide the design of software component at a very granular level.

These practices will help you define internal APIs that feel good faster, because you will notice cumbersome interfaces sooner, and are therefore given the opportunity to fix them when they have the least possible number of dependencies.

I know of many developers who prefer writing no tests, or prefer a bottom-up strategy. This does not mean that their software is of poor quality, by no means. But I will observe that I can’t ship software of the quality standard that I put myself to unless I write tests first and follow outside-in methodologies. Indeed, this seems to indicate they are smarter than me.

Boston.io Recap

Boston.io took place yesterday at the Microsoft NERD Center. The event is aimed at students in the Greater Boston area who are interested in entrepreneurship and coding, whether that’s design, development, or ambidextrous.

There were technical and talks on every aspect of the stack, from the metal all the way up to serving and consuming APIs and user experience design. There were also some not-so-technical talks about topics like the Boston startup scene, open source and hacker culture.

My talk was about PostgreSQL and it focused on how to use SQL to mine data and get information out of it. I stored the twitter stream for about 24 hours prior to the conference and showed how to look at that data showcasing CTEs, Window Functions and other Postgres features. Among the insights we saw that the tweeps who post hashtags and urls the most are spam accounts. Also, the most posted URLs come from shortener services, but also surprised to see livingsocial.com among those. We found other fun facts during the talk too. It went great.

Hopefully the talk gave a good taste of what you can do with Postgres and SQL. It was SQL heavy, but that did not come without warning

Other talks worth mentioning included Mike Burns’ classic UNIX talk. This time around he used curl, sed and grep to automate a SQL injection attack on a hilarious page he staged for this purpose. Erik Michaels-Ober’s talk was also great and surely inspired a few students to put code out there for the world to see. So many great talks altogether though, by people like Nick Quaranto on TDD, Ben Orenstein on vim, Bryan Lyles on OOP and more.

Oh, and there was a dude at the afterparty with a heroku shirt. Here’s a photo.

Overall a great event; thanks to organizers thoughtbot and Greenhorn Connect and drinkup sponsors hashrocket and github.

SSH Tunnels for Remote Pairing

Yesterday was a good day. @pvh and I paired for a few hours, even though we’re at opposite coasts.

Here’s what you need:

  • A server somewhere that both pairs have access to. We used an EC2 instance. We’ll use it to meet up and create the tunnel.
  • SSH client libs - your should already have this unless you’re on windows in which case you probably want PuTTY.
  • Skype for audio.

As you know, The Internet is made of nodes connected by Tubes. Unfortunately, there is no tube from your machine to your pair’s machine. What we’ll do here is use a third node that has tubes to both of your machines to relay traffic through, in essence creating an Internet Tube from your machine to your pair’s. This kind of Internet Tube is called a Tunnel. Since we’re using SSH to do the traffic encryption and forwarding, it’s an SSH Tunnel. Yes, that’s somewhat made up, but sounds legit!

As it turns out, setting up a tunnel is fairly simple. For example, let’s set up a tunnel between you and Jane using a remote server saturn.

  1. You: Open up a shell and forward traffic to your local port 9999 over to Saturn’s port 1235:
1
ssh -L 9999:locahost:1235 saturn_user@saturn
  1. Jane: Open up a shell and forward traffic from saturn’s port 1235 to her port 22
1
ssh -R 1235:localhost:22 saturn_user@saturn
  1. You: Open up another shell, and ssh into your local port 9999 specifying a username on Jane’s machine.
1
ssh jane_user@jane -p 9999

And you’re good to go. Create a shared screen session, open up $EDITOR, use skype, google hangouts, face time or whatever for audio and start ping ponging.

The latency was surprisingly minimal. We left this tunnel open most of the day. It sat idle for periods of time and the connection was left active. All in all, a great setup.

If this is something you’ll do often, you might as well add a more permanent configuration to ~/.ssh/config. For example, you might add:

~/.ssh/config on your machine
1
2
3
4
5
6
7
8
9
Host tunnel_to_jane
Hostname saturn
LocalForward 9999:localhost:1235
User saturn_user
Host jane
Hostname jane
User jane_user
Port 9999

Then you’d do, on one terminal, ssh tunnel_to_jane, and on the other ssh jane.

And Jane might add:

~/.ssh/config on Jane’s machine
1
2
3
4
Host tunnel_from_you
Hostname saturn
RemoteForward 1235:localhost:22
User saturn_user

And she’d just do, ssh tunnel_from_you

This can be used not ony for remote pairing, but rather to forward any traffic on a port, over an SSH encrypted channel, to a remote host. For more see ssh_config(5) and ssh(1), and happy pairing!

Redis Talk at PostgreSQL Conf West

In case you’ve missed it so far, PostgreSQL West will take place next week in San Jose, California. You should go.

This weekend I worked on the slides and content for my Redis talk. Why would I decide to speak about Redis in a PostgreSQL conference? As it turns out, we’ve had great success in using Redis to compliment our architecture, not to replace a main data store. I will speak about our experience in scaling out write heavy apps with Redis.

I will first introduce Redis from a general perspective, and then dig into the data types it offers and the operations it can do on each data type. During the course of the talk, I will demonstrate how we’ve used certain data types to solve some rather tricky scaling problems: real use cases solving real problems.

I hope this talk will be entertaining and informative to both DBAs and application developers. I will be sharing the slides here and on twitter afterwards, so stay tuned.

Additionally, on Tuesday I will be teaching a one day workshop on Ruby and Rails with a PostgreSQL focus. This is the second time I will do this at PostgreSQL conf, the last time being at PostgreSQL conf east in New York City. The class size was small, and the feedback I received was very positive in that attendees got a good grasp of the Ruby programming language and how Rails and Postgres fit in the ecosystem. I hope this time around is even better.

Hope to see you there.

PostgreSQL 9.1 Released

Version 9.1 of PostgreSQL was released yesterday.

Among the exciting new features there is:

  • pg_basebackup - this can be used alongside Streaming Replication to perform a backup or clone of your database. I can imagine heroku adopting this as an even faster and reliable way to sync your production and staging databases, for example (when and if they upgrade to 9.1). However it can also be used to create plain old tarballs and create standalone backups.

  • Another replication goodie: Synchronous replication. On Postgres 9.0, replication was asynchronous. By enabling synchronous replication, you are basically telling the master database to only report the transaction as committed when the slave(s) have successfully written it to its own journal. You can also control this on a specific session by doing SET synchronous_commit TO off.

  • The new pg_stat_replication view displays the replication status of all slaves from a master node.

  • Unlogged tables. What? Postgres supports unlogged tables now? Yes, it does. They can be used for data you don’t necessarily care about, and a crash will truncate all data. They are much faster to write to and could be useful for caching data or keeping stats that can be rebuilt. You can create them like so:

1
CREATE TABLE UNLOGGED some_stats(value int)
  • SQL/MED gets the ability to define foreign tables. This is rich. It means that you can define a foreign data wrapper for a different database and access it from Postgres seamlessly. Some hackers have already built some nifty foreign data wrappers for mysql, oracle, and even redis and couchdb. Although I’m of the mind that if you’re actually using any of these databases to supplement your app’s data needs, just talk to them directly from your app. However, it may be possible to write some higher performance reports that use different data sources, and you let Postgres do the heavy lifting of munging all the data together.

  • You no longer need to specify all columns on a select list from your GROUP BY clause: functionally dependent columns are inferred by the planner, so specifying a primary key is sufficient. I talked about this before, and it’s a cause of great frustration to users coming from MySQL.

There’s much more in this release. Here are the release notes.

Huge thanks, Postgres hackers!

Design Tweaks

Today I made a few design tweaks to this blog. The goal is to move a bit away from the stock default octopress theme - but my design chops aren’t all that great and I can’t really budget the time to do a design from scratch.

A few simple changes and I’m OK with how it looks for now:

  1. I created the basic font logo that you now see on the header. This was the idea from the beginning, but I never had a chance to include it here. Commit: 27d0107f.

  2. Change the typography in the site, as the original seems a bit heavy for my taste. I started out by using a Helvetica Neue on headers, but decided to go with Antic from the Google web font service. Commits: 2dd8f46b4 and 84bad437.

  3. I wanted something different for that dark background. Went hunting for tiles and patterns. There’s good stuff out there, but I settled for the dark wood background found here. Commit: 9593a00431. At this point, it also made sense to make the header and footer have a light background and dark font.

  4. Finally, added a gradient to the header. Commit fe918b0e

I’m fine with the result for now, but will probably revisit soon. Here’s a before and after.

Before

Before

After

After

You Should Work for These Guys

For the last few months we’ve been working for an awesome company on a greenfield project here in Boston/Cambridge.

The industry: Healthcare. The goal: Improve patient’s lives by changing the way the entire system works. It’s exciting, and it is happening, and you can be a part of it.

The stack

Rails 3.1, Backbone.js, Coffeescript, faye, PostgreSQL, Cucumber, RSpec and Jasmine.

The process

Daily standups, TDD, code reviews via github pull requests.

The result

A highly responsive non-trivial app with a very clean code base and beautiful design.

The future holds a mobile app, web service integrations and ongoing maintenance to the current code base.

If you live in the Boston area, you should apply. If you don’t, you should move here. Right here.

New Domain Name, New Blog Engine

I haven’t touched my blog for a while. Part of it is that I just didn’t identify myself with “Awesomeful” any more. On the other hand, have I got a deal for you! Both awesomeful.net and awesomeful.org are for sale, so hit me up if you’re interested - I’m talking to you mister whois awesomeful.com.

Welcome to the new blog: Practice Over Theory. I hope that the new name and engine inspire me to post more often.

Migrating was not a huge task at all. I decided to give octopress a try. It prescribes a really weird method of deploying to github pages which involves cloning yourself into a subdirectory (!!), but now I have a pretty neat set up. It’s backed by jekyll and has a few nice addons, the most useful of which is it’s code highlighting theme which is based on Solarized.

Speaking of code highlighting, let me show you a little rack app that redirects the old awesomeful.net posts to their new warm locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
require 'rubygems'
require 'sinatra'
REDIRECTS = {
'awesomeful-post-1' => 'new-location-1',
'awesomeful-post-2' => 'new-location-2'
}.freeze
get '/*' do |path|
one_year_in_seconds = 31536000
headers 'Cache-Control' => "public, max-age=#{one_year_in_seconds}",
'Expires' => (Time.now + one_year_in_seconds).httpdate
redirect to("http://practiceovertheory.com/#{REDIRECTS[path]}"), 301
end

Pretty neat, huh? The syntax highlighting, I mean.

Regarding the above sinatra app, I just have a dictionary[1] mapping the old paths to the new ones, and respond with a HTTP 301 Moved Permanently. The interesting bit is the HTTP caching employed. Heroku’s (awesome) varnish servers will remember that response for one year. Try it here.

[1] it’s a hash!

Machine Learning - Who’s the Boss?

In the Machine Learning field, there are two types of algorithms that can be applied to a set of data to solve different kinds of problems: Supervised and Unsupervised learning algorithms. Both of these have in common that they aim to extract information or gain knowledge from the raw data that would otherwise be very hard and unpractical to do. This is because we live in very dynamic environments with changing parameters and vast amounts of data being gathered. This data hides important patterns and correlations that are sometimes impossible to deduce manually, and where computing power and smart algorithms excel. They are also heavily dependent on the quantity and quality of the input data, and as such, evolve in their output and accuracy as more and better input data becomes available.

In this article we will walk through what constitues Supervised and Unsupervised Learning. An overview of the language and terms is presented, as well as the general workflow used for machine learning tasks.

Supervised Learning

In supervised machine learning we have a set of data points or observations for which we know the desired output, class, target variable or outcome. The outcome may take one of many values called classes or labels. A classic example is that given a few thousand emails for which we know whether they are spam or ham (their labels), the idea is to create a model that is able to deduce whether new, unsean emails are spam or not. In other words, we are creating a mapping function where the inputs are the email’s sender, subject, date, time, body, attachments and other attributes, and the output is a prediction as to whether the email is spam or ham. The target variable is in fact providing some level of supervision in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a classifier. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a statistical regression model.

Lets take this as an example in supervised learning algorithms. Given the following dataset, we want to predict on new emails whether they are spam or not. In the dataset below, note that the last column, Spam?, contains the labels for the examples.

Subject Date Time Body Spam?
I has the viagra for you 03/12/1992 12:23 pm Hi! I noticed that you are a software engineer
so here’s the pleasure you were looking for…
Yes
Important business 05/29/1995 01:24 pm Give me your account number and you’ll be rich. I’m totally serial Yes
Business Plan 05/23/1996 07:19 pm As per our conversation, here’s the business plan for our new venture Warm regards… No
Job Opportunity 02/29/1998 08:19 am Hi !I am trying to fill a position for a PHP … Yes
[A few thousand rows ommitted]
Call mom 05/23/2000 02:14 pm Call mom. She’s been trying to reach you for a few days now No

A common workflow approach, and one that I’ve taken for supervised learning analysis is shown in the diagram below:

The process is:

  1. Scale and prepare training data: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.
  2. Create a training set and a validation set by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier’s performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also cross-validation), but that is beyond the scope of this article.
  3. Train the model. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.
  4. Validation and tuning: After we’ve created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model’s generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set’s, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, loosing its generalization capabilities. This is called overfitting, and there are various techniques for overcoming it.
  5. Validate the model’s performance. There are numerous techniques for achieving this, such as ROC analysis and many others. The model’s accuracy can be improved by changing its structure or the underlying training data. If the model’s performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.
  6. Use the model to classify new data. In production. Profit!

Unsupervised Learning

The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning.

This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don’t have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don’t necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible.

Another main difference is that in a clustering problem, the concept of “Training Set” does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.

As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs. Another example may be to cluster patients by their chronic diseases and comorbidities in such a way that targeted interventions can be developed to help manage their diseases and improve their lifestyles.

For unsupervised learning, the process is:

  1. Scale and prepare raw data: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.
  2. Build model: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.
  3. Validate: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.
  4. Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!

Step zero

A common step that I have not outlined above and should be performed when working on any such problem is to get a strong understanding for the characteristics of the data. This should be a combination of visual analysis (for which I prefer the excellent ggplot2 library) as well as some basic descriptive statistics and data profiling such as quartiles, means, standard deviation, frequencies and others. R’s Hmisc package has a great function for this purpose called describe.

I am convinced that not performing this step is a non starter for any datamining project. It will allow you to identify missing values, general distributions of data, early outlier detection, among many other characteristics that drive the selection of attributes for your models.

Wrapping up

This is certainly quite a bit of info, especially if these terms are new to you. To summarize:

Supervised Learning Unsupervised Learning
Objective Classify or predict a class. Find patterns inherent to the data, creating cluster of like data points. Dimensionality Reduction.
Example Implementations Neural Networks (Multilayer Perceptrons, RBF Networks and others, Support Vector Machines, Decision Trees (ID3, C4.5 and others), Naive Bayes Classifiers K-Means (and variants), Hierarchical Clustering, Kohonen Self Organizing Maps
Who’s the Boss? The target variable or outcome. The relationships inherent to the data.

Hopefuly this article shows the main differences between Unsupervised and Supervised Learning. On followup posts we will dig into some of the specific implementations of these algorithms with examples in R and Ruby