There is an informative blog post on O’Reilly today titled, Need faster machine learning? Take a set-oriented approach. In summary it’s about bringing processing efficiency to a ‘machine learning’ algorithm to categorise job vacancies in the field of Electric Medical Records. Even for this ‘small’ categorisation set the calculations are vast. The LifestlyeLinking project is using this approach to classify lifestyles instead of jobs. While the processes read similar the technology applied and categorisation seeding used start from different places. The LL project has chosen Wikipedia to seed lifestyle definitions and an application of wisdom of the crowd mathematics is used instead of a bayes classifier. Regards of these differences, the goals are the same, to give individuals the best information to live their lives by or to find the right job.
As for the question of scaling, smarts in database and managing data is where this post says ‘Set Oriented’ approach has applied dramatic improves in processing times. The LL project has concluded (thus far) to thinking away from this centralized data process to a framework of decentralized peer-to-peer processing and then find away to aggregate (share) the data demanded for each peers needs. We are yet to put this into practice but from our experience to date, scaling in the cloud breaks down and ‘small’ levels of categorisation, let alone all the information that makes up life.