Scalable thoughts...

Thursday, August 1, 2013

User-based recommendation using Mahout and Solr

Recommendation task is an important issue when dealing with e-commerce, news and other websites in general. In fact, the main goal is to provide relevant information to our target users, related to what they are searching. In this context, we will discuss how to achieve it by retrieving indexed data ("wishlist" per user) from Solr and use it to build a recommender using Apache Mahout to infer related items based on users' preferences.

Some background: Apache Mahout

Mahout is a scalable machine learning framework which supports recommendation, collaborative filtering, clustering and classification tasks by providing a widely well-known set of algorithm implementations. As Mahout is scalable, it easily supports distributed computing by using Hadoop as background map-reduce framework.

Through its DataModel interface, Mahout provides us some mechanisms to process data stored in file, database, memory which turns easily integration with different sources, not just Hadoop. Therefore we have choosen Solr as data source, just for simplicity.

As we intend to do recommendation task, we will setup an user-based recommendation Mahout environment for tests. For now, we are not worried about performance issues related to available data size. That said, next section will use some toy sample to illustrate the framework functionality.

By user-based recommendation we mean that recommended items to some user are inferred from other similar users preferences.

Example setup: user-based recommendation

First of all, we have indexed some data in Solr. Each document can be seen as an user wishlist, where userid field represents the unique user identification and itemlist field is a list of product ids. So if we execute a Solr query then we get a response like this:

This sample data is the same exposed in early chapters of Mahout in Action. The main difference here is that we dismiss preference values, because in this case we are not considering user evaluation values about items. This is called boolean preferences, that means, if there is an itemID associated to an userID then preference exists and its value does not matter.

Code snippet below creates a model as a GenericBooleanPrefDataModel instance from data indexed in Solr search engine. This model is basically the data structure used by recommendation algorithm in next steps.

After create a DataModel object we are able to build a recommender:

The code snippet above does all the magic: Nearest neighborhood means that recommended items for some user will be calculated according to the (log-likelihood) similarity between this user and users contained in model. The GenericBooleanPrefUserBasedRecommender class is the appropriate recommender for boolean preferences.

The LogLikelihoodSimilarity class does not need preferences values. Other similarity metrics like Euclidean distance and Pearson correlation throw IllegalArgumentException for boolean preferences.

After executing this code, the recommender answer is:

It means that user #4 has already liked items 101,103,104,106 and we have recommended the item 102 to him, according to our user-based recommendation setup.

Anonymous users

If user is not indexed (i.e. new user, not-logged or not-registered user), you can change code and consider PlusAnonymousUserDataModel class, which represents an user from whom we do not have any information about. Here we assume that some random user have marked items 102 and 105 in some kind of wishlist.

The anonymousDataModel object wraps our previous loaded DataModel object. PreferenceArray contains preferred items and some temporary userId to represent this anonymous user. The output is:

Which represents recommended items to this new user, based on previous users selected items.

Remarks

We have illustrated how we can retrieve Solr data and build Mahout DataModel in order to make some good user-based recommendations. It is a quite simple modeling the input and then build a recommender. We have focused here in items with no preference values and that's why we have applied log-likelihood similarity implementation which is the only similarity metrics that does not require these values to output some response.

You have already noticed that recommendation performance and response quality metrics (accuracy, precision, recall and so on) evaluation were out of our scope. These topics will be certainly explored later.

Sunday, June 9, 2013

Clustering documents with Solr and Carrot2

It has been very common hearing about machine learning techniques which try to infer similarities and labels among data. Carrot2 is a open-source clustering framework that reaches it. My intent in this post is to configure Solr to use Carrot2 for, given a search result, clustering similar documents in groups and suggest a label to these groups. But first, we will present some previous background that will help you to understand clustering mechanism.

Some background: Unsupervised learning

This is a kind of learning which for input dataset, the corresponding class (or label) is unknown. Therefore, techniques tend to learn how much similar documents are, considering its attributes and then clustering them in similar groups. It's like the way we human beings use to think: we are capable to associate similar objects, people and other stuff by analysing characteristics. That said, clustering algorithms work in an unsupervised way, that means, there is no label in training set to guide clustering algorithm to the right path (minimizing error rate).
This is how Carrot2 algorithms work: in a unsupervised learning way, implementations try to group similar documents and suggest a label for them.

Carrot2

As we have already said, Carrot2 is a clustering framework. It contains some unsupervised learning algorithm implementations. And can be easily integrated to Solr through ClusteringComponent. This structure allows us to setup a Solr clustering searchComponent and then plug in some requestHandlers. At the end, a request is made and according to search results, documents are clustered in groups (based on their similarity).

Example setup

A test dataset is available right below. There are sample some documents and we can play with parameters:

Now let's setup solrconfig.xml in order to configure clustering component:

We have chosen the BisectingKMeansClustering class for example, which is an implementation of the widely known k-Means clustering algorithm. Basically, this algorithm randomly choose n-centroids ( which n is the clusterCount). By centroid, we mean the instance (document) which best represents a certain group. Therefore, while algorithm iterates, it calculates new centroids, according to documents shown in it's training phase. Until no changes in centroids occurs, or the maxIterations is reached. Another characteristics of k-Means is that it is a hard-clustering algorithm, which means that a document can be grouped at just one single group.

Once clustering searchComponent has been already set, let's declare it in requestHandler definition:

Assuming that you already have indexed data sample, let's request a simple search:

So, let's take a look at response:

Interesting! Solr and Carrot2 have grouped similar documents (which are represented by id fields) and suggested a label that should best represent the group. In the case of label "Pencil" this is more clear. The other group (which are not pencils, if you notice the title contents) were labeled as "Expensive", which according to algorithm it's the most likely label, considering input dataset. If you try tunning clusterCount parameter in solrconfig.xml, you should notice that documents which were in different groups before can appear in same group.

Remarks

Carrot2 is a really simple clustering framework. It has a lot of open-source and also comercial implementations. Apache Solr has built-in integration with Carrot2.

On the other side, Carrot2 computes clusters considering search result documents. It means that, for performance restrictions, you cannot retrieve all document dataset after a request. Therefore, as less documents returned, less quality of clustering. If you try to increase the Solr rows parameter, in a large dataset, you will figure that response time can be compromised. At this point, Carrot2 clustering.documents, that could avoid this situation, is not available yet. You also should take care about Solr field filters and analysers, which can noise label suggestion.

At last, if you need a more accurate clustering suggestion with high time response, you should consider to do that in index time, not in query time. Maybe you can also check clustering algorithms provided by Mahout scalable machine learning library, which certainly we shall discuss in future articles.

That's all. See you next post.

Tuesday, June 4, 2013

Automated tests in four J-steps

Hi there! It’s really important in software development also thinking on automated functional tests for many reasons, which is not intended to discuss it today. However, it’s very common QA team (at least I have seen a lot) using complex automation tools in order to accomplish some simple tests. These tools could make the creation and test evolution very hard to follow.

Our purpose in this article is to simply build an example with the four J's: automate integration test with Jersey, as a REST client and JUnit as our main test framework, publish test documentation with Javadoc and at last, setup Jenkins to show JUnit test reports and Javadoc documentation in it's dashboard. As you will see, this configuration is very easy to create and maintain. You should think twice before using a more powerful test strategy. This simple configuration case could cover the most part of automated test needs.

Example setup

In order to show the power of J's, we have implemented a simple webservice test and deployed it at Jenkins environment. Therefore, it is expected to be a very practical example, and that is why we have illustrated it in four J-steps.

J-step #1: Jersey

Jersey is a RESTful library, reference implementation of JAX-RS specification. We have easily implemented webservice request client in code snippet below:

The code purpose is to make a request to url and save response in some parameterized class instance, using Java generics mechanism. We have also specified that the webservice returns a xml object that will be unmarshalling behind the scenes to our Java object.

J-step #2: Javadoc

We expect maven to generate test javadoc (supposing we are required to write test documentation using Javadoc). To make it happen, let's setup maven-javadoc-plugin in pom.xml:

Javadoc plugin for maven allows us to customize tags, so we can write all automated test documentation in javadoc. I think this is a great advantage, despite the fact QA team is always looking for tools that could provide some structured way of test documentation. Javadoc in automated test can reach this goal. In configuration we have created customized javadoc annotations: @input and @assert. Whenever these annotations appear in code, they will be mapped as respective head tag values in javadoc.

J-step #3: Junit

So, we already have configured javadoc plugin, then let's create our first (and documented, for sure!) test class:

This test is really, really, and really simple. A request is done using requestWebService() method and put response in a Summation instance. You can implement Summation class using classes inside JAXB(another "J") javax.xml.bind package. The method getResult() actually contains the result of sum operation, provided by webservice. Finally, it can be used it to compare with expected result using JUnit assertEquals.

J-step #4: Jenkins

Let's deploy the structure in our continuous integration environment. The last part is setting up Jenkins. Here, Javadoc and JUnit report views in dashboard can be enabled filling form below:

We just have to input maven generated test report and javadoc directories. Approaching like this, we can easily visualize JUnit reports and test documentation in Jenkins dashboard. The test javadoc written before should look like this:

As expected, @assert and @input annotations were converted to "Test assertion" and "Test input", respectively.

Remarks

Integration test is an essential part of all software development cycle. However, sometimes it is lacked because of inherent complexity of doing this activity. This approach provides you a simple way to write automated webservice tests without a "silver bullet" tool, which sometimes is very hard to maintain. I mean, there are a lot of interesting and powerful test tools, but sometimes it's required us to develop things really simple under time and/or budget restrictions.

Another advantage is when deploying test cases, test reports and javadoc at Jenkins, we are concentrating these artifacts in the same environment. This scenario can contribute to test maintainability.

That's acceptable for now. See you next post.

Friday, May 31, 2013

Configure Solr Did You Mean

Apache Solr is one of my top favorite tools. This open-source enterprise search engine is very powerful, scalable, resilient and functional. In this article, my purpose is to show how to configure 'did you mean' feature. I used to say to coworkers that ‘did you mean’ is the most cost-benefit Solr feature. This happens because, as you will see, it’s very easy to setup and provide to target user the feeling that your search mechanism is actually very smart.

Some background

Solr ‘did you mean’ support can be configured using SpellCheckComponent which is built on top Lucene implementations of org.apache.lucene.search.spell.StringDistance interface. This interface provides the getDistance(String string1, String string2) method. These implementations just compare two parameterized strings and returns a (double) factor. If factor tends to 1.0, then words is more similar to each other.

A dummy test that I use to do is write a simple Java main program, take some sample words of the index and try Lucene implementations.

After test, I have some idea of how is the average value of accuracy parameter which we will configure for SpellCheckerComponent. This is important because if you are very restricted (accuracy much close to 1.0) then Solr could not suggest any word. A low value of accuracy could not be fine in fact, because Solr could suggest words completely different.

The output code snippet is:

After all, what we have to do is to setup Solr to use one of these Lucene implementation behind of scenes, through SpellcheckComponent configuration.

Example setup

Now, hands on! I have used the same schema.xml default Solr installation and have indexed sample documents located in /example/exampledocs folder. The first snippet show the solrconfig.xml setup:

Before Solr 4.x, another index had been created by Solr to compute spellcheck. Currently, Solr can use the same index as dictionary (solr.DirectSolrSpellChecker config). This option reduces index size, which is more than desirable. :)

Another point is when we analyse output from the above example, it shows us clearly that if we set an accuracy parameter higher than 0.75, only the JaroWinklerDistance implementation will return suggestions. So you have to pay attention on this accuracy value setup.

After configuring solrconfig.xml, we can test the query. It looks like:

And response:

An important thing is the use of spellcheck.collationExtendedResults and spellcheck.maxCollationTries parameter. If these parameters are missing, then Solr could combine words without returning results and suggest the collation. It's not desirable behavior. For example, if you test querying ipod and sansung, even Solr suggesting samsung correctly, the collation list will not considering this option because these words in the same query would not capable to return document results.

Complexity

Distance algorithms implemented by Lucene like LevensteinDistance class (Solr default choice) have complexity O(m.n), where m and n are the length of compared Strings. If you consider that word length average in English language is about 5 characters, then this distance provides a very acceptable performance. If on your tests you notice performance is poor, then you may execute load and/or stress testing on your other queries, because your overall system performance should not be acceptable.

Conclusions

Provide 'did you mean' experience to users can be achieved easily. Indeed it is a powerful way to minimize the "zero-result" issue. You can learn more about Solr spellcheck following Solr wiki: http://wiki.apache.org/solr/SpellCheckComponent.

That's all! See you next post.

Tuesday, May 28, 2013

Welcome aboard!

Hi everyone! I have created this blog to share some quick tips in order to help you to develop performing, scalable and (why not) smart systems. Our discussion topics include (but not exclusive): java and python programming, frameworks, performance tests, enterprise search engines, asynchronous event driven communication, machine learning, statistics, mathematics and so on.

The main ideia is to show you practical and feasible tools which could increase your software development and test experience besides your target users’ experience.

Are you ready? Hope you enjoy it!

Blog Search