comSysto auf der TDWI Konferenz

Kommende Woche präsentieren sich comSysto und MapR gemeinsam als Silber-Sponsoren auf der vom 17. bis zum 19. Juni 2013 im MOC in München stattfindenden 13. Europäischen TDWI-Konferenz, der größten, unabhängigen Veranstaltung zu den Themenbereichen Business Intelligence, Data Warehousing und Datamanagement im deutschsprachigen Raum.

Am Mittwoch, den 19. Juni, wird Dr. Markus Schmidberger von comSysto eine Session zum Thema „Using R for Business Intelligence in Big Data“ halten. Das gesamte Programm der TDWI Konferenz finden Sie hier!

Kommen Sie auf die Gewinnerseite und besuchen Sie comSysto und MapR am Gemeinschaftsstand (# 27 und 28)! Zu gewinnen gibt es:

1.         ein Free Ticket für ein MapR Training
2. + 3.  je ein Workshop (max. 12 Personen) über verschiedene Big Data Technologien
4.         ein neues, originalverpacktes Nexus 7

Kurzinformation:
TDWI Konferenz, Montag – Mittwoch, 17.-19. Juni 2013

Veranstaltungsort
M.O.C. München, Lilienthalallee 40
Informationen zur Anfahrt

Session W3A, 10:00 – 13:15 Uhr, Raum K3
Dr. Markus Schmidberger: „Using R for Business Intelligence in Big Data

Informationen zu weiteren Events und Trainings stehen auf unserer Eventseite.

Integrating Google Calendar into a Wicket Application

I recently integrated Google Calendar into a Wicket application I use at home. As some requirements are atypical I want to describe them before going into too much detail:

  • As I use the application only by myself I want to allow anonymous access to my application. If multiple users access your system you have to implement “proper” authentication in your application though.
  • For the sake of simplicity I do not care about encrypting keys stored on my disk. However, if you use the code from this post in a real production system with multiple users please do yourself a favor and ensure any keys are stored as secure as possible.

Technical options

If you want to use a Google API in your application there is a chance that they already provide a client library, which is actually the case for Google Calendar. I decided to go another route for two reasons:

  1. As this is yet another REST API, there is no need for a special library and a new API. If I use Spring already I can achieve the same result by simply using Spring’s RestTemplate.
  2. I just wanted to create, update and delete all-day events with an event title. There is no need for a full blown API that allows all options such as support for recurring events, multiple participants or reminders.

Prior to accessing a Google API an application needs authorization from users to access their data which is handled by OAuth 2. As the application already uses Spring for various aspects, Spring Security OAuth 2 was the obvious choice for me as it integrates nicely into the rest of the application.

However, I faced two challenges: First, I needed to integrate Wicket and Spring Security OAuth 2 and second, Spring Security OAuth 2 needed some tweaks to work together with my setup.

Apart from Spring Security OAuth 2 I use the following libraries/frameworks:

  • Wicket as web framework
  • Spring as dependency injection container
  • Jackson for storing tokens and serializing and deserializing messages to the Google Calendar API
  • Gradle as build tool
  • Jetty as container

I have created a simple demo application which allows you to create an all-day event on any of your calendars. It is available on Github.

OAuth 2 in less than 100 words

There is plenty of information available on the web about OAuth 2, so I just want to describe briefly what OAuth 2 does:

Using OAuth 2 users can grant your application access to their data available through third-party services without providing you the credentials of the accounts you want to access. It involves three parties: Your users, your application (OAuth 2 client) and the authorization provider (OAuth 2 server). Apart from being users of your application they are also known to the authorization provider. When users access your application (step 1), it requests permission to e.g. access their data from a third party service (step 2). The authorization provider redirects to a page where users can grant your application access (step 3).

OAuth2 authorization scenario

OAuth2 authorization scenario (Icons courtesy of Adam Whitcroft)

Two great resources for in-depth information about OAuth 2 are Google’s OAuth 2 documentation and RFC 6749.

OAuth 2 for Web applications

Google’s OAuth 2 documentation describes different scenarios which can be used. I assume that you want to use the Webserver scenario, and specifically that you want to use “online access” (more on that later). In the Webserver scenario the user is redirected to Google’s authorization server before the first call of the application to the Google Calendar API. After the user has granted access, the authorization server redirects the user to your application along with an access token and the application may access the Google Calendar API with this access token.

Getting started

Before we dive into the code, we have to register the application with the OAuth provider. For Google, this is done via the API console. The process is pretty straightforward:

1. Create a new project and provide a descriptive name:

Create a new Google API project

2. Create a new client id. The client id identifies your application against the OAuth provider. Be sure to provide a custom redirect URL in the second screen.

Create a new client id for a Google API project (part 1)

Create a new client id for a Google API project (part 2)

3. After you have clicked “Create” the following overview is presented:

Google API project has been created

The most relevant pieces for your application are the client id and the client secret. As I have already described, the client id uniquely identifies your application with the OAuth 2 provider. The client secret authenticates your application with the OAuth 2 provider. If you want to think of traditional username/password based authentication, the client id loosely corresponds to the username of your application and the client secret corresponds to the password of your application with the OAuth 2 provider. Apparently, these two properties should not be shared.

Copy the values for “client id” into the property “google.calendar.client.id” and “client secret” into the property “google.calendar.client.secret” in the file application.properties if you follow along with the demo application.

4. Next, request access to the Google Calendar API for your application in the “Services” menu

Activate access to Google calendar API

Google Calendar Access using Spring OAuth 2

All accesses of the demo application to the Google Calendar API are encapsulated in the class GoogleCalendarRepositoryImpl. It uses an extension of the Spring standard interface RestOperations called OAuth2RestOperations which can handle OAuth 2 authorization in addition. Similar to the standard implementation of RestOperations, RestTemplate, Spring Security OAuth 2 provides a OAuth2RestTemplate. In the demo application, the OAuth2RestTemplate is configured in com/github/gcaldemo/calendar/spring-context.xml. First, let’s have a look at the „oauth:resource“ element there:

<oauth:resource id="google"
                type="authorization_code"
                client-id="${google.calendar.client.id}"
                client-secret="${google.calendar.client.secret}"
                access-token-uri="https://accounts.google.com/o/oauth2/token"
                user-authorization-uri="https://accounts.google.com/o/oauth2/auth"
                scope="https://www.googleapis.com/auth/calendar"
                client-authentication-scheme="form"
/>

This definition describes the resources we want to access. Apart from the client id and the client secret we have already discussed it also contains the URL to which users will be redirected when the application needs their approval to access their data (user-authorization-uri). The “scope” attribute defines the privileges the application wants to acquire. In this case we want read and write access to calendar data as indicated by the URL. The proper URL can be found in the Google Calendar API documentation.

Whenever the application tries to call the Google Calendar API, Spring OAuth 2 will check if the application has a valid access token. This access token is issued by the OAuth 2 provider and provided to the application after the user has granted access to its data. However, the access token is only valid for a limited time period which varies across OAuth providers. Google’s access tokens are currently valid for one hour. After that time period the access token is invalid and users have to grant the application access to their data again.

Fortunately, there are multiple means to prevent nagging users continuously:

  • The OAuth 2 RFC specifies a so called refresh token. The refresh token is sent by the OAuth 2 provider along with the first access token. It can be used to obtain a new access token upon expiration of the old one without user intervention. The validity period of the refresh token may be limited and depends on the OAuth provider. Google’s refresh token is valid until the application explicitly prompts the user explicitly for authorization. Note that Google sends the refresh token only in the „offline“ scenario. Offline means basically that an application can act on behalf of users without the user needing to be present (think batch processes). For more information on the refresh token please refer to the section on refresh tokens in Google’s OAuth 2 documentation.
  • In the online scenario (user is present when your application accesses the Google Calendar API), Google does not send a refresh token but rather stores a cookie on the client’s browser. This cookie will be used instead of a refresh token to prevent repeated explicit approval from the user. As I have hinted earlier, this scenario applies to the demo application.

The demo application provides a simple JSON based token store which is sufficient for a single user. It is implemented in JsonClientTokenServices, which performs the following tasks:

  • It stores the access token in a JSON file
  • It adjusts the expiry_in value: The expiry that is sent from the OAuth 2 provider is denoted in seconds from the point in time when the OAuth 2 provider has issued the access token. So, if the access token is valid for one hour, the initial value will be 3600 (60 seconds per minute * 60 minutes per hour). However, this is obviously not suitable for persistent storage. Therefore, the token store will adjust the expiry accordingly when loading an access token.

The token store is configured along with the access token provider:

<bean id="accessTokenProviderChain" class="org.springframework.security.oauth2.client.token.AccessTokenProviderChain">
<!-- Redefinition of the default access token providers  -->
  <constructor-arg index="0">
    <list>
      <bean class="org.springframework.security.oauth2.client.token.grant.code.AuthorizationCodeAccessTokenProvider"/>
      <bean class="org.springframework.security.oauth2.client.token.grant.implicit.ImplicitAccessTokenProvider"/>
      <bean class="org.springframework.security.oauth2.client.token.grant.password.ResourceOwnerPasswordAccessTokenProvider"/>
      <bean class="org.springframework.security.oauth2.client.token.grant.client.ClientCredentialsAccessTokenProvider"/>
    </list>
  </constructor-arg>
  <property name="clientTokenServices">
    <bean class="com.github.gcaldemo.calendar.repository.impl.token.JsonClientTokenServices"/>
  </property>
</bean>

After defining both the resource to access and the access token provider, the OAuth2RestTemplate can be configured:

<oauth:rest-template id="googleCalendarRestTemplate"
                     resource="google"
                     access-token-provider="accessTokenProviderChain"/>

Integrating Spring OAuth 2 into Wicket

If the application has a valid access token, the OAuth2RestTemplate performs the API call, otherwise a UserRedirectRequiredException is thrown. Typically, the OAuth2ClientContextFilter, which is part of the Spring security chain, should catch this exception and redirect the user to the user-authorization-uri specified earlier. However, by default Wicket catches all exceptions that occur in the Web application and just dumps them in development mode or provides an error page in production mode. Hence, the UserRedirectRequiredException will never reach the Spring security filter chain. Therefore, we have to tweak exception handling by implementing a custom IExceptionMapper:

public class OAuth2ExceptionMapper implements IExceptionMapper {
  private final IExceptionMapper delegateExceptionMapper;

  public OAuth2ExceptionMapper(IExceptionMapper delegateExceptionMapper) {
    this.delegateExceptionMapper = delegateExceptionMapper;
  }

  @Override
  public IRequestHandler map(Exception e) {
    Throwable rootCause = getRootCause(e);
    if (rootCause instanceof UserRedirectRequiredException) {
      //see DefaultExceptionMapper
      Response response = RequestCycle.get().getResponse();
      if (response instanceof WebResponse) {
        // we don't want to cache an exceptional reply in the browser
        ((WebResponse)response).disableCaching();
      }
      throw ((UserRedirectRequiredException) rootCause);
    } else {
      return delegateExceptionMapper.map(e);
    }
  }

  private Throwable getRootCause(Throwable ex) {
    if (ex == null) {
      return null;
    }
    if (ex.getCause() == null) {
      return ex;
    }
    return getRootCause(ex.getCause());
  }
}

The custom exception mapper has to be created in the application by the exception mapper provider:

public class CalendarDemoApplication extends WebApplication {
  private IProvider<IExceptionMapper> exceptionMapperProvider;

  @Override
  protected void init() {
    super.init();
    this.exceptionMapperProvider = new OAuth2ExceptionMapperProvider();
    //details left out - see original class on Github
  }

  @Override
  public IProvider<IExceptionMapper> getExceptionMapperProvider() {
    return exceptionMapperProvider;
  }

  /**
   * Custom Exception Mapper provider that integrates the OAuth2ExceptionMapper into the application.
   */
  private static class OAuth2ExceptionMapperProvider implements IProvider<IExceptionMapper> {
    @Override
    public IExceptionMapper get() {
      return new OAuth2ExceptionMapper(new DefaultExceptionMapper());
    }
  }
}

Now UserRedirectRequiredException will not be handled by Wicket but propagated further up the call stack which allows the OAuth2ClientContextFilter to handle the exception properly. We are almost done but one last piece is still missing.

System-internal Authentication

As I have written in the introduction, I do not want to authenticate against my own application as I am the only user. However, Spring Security OAuth 2 expects user credentials within the application prior to authenticating the application against an OAuth 2 server. If we try to perform a Google Calendar API call as anonymous user we get the following trace:

org.springframework.security.authentication.InsufficientAuthenticationException: Authentication is required to obtain an access token (anonymous not allowed)
at org.springframework.security.oauth2.client.token.AccessTokenProviderChain.obtainAccessToken(AccessTokenProviderChain.java:88)
at org.springframework.security.oauth2.client.OAuth2RestTemplate.acquireAccessToken(OAuth2RestTemplate.java:217)
at org.springframework.security.oauth2.client.OAuth2RestTemplate.getAccessToken(OAuth2RestTemplate.java:169)
at org.springframework.security.oauth2.client.OAuth2RestTemplate.createRequest(OAuth2RestTemplate.java:90)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:479)
at org.springframework.security.oauth2.client.OAuth2RestTemplate.doExecute(OAuth2RestTemplate.java:124)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:446)
at org.springframework.web.client.RestTemplate.getForObject(RestTemplate.java:214)
at com.github.gcaldemo.calendar.repository.impl.GoogleCalendarRepositoryImpl.loadCalendars(GoogleCalendarRepositoryImpl.java:45)
[...]

As the exception message tells, anonymous users are not allowed to perform OAuth 2 operations. Therefore, we have to grant access to the application only to authenticated users. This is done by configuring an interceptor in the http security configuration:

<security:intercept-url pattern="/**" access="ROLE_USER" />

Next, we have to trick Spring Security OAuth 2 by implementing a custom authentication processing filter which publishes a system user with proper privileges to the SecurityContext:

//some details omitted - see original class on Github
public class SystemAuthenticationProcessingFilter extends AbstractAuthenticationProcessingFilter {
  @Override
  public Authentication attemptAuthentication(HttpServletRequest request, HttpServletResponse response) throws AuthenticationException, IOException, ServletException {
    // Populate an internal system user in the security context with proper access privileges. These is typically not
    // necessary for multi user systems in production as users typically have to authenticate against your
    // application before using it.
    Authentication authentication = new TestingAuthenticationToken("internal_system_user", "internal_null_credentials", "ROLE_USER");
    authentication.setAuthenticated(true);
    return getAuthenticationManager().authenticate(authentication);
  }

  public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
    throws IOException, ServletException {

    if (SecurityContextHolder.getContext().getAuthentication() == null) {
      SecurityContextHolder.getContext().setAuthentication(attemptAuthentication((HttpServletRequest) req, (HttpServletResponse) res));
    }
    chain.doFilter(req, res);
  }
}

Now, Spring Security OAuth 2 can perform authorization requests properly and we are redirected before accessing the Google Calendar API for the first time:

Demo application requests access permission from Google

After the user has granted the application access to its data we can create an event. Note that the calendar drop down is already pre-filled with all Google calendars the user has at least write access to:

Creating a new Google calendar event

New Google calendar event has been created

Summary

Although a client library for the Google Calendar API is available it is sometimes feasible to use libraries and technologies that are already used in a project. With a few tweaks I was able to use the Google Calendar API in a Wicket application using Spring Security OAuth 2. The example application on Github demonstrates the integration but beware of the limitation that was mentioned above: This setup is primarily suited for single-user applications. If you want to reuse the sample code in a production environment you should use a ClientTokenServices implementation backed by a database and use a real implementation of AbstractAuthenticationProcessingFilter such as UsernamePasswordAuthenticationFilter.

I hope that this post described in enough detail how to integrate the Google Calendar API into a Wicket application. Otherwise, feel free to ask questions in the comments section.

AWS Summit 2013 Berlin Roundup

This year’s German AWS Summit took place in Berlin on May 2nd at the Berlin Congress Center right at Alexanderplatz.

Amazon seem to have hit a nerve with their product and conference topic as the place was completely packed. Quite a large queue of attendees had stacked up right before the entrance:

IMG_20130502_092028

Most of the conference rooms were also quite full. All in all a very good sign that people are interested in cloud computing and willing to shift their attention to their core business instead of keeping reinventing the wheel in terms of IT infrastructure.

Werner Vogels, CTO of AWS opened the summit with his keynote speech. He brought up some interesting case studies, where their product has come from, where they are now and also gave a short glimpse into the future of AWS. He offered a view into how he thinks cloud computing should shape the future of human-computer interaction and how content is experienced utilising all those new possibilities. For him “devices are only windows to your content in the cloud”. As an example he brought up treadmills which he sometimes uses in hotels. Vogels wants those built in a way so that he can authenticate to them and then access all his personal content being either work documents or all forms of entertainment. And if you look at Google Glass for example he’s right because how much of a “device” really is left with a pair of glasses that you wear. It doesn’t really matter if the device has this hardware spec or one slightly better. It all comes down to if you can access your content anytime, anywhere and to have that content augmented with whatever information your current situation benefits from. Vogels also stressed that Amazon’s vision is to bring prices down for their services so that you don’t really think about whether to use them or not. “It’s just like switching on the light in the evening – you don’t think about that!”

IMG_20130502_112117

Most of the other talks were mainly about how to design applications and infrastructure in the cloud. The main driver for those design decisions is scalability. And this does not only mean that you should be able to scale up or out for peak times but also, and much more important from a cost perspective, to be able to scale down when there are less requests to your application and your provisioned assets are under-utilised. This is an important and interesting aspect when making decisions about e.g. what instance size to use on AWS. For people coming from traditional IT backgrounds there needs to be some shift in their mind set because they are used to over commission physical hardware just to be on the safe side. With elastic resource provisioning this is no longer necessary.

To make all of this work there is of course a heavy need for automation of all aspects of provisioning systems and deploying software artefacts. A fact which was made clear in nearly all of the talks. But this ultimately also calls for a higher level of abstraction when you think of the services you use. What applications really need is some sort of runtime environment which in case of e.g. Java, Ruby and Python apps is totally accepted. You have a virtual machine and nearly no one questions their internals. We just use them. But with things like web servers or any other “traditional” services people tend to set up complete systems from the OS level because they want to push all the shiny knobs and buttons that these offer. I had an interesting discussion with an attendee how was more on the traditional side of things. But it seemed that he had almost something like an epiphany as he said: “Maybe you’re right. I don’t think about OS processes on a low level and how they work. I just trust them to do the right thing!”

In summary it’s quite an exciting time to see how usage paradigms slowly but constantly change towards cloud based offerings which are not only getting used by start-ups but also by larger traditional enterprises who recognise that if they want to move fast, be agile and keep up with their competition the cloud is the place to be!

Data Analysis with the Unix Shell

Currently, the Hadoop based software company Cloudera creates the new certification called Data Science Essentials Exam (DS-200). One goal of the certification is to learn tools, techniques, and utilities for evaluating data from the command line. That’s why I am writing this blog post. The Unix shell provides a huge set of commands that can be used for data analysis. A good introduction to Unix commands can be found in this tutorial.

The data analyst friendly commands are: cat, find, grep, wc, cut, sort, uniq

This commands are called filters. Data passes through a filter. Moreover, a filter can modify data a bit on the way through them. All filters read data from the standard input and writes data to standard output. Filter can use the standard output of another filter to be its standard input while using the pipe “|” operator. E.g. the cat command reads a file to standard output and the grep command uses this output of cat as standard input to search if the city ‘Munich’ is in a city file. The example dataset is available on github.

bz@cs ~/data $ cat city | grep Munich
    3070,Munich [München],DEU,Bavaria,1194560

In the example above you can see the structure of the sample data set. The dataset is a comma separated list. The first number represents the id of an entry, followed by the name of a city, the countrycode, district and the last number represents the population of a city.

Now, let’s answer an analytical question: What is the city with the biggest population in the data set? The second and the fifth column can be selected with the help of awk.  Awk creates a list where the population is on the first position and the city name is on the second position. The sort command can be used for sorting. Therefore, it is possible to find out which city in the dataset has the biggest population.

bz@cs ~/data/ $ awk -v OFS="  " -F"," '{print $5, $2}' city | sort -n -r | head -n 1
    10500000  Mumbai (Bombay)

It is also possible to make joins in the Unix shell with the command called join. The join command assumes that input data is sorted based on the key on which the join is going to take place. You can find another dataset on github which contains countries. This dataset is a comma separated list as well. The 14th column in the country dataset represents the capital id which is similar to the id in the city data set. This makes it possible to create a list of countries with their capitals.

bz@cs ~/data/ $ cat city | head -n 2
    1,Kabul,AFG,Kabol,1780000
    2,Qandahar,AFG,Qandahar,237500
bz@cs ~/data/ $ cat country | head -n 2
    AFG,Afghanistan,Asia,Southern and Central Asia,652090,1919,22720000,45.9,5976.00,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
    NLD,Netherlands,Europe,Western Europe,41526,1581,15864000,78.3,371362.00,360478.00,Nederland,Constitutional Monarchy,Beatrix,5,NL
bz@cs ~/data/ $ join -t "," -1 1 -2 14 -o '1.2,2.2' city country | head -n 2
    Kabul,Afghanistan
    Amsterdam,Netherlands

Finally, let’s get a deeper look in the city data set. The question for this example is: How is the distribution of cities in the city data set? A combination of the sort and the uniq commands allows us to create data for a density plot. This data can be streamed (>) to a file.

bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r | head -n 4
    363 CHN
    341 IND
    274 USA
    250 BRA
bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r > count_vs_country

Gnuplot is a command which allows us to visualize the density data file. We have to tell gnuplot what it has to print and how it should be printed. You can use gnuplot while telnet or ssh session as well because plots can be printed in ACSII-Characters. Therefore, the terminal type has to be set to ‘dumb‘.

bz@cs ~/data/ $ gnuplot

	G N U P L O T
	Version 4.6 patchlevel 2    last modified 2013-03-14
	Build System: Darwin x86_64

	Copyright (C) 1986-1993, 1998, 2004, 2007-2013
	Thomas Williams, Colin Kelley and many others

	gnuplot home:     http://www.gnuplot.info
	faq, bugs, etc:   type "help FAQ"
	immediate help:   type "help"  (plot window: hit 'h')

Terminal type set to 'x11'
gnuplot> plot 'count_vs_country' with points
gnuplot> set term dumb
Terminal type set to 'dumb'
Options are 'feed  size 79, 24'
gnuplot> plot 'count_vs_country' with points

400 ++------------+-------------+------------+-------------+------------++
+             +             +            + 'count_vs_country'   A    +
350 A+                                                                  ++
A                                                                    |
|                                                                    |
300 ++                                                                  ++
|A                                                                   |
250 +A                                                                  ++
|                                                                    |
200 ++                                                                  ++
|A                                                                   |
| A                                                                  |
150 ++                                                                  ++
| A                                                                  |
100 ++                                                                  ++
| AA                                                                 |
|  AAA                                                               |
50 ++    AA                                                            ++
+       AAAAAAAA            +            +             +             +
0 ++------------+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA---++
0             50           100          150           200           250
GNU-Plot

Density plot for the Distribution of City related to a Country

I hope you enjoyed this little excurse in data analysis with the Unix shell. It is useful for students which are currently working on the study guide of Data Science Essentials (DS-200) Beta. Furthermore, I demonstrated how powerful the Unix shell can be used for basic analytics. The Unix shell is also able to do basic things like an analyst normally is executing in a statistical software as R.

Cassandra 1.1 – Tuning for Frequent Column Updates

Cassandra is known for its good write performance. But there are scenarios, when you might run into trouble – especially when particular use case generates heavy disk IO. This could be the case for columns which receive frequent updates. However, you can avoid those problems, with proper configuration, or just by updating to recent Cassandra version. The good news is, that it can be applied to already ruining system, so when you are already having problems, there is still a hope.

Memtables are flushed to immutable SSTables, and it is possible, that single column value can be stored in different SSTables, when its value was changing over long enough time period. This guarantees fast inserts, because data is being just appended to disk. But on the other hand, unnecessary writes will decrease disk performance, not only because many SSTables has to be written, but mainly because duplicates on disk will have to be compacted later on.

The idea is to tune Cassandra in the way, that we take benefit from frequent updates. This can be achieved by keeping data in memory and by delaying disk flushes. In this case new updates will replace existing values in memory.
This will generate less disk traffic, because it will decrease amount of flushed duplicates. This is not all – this will also create write through cache, and read requests will benefit from it. Here are some confutation tips:

  • Make sure that you have at least Cassandra 1.1 – it contains optimization for frequently changing values (CASSANDRA-2498). For the cases where single value is stored in multiple SSTables, older Cassandra versions would need to read column values from all SSTables in order to find most recent one. Now SSTables are sorted by modification time, so it’s enough to read most recent value and simply ignore remaining outdated values.
  • Increase thresholds for flushing memtables. Each update on memtable, results in one less entry in SSTable.
  • Each read operation checks first memtables, if data is still there, it will be simply returned – this is the fastest possible access. Its like non blocking write through cache (based on skip list).
  • To large memtable on the other hand will result in larger commit log. This is not a problem, until your instance crashes. It will need some time to start, because it would need to read whole commit log.
  • Compaction merges SSTables together, and this increases read performance, since we have less data to go through. But this process does not have high priority. When Cassandra is nearly exhausted, it will skip compaction, and this can lead to data fragmentation.

Caching 

  • Row cache makes really sense for frequent reads of the same row(s), and additionally when you read most of the columns of each single row.
  • For active row cache, access to single column from particular row will load whole row with all its columns into memory. Analyze data access patterns, and makes sure that it is not an overhead, and that you have enough memory. It would be really waste of resources, to load hundred columns into memory, to just access only a few.
  • Row cache works as wright through, for data that is already in it. Data is loaded into row cache first when it’s being read, and when it was not found in memtable. From this point of time it will get updated on each write operation. Frequently changing entry, without read access will not affect row cache, because it’s not there.
  • Updates on data in row cache will decrease performance, and actually, those frequently changing columns are probably also available in memtable. Read process will first search memtable, and in case of hit ignore row cache. From this point of view row cache makes sense, if you also read other columns which are not changing frequently. For example single row has 200 columns, 50 receive frequent updates, 100 sporadic, and read process reads always all. In this case row cache makes sense – we will have to actualize 50 columns on each insert, but we will gain fast access to remaining 150.
  • It might be good idea to disable row cache, increase memtable size in hope to reduce disk writes, and to use memtableas cache.
  • Disabling row cache does not necessary mean additional disk seeks. Cassandra uses memory mapped files, which means that each file access is being cached by operating system. Relaying on memory mapped files is nothing new – Mongo does not have cache at all – it’s not needed, since file system cache works just fine. But Mongo has different data structure on hard drive, because they store BSON document optimized for reads, its all in one place, Cassandra might (not always) need first to collect data from different locations.
  • Row cache would help also in situation where single row spreads over many SSTables. In this case putting all data together is CPU intensive operation, not mentioning possible disk access to read each column value.
  • When row cache is disabled, key cache must be used. Key cache is like an index – and you definitely want to load your whole index into memory.
  • When row cache is disabled and key cache is enabled, and read operation get hits on key cache, we have quiet performant solution. Searching SSTables runs fully in memory, only reading column value itself requires disk access. And maybe even not that, since it’s memory mapped file.
  • When disabling row cache remember to tune read ahead. The idea is, to read from disk only single column and not more data when it’s not needed.

Just to summary …. run performance tests, check Cassandra statistics, and verify how many SSTables has to be searched to find data, and what is the cache usage. This might be a good entry point to change memtable size, or to tune caching. In my case, disabling row cache, large key cache and increased memtable thresholds was the right decision.

startup-feedback-3-code

Background of collaborative filtering with Mahout

In order to set up Apache Mahout, a library written in Java to perform scalable machine learning algorithms based on Hadoop, in the architecture of Mario’s fabulous online shop for pizza, pasta and co (see blog post Building an Online-Recommendation Engine with MongoDB and Mahout) we’d like to know which recommendation strategy is the best for our so far fictional use case (which is computing recommendations for 32 products and 101 users in real time). With this small amount of data we could also use other tools, e.g. Weka, but in an actual online shop the occurring data would be a lot more than what we simulate here, which is why we choose Apache Mahout. Before we dive into coding details let’s have a look at what Mahout’s collaborative filtering actually does.

Collaborative Filtering

In order to be able to transfer the recommendation logic to use cases of different businesses we opt for collaborative filtering. A technique for producing recommendations solely based on the user’s preferences for products (instead of including product features and/or user properties). Well, collaborative filtering can be user- or item-based. User-based recommendation promotes products to the user that are bought by users who are similar to her.

User-based Recommendation

User-based Recommendation: recommend products to a user based on what similar users have bought

Item-based recommendation proposes products that are similar to the ones the user already buys.

Item-based Recommendation

Item-based Recommendation: recommend products to a user that are similar to the ones he already bought

User-Item Preferences and Similarity

Alright, but what does similar mean in this context? In collaborative filtering similarity between users (for user-based recommendations) or items (for item-based recommendations) is computed based on the user-item preference only. We use the number of how often a user bought a product as a proxy for the user’s preference. It’s not a perfect proxy but is does the trick and it’s easy to gather. One could also use the number of clicks or views or a combination of those.

Based on these user-item preferences we can use the Euclidean distance or the Pearson correlation to determine the similarity between users respectively items (products). Based on the Euclidean distance, two users are similar if the distance between their preference vectors projected into a Cartesian coordinate system is small. In fact, the Pearson correlation (based on demeaned user-item preferences) coincides with the cosine of the angle between the preference vectors. That is, two users are similar if the angle between their preference vectors is small, or formulated in terms of correlation, two users are similar if they rate the same products high and other products low, intuitively spoken. 

Euclidean and Cosine/Pearson Similarity

Difference between Euclidean and Cosine/Pearson User-Similarity

However, user-item preferences can be (intentionally) limited to pure association, i.e. the user buys or doesn’t buy the product (respectively views or doesn’t view the product etc.). In this case, similarities between users or items can be computed based on the Tanimoto coefficient or the log-likelihood ratio. Both similarities are concepts of how likely respectively unlikely it is that two users have both an association to some items but not to other items.

Tanimoto similarity

The Tanimoto similarity between 2 users is computed as the number of products the 2 users have in common divided by the total number of products they bought (respectively clicked or viewed) overall.

This isn’t really a detailed description of similarity measures and it doesn’t need to be one: Even if one fully understands the concept and computational details of these similarities, in the end one would probably still prefer a data driven decision in order to choose between them for the particular use case at hand.

So Mario decided to implement all of the above mentioned recommenders, that is user- and item-based each combined with one of the for similarity measures, plus the Slope One recommender which doesn’t need any similarity measure as input at all. Once all 9 Mahout recommendation strategies are implemented he wants to evaluate and compare them.

Stay tuned for the coding details of how to integrate the open source recommendation framework Mahout into Mario’s online shop.

Please feel free to attend our talk “Building a Online-Recommendation Engine with MongoDB” at the Free GOTO NoSQL Munich – part II in Munich, April 9, 2013 to get a live and comprehensive presentation of our online-recommencation engine. Furthermore, we would love to meet you at the NoSQL Roadshow Munich 2013. A great place to learn more about NoSQL and Big Data technologies. To get a 30% discount please use the comSysto Code COMSYSTO30.

 

Cassandra 1.1 – Reading and Writing from SSTable Perspecitve

Apache Cassandra, another NoSQL database which provides scalability and high availability.

To keep things simple I will stick to read / write value of one column within single row, and single node deployment.

Writing

We will store one column given by row key and column name.

Each thrift insert request blocks until data is stored in commit log and memtable - this is all, other operations (like replication) are asynchronous. Additionally client can provide consistency level, in this case call will be blocked until required replicas respond, but asides form this, write operation can be seen as simple append. Commit log is required, because memtable exists only in memory, in case of system crash, Cassandra would recreate memtables from commit log.

Memtable can be seen as dedicated cache created individually for each column family. It’s based on ConcurrentSkipListMap - so there is no blocking on read or insert. Memtable contains all recent inserts, and each new insert for the same key and column will overwrite existing one. Multiple updates on single column will result in multiple entries in commit log, and single entry in memtable. It will be flushed to disk, when predefined criteria are met, like maximum size, timeout, or number of mutations. Flushing memtable creates SSTable and this one is immutable, it can be simply saved to disk as sequential write.

Compaction process will merge few SSTables into one. The idea is, to clean up deleted data, and to merge together different modifications of single column. Before compaction, a few SSTables could contain value of single column, after compaction it will be only one.

Reading

We will try to find value of single column within one row.

First memtable is being searched, it’s like write through cache, hit on it provides the most recent data – within single instance of course, not in a whole cluster.

As the second step Cassandra will search SSTables, but only those within single column familySSTables are grouped by column family, this is also reflected on disk, where SSTables for each column family are stored together in dedicated folder.

Each SSTable contains row bloom filter, it is build on row keys, not on column names. This gives Cassandra the possibility to quickly verify, whenever given SSTable at least contains particular row. Row bloom filers are always hold in memory, so checking them is performant. False positives are also not problem anymore, because latest Cassandra versions have improved hashing and increased size of bit masks.

So … Cassandra have scanned all possible SSTables within particular column family, and found those with positive bloom filterfor row key. However the fact, that given SSTable contains given row, does not necessary mean, that is also contains given column. Cassandra needs to “look into SSTable” to check whenever it also contains given column. But it does not have to blindly scan all SSTables with postie bloom filter on row key. First it will sort them by last modification time (max time from metadata). Now it has to find first (youngest) SSTable which contains our column. It is still possible, that this particular column is also stored in other SSTables, but those are definitely older, and therefore not interesting. This optimization comes first with Cassandra 1.1 (CASSANDRA-2498), previous version would need to go over all SSTables.

Cassandra has found all SSTables with positive bloom filter on row key, and it has sorted them by last modification time, now it needs to find this one which finally has our column – it’s time to look inside SSTable:
First Cassandra will read row keys from index.db, and find our row key using binary search. Found key contains offset to column index. This index has two informations: file offset for each column value, and bloom filter build on column names. Cassandra checks bloom filter on column name, if it is positive it tries to read column value – this is all.

For the record:

  • index.db contains sorted row keys, not the column index as the name would suggest – this one can be found indata.db, under dedicated offset, which is stored together witch each row key.
  • SSTable has one bloom filter build on row keys. Additionally each row hat its own bloom filter, this one id build oncolumn names. SSTable containing 100 rows will have 101 bloom filters.
  • In order to find given column in SSTable Cassandra will not immediately access column index, it will first check key cache - hit will lead directly from row key to column index. In this case only one disk access is required – to read column value.

Conclusion

Bloom filters for rows are always in memory, accessing them is fast. But accessing column index might require extra disk reads (row keys and column index), and this pro single SSTable. 
Reading can get really slow, if Cassandra needs to scan large amount of SSTables, and key cache is disabled, or not loaded yet.

Cassandra sorts all SSTables by modification time, which at least optimizes case where single column is stored in many locations. On the other hand, it might need to go over many SSTables to find “old” column. Key cache in such situation increases performance significantly.

Row keys for each SSTable are stored in separate file called index.db, during start Cassandra “goes over those files”, in order to warm up.  Cassandra uses memory mapped files, so there is hope, that when reading files during startup, then first access on those files will be served from memory.

cassandra_read_db

cassandra_read_sstable