Currently, the Hadoop based software company Cloudera creates the new certification called Data Science Essentials Exam (DS-200). One goal of the certification is to learn tools, techniques, and utilities for evaluating data from the command line. That’s why I am writing this blog post. The Unix shell provides a huge set of commands that can be used for data analysis. A good introduction to Unix commands can be found in this tutorial.
The data analyst friendly commands are: cat, find, grep, wc, cut, sort, uniq
This commands are called filters. Data passes through a filter. Moreover, a filter can modify data a bit on the way through them. All filters read data from the standard input and writes data to standard output. Filter can use the standard output of another filter to be its standard input while using the pipe “|” operator. E.g. the cat command reads a file to standard output and the grep command uses this output of cat as standard input to search if the city ‘Munich’ is in a city file. The example dataset is available on github.
bz@cs ~/data $ cat city | grep Munich 3070,Munich [München],DEU,Bavaria,1194560
In the example above you can see the structure of the sample data set. The dataset is a comma separated list. The first number represents the id of an entry, followed by the name of a city, the countrycode, district and the last number represents the population of a city.
Now, let’s answer an analytical question: What is the city with the biggest population in the data set? The second and the fifth column can be selected with the help of awk. Awk creates a list where the population is on the first position and the city name is on the second position. The sort command can be used for sorting. Therefore, it is possible to find out which city in the dataset has the biggest population.
bz@cs ~/data/ $ awk -v OFS=" " -F"," '{print $5, $2}' city | sort -n -r | head -n 1 10500000 Mumbai (Bombay)
It is also possible to make joins in the Unix shell with the command called join. The join command assumes that input data is sorted based on the key on which the join is going to take place. You can find another dataset on github which contains countries. This dataset is a comma separated list as well. The 14th column in the country dataset represents the capital id which is similar to the id in the city data set. This makes it possible to create a list of countries with their capitals.
bz@cs ~/data/ $ cat city | head -n 2 1,Kabul,AFG,Kabol,1780000 2,Qandahar,AFG,Qandahar,237500 bz@cs ~/data/ $ cat country | head -n 2 AFG,Afghanistan,Asia,Southern and Central Asia,652090,1919,22720000,45.9,5976.00,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF NLD,Netherlands,Europe,Western Europe,41526,1581,15864000,78.3,371362.00,360478.00,Nederland,Constitutional Monarchy,Beatrix,5,NL bz@cs ~/data/ $ join -t "," -1 1 -2 14 -o '1.2,2.2' city country | head -n 2 Kabul,Afghanistan Amsterdam,Netherlands
Finally, let’s get a deeper look in the city data set. The question for this example is: How is the distribution of cities in the city data set? A combination of the sort and the uniq commands allows us to create data for a density plot. This data can be streamed (>) to a file.
bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r | head -n 4 363 CHN 341 IND 274 USA 250 BRA bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r > count_vs_country
Gnuplot is a command which allows us to visualize the density data file. We have to tell gnuplot what it has to print and how it should be printed. You can use gnuplot while telnet or ssh session as well because plots can be printed in ACSII-Characters. Therefore, the terminal type has to be set to ‘dumb‘.
bz@cs ~/data/ $ gnuplot G N U P L O T Version 4.6 patchlevel 2 last modified 2013-03-14 Build System: Darwin x86_64 Copyright (C) 1986-1993, 1998, 2004, 2007-2013 Thomas Williams, Colin Kelley and many others gnuplot home: http://www.gnuplot.info faq, bugs, etc: type "help FAQ" immediate help: type "help" (plot window: hit 'h') Terminal type set to 'x11' gnuplot> plot 'count_vs_country' with points gnuplot> set term dumb Terminal type set to 'dumb' Options are 'feed size 79, 24' gnuplot> plot 'count_vs_country' with points 400 ++------------+-------------+------------+-------------+------------++ + + + + 'count_vs_country' A + 350 A+ ++ A | | | 300 ++ ++ |A | 250 +A ++ | | 200 ++ ++ |A | | A | 150 ++ ++ | A | 100 ++ ++ | AA | | AAA | 50 ++ AA ++ + AAAAAAAA + + + + 0 ++------------+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA---++ 0 50 100 150 200 250
I hope you enjoyed this little excurse in data analysis with the Unix shell. It is useful for students which are currently working on the study guide of Data Science Essentials (DS-200) Beta. Furthermore, I demonstrated how powerful the Unix shell can be used for basic analytics. The Unix shell is also able to do basic things like an analyst normally is executing in a statistical software as R.
Nice overview of basic working with gnu tools.
One remark: Don’t use “cat” if there is only one file – see also:
http://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_cat
Just use grep instead of cat | grep
Cheers
The above ‘join’ command does not work -at least not in Ubuntu 3.13. The problem seems to be that ‘join’ expects string order (not numerical order, which is the way the raw data files are sorted). In other words, the ‘capital id’ field is seen as a string, not a number. To solve the problem, one can use sort to sort the files without specifying the -n option:
$ sort -t, -k1 city >sortcity
$ sort -t, -k14 country >sortcountry
This will sort the files by join column but using string comparison. Even then there are a few issues;
$ join -t “,” -1 1 -2 14 -o ‘1.2,2.2’ sortcity sortcountry >result
join: sortcity:11: is not sorted: 100,Paraná,ARG,Entre Rios,207041
join: sortcity:11: is not sorted: 100,Paraná,ARG,Entre Rios,207041
join: sortcountry:10: is not sorted: AZE,Azerbaijan,Asia,Middle East,86600,1991,7734000,62.9,4127.00,4100.00,Azärbaycan,Federal Republic,Heydär Äliyev,144,AZ
Still, it’s pretty close:
$ wc -l result
200 result