World Cup Data Analysis For Fun - Part II

Continuing from Part I ( http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-i.html ), following chart shows density of number of goals scored by country in a world cup tournament.  Black line in the fore ground is the average density of goals.

Some interesting facts:
* Purple peak is Wales with four goals in 1958 and that is the only year they played.
* Organge-yellowish peak is Bolivia scoring no goals twice and one goal once
* Large percentage (~80%) score no more than 10 goals in each tournament





Goals For Summary (per country per cup):
  • Min. :        0.0
  • 1st Qu.:     2.0
  • Median :   4.0
  • Mean :      5.7
  • 3rd Qu.:    8.0
  • Max. :     27.0
Goal Against Summary (per country per cup):

  • Min.   :     0.0
  • 1st Qu.:    4.0
  • Median :  5.0
  • Mean   :   5.7                                                                         
  • 3rd Qu.:   7.0  
  • Max.   :  17.0

While it is low number of goals scored in a each world cup (see chart above) it is also interesting to see the trend over many decades of all goals (scored + allowed) per game.  Here I applied the LOWESS (locally weighted scatter plot smoothing) non-parametric regression to better fit the data (blue line).


  
Though early in early years there were lot more goals each game, in the recent past (after 1970) it has stabilized around 2.7 goals per game.  But how do soccer power houses (Argentina, Brasil, Germany, etc.) compare with seven other countries chosen from another cluster (See part 1).  As one would expect you have to score more than you allow :) and represented by gray dashed line on Y-axis i.e,

Goals Scored / Goals Allowed > 1




The colored year shows the winner of the World Cup on that year while the size of the bubble shows the total goals (Scored plus Allowed).  Six countries won all world cups between 1930 and 2006 except for the years 1930 and 1950 when Uruguay won and there were no world cups during 1942, 1946.

The outlier you see at the left top screen (BR, 10) is when Brazil scored 10 goals but allowed only 1 goal in 1986 in 5 matches while Argentina was the world cup winner scoring 14 goals and allowing 5 goals in 7 matches.

And the bottom (US, 0.14) big dot is for when US scored 1 goal and allowed 7 goals in 1934.



World Cup Data Analysis For Fun - Part I

With the world cup fever of 2014 around it is interesting to do some analysis and dig deeper through stats.  Here is an attempt during a weekend.

I pulled some publicly available data of all world cups from 1930 to 2006 and after cleaning it up for my purpose it had the following entries for each match/game:
Country, Year, FIFA_Winner, Country_Code,
Goals_For, Goals_Against, Matches, Penalties,
Won, Drawn, Lost, Corners, Offsides,
Shots_On_Goal, Free_Kicks, etc.


My first attempt was to take a look at how the countries cluster together and it would also be easy to validate the clustering with some prior knowledge of world cup. For example, one would expect Brazil, Germany, Argentina and few others possibly cluster together.

As in any statistical analysis it is bit of challenge to decide how to handle missing values.  In the above data, fields like "Shots on Goal, Shots Wide, Free Kicks, Corners" were not available up until 2002.  Either these values can be set to 0 or handle with mean of the available data (over the available period) with function like

mean_vec < - function(vec) {
    m <- b=""> mean(vec, na.rm = TRUE)
    vec[is.na(vec)] <- b=""> m
    return(vec)
}


where you replace 'NA' with mean.  It could be used either column-wise or row-wise through apply function.  It is grand mean of each column which introduces its own errors into model.  Better would be to have mean at country level (a simple and straight forward and works better for data with Gaussian distribution characteristics) or other techniques including regression substitution, most probable value sub., etc.  For some more details see http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

Running the sum-squared-error (SSE) yielded the below chart. With the elbow/bend between 4 and 6 it would be sufficient to have minimum 4 clusters. I choose 10 for below analysis.




With 10 clusters it resulted in following dendogram:



How do the Soccer power houses like Brazil, Germany and few others (cluster 7 from left in the above diagram) would compare with few others.  One metric is how many goals do they score in each match while allowing some.  Density plots would be one visualization where I plotted 3 dimensional density with "Goals For" in X axis and "Goals Against" in Y axis. I left Sweden from list for now.  Here is a twin peak with 1 and 2 goals in favor while ~0.5 goals against per game.  Contrast this with one other countries below.






Comparing with the 7 other countries from the last cluster (#10 in the above dendogram), I get different density plot where peak happens with ~0.6 goals in favor while ~2 goals against per game.

PS: Note the difference in scales between these two plots.  It will be interesting super impose one above the other with the same scale along 3 dimensions.




Use of heat map is another visualization with more details including deviation of each variable (represented by light blue vertical lines below).  Compare below "Games Lost and Goals Against" with "Games Won and Goals For" for the two clusters.  Also Shots on Goal.



More (part II) analysis at: http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-part-ii.html


Your genome data through API

Recently 23andme.com reduced their prices of DNA test kit to $99 and now you get your hereditary and other related information for price less than $100!  That is a big drop from where it started $999 and then $299.  I know little about genome/ genomics but it is a very interesting, dynamic, fast growing field which has potential to change the way we view health (one now is empowered to know whether they are at any risk of potential congenital diseases now or in the future ) or one's ancestry!

My interest was in the data that you can play with.  With the API in place you can pull your or demo data.  To do that first I needed to setup data pull through API and following is quick summary of setup I had on my mac.

Some important links:
23andme API
OAuth Introduction
OAuth flow

After creating a developer login account  you can set up the application/client with given credentials - client_id and client_secret.  See below.



For quick testing and pull, I used Dancer web framework on local mac and 23andme uses OAuth2 with 3 legged authentication.  As a first step get the 'code' by creating a simple page with link to their login page.

For exmaple link below takes the user to authorize your client and once successfully user logins the authentication dance happens between the client and server.

"a api.23andme.com="" authorize="" href="http://www.blogger.com/" https:="" redirect_uri="http://localhost:5000/receive_code/&response_type=code&client_id=YOUR_CLIENT_ID&scope=basic" rs123="">"Connect with 23andMe. [pid: $$]";

Note: pid above is a process id for me to know when I killed and restarted the Dancer.

User clicks on the link


and then login to authorize your client to access to her resources.  This is the point where 'code' is received and exchanged for access_token.


After successful OAuth dance now you can call any of end-points ( https://api.23andme.com/docs/reference/ ).  Here is demo user call ( https://api.23andme.com/1/demo/user/ )


Data bandwidth diagram - Washington Post article

In the last few days there have been lot of news with respect to NSA leaks. One of the presentation slide (#2) in article has the bandwidth capacity as shown below. Digram utilized D3 and data in csv file.




Shown below are the same in chord diagram with javascript animation highlighting each regions bandwidth with mouse-over.



Asia/Pacific Region highlighted below:



Git useful links

Here are some of the git links that are very useful and helped me over the time in working with git.  If you have other suggested links feel free to send them to me or add the in comments.


http://git-scm.com/book - A must read (free pdf book!)

http://gitimmersion.com/index.html - Downloadable tutorial and try out commands on sample

http://sitaramc.github.com/master-toc.html - Excellent coverage of git

http://marklodato.github.com/visual-git-guide/index-en.html - A nice quick visual guide of git

http://www.vogella.com/articles/Git/article.html - Another excellent intro to git

http://ndpsoftware.com/git-cheatsheet.html#loc=workspace - A dynamic visual cheat sheet

http://www.gitguys.com/topics/ - A detail intro with presentations

http://nvie.com/posts/a-successful-git-branching-model/  - Great article on branching strategy

http://steveko.wordpress.com/2012/02/24/10-things-i-hate-about-git/ - A view on why git is bad for version control

http://stackoverflow.com/questions/871/why-is-git-better-than-subversion - A view on why git better than svn


Other very specific git related topics:

http://stackoverflow.com/questions/67699/how-do-i-clone-all-remote-branches-with-git

http://longair.net/blog/2009/04/16/git-fetch-and-merge/ - fetch/merge Vs pull commands

http://toroid.org/ams/git-central-repo-howto - Sharing git repo with others

http://stackoverflow.com/questions/457927/git-workflow-and-rebase-vs-merge-questions - Detail discussion of rebase Vs merge

http://stackoverflow.com/questions/1241720/git-cherry-pick-vs-merge-workflow - When to use cherry pick vs merge/rebase

http://stackoverflow.com/questions/1398329/take-all-my-changes-on-the-current-branch-and-move-them-to-a-new-branch-in-git - Apply current branch changes to other

http://stackoverflow.com/questions/2187000/untracked-files-between-branches-in-git - Switching between branches with untracked files

http://stackoverflow.com/questions/2498458/why-did-git-set-us-on-no-branch - Solution to "no branch" issue

https://ariejan.net/2010/06/10/cherry-picking-specific-commits-from-another-branch - Cherry pick a specific commit

http://stackoverflow.com/questions/2862590/how-to-replace-master-branch-in-git-entirely-from-another-branch - Replacing master branch itself!

https://github.com/git/git/blob/master/contrib/completion/git-completion.bash - Git bash utility

http://nathaniel.themccallums.org/2010/10/18/using-git-fast-forward-merging-to-keep-branches-in-sync/ - git fast-forward examplg

http://stackoverflow.com/questions/3895453/how-do-i-make-a-git-commit-in-the-past - Perform a commit in the past!

Cheers,
Shiva



Avoid git prompt for username and password

While working with git you can either enter username and password at each time you perform operations like git pull, push, etc.  In the following example, I will be using github repository for remote  with local repos on my Mac.  I will also be using my own account and company user account to access two different github accounts and their respective repos.

Simple setup:


   One account only and access to its respective repositories.

Here just modify your ~/local_repo/.git/config file to either use https or ssh connection.  Typical config looks like

DW Schema Sizes
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    # url = https://github.com/USER_OR_ORG_NAME/REPO.git
    url = ssh://git@github.com/USER_OR_ORG_NAME/REPO.git
[branch "master"]
    remote = origin
    merge = refs/heads/master
...
...
With command line git pull/push and ssh pub-key setup at github.com.  See how to provide the pub-key to github.  If you already have a ssh key and you would like to use the same just add entry in the ~/.ssh/config file with corresponding entries.  See below for more details on how to do this.

By default when cloning is done with git clone command the config file generated will have entry of "url = https://github.com"  (assuming you are cloning from github repo).   See how to switch from https to ssh for a given repo or you can comment out the line as above and enter "ssh" link shown above in the example.  Git config (git_config) has too many options and you can control different aspects of git behavior.

Multiple account setup:


Among many but another scenario is when using two or more different github accounts from the same computer/system.  Under this ~/.ssh/config file will be useful.  Here ssh setup is required and https doesn't help much.

Other variations would be multiple accounts with different service providers not just github.  This is little bit easier than having multiple accounts with same service providers.

Example: Github two accounts - doe_work and doe_personal and for each account you create two separate ssh keys with ssh-keygen.

> ssh-keygen -t rsa -f ~/.ssh/id_rsa_doe_work_git  -C "Job new_key doe@company.com"
> ssh-keygen -t rsa -f ~/.ssh/id_rsa_doe_personal   -C "Personal key"

Now configure both accounts to use ssh and its config - ssh_config.  Important section in ssh_config is the "host" which is a string pattern to match in the config file to get it's respective options/ variables.  You obviously should add the newly created public keys to github under settings->SSH Keys->Add SSH Key. Sample pub key for above.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdnbxKkCrYUv3YbutC2Dw6jIhQWLNIzNA3Ec6inlmrngwB33fCaEP4ZiOzPq8A0BRBCyV HYhC3txA9Jn1tRXVZ4tUGEslvN2qF2HNXJhSx8V5Vk1r3LmWe1uehOjAekSK0apELpkafSwigzgkm9oAmbNQ5p0N1e8ar/TXbOOzWVMRu9K G/fILuHf90UZ4H5hOrZov9eZSwabnSMvORirizFXYZPp/FQ30fV3wZJKJoNnmOY+/txjnNc+mikYiezjeA66vWlDGfJQ+Xlb+i1bnXoxBfv hrE/nSuSUVNmGy0bYPOFwbxPrnz0jFGCgdUh7KfKD2yE/gc0abhW0nyxkP Job new_key doe@company.com

...
ServerAliveInterval 60
ServerAliveCountMax 60
...
Host github.com
    HostName github.com
    # User doe_work
    IdentityFile ~/.ssh/id_rsa_doe_work_git
...
Host github-mine
    HostName github.com
    # User doe_personal
    IdentityFile ~/.ssh/id_rsa_doe_personal
...
Now in each repo change /.git/config to use the above ssh/hosts.
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    url = ssh://git@github.com/USER_OR_ORG_NAME/REPO.git
and
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    url = ssh://git@github-mine/USER_OR_ORG_NAME/REPO.git

The three configuration files discussed are used by git to perform handshake with remote server and authenticate for each of git pull, push, fetch etc.


Large data mysqldump to Greenplum

Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL.  MySQL schema didn't have many tables but one single table was large with around 50G size including data and index.  Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility
> psql -h HOST -U USER -f FILENAME   
but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It's dump creates a file that is not fully compatible with direct load into postgresql.  Dump even with --compitable=postgresql option didn't help much.

One of the major issue while loading huge file with psql utility the "Out of memory" error even with reasonably small file, say 1G.
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.
As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.
-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
And retained any lines between
   INSERT INTO line and ENABLE KEYS line.  
Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB!   These are long lines with thousands and thousands of records.  At certain intervals, 100,000 or so, Clustrix inserted new row with "INSERT INTO ...".  I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'  
thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of "Out of memory" you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn't be the case.  With INSERT INTO .... VALUES ( .... ) statement there is no need to do so.  Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing "," at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing "," etc.

Split command used
> split --lines=500000 FILENAME
Adding "INSERT INTO ...." to each of these files and instead of seeking to end of file and removing extra ",", I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
      echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;
   done
This created for each split file corresponding file with "_r_l" suffix (ready_to_load).

Then loaded the table
> for fn in `ls xd*_r_l`;
    do
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";
    done


Systems and utilities used:

  Greenplum DB - Greenplum Database 4.0.6.0 build 4
  Postgresql - PostgreSQL 8.2.14
  MySQL - 5.0.45-clustrix-v4.1
  Perl - 5.8.8 multithreaded
  Bash
  All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours.  Still substantial but it is one time load and was acceptable.


Technique 2:   Using mysqldump and Greenplum gpload

Greenplum's bulk loading utility (gpload) is an excellent one to load large data set.  After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload  -f  $gpload_ctl_file

with control file created dynamically from a template.  For example in the below table replace all place holders with respective values.  With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

VERSION: 1.0.0.1
DATABASE:
USER:  
HOST:  
PORT:  
GPLOAD:
   INPUT:
     - SOURCE:
         LOCAL_HOSTNAME:
           -
         PORT:
         FILE:
           -
     - FORMAT: text
     - DELIMITER: '|'
     - NULL_AS: 'NULL'
     - ERROR_LIMIT: 25
     - ERROR_TABLE: sandbox_log.gpload_errors
     - COLUMNS:
         - timestamp: text
         - priority: text
     ...
     ...
 PRELOAD:
    - TRUNCATE: false
   OUTPUT:
     - TABLE:
     - MODE: INSERT

This is a much faster and efficient loading than technique 1.

HTH,
Shiva

Recursion defined

Recursion, see Recursion.  :)

Something defined in terms itself.  Or sometimes CS scientists or programmers making point through
GNU - "GNU's Not Unix"
YAML - "YAML Ain't Markup Language"
Or beautiful Sierpinski Traingles

When a function call's itself some interesting things happen behind the scene like holding onto the variables which later used when computer execution unwinds the stack.  In a typical example of recursion in solving a factorial, one may write
#!/usr/bin/env perl

use strict;
sub factorial {
    my $v = shift;
    return 1 if $v == 1;
    return $v * factorial($v - 1);
}
factorial(5);
When a first call is made to factorial(5), the execution jumps to factorial function (subroutine) and gets to last line, where while evaluating encounters another function call to factorial ($v -1) which then again makes a call to function or subroutine. This pushes a new call frame on to stack (with args).  If a function returns it is pop-ed out of the stack and done (lost).

Few things are working together with call stack, heap, garbage collector (which removes any memory of any variable or func obj that doesn't have reference count 1 or more), and execution system.

Now to see more on recursion you can try the following


  1 #!/usr/bin/env  perl
  2 $! = 1;
  3 use strict;
  4 use IO::Handle;
  5 use Carp qw(cluck);
  6
  7 STDOUT->autoflush(1);      # Flush output immediately
  8 STDERR->autoflush(1);
  9
 10 sub factorial {
 11     my $v = shift;
 12  
 13     dummy_func();             # Sub that returns immediately printing call stack
 14     return 1 if $v == 1;
 15     print "Variable v value: $v and it's address:", \$v,
                     "\nCurrent sub factorial addr:", \&factorial, "\n","-"x40;
 16     return $v * factorial($v - 1);    # Builds on call for each func call
 17 }
 18  
 19 sub dummy_func {
 20     cluck;
 21 }
 22
 23 factorial(5);
Resulting output:
  1     main::dummy_func() called at ./t_recursion.pl line 13
  2     main::factorial(5) called at ./t_recursion.pl line 23
  3 Variable v value: 5 and its address:SCALAR(0x7ff6240546a0)
  4 Current sub factorial addr:CODE(0x7ff62402f2c0)
  5 ----------------------------------------
  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9 Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)
 11 ----------------------------------------
 12     main::dummy_func() called at ./t_recursion.pl line 13
 13     main::factorial(3) called at ./t_recursion.pl line 16
 14     main::factorial(4) called at ./t_recursion.pl line 16
 15     main::factorial(5) called at ./t_recursion.pl line 23
 16 Variable v value: 3 and its address:SCALAR(0x7ff6240612f8)
 17 Current sub factorial addr:CODE(0x7ff62402f2c0)
 18 ----------------------------------------
 19     main::dummy_func() called at ./t_recursion.pl line 13
 20     main::factorial(2) called at ./t_recursion.pl line 16
 21     main::factorial(3) called at ./t_recursion.pl line 16
 22     main::factorial(4) called at ./t_recursion.pl line 16
 23     main::factorial(5) called at ./t_recursion.pl line 23
 24 Variable v value: 2 and its address:SCALAR(0x7ff624061538)
 25 Current sub factorial addr:CODE(0x7ff62402f2c0)
 26 ----------------------------------------
 27     main::dummy_func() called at ./t_recursion.pl line 13
 28     main::factorial(1) called at ./t_recursion.pl line 16
 29     main::factorial(2) called at ./t_recursion.pl line 16
 30     main::factorial(3) called at ./t_recursion.pl line 16
 31     main::factorial(4) called at ./t_recursion.pl line 16
 32     main::factorial(5) called at ./t_recursion.pl line 23

When recursion script is kicked-off, it pushes factorial(5) first frame on to the call stack (line 2 above) which calls another dummy_func which then goes on to the stack (line 1).   Hence when cluck is called in dummy_func there are two calls on the stack along with any arguments passed.

Then dummy_call returns and is pop-ed from the stack.  Program moves to line 15 (script above) and evaluates to false.  Then prints lines 3&4 output ($v and its location, factorial sub location).

Script line 16 calls factorial which pushes the new function call on to stack and at the point the value of $v is 5.  The function and this variable are in same scope and on stack.  So later when this function returns is multiplied with $v (value 5).

When factorial is called 2nd time (but first time at line 16 and pushed onto call stack) $v is reduced by 1 ($v -1) which is then copied and execution starts at top of this subroutine again.  Remember copy of definition of function always the same at some location (CODE(0x7ff62402f2c0)) in program memory.

This execution then calls dummy_func which spits out the call stack and as you expected now you have dummy_func at top, 2nd factorial in middle and 1st factorial call at bottom.  Stack is FILO (First In Last Out or LIFO - Last In First Out).  Then execution moves to lines 14 & 15.   Output looks like:

  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9  Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)

At script line 16 the recursion continues and you get output lines 12 to 32.  At the last function the base or terminal condition of recursion is met ( return 1 if $v == 1; ) and it returns 1.

factorial of 1 => 1! = 1;

Now the stack rewinding begins, the return value of 1 (when factorial (1) returned) is multiplied with the variable $v (value 2) and results in 2 which is returned by  return $v * factorial($v - 1);  statement.

Finally,  5! = 120.

All this happen behind the scene and it might be just better to know and recognize the common pattern when this happen :).  I wouldn't worry about how the implementation is done when I run query like

SELECT column_N FROM table_X;
It is so darn simple but so much goes behind that SQL statement from mapping table to file and exact location in file to extract correct values.  It is all hidden from the application program.

For more details take a look at "Call Stack" or "Activation Record".

But if you like to dig deeper through debugging, try


> perl -d t_recursion.pl
Loading DB routines from perl5db.pl version 1.33
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(t_recursion.pl:2):	$! = 1;
  DB<1> n
main::(t_recursion.pl:7):	STDOUT->autoflush(1);
  DB<1> n
main::(t_recursion.pl:8):	STDERR->autoflush(1);
  DB<1> n
main::(t_recursion.pl:23):	factorial(5);
  DB<1> s
main::factorial(t_recursion.pl:11):
11:	    my $v = shift;
  DB<1> s
main::factorial(t_recursion.pl:13):
13:	    dummy_func();
  DB<1> s
main::dummy_func(t_recursion.pl:20):
20:	    cluck;
  DB<1> T
. = main::dummy_func() called from file `t_recursion.pl' line 13
. = main::factorial(5) called from file `t_recursion.pl' line 23
  DB<1> 


Using Webfonts or font-face

With all new browsers supporting CSS3, it is easy to use .woff (Web Open Font Format) and content page developer aren't limited to user computer locally available fonts.  W3 discusses CSS3 as

With CSS3, web designers are no longer forced to use only web-safe fonts

Along with many others Google provides access to reportire of fonts that can be used.  As of early 2013 there were more than 615 font types available.


And using the @font-face the browser links to the requested font and use it.  Multiple @font-face rules can applied and used selectively with in the same page or site. In Google dynamic views, for example, with embedded code

@font-face {
  font-family: 'Great Vibes';
  font-style: normal;
  font-weight: 400;
  src: local('Great Vibes'), local('GreatVibes-Regular'), url(http://themes.googleusercontent.com/static/fonts/greatvibes/v1/6q1c0ofG6NKsEhAc2eh-3brIa-7acMAeDBVuclsi6Gc.woff) format('woff');
}

one can set the required selector to this font.  To get the font-face rules of above click on "Quick-use" link on Google webfonts above and paste the href into broswer.  W3 Draft on CSS3 explains these rules and all the CSS3 details.

To access the Blogger dynamic views template login into your blogger home and select Template > Customize (button)  > Advanced > Add CSS.  Paste @font-face rules like below.



In the above example the 'Great Vibes' fonts is applied to titles of the pages/content.


Github weekend issue - Dec.23, 2012

Oh oh. Github seemed to have major issue on Saturday from ~4.30pm PST. It is has been down for while now :(.  The site status page https://status.github.com/ is showing





And main site



Greenplum Postgres Regexp fun

Last few months I started woking in Greenplum/ postgres environment and was pleasantly surprised to see the regular expression capabilities.  Compared to Mysql (5.1) it has quite advanced RE features. For example I ended up using the following query to strip the decimal numbers from another string.

Input strings has the format "BucketN_N[ -N | + ]" where N is decimal number.  For example, "Bucket1_1" or  "Bucket7_21-50" or "Bucket5_100+" are all valid and output should extract bucket number (first N), minium value (second N) and maximum value of the bucket (third optional N).

The query 
SELECT
  bucket_string
  ,substring(qty_bucket, 7, 1)::integer as bucket_num
  ,regexp_replace(bucket_string, '^Bucket\\d{1,}_(\\d{1,})([-+].*)?', E'\\1')::integer as minimum_qty
  ,regexp_replace(regexp_replace(bucket_string, '.*-(\\d{1,})', E'\\1'),  '.*_(\\d{1,})([+]?)$', E'\\1')::integer
  as maximum_qty
FROM a_table
ORDER BY bucket_string
Though I could have used other functions like substring and other, it was fun using regular expression in smallest code.  regexp_replace is the workhorse taking a string and replacing it with matched string (\\1 or $1).  "^ and $" are anchors to beginning and end of string,  \\d matches with decimal numbers and {1,} means 1 to many quantifier.


And more with regexp_matches and array selection.  I wanted to select offset number of days from 2012 December 01 in this simplified example.  The offset is a text in another external table of Greenplum and data is not always clean and could have some text before the offset.

SELECT d
  regexp_matches(d, '(?:.*_)?(\\d+)$')    re_ary
 , '2012-12-01'::date+((regexp_matches(d, '(?:.*_)?(\\d+)$'))[1] || 'days')::interval   new_d
FROM
(
    SELECT '90' d
    union
    SELECT 'new_90' d
    union
    SELECT 'old_90_10' d
) tt
returns

d         re_ary      new_d
-------------------------------------
90         {90}      2013-03-01 00:00:00
new_90     {90}      2013-03-01 00:00:00
new_90_10  {10}      2012-12-11 00:00:00

With "?:" is non-capturing group and this matched text is optional meaning any text before string ending number is not-captured for later use and original string may have it but not all the time. '$' anchors it to the end of string with regexp_matches returning an array of single element and subscript [1] selecting it.



Nice chart usage for dynamic pricing

Real time analytics allows to track and monitor users and their activities; and then adjust what is presented to them.  It could be a relevant advertisement like showing Nikon camera advertisement for a user searching for a new camera or camera prices or anything similar.

Taking one step further is when the price of an item(s) is adjusted depending on whether that user is a loyal customer and/or there is higher possibility that he/she may buy other accessories.  Realtime predictive analytics makes this possible.

Below is an nice graph presented by Wall Street Journal and at http://www.ritholtz.com/blog/2012/09/lucky-us-toilet-paper-priced-like-airline-tickets/

Graph shows three companies' (Sears, BestBuy and Amazon) price variation over a day for a microwave.  Amazon increased prices during the peak hours by more than 10% (~8am to 12.30pm and then again 3pm to ~9pm EDT). All times shown in graph are in PDT (Pacific Day Time) timezone.

Even more interesting will be observe whether prices were varied based on user location or where Amazon's servers were located?  As it is simple to geo map the IP address of a user computer/device and vary the prices accordingly!  Different users from different cities at different times will see different prices. The price points and user experience can be optimized for improved sales!





Using d3 visualization for fraud detection and trending

d3 (Data Driven Document) is a great data visualizing tool and recently used it to track possible fraud or how some metrics have behaved over few hours to over few weeks.  You can filter a group that is of interest out of, say, million or more and then use d3 to work through manually for more unique set.

The graph is dynamic (unlike in this blog) where you can select range of lines by moving the cursor over any of Y-axes and selecting them.  It is much more interesting in action than a snapshot of images below.

Below is users' "A" metric over 5 hours, 5 to 12 hours, 12 hours to 1 day, similarly up to last 14 to 28 days.  The user group was selected from few different segments and each major color corresponds to a segment.  d3 automatically varies the color slightly for each line.  Though it is not very clear in the image below main colors were blue, green, orange and purple.



The input to d3 is a simple csv file and all the presentation is handled by d3 unlike in many previous packages I had used where I ended up creating a output file in html or some xml for flash.  Big advantage with d3 over these is attaching the html element to each data point in the programming and in-built data visualizing functions do the rest of magic.

In the above scenario, for example, you can move the cursor to left most scale (5 hour) and zoom in on lines above 1,700 to 2,000.  There is only one user well above the rest who have metric 200 or lower.  This user hasn't done much in last 4 weeks until last 5 hours!  Time to look into what this user is doing and we use other tools to do further analysis.



Similar to above scenario, below is another graph where I am interested in about all users whose score was between 600 and 1,400 over last 2 to 7 days.  There is not much exciting in below graph and have seen more interesting details other times.


Happy data visualization!

Users Password Analysis

As a data engineer it is always interesting to work on large unique data sets.  With recently released Yahoo users details (453K) many insightful info can be gleaned from the data.  For example, even though password hacking is well known for long time still large number of users use simple passwords, sometime as simple as "password" or "123456" or similar.  Here are top 10 passwords and number of users who had used them!


123456         1667
password       780
welcome        437
ninja              333
abc123          250
123456789   222
12345678     208
sunshine        205
princess         202
qwerty          172 

It is interesting to see how many users had unique passwords which was not used by anyone in this data set.  There were 10.6K users with no password which might be due to data issue and ignored for many of calculations and only ~304K (69%) users with unique passwords.

Another interesting insight is if password is used by more than one user, there is likely hood that it is some kind of latin word or words ("whatever", "iloveyou") or proper name ("jordon", "ginger") or some number (123321) or what can easily be guessed (for example, "q1w2e3r4" for qwerty keyboard or "asdfgh", etc.).  Even when two users used the same password there was some certainty that it is a guessable password! With each additional user the certainty increases quite quickly.  Under these circumstances, even if a password is encrypted (by md5 or sha or other encryptions) by service providers, with brute force application one can find out the password for these users.

By also looking into how users from different email service providers had their passwords setup showed the following.  As expected, Yahoo had more users (x-axis) while smaller companies ("others" in the chart) had more number of users (71.7%) with unique passwords.  At the same time gmail and live users' password length is more than 8.87.  Length of the passwords is represented by size of the bubble.


Having bigger bubble size and higher up in the Y-axis is better as it represents more users using unique passwords with longer password strings.  See table below for more details.



Even more interesting analysis can be done including people's or places' names in their password.  One could be able to use popular names from US Social Security Administration's and names' list go as back as 1880! There were lot more passwords that simply used these names!  Lot more matches can be found with minor modifications like changing i to 1 or o to 0 (zero), etc.
 
With many users using simple passwords service providers or websites should force each user to have stronger password by enforcing them during the registration or each login.  Users should also be forced change them once in few months.  It might be even better each computer equipped with finger or eye reader that can be used for user authentication thus avoiding this whole password mess.



Simple time series metric

Here is an example of a fusion chart recently created using MySQL, Perl, PHP and Fusion charts with fudged data. X axis is date & hour for last 7 days.  The data is pulled from MySQL and generates a XML file for consumption by PHP and Fusion.


The time data is in UTC (unix timestamp and first second of each hour. For example, 1262307600 => 2010-01-01 01:00:01) summarized with other dimensions and metrics. The simplified fact table looks like

ts_begin_hr, dim_1, dim_2, metric_1, metric_2, etc.

The query used to pull required report data.

         SELECT 'dim1' dim, concat(D_H.d, ' ', D_H.hr) pst,
             CASE WHEN P.met1 IS NULL THEN  0.0  ELSE P.met1 end METRIC
         FROM
            (
             SELECT d.d, h.hr 
             FROM
              (SELECT '00' hr UNION SELECT '01' UNION SELECT '02' UNION SELECT '03' UNION SELECT '04'
               UNION SELECT '05' UNION SELECT '06' UNION SELECT '07' UNION SELECT '08' 
               UNION SELECT '09' UNION SELECT 10 UNION  SELECT 11 UNION SELECT 12 
               UNION SELECT 13  UNION SELECT 14 UNION  SELECT 15 UNION SELECT 16 
               UNION SELECT 17 UNION SELECT 18 UNION   SELECT 19 UNION SELECT 20 
               UNION SELECT 21 UNION SELECT 22 UNION SELECT 23
              ) h
              CROSS JOIN
              (SELECT full_date d FROM DW_DB.dim_date
               WHERE full_date BETWEEN date_add(current_date(), interval -6 day) AND current_date()
              ) d  -- 1
            ) D_H
            LEFT OUTER JOIN
            (
              SELECT 'dim1' dim,
                  , date(convert_tz(from_unixtime(ts_begin_hr), 'GMT', 'America/Los_Angeles')) d
                  , hour(convert_tz(from_unixtime(ts_being_hr), 'GMT', 'America/Los_Angeles')) hr
                  , sum(met1) met1
              FROM DW_DB.FACT_TABLE FT       
              JOIN DW_DB.DIM_TABLE DM
                ON FT.dim_id = DM.dim_id
              WHERE from_unixtime(ts_begin_hr) >= date_add(now(), interval -7 day)
              GROUP BY dim, d, hr
            ) P  -- 2
         ON D_H.d = P.d and D_H.hr = P.hr
         WHERE D_H.d < date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))
            OR (D_H.d = date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))
                AND D_H.d <= hour(convert_tz(now(), 'GMT', 'America/Los_Angeles'))
              )
         ORDER BY D_H.d, D_H.hr
        ;";


Subquery 1 (in blue) gets a cross-joined table of 24 hours (0..23) with last 6 dates resulting in output like YYYY-MM-DD HH.  Single digit hours are converted to two digits in string format (SELECT '0n').  The system and mysql were running in UTC timezone while business users needed them in PST tz.  Subquery 2 (in brown)  generates required metric in PST and joined with subquery 1.  With where clause any extra hours (later than current hour) is filtered out in left outer join.  

Subquery 1 is joined with left outer join for possibility that during a given hour there might not be metric (say, dollar amount) available because there might not be any transaction during that hour for a given dimension.

The result set is dumped to XML file for Fusion chart consumption resulting in above output through php wrapper. The chart is updated every hour.

Visualizing daily metric with tiles

One of the effective way to present a time series data over long period of time is either typical line charts or some modified versions of it.  It gets little harder to visualize when would like to do see clustering of data points.  Here is one where I find it tiled series that gives quick glimpse of a some metric varying over few years.  This is a daily metric starting January 01, 2010 and up to recent week of May, 2012; nearly 2 and 1/2 years of metric variation.

The Y-axis measures represent day of the week (wday) with 1 representing Monday, 2 Tuesday and so forth with 7 as Sunday.  I set the values to these so that I could cluster weekends metrics (5,6 & 7) together vs the weekday (1,2,3,4 - Monday to Thursday).  X-axis represent week of the year (1 to 52 or 53).  Year 2010, 2011 and 2012 are series.  Metric values varies from 0.00 to 10.00 and color of each tile varies slightly based on the metric value.

If you were to use discrete values of metrics say, 0 to 5 the color coding is similarly quite distinct.   See    graph 2 below.  The data is from 2nd week of March, 2010 to May, 2012.




Graph 1



Graph 2

User traversal digraphs

Visualizing the users' traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are - each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
      Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
   # splines=false;
   rankdir=LR

   2 ->3 [penwidth=4.9, label="1190"]
   3 -> 4 [penwidth=4.9, label="1150"]
   4 -> 5 [penwidth=4.8]
   5 -> 6 [penwidth=4.8]
   6 -> 7 [penwidth=4.8]
   7 -> 8 [penwidth=4.8]
   …
   …
}

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below.  You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 - with sample weights between states 2 & 3, 3 & 4



With dot input file content like

  subgraph cluster1 {
    rankdir="LR"
    2 -> 3  [penwidth="23", label="23.48%",  taillabel="4450",  headlabel="3405"]
    3 -> 4  [penwidth="25", label="24.9%",  taillabel="3405",  headlabel="2557"]
    4 -> 5  [penwidth="18", label="18.34%",  taillabel="2557",  headlabel="2088"]
    5 -> 6  [penwidth="19", label="19.3%",  taillabel="2088",  headlabel="1685"]
    6 -> 7  [penwidth="20", label="20.18%",  taillabel="1685",  headlabel="1345"]
    7 -> 8  [penwidth="26", label="26.47%",  taillabel="1345",  headlabel="989"]
    8 -> 9  [penwidth="35", label="35.29%",  taillabel="989",  headlabel="640"]
    9 -> 10  [penwidth="39", label="38.59%",  taillabel="640",  headlabel="393"]
    10 -> 11 [penwidth="36", label="35.88%",  taillabel="393",  headlabel="252"]
  }


Digraph 2 - with users drop-off between states in %



Joins between tables of different charset

Not too long ago I got bitten by a slow performing MySQL query even though I had right indices (in fact unique index) and the tables were not big tables. The query which is similar to below was taking more than 20 mins which I ended up killing it.

The issue was one table was storing data in "latin1" while other in "utf8".

select count(1)
from table_1 T1
left outer join table_2 T2
  on T1.id = T2.id
where T2.id is null

Note: "id"s here are varchars and they are not integers.

T1 had around 100K while T2 has around 800K joining on MySQL5.1, quad machine with 16G RAM.  I expected much quicker response and on running the explain plan I could see the query doing the full table scan!  First I wasn't sure why this was happening but after digging little bit and found out that one table was defined with "charset=latin1" and another with "charset=utf8".

MySQL was converting latin1 to utf8 and then performing the join which ended up with full table scan. On converting T2 to utf8 with collation utf8_unicode_ci, the query timing dropped to 10secs or so!

With that said by default now we create all tables with charset=utf8. 

Life with Regular Expressions!

Damn this is good!
(Credit:  xkcd.com)


TImezone conversion for hourly report

We recently built a real time hourly report where data is streamed in and summarized through multiple jobs and pushed into reporting tables.  The business users wanted the reports as quickly as possible and first phase of the project needed to be implemented in short time.  With very limited time from concept to design and implementation, I utilized 3rd party web charting service (www.chart.io) and pushed the highly summarized data to set of tables.

During a final reporting, I needed to convert date and hour in UTC to PST.  The table had date (date type) and hour (tinyint) columns.  Chart.io uses simple x, y chart (line, bar, etc.) and takes any co-ordinates passed for them and displays them.  See below for one of final chart.


X axis is date and hour with format 'YYYY-MM-DD HH'.  The following SQL converted the date and hour from corresponding 2 columns with format YYYY-MM-DD and HH respectively in UTC timezone to PST. 

SELECT substr(convert_tz(concat(date, ' ', if(length(hour) = 1, concat('0',hour), hour)), 'GMT', 'America/Los_Angeles'), 1, 13) as x,
...
FROM
WHERE ...
GROUP BY x...
ORDER BY x ASC

Since maintaining the order is important. For example '2011-11-11 02' hour is right after '2011-11-11 01' hour and before '2011-11-11 03', single digit hours (1 through 9) are concatenated with "0'.  Then convert_tz converted the GMT timezone to PST (America/Los_Angeles) time.  There was suggestion to simply use -8.00 in hardcoded timezone difference but will result in incorrect hour during the months day light saving time (summer).   Since data is stored for each hour in tables over many months and years, by having the above technique one doesn't need to worry about this issue.  The system will automatically and correctly offset the hours in the report.  Handling timezone conversion is a not a trivial one and see timezone wiki for different timezones around the world.  For timezone conversion to work in MySQL, one should load the appropriate data into tables like this.

The reports/charts generated were very useful to business user and helped them take some quick decisive decisions and help build better user experience and improve sales and rev!