Friday, August 31, 2012

Calculating Pi to 5 Digits

Yesterday I was reading 20 controversial programming opinions which has some interesting debates going on in it.   However I was drawn to the following programming question that one responder likes to ask potential new hires in job interviews:

Given that Pi can be estimated using the function 4 * (1 – 1/3 + 1/5 – 1/7 + …) with more terms giving greater accuracy, write a function that calculates Pi to an accuracy of 5 decimal places.

The first part of this is easy, depending on the language you use.  In Java it's something simple like:


float Pi=(float)1.0;
int mult=-1;
for (int dem=3;dem <1000 dem="dem+2){<br">    float Add=(float)(1.0/(float)dem);
    Pi=Pi+(float)(mult*Add);
    System.out.println("Dem "+dem+"  "+Add+ "  Pi "+4.0*Pi);
    mult=-1*mult;
}



The problem is getting the answer to 5 digits accuracy. How can we know it's accurate unless we know the value of Pi. My solution (which does have it's problems is to iterate until the calculated value of Pi doesn't change to the accuracy we require. In the examples below I've used string formatting to track the old value and the new value until they are the same. Both versions give the same answer 3.14159

In Java

import java.text.*;
public class Pi {
   public static void main(String[] args) {
      double Pi=(double)1.0;
      //http://www.javaprogrammingforums.com/java-programming-tutorials/297-how-format-double-value-2-decimal-places.html
      DecimalFormat df = new DecimalFormat("#.#####");
      String oldPi;
      String newPi;
      long dem=3;
      oldPi =df.format((double)4.0);
      newPi =df.format(Pi);
      int mult=-1;
      while (oldPi.compareTo(newPi)!=0){
        oldPi=df.format((double)4.0*Pi);
        double Add=(double)(1.0/(double)dem);
        Pi=Pi+(double)(mult*Add);
        newPi=df.format((double)4.0*Pi);
        System.out.println("Dem "+dem+"  "+Add+ "  Pi "+df.format((double)4.0*Pi)+ "  Pi "+4.0*Pi+" : "+oldPi+" : "+newPi);
      mult=-1*mult;
     dem+=2;
     }
 }
}

And in C
#include
#include
main(){
   double Pi=(double)1.0;
   char oldPi[100];
   char newPi[100];
   long dem=3;
   sprintf(oldPi,"%.5f",(double)4.0);
   sprintf(newPi,"%.5f",(double)Pi);
   int mult=-1;
   while (strcmp(oldPi,newPi)!=0){
      sprintf(oldPi,"%.5f",(double)4.0*Pi);
      double Add=(double)(1.0/(double)dem);
      Pi=Pi+(double)(mult*Add);
      sprintf(newPi,"%.5f",(double)4.0*Pi);
      printf(" %ld Pi %.5f %s %s \n",dem,4.0*Pi,oldPi,newPi);
      mult=-1*mult;
      dem+=2;
   }
}
The problem with both these versions is they don't work if you try and increase the accuracy.  If you want 6 decimal places then the value doesn't settle down, in fact it oscillates between two values and never stays the same.

So two questions:


  1. What does the code look like in other languages (particularly something like Erlang)
  2. How to deal with the oscillation problem ?

Tuesday, July 24, 2012

Cassandra on a Raspberry Pi, 5 and 6 node insert stress tests:

Here's a quick update for the performance graphs for cassandra on Raspberry pi.  Here's the results for 5 and 6 node inserts on a stress test



I'm getting to the point in this project where I can start to build pseudo data centers and start to test performance there.

Thursday, July 19, 2012

Java performance on Raspbian vs Debian

Over the past few weeks I’ve been blogging about my experience of running Apache Cassandra on the Raspberry Pi.  I plan to use the Pi as an educational resource in the University I work in, hopefully giving students the chance to play with large clusters and experiment with configurations, database models and practices in a nosql environment.  Of course performance isn’t great but  for me, it’s a  cheap way of getting lots of nodes and do real network configuration problems. 

A couple of days ago a new Debian based distro for the pi was released called Raspbian “wheezy” was released and is now the official Raspberry Pi Debian distro (I believe).   This is the first OS release for the Pi to take advantage of the Pi’s floating point hardware, which is going to make the OS a lot faster for general use.  I downloaded it for testing in my rig, sadly this is a tale of woe.

Apache Cassandra is a Java application and needs a JRE in order to run.    I’ve always used a  Oracle supplied JVM “Oracle’s java SE for  embedded “

http://www.oracle.com/technetwork/java/embedded/downloads/javase/index.html


Sadly, it seems this can’t be used on Raspbian.  Trying to run it gives :

Java: error while loading shared libraries: libjli.so: cannot open shared object file: No such file or directory

It seems that this version of Java uses the “soft float ABI (armel) which is incompatible with Raspbian”: (thanks to mpthompson on the Raspberry Pi forum for the information) so it’s looking like it can’t run.  Back to openjdk ?

But wait !  Why did I not use openjdk in the first place ?

That’s simple, performance.  In my experience (and perhaps this is a configuration problem I’m not aware of)  Open JDK is a lot slower than the Oracle version.  And  I mean a lot slower!  I set up a single node Cassandra server image, one with the old Debian image and Oracle Java the other with Raspbian and Open JDk.  I then ran stress tests from a Apple Air (something I’ve done many times !) .  Here’s the results.  The second column is interval_op_rate, you want this to be as high as possible, the third column  is  avg_latency, you want this to be as low as possible.

Raspbian and OpenJDK


>Lifeintheairage:bin Administrator$ ./stress -d 192.168.1.12 -o insert -I DeflateCompressor
Unable to create stress keyspace: Keyspace names must be case-insensitively unique ("Keyspace1" conflicts with "Keyspace1")
total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
485,48,48,0.8705896907216495,10
1042,55,55,0.9123070017953321,20
1436,39,39,1.2947030456852793,30
2010,57,57,0.9009128919860627,40
2510,50,50,0.961294,51
2743,23,23,1.922206008583691,61
3306,56,56,1.0665861456483126,71
3863,55,55,0.9055601436265709,81
4118,25,25,2.0272901960784315,91
4659,54,54,0.9333364140480591,102
5031,37,37,0.916733870967742,112
5498,46,46,1.480710920770878,122

 

Debian Squeze and Java SE for embedded


>lifeintheairage:bin Administrator$ ./stress -d 192.168.1.10 -o insert -I DeflateCompressor
Unable to create stress keyspace: Keyspace names must be case-insensitively unique ("Keyspace1" conflicts with "Keyspace1")
total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
2565,256,256,0.18891695906432748,10
4604,203,203,0.2503182932810201,20
7093,248,248,0.20536078746484532,30
9289,219,219,0.23249635701275045,40
11516,222,222,0.22830354737314773,51
14107,259,259,0.19691161713624084,61
16297,219,219,0.22646849315068493,71
18092,179,179,0.29083064066852365,81
19756,166,166,0.30374939903846154,91
21689,193,193,0.2648339368856699,102
23404,171,171,0.18766355685131195,112
25395,199,199,0.3459779005524862,122
27646,225,225,0.2330964015992892,132
29684,203,203,0.24136898920510305,142


And a graph of interval_op_rate:



(red is Java SE for embedded, blue is OpenJDK)

Java SE for embedded really is a lot faster for Apache Cassandra (and I wouldn’t be surprised for other java apps such as Arduino IDE).  For now I need to stick with the Debian release, I hope it doesn’t become unsupported.  Hopefully someone can  get in touch with Oracle and encourage them to support a official port of Java SE for embedded onto the raspberry which supports the correct Raspbian libraries.

Saturday, June 16, 2012

3 Node / 4 Node Cassandra Stress test on a Raspberry Pi cluster




One of the things I’m interested in is using  tiny Raspberry Pi computers for teaching database and network admin to Undergraduate and MSc students.  In the first instance I’ve been looking at building a large cluster of these devices for to run a cluster of apache Cassandra database servers.  I’m in no way expecting these to get any where near the performance of  real servers or even VM installations but, for me at least, they give a feeling of working with real hardware.   The first thing I’m doing is conducting stress tests  with various configurations,  but I’m limited by availability of the  devices.  I started out with a cluster of 3 and have just managed to add another node.    The stress test is using the stress command Cassandra provides in the tools directory of a standard installation (some distributions missed the directory so  you may need to get the source and build the stress tool yourself).   After we’ve looked at the chart, I’ll look a little at the process of adding a new node to a Cassandra cluster.  For the record the commands I used to stress the cluster are as follows:

 Insert:
./stress -d 192.168.1.10,192.168.1.11,192.168.1.12 -o insert -I DeflateCompressor

Read:

./stress -d 192.168.1.10,192.168.1.11,192.168.1.12 -o read

For a 4 node test I added the new node into the list of hosts.  Note also I’m using DeflateCompressor as I’ve not yet managed to get snappy compressor compiled for the Pi.  I used a Mac book air to drive the stress test over a wifi connection to the cluster which is connected via a Netgear 10Meg switch which should handle the data rates form a Pi

Here then  is a graph combining inserts and reads for 3 and 4 node clusters:




One thing I do want to note here, for both the 3 and 4 node clusters the insert performance drops suddenly towards the end of the run.  I’m not sure why that happens.  The clusters where in both case balanced with each node running 90% CPU.  Here’s the ring information for the cluster arrangements (optained from the nodetool command ./nodetool -h 192.168.1.10 ring)

Address         DC          Rack        Status State   Load            Effective-Owership  Token                                       
                                                                                           113427455640312821154458202477256070485     
192.168.1.11    datacenter1 rack1       Up     Normal  14.67 MB        33.33%              0                                           
192.168.1.10    datacenter1 rack1       Up     Normal  14.42 MB        33.33%              56713727820156410577229101238628035242      
192.168.1.12    datacenter1 rack1       Up     Normal  14.51 MB        33.33%              113427455640312821154458202477256070485     


pi@raspberrypi:/home/space/apache-cassandra-1.1.0/bin$ ./nodetool -h 192.168.1.12 ring
Address         DC          Rack        Status State   Load            Effective-Owership  Token                                       
                                                                                           127605887595351923798765477786913079296     
192.168.1.11    datacenter1 rack1       Up     Normal  11.24 MB        25.00%              0                                           
192.168.1.10    datacenter1 rack1       Up     Normal  11.24 MB        25.00%              42535295865117307932921825928971026432      
192.168.1.12    datacenter1 rack1       Up     Normal  11.38 MB        25.00%              85070591730234615865843651857942052864      
192.168.1.13    datacenter1 rack1       Up     Normal  11.1 MB         25.00%              127605887595351923798765477786913079296     

Moving from 3 to 4 nodes.

Here’s the procedure I used to move from 3 to 4 nodes. Providing your  cluster is already balanced with the initial_token correctly set in the Cassandra.yaml file you can add the new node with it’s correct key.  Once it’s bootstrapped on each of the other nodes you can use nodetool move to change that nodes token, something like:

sudo ./nodetool -h 192.168.1.10 move 42535295865117307932921825928971026432

Does this on each node that needs to be moved, so not the first node with a token of 0 and the new node you've just added with the correct initial token.  After the node is moved you will need to run cleanup to delete any data that the node doesn’t need:

./nodetool -h 192.168.1.10 cleanup

There’s a simple python code you can use to calculate the keys (this version courtesy of a good friend  on twitter)

import sys
if (len(sys.argv) > 1):
   num = int(sys.argv[1])
else:
   num = int(raw_input("How many nodes? :"))
for i in range(0,num):
   print 'node %d: %d' % (i, (i*(2**127)/num))

I’m looking forward to going beyond 4 nodes soon !

Getting more memory on the Pi

The Pi is a little short on memory for this type of server.  The situation isn’t helped by some of the memory being shared by the GPU, the default being 64M.  You can move this down to 32 M by changing the start.elf file.

Change to /boot  on the pi
Copy start.elf to start.elf.old  (sudo cp start.elf start.elf.old)
Copy arm224_start.elf to start.elf (sudo cp arm224_start.elf to start.elf)

Reboot.  You can use the top command to see the performance of your Pi and how much memory it has.   See http://elinux.org/RPi_Advanced_Setup for more information on the elf files available and how much memory the GPU uses for each.

A Pic of the setup
Just for completeness, here's a pic of 4 Raspberry Pi running apache cassandra



Thursday, June 7, 2012

Raspberry Pi, not just for teaching programming


There’s quite rightly been a lot of talk about the Raspberry Pi, and quite some discussion on what it’s good for.      Whether it will  succeed in it’s mission to create a new army of programmers is any one’s guess at this point, but for me it’s already succeeding.  No, not for programming but for teaching computer administration.   I’ve got a small cluster of  Pi (three, 2 borrowed at the moment) and I’ve been having a lot of fun configuring apache Cassandra on  them.    So for   less than £100 I’ve got a Linux cluster I can blow away at any moment and start again.    I can reconfigure the Cassandra settings, start the cluster again and run stress tests on the thing.


I take backups of the SD cards once in a while so I can go back to previous configs at any point  which is quite easy.    On a Mac (or Linux) just  put the card in a card reader and use the following command:

dd bs=1m if=/dev/rdisk1 of=disk.img

where rdisk1 is the USB port you will have identified when creating your first image and disk.img is the file you want to create.

As a teacher,  having cheap hardware around like this is going to allow  us to give students machines to  play on, to set up, muck around with little or no chance of damage.  Our undergraduate networking course is  going to get a whole lot more hands on !  Sure you could do it with VMs  but that just won’t feel as real as plugging a cluster together.  Once we have our data Science MSc up and running (News here )  I’m hoping that we can give the students access to a 2 data center cluster of 20 to 30 machines, all for around 100.    Doing that with VMs is possible but a lot work (although you can of course automate it)  and would probably cost a  lot more to setup !

Looks like the Pi can do a lot more than teach programming.

Friday, May 25, 2012

Cassandra Stress test on a Raspberry Pi


So here are my initial results at using  a  Raspberry Pi to run Cassandra.  At the moment I’m running a single Pi with Cassandra 0.8.10 (compacting is not available and Snappy  will not run currently on the Pi) I’m using the  java stress tests that come with Cassandra source.  These tests where run with the stress test classes running on the same Pi as the Cassandra instance.  To be honest it’s not looking great,  but looking at ways of getting faster IO and tuning:

Pi with class 10 Sd card

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
289,28,28,0.5914359861591696,11
1266,97,97,0.13869907881269192,22
2056,79,79,0.5817556962025316,32
3744,168,168,0.31356575829383887,43
5737,199,199,0.20355143000501758,54
7703,196,196,0.179558494404883,64
9514,181,181,0.2858961899503037,74
11305,179,179,0.1481675041876047,85

As you can see this ramps up to a interval op rate of nearly 200.  Using an external HD on the USB is actually slightly worse (extract from later in the run):

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
889736,133,133,0.263335837716003,7116
890287,55,55,0.5460671506352087,7126
891445,115,115,0.37030224525043176,7136
892668,122,122,0.4226631234668847,7147
894149,148,148,0.23508845374746792,7159
895317,116,116,0.3649169520547945,7170
896475,115,115,0.23375302245250432,7180
897574,109,109,0.47590354868061874,7190
898558,98,98,0.41302845528455284,7201
899626,106,106,0.39788951310861426,7211
900910,128,128,0.21852570093457943,7221
902078,116,116,0.41339640410958906,7232

For comparison, here’s the results from my new Apple Mac Air with a SSD :

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
30581,3058,3058,0.013996631895621465,10
137194,10661,10661,0.003453800193222215,20
277241,14004,14004,0.0024328332631188103,30
408411,13117,13117,0.0025887245559197986,40
535147,12673,12673,0.0027360339603585406,51
662768,12762,12762,0.0026654704163107954,61
792233,12946,12946,0.0026509326845093268,71
919061,12682,12682,0.002678690825369792,81
1000000,8093,8093,0.0026410383128034694,88
END

Monday, May 21, 2012

Snappy Compression fails for Apache Cassandra on a Raspberry Pi

Although I’ve managed to get Apache Casssandra running on a Raspberry Pi I’ve been struggling to make any use of it.    I’ve been using the latest build of Cassandra (1.1.0) and when ever I’ve tried creating a column family I’ve been getting the following error:

SnappyCompressor.create() threw an error: org.xerial.snappy.SnappyError [FAILED_TO_LOAD_NATIVE_LIBRARY] null

As you are aware compression on Column Families was introduced in  Cassandra 1 for space saving and increased disk IO (http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression).  By default the compression is Snappy using the Google Snappy Library (http://code.google.com/p/snappy-java/ ) which is a wrapper for Google Snappy (http://code.google.com/p/snappy/) written in C++.  There is no port for Google Snappy on the Pi and my initial attempts to build one have not been successful (autogen.sh is failing with undefined macro AC_DEFINE)

You should be able to create a column family with DeflateCompressor from  cqlsh as follows:

create TABLE users (KEY varchar Primary key, password varchar, gender varchar) WITH compression_parameters:sstable_compressor = 'DeflateCompressor';

However that’s failing with the same Snappy Compressor error.   Describing the Keyspace with Cassandra-cli gives the following:
Keyspace: test:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
    Options: [replication_factor:1]
  Column Families:
    ColumnFamily: users
      Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
      Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.1
      DC Local Read repair chance: 0.0
      Replicate on write: true
      Caching: KEYS_ONLY
      Bloom Filter FP chance: default
      Built indexes: []
      Column Metadata:
        Column Name: password
          Validation Class: org.apache.cassandra.db.marshal.UTF8Type
        Column Name: gender
          Validation Class: org.apache.cassandra.db.marshal.UTF8Type
      Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
      Compression Options:
        sstable_compressor: DeflateCompressor
        sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

Note the two compressors in action.  I’m not sure quite what’s going on, or if this is the correct result at the moment but I’m continuing to investigate (any answers gratefully received.    So for now it’s back to Apache Cassandra 0.8.10 which doesn’t have compression and seems to work on the Pi (once the env file has been changed).

Update

Thanks to Jonathan Ellis for pointing out the command should be:

Create TABLE users2 (KEY varchar Primary key, password varchar, gender varchar) WITH compression_parameters:sstable_compression = 'DeflateCompressor';

Thursday, May 17, 2012

Starting Cassandra on Raspberry Pi restart



(may be useful for other Raspberry Pi services)



A couple of days ago I blogged about getting Cassandra running on a  Raspberry Pi,  a fairly straight forward procedure!    However the way I had it set up required Cassandra to be restarted manually each time the Pi was turned on which is not ideal.  We really want our Cassandra to start each time the Pi is started and to run as a service in the background.  Services on Debian (which I am using) are  defined in Init.d and are managed by the update-rc.d command.     So, we need a command script to put into /etc/init.d that will be run when the Pi restarts.  There is an example script at: http://www.jansipke.nl/centos-cassandra-init-start-stop-script/  but for the pi this has a number of problems.    


The first problem is that the script uses the  daemon command to run the Cassandra script.  This command does not exist under Debian instead we will need to use the start-stop-daemon command.  Instead of :


daemon $CASSANDRA_BIN -p $CASSANDRA_PID >> $CASSANDRA_LOG 2>&1


we will use:


start-stop-daemon --start  --pidfile $CASSANDRA_PID --startas $CASSANDRA_BIN -p $CASSANDRA_PID  >> $CASSANDRA_LOG 2>&1


The second problem is that usleep doesn’t exist under this Debian, use “sleep 0.500000” instead.  Finally we will need to add the java path and java_home to the start of the file:


PATH=$PATH:/usr/local/bin/java
JAVA_HOME=/usr/local/bin/java


The full file is available on github  here: https://github.com/acobley/CassandraStartup.  Remember you will need to change the variables that point to the locations of Cassandra and it’s PID file and lock file.   Once the file is copied to /etc/init.d (remember to change I so it’s executable “sudo chmod +x Cassandra”) you can use update-rc.d  to add it to all the /etc/rc?.d files:


update-rc.d  cassandra defaults


No doubt this procedure and the Cassandra file can be improved !


For more info on the update-rc.d file see: http://www.debuntu.org/how-to-manage-services-with-update-rc.d


Thursday, May 10, 2012

Apache Cassandra on a Raspberry Pi


One of the  reason I got hold of a Raspberry Pi (the $35 arm based Linux
machine) was to play around with building a cluster of them  for handling
"Big Data".  This is a real exercise in tinkering, very much a “what if” scenario.  The first thing I wanted to play with was getting Apache Cassandra to run on  the Pi.  Of course Cassandra is built with Java, there is no java on the Pi out of the box.    Several people suggested building Openjdk (http://openjdk.java.net/) but I plumbed for Oracle’s java SE for  embedded available here:


Download Java SE for Embedded 7 (ARMv6/7 Linux - Headless )and install it on the PI.  Once done and with the path and java_home correctly set you should be able to run the java on your pi.

Next get hold of a version of Apache Cassandra (http://cassandra.apache.org/)  I was using version 1.1.0.  Install it as usual.    If yo u try and run Cassandra from the bin directory it will fail to start with the serror:

“Invalid initial eden size: -Xmn0M”

The problem is that cassandra-env.sh is trying to work out the heap size by getting the  amount of memory on the machine and multiplying it it by the number of processors (line 69)

max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*" $system_cpu_cores`

The number of system cpu cores (line 22 or there abouts):

system_cpu_cores=`egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo`

is failing on the Pi (there is no processor line in /proc/cpuinfo).  For now I’ve manually altered cassanda-env.sh  to report only on core. 

system_cpu_cores=1

Cassandra now runs on the Pi.     There must be a better way of doing this( my Linux programming is failing me for the moment)  but for now I’m just tinkering so it will do.

Next up, try and get some performance figures for the Pi running Cassanadra.

Update

The correct way to fix this is to add change the Linux section of cassandra-env.sh to
           system_memory_in_mb=`free -m | awk '/Mem:/ {print $2}'`
            system_cpu_cores=`egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo`
            if [ "$system_cpu_cores" -lt "1" ]
            then
               system_cpu_cores="1"
            fi
            echo "Linux"
            echo "memory" $system_memory_in_mb
            echo "cores" $system_cpu_cores