Wednesday, March 19, 2014

Saving an image in Cassandra BLOB field

We had an occasion today to be able to store images in a blob field of a Cassandra tables.  More to the point I needed to extract it and send it from a java servlet to a web browser as an image.   The code for storing the image is quit easy but there is a small gotcha when retrieving it.   So, suppose we have a table that looks something like:

String CreateTweetTable = "CREATE TABLE if not exists Messages ("+
                "user varchar,"+
                " interaction_time timeuuid,"+
                " tweet varchar,"+
                " image blob," +
                " imagelength int,"+
                " PRIMARY KEY (user,interaction_time)"+
                ") WITH CLUSTERING ORDER BY (interaction_time DESC);";

Our image will be in the blob field and we also store the size of the image for reference. We can load a picture from a file on the local machines hard disk like this:

FileInputStream fis=new FileInputStream("/Users/Administrator/Desktop/mystery.png");
byte[] b= new byte[fis.available()+1];
int length=b.length;
fis.read(b);

We now need to convert the byte array into a bytebuffer:

ByteBuffer buffer =ByteBuffer.wrap(b);

Writing the record becomes simply:

 PreparedStatement ps = session.prepare("insert into Messages ( image, user, interaction_time,imagelength) values(?,?,?,?)");
BoundStatement boundStatement = new BoundStatement(ps);
session.execute(  boundStatement.bind( buffer, "Andy",  convertor.getTimeUUID(),length));


Getting the image back is simple.  Use a Select to get the result set:

PreparedStatement ps = session.prepare("select user,image,imagelength from Messages where user =?");
BoundStatement boundStatement = new BoundStatement(ps);
ResultSet rs =session.execute ( boundStatement.bind("Andy"));

We can now loop through the result set (here we are assuming only one image comes back)

ByteBuffer bImage=null;
for (Row row : rs) {
 bImage = row.getBytes("image") ;
 length=row.getInt("imagelength");
}

However to display the image we will need it as a byte array.  We can’t use bImage.get() as this reaches down in to the raw buffer (see: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/blob$20ByteBuffer/java-driver-user/4_KegVX0teo/2OOZ8YOwtBcJ for details )  Instead we can use :

byte image[]= new byte[length];
image=Bytes.getArray(bImage);

In the servlet we can return this image in one of 2 ways:

OutputStream out = response.getOutputStream();
response.setContentType("image/png");
response.setContentLength(image.length);
out.write(Image);

Writes the image as a single lump which may use too much memory.  You might be better using a bufferedinput stream (http://stackoverflow.com/questions/2979758/writing-image-to-servlet-response-with-best-performance

InputStream is = new ByteArrayInputStream(Image);
BufferedInputStream input = new BufferedInputStream(is);
byte[] buffer = new byte[8192];
for (int length = 0; (length = input.read(buffer)) > 0;) {
    out.write(buffer, 0, length);
}
out.close();

Wednesday, January 22, 2014

Running Cassandra 2.x.x on Windows 7 and 8

This blog post describes how to get the Cassandra 2.x.x family running on a windows machine.  It's clear that Cassandra should not be run for production on Windows (except perhaps on Azure), but if you're a student learning to use C* it may well be you have no choice to run it on Windows 7 or 8 on your laptop.  Lets get started : 

Install JRE 7

Open a command prompt and type java -version to see if it is installed properly.  If not find a jre from oracle and install it.  Make sure it's a version 7 at the least (version 6 will not work).

You'll need to set JAVA_HOME. Find the control panel (on windows 8 search for it).  Go to "system and security" and then "system". Click on "Advanced Settings" and then the "Environment Variables" button. Click on the new button an in the Variable name box type JAVA_HOME
under the value you'll need to put in the path to the java you are using.  Mine is

c:\program files\java\jre7

but yours may be different, especially if you have a jdk.  If you are going to program java clients for C* you will need a JDK but that's a different post

 

Install Cassandra

Download Cassandra from http://cassandra.apache.org/ probably a file like
apache-cassandra-2.0.4-bin.tar.gz
You'll need to unpack this file and that will depend on which flavor of windows you have.  At this point I'll assume you have a legal copy of winzip or similar.  Unpack the downloaded file to the root of the c: or d: drive on your machine

You can now change to the Cassandra install directory in your command prompt, change to the bin directory to start Cassandra, type Cassandra to start it.  The window will print a lot of information but you are looking for a line like:

 INFO 19:00:31,031 Listening for thrift clients...

to make sure it's working.

CQLSH

So now we have C* running, we need to check we can connect to it.  Start by opening another command prompt and type cqlsh. Sadly it won't start, cqlsh now needs an installation of python, so lets get one installed. Download one from  http://www.python.org/  and go to downloads then "individual release". click on the 2.x stable release and then
scroll down to the download section.  Your looking for the Windows MSI installer.  I used:

http://www.python.org/ftp/python/2.7.6/python-2.7.6.msi

Download it and run it to install Python, you'll need a version 2 of python, NOTE THIS WELL, version 3 will not work! This installs a nice windows version of python but does not install
a path to the executable.  You'll need to set it by hand I guess.  Once again Find the control panel (on windows 8 search for it).  Go to "system and security" and then "system"
Click on "Advanced Settings" and then the "Environment Variables" button.  Under the system variables find PATH.  Highlight it and click edit.

Careful!  We don't want to wipe the current contents (if you do hit cancel) go to the end of the current path and enter

;c:\python27

Note the ; at the beginning.  Again this will depend on the current version of python you've installed and should mirror the path to your python installation. Click OK to close the
dialog boxes and open a command prompt again.

Now you should be able to change to the cassandra directory and then the bin directory and type cqlsh.  With luck you should get the
cassandra cqlsh prompt:

Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.0 | Cassandra 2.0.4 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh>

type "use system;" followed by "describe  keyspaces;", cqlsh should reply:

system  system_traces

You're now connected and ready to start work.

BTW, folks at datastax and apache Cassandra, why is this so hard ?  Would Datastax Devcenter work easier ?

Setting up a Cassandra cluster on Windows with a vagrant virtualbox

Setting up Cassandra on windows can now be a pain with all it's dependancies, but it's something I'll cover in a later post.  One simpler way is to get C* running in a virtual box and perhaps even run it as a mini cluster.  This can be helped a lot by using Vagrant, but even that isn't quite straight forward.

The following has worked for me and is based heavily on the work done by calebgroom  and his github contribution vagrant-cassandra.  I've altered it a bit for use with the latest C* which adds virtual nodes etc.  Using these instructions you should be able to provisiona 3 node C* cluster with vnodes.

1: Install oracle Vm Virtualbox from https://www.virtualbox.org/ the latest version should do.
2: Install git for windows http://msysgit.github.io/ ensure you select the option to run git from the command line.
3: Install ruby for windows http://rubyinstaller.org/ V2.x.x Select all options
4: Download devkit DevKit-mingw64-32-4.7.2-20130224-1151-sfx.exe
4.1: Extract it to a permeant location
4.2 Start "commandline" with ruby from
4.3 Change to devkit location and run
    ruby dk.rb init
    ruby dk.rb install

5 At any location run gem install librarian-chef (This may take sometime)
6 Download vagrant (http://www.vagrantup.com/ ) and install it
7: Open a command prompt and git clone https://github.com/acobley/vagrant-cassandra.git
8:  change to the directory vagrant-cassandra\vagrant and run
   librarian-chef install
9: Open vagrant/cookbooks/java/attributes and edit default.rb so that
    default['java']['jdk_version'] = '7'
   
10: If you want comment, out the “DL is deprecated, please use Fiddle” warning at C:\HashiCorp\Vagrant\embedded\lib\ruby\2.0.0\dl.rb
11: change to the vagrant-cassandra and run
    Vagrant up
This could take some time, but once it's finished you should be able to ssh to the virtual machine if you have a ssh installed.

   vagrant ssh node1

If you don't have ssh installed the the git installation comes with a ssh client so add c:\program files\git\bin to your path

 set PATH=%PATH%;c:\program files\git\bin

 Or set the path environment variable from the control panel.

 You can then ssh to the virtual host
  
  ssh vagrant@127.0.0.1 -p 2222 -i c:/users/*username*/.vagrant.d/insecure_private_key
 
Once inside the virtual machines you can test and see if  it works by getting the c* status by typing

/usr/local/cassandra/bin/nodetool -h 192.168.2.10 status





You can bring down the cluster with vagrant halt  and remove it with vagrant destroy (but then you'll need to start again!)

Vagrant can also be run on a mac.  Make sure you have vitualbox installed, clone the https://github.com/acobley/vagrant-cassandra.git and follow the instructions in the readme.

Saturday, November 2, 2013

Hadoop 2.x : jar file location for wordcount example

The Jar files  for Hadoop 2.x have moved location from Hadoop 1.x.  I found the following command

javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes myWordCount.java

will allow you to compile the standard wordcount example code.  You can see that the common files are in /share/hadoop/common/ and the mapreduce files are in /share/hadoop/mapreduce/.  Finally the common lib file are in /share/hadoop/common/lib

This post is in answer to this stackoverflow question:

http://stackoverflow.com/questions/19488894/compile-hadoop-2-2-0-job

(or set your classpath lke this

 export CLASSPATH=$HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar

and compile like this:

javac -classpath $CLASSPATH -d myWordCountClasses myWordCount.java

)

Wednesday, October 30, 2013

Hadoop 2 on Ubuntu on Azure.

This is to be read in conjunction with http://ac31004.blogspot.co.uk/2013/10/installing-hadoop-2-on-mac_29.html

Fire up a Azure Ubuntu server and ssh to it

Install a Java JDK:
apt-get install default-jdk

On you home machine, download a copy of Hadoop and secure copy it to the Azure machine (your username and machine will be different)
scp hadoop-2.2.0.tar.gz user@Hadoopmachine.cloudapp.net:

Unzip it and untar it
gunzip hadoop-2.2.0.tar.gz
tar xvf  hadoop-2.2.0.tar

You'll still need to set up the env variables
export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_INSTALL=/home/user/hadoop-2.2.0
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin


Also add JAVA_HOME, add Hadoop_INSTALL and change path in /etc/environment, see http://trentrichardson.com/2010/02/10/how-to-set-java_home-in-ubuntu/ for details

After setting up core-site.xml and hdfs-site.xml  you'll make the datanode and name nodename directories

mkdir -p /home/hadoop/yarn/namenode
mkdir /home/hadoop/yarn/datanode

Everything else should be the same.

Tuesday, October 29, 2013

Installing Hadoop 2 on a Mac

I've had a lot of trouble getting Hadoop 2 and yarn 2 running on my MAC.  There are some tutorials out there but they are often for
beta and alpha versions of the hadoop 2.0 family.  These are the steps I used to get Hadoop 2.2.0 working on my MAC running OSX 10.9


Get hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/common/

make sure JAVA_HOME is set (if you have Java 6 on your machine):
export JAVA_HOME=`/usr/libexec/java_home -v1.6`

point HADOOP_INSTALL to the hadoop installation directory
export HADOOP_INSTALL=/Applications/hadoop-2.2.0

And set the path
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

You can test hadoop is found with
hadoop -version

make sure ssh is set up on your machine:
system preferences -> sharing -> remote login is ticked

try:
ssh @localhost

where is the name you used to logon.

in $HADOOP_INSTALL/etc these are the conf files I changed.

core-site.xml

 <configuration>  
 <property>  
   <name>fs.default.name</name>  
   <value>hdfs://localhost:9000</value>  
  </property>  
 </configuration>  


hdfs-site.xml

 <configuration>  
 <property>  
   <name>dfs.replication</name>  
   <value>1</value>  
  </property>  
  <property>  
   <name>dfs.namenode.name.dir</name>  
   <value>file:/Users/Administrator/hadoop/namenode</value>  
  </property>  
  <property>  
   <name>dfs.datanode.data.dir</name>  
   <value>file:/Users/Administrator/hadoop/datanode</value>  
  </property>  
 </configuration>  


Make the directories for the namenode and datanode data (note the file above and the mkdir below will need to reflect where you  want to store the files, I've stored mine in the home directory of the Administrator user on my Mac).

mkdir -p /Users/Administrator/hadoop/namenode
mkdir -p /Users/Administrator/hadoop/datanode

hadoop namenode -format

yarn-site.xml
 <configuration>  
 <!-- Site specific YARN configuration properties -->  
 <property>  
 <name>yarn.resourcemanager.address</name>  
 <value>localhost:8032</value>  
 </property>  
 <property>  
 <name>yarn.nodemanager-aux-services</name>  
 <value>madpreduce.shuffle</value>  
 </property>  
 </configuration>  


start-dfs.sh
start-yarn.sh
jps

should give
9430 ResourceManager
9325 SecondaryNameNode
9513 NodeManager
9225 DataNode
9916 Jps
9140 NameNode

if not check log files.  If data node note started and  you get incompatible id's error, stop everything delete datanode directory and recreate
datanode directory

try  a ls
hadoop fs -ls

if you get

ls: `.': No such file or directory

then there is no home directory in the hadoop file system.  So

hadoop fs -mkdir /user
hadoop fs -mkdir /user/<username>
where is the name you are logged onto the machine with.

now change to $HADOOP_INSTALL directory and upload a file

hadoop fs -put LICENSE.txt


finally try a mapreduce job:

cd share/hadoop/mapreduce
hadoop jar ./hadoop-mapreduce-examples-2.2.0 wordcount LICENSE.txt out

Friday, October 11, 2013

Mapping CQL's sets and maps to column families

In this post we are going to explore how CQL implements sets and maps in Cassandra’s column store.

(in a bizarre twist of fate, John Berryman. created this post http://www.planetcassandra.org/blog/post/understanding-how-cql3-maps-to-cassandras-internal-data--structure yesterday on the same subject, I swear I hadn't seen it when I started working on this post, yesterday as well !  It's just how it goes sometimes,  Johns post is great it has to be said !. )

In CQL version 3 wide tables have been supported through the use of sets, maps and lists.  These features have been supported since Cassandra 1.2 (http://www.datastax.com/dev/blog/cql3_collections) and should now be the de facto way of creating “wide tables”  the canonical example of sets is the use of multiple email addresses for a user .  In the relational world you might create a email address table with a foreign key pointing to the user id for each address.   This is going to cause a join just for any request that needs details of the user and their valid addresses. 

Suppose we create a simple keyspace in the usual fashion:

create keyspace Keyspace3 WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

In a  Cassandra (from 1.2) you would create a table like this:

CREATE TABLE Users (   
    id uuid Primary Key,
    name text,    
    email_addresses set) ;

(This is similar to  Sylvain Lebresne’s example here http://www.datastax.com/dev/blog/cql3_collections)

We can insert data into the table (a user with 2 email addresses like this):

insert into users(id,name,email_addresses) values (88b8fd18-b1ed-4e96-bf79-4280797cba81,'tim',{'tim@example.org','timothy@example.org'});

This user has a UUID, a name and two email addresses.   You can of course get the email addresses with a select command:

select email_addresses from Users;

which will return the addresses as a set:

email_addresses
-------------------------------------------------------
            {'tim@example.org', 'timothy@example.org'}

However, how is this implemented in the column store ?  If you had used a thrift based interface (such as Hector) you may have created the column family and had the following structure:

Id: 88b8fd18-b1ed-4e96-bf79-4280797cba81 (Key)
    name: tim
  email_address: 'tim@example.org'
  email_address: 'timothy@example.org'

but how is it implemented in CQL3 ?  If you fire up Cassandra-cli you can use the list command to see what is stored in the column family:

LifeintheAirAge:bin Administrator$ ./cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 2.0.0

Please consider using the more convenient cqlsh instead of CLI
CQL3 is fully backwards compatible with Thrift data; see http://www.datastax.com/dev/blog/thrift-to-cql3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use keyspace3;
Authenticated to keyspace: keyspace3
[default@keyspace3] list users;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=email_addresses:74696d6f746879406578616d706c652e6f7267, value=, timestamp=1381497810072000)
=> (name=name, value=74696d, timestamp=1381497810072000)

So we can see the rowkey as expected and the name of the user as a name value pair (the value is in ASCII in hex in this case 746976d is tim).

But for the email_addresses the values are not set.  The values of the email addresses is encoded into the name along with the “column” schema name.  name=email_addresses:74696d406578616d706c652e6f7267  is the column name email_addresses followed by tim@example.org in ASCII hex) .  Why do this ? Why not have the name as email_addresses and the value as the the hex email address ?  One reason perhaps is because this allows us to implement maps ina similar way with out needing special cases.   Suppose we alter table to include a map, we want to store details about our user, but we don’t yet know which details the user will provide (A contrived example I’ll grant you).  You can alter the table as follows:

alter table users add details map;

and insert some details as follows:

update users set details= {'tel' : '555 232341', 'twitter' : '@andycobley'} where id =88b8fd18-b1ed-4e96-bf79-4280797cba81;

What does our column now look like? Using the list command we get :

RowKey: 88b8fd18-b1ed-4e96-bf79-4280797cba81
=> (name=, value=, timestamp=1381498805511000)
=> (name=details:74656c, value=3031333832333435303738, timestamp=1381498805511000)
=> (name=details:74776974746572, value=40616e6479636f626c6579, timestamp=1381498805511000)

You can see the map key is stored with the column name  in the name part of the column family name value  pair. So  name=details:74656c contains ‘tel’ as a ASCII hex value.  The map value is simply stored in the value part of the column family name value pair.

So, we’ve seen how CQL3’s  maps and sets map on to the column family name value pairs by storing the CQL table’s column name in the name part of the column family name value pair.  It’s quite simple and elegant really.

(as ever I’m more than happy to receive corrections or further explanations !)