NOSQL – The Famous Four

Relational Databases (RDBMS) have been the stable of data storage systems for a long time, however these have been joined by a new generation of database which come under the category NOSQL. NOSQL was coined by Carlo Strozzi in 1998 when describing his light weight, open-source relational database which did not have an SQL interface. However it is commonly understood that NOSQL means Not Only SQL (there is sometimes a belief that it means No SQL). Carlo believes the movement should be called NoREL (Not Only Relationship) as the databases are loosing the relational aspect not only the SQL. It is understood that rows and relationships are not the most efficient way to store data for a lot of applications. However they are usually more specialized in comparison to RDBMS, usually there are little built in functions in the database in comparison to RDBMS which has a multitude such as time functions, caparison, search. They usually try and achieve a greater scalability in comparison to RDBMS however this means that they lose some fundamentals of databases that have come to be relied upon.

The four most common forms of NOSQL are listed bellow:

Document Store

This takes the commonly understand concept of a scheme-less document as the container for the information. Meat data is then attached to each document, allowing them to be accessed retrieved and organized. This can be done through collections, tags, higherachical tree of documents and metadata.

As with any form of NOSQL systems there can be similarities drawn between them and RDBMS systems. Here collections of documents could be seen as a table, with each row being a document. However there is no guaranty that each document would have the same fields within it as it is a shameless.

Retrieval of documents is based on a ‘key-value’ system where each document has a key, however different systems allow for different queries that search the content of each document.

There a number of well-used documents based databases, the leading being MongoDB. This is an open source system that is supported by 10Gem. Their systems uses JSON to store the documents, with dynamic schemes, thus allowing documents to have different fields and values. This has been adopted by a number of large companies such as MTV, Craigslist, and Foursquare for the easy and speed at which they are able to access and modify documents which can represent real entities better than a traditional relational system can.

Graph Databases

These have been designed for systems where the data is best represented in graphs, with nodes, edges and properties. This has been using in representing public transport systems, roads, topologies, where there is a need for representing undetermined number of connections between points in an easy and simple methods. Nodes are similar to objects stored within object databases, allowing IDs, names and other inflation to be stored about each node. Where as edges are what the nodes are connected with, they represent relationships between them.

For associative data sets they all for a performance increase over traditional relational databases, allowing the structure of system to be mapped more directly. Like document store systems they have a less ridged schema which means they can adapt to storing different types of objects, e.g. bus stations and bus stops may require different types of data. They allow for more powerful queries of the graph compared to RDBMSs as they use the full power for graph theory, and do not have to abstract to above a relation system. Neo4j is currently the leading system user for Graphical Databases, it is written in JAVA however has inbuilt APIs and wrappers allowing it to be accessed from different languages and systems.

Key-Values Store

As with all the systems mentioned so far it again is shameless. These are similar to key value pairs used within data structures in all the large programming languages such as JAVA and Python. This is where a key is associated to a value, and to access the value one must look for associated key.

However for each value it must have a unique key to retrieve it by. This allows for quicker accesses, which is seen within simpler data stores as instead of having to query nodes which could be across multiple systems, it allows for direct access to the key and thus the value as it is a flat structure. Due to the simple nature of the structure of the data means the code to access is also very simple in comparison to SQL. However as RDBMS have tables it means access control can be implemented, however this is more challenging to implement here as there is no separation of data.

This category can be broken down into a number of sub categories as this form of data store can be manipulated for different mean. Some of the main variants store the data within memory, allowing fast access, however some store it to disk, allowing for data to be maintained. Redis is one which is the leading the key-value database movement it is  open-source, networked and stores data in-memory.

Object Databases

Object database (OODBMS) systems store information in objects in much the same way object orientated programming languages (OOPL) do. This allows object to be created, store and retrieves, within out any need of converting them into another form of query that would have been needed before when storing them in traditional DBMS. As they are so heavily integrated with the programming languages it means that the scheme can be maintained and manipulated within the same development environment.

In the early days it was seen that OODBMS added persistency to object programming languages. Like any database they have give the developers the ability to query large set of objects with a simple language. Objects are usually retrieved by following pointer, thus there the need for doing expensive joins on tables is removed, thus reducing the access time to the data. There have been many features added over time that are only possible by making object persistent, such as versioning, allowing developers to access past states of objects.

Wide-column Databases

These are all based on column family databases which are used for extensively large data sets which have large access overheads if stored within rows thus these types of systems are designed to scale to petabytes in size. These can be compared to RDBMSs:

  • Table in RDBMS are one or more column family’s (column family being a collection of columns)
  • A column in a column family would be a triplet consisting of a key, and a value.

The reason for serialising the data in column instead of rows is for performance gains. Firstly if you only want to query three out of ten columns in a table you only have to access three out of the five in comparison to data which is stored in rows which require the whole table to be accessed. This then also makes performing tasked on one column at a time quicker as again the read and write time to access the column is reduced. Also for adding and removing large amount of data at a time there are performance gains as you add or remove the column, thus not having to ideate through each row. However if there is the need to access one row in the data sat multiple operations will have to be performed, however these are usually limited in large datasets.

There are many variation of wide-column databases, with Google’s BigTable being the one which is credited for starting the trend in using this type of database design. Other notable ones are HBase, Cassandra and Hypertable, with Facebook having used introduced Cassandra for storing vast amounts of user information across multipul servers.

As you can see there has been a fragmentation and specialisation in the database field. This has lead to many improvements in systems, allowing developers easier systems to store data in which it is more suited to them. However this has lead to a fragmentation in knowledge about each system, initially RDBMS had SQL as a simple and universal language which allowed anyone with the knowledge of SQL access to the data within the majority of RDBMS; but now with the fragmenting there are many different ways of accessing and setting up these data stores which require a specialist knowledge. This is also needed when deciding which one to user.  I also agree heavily with Carlo Strozzi, as I have showen it is not the removal of SQL which is defining the field but the removal from the traditional relationship model of Databases to a ever increasing use of schemes-less systems which should allow for more specialised and adaptive development of existing and new systems.