Web consultancy based in Tel Aviv, Israel, and Berlin, Germany.
About us.
Read about
how we work
with our clients.
Contact us, we are friendly.
We're hiring.
Read our blog.
Hi, my name is Vitaly Kushner and I’m a founder of Astrails.
For the last 5 years we’ve built many web applications.
There is a growing trend lately in this space which we are going to discuss today.
And that is NoSQL.
You surely heard about it.
In short it seems like Everybody does it.
Someone calculated that lately there is a new NoSQL data store appearing every 2 weeks. We are talking tens already and will probably get to over hundred sometimes soon.
The problem is, if you ask 10 people what is NoSQL, you get 10 different answers.
So what is NoSQL and Why should you care?
As I said there are many different things that are being called NoSQL.
Usually what unifies them is that they are non-relational.
Apart from that we got:
There is a lot of different things under the same term, as you see, and there are quite a few people criticizing NoSQL for that. They try to invent some other names for it, or to partition the space into parts.
One word of advice: Leave it already, NoSQL is here to stay. Deal with it ;)
I’d say: NoSQL is a new generation of databases built for high traffic web applications (but not only ;).
So. Why and When should we use NoSQL?
there are clusters of over 100K machines at yahoo and google
I’d like to see Oracle doing that ;)
At one point Twitter were writing 7G of user generated data per day! and that amount doubles several times within a year.
So the common pattern here is massive scale.
The question is, can you benefit from NoSQL if you are not Google?
Definitely YES.
As I already said, the flexible schema can be very good even at the very small scale.
But there is another and much more important aspect to it.
One word:
Ask any business owner and they will tell you that they absolutely can not tollerate any downtime whatsoever. This is of course not true. A lot of smaller scale websites can have a little downtime w/o any adverse effect to the company.
This being said, availability is sitll a very important goal.
You can get that with RDBMS. Just ask any conventional enterprise. Redundancy and NSPOF are the keys here.
But while it can be done it is definitely hard (and expensive). You can’t just throw a 5 node MySQL cluster and treat it as a highly available SQL database.
You need to carefully configure it first. Master-Master + lots of slaves. And you loose some of the RDBMS goodies in the process.
For example, if you have master-slave MySQL cluster and you write something to the master, and then go and try to read it from the slave, you will probably get the old value as there is a replication delay.
So consistency goes out of the window.
We forfeit consistency to get availability.
This is in fact a very generic principle.
There is this thing called “The CAP theorem” which states that from Consistency, Availability and Partition tollerance you can only pick two!
Now that we looked at the availability, lets get back to scale.
What are the usual ways to scale?
Buy a 16-way 128G RAM server to hold your SQL. This is not that bad btw. You can have a 128G RAM server for 2-3K/month this days. If you can fit all your data (or at least all of the working-set) into memory - MySQL is blazing fast. And so if you already have an existing conventional SQL applicaiton, this might be the cheapest way to scale it in the short term.
In both cases you will have to change the application to work with this setup. And it will cost a lot in operational overhead.
Your systems team will hate you.
Backups setup and monitoring will take much much more then in a regular non-partitioned setup. And the overhead only grows with time.
Note that if you do any kind of partitioning, you can no longer use transactions and joins across the partition. You will have to do some application logic instead.
So the question is:
If we don’t use RDBMS features why do we need RDBMS at all?
Or in other words if you have sharded MySQL with Memcached you are building a “square wheel Cassandra” :)
OK now. You decided to use NoSQL in your organization?
First thing is to choose which one.
Few examples:
Single-node durability defined as: write, once completed, will survive permanently, even if the server is killed or crashes, or loses power.
Lets talk a little bit about schema.
Schema-less doesn’t mean you don’t need to think about schema!
Quite the contrary. You usually have to think much harder about your schema with many of NoSQL databases.
It just means that your schema is defined in your application and not in the database.
With RDBMS you define schema, then you look what kinds of queries you need and add indexes to accomodate.
It is the opposite in Cassandra for example. Some people call Cassandra “index construction kit”.
What does it mean?
You start with your queries first.
Then you figure out how to organize the data to make sure you can do your queries effectively.
For example, lets say you need to make an authentication system to allow users to login with user/password.
Naturally you will have your user model. Something like this:
Users: {
"vitaly": {
"email": "[email protected]",
"company": "astrails",
"password": "secret"
},
"michael": {
"email": "[email protected]",
"company": "astrails",
"password": "superduper"
},
...
}
Now someone wants to login with email [email protected], how do you authenticate?
In an RDBMS you would just have an index by email. In MongoDB too you can have such an index and all is quite simple.
In Cassandra though you will have to construct such an index by yourself.
UsersByEmail: {
"[email protected]": "vitaly",
"[email protected]": "michael",
...
}
or if you need to group people by company you would have something like:
UsersByCompany: {
"astrails": {
"vitaly": "[email protected]",
...
}
}
Bottom line, need to think about queries first and then build the schema to match.
But what if your requirements change?
You can migrate.
But unlike RDBMS you can do your schema changes gradually.
Some schema changes do not require any migration at all. Adding a column with some default value can be done in code w/o touching the db. you just treat missing column as having default value and add it on all new records. That’s all.
But for more complicated cases, you can use the following approach.
You can have schema_version in every record.
Then when reading a record from the database you can migrate it to the latest version if needed (and possibly write it back).
In addition you might have a background process that slowly migrates all the records too.
This way you immediately can continue to use your db as if it is all migrated already.
Unless the migration includes building new indexes, in which case you’ll have to wait for the background process to finish before you can take advantage of it.
But sill, the db is fully usable during the process.
Start slow, especially if its for an existing product. You can slowly move parts of functionality to NoSQL solution.
For example at Twitter they’ve built a completely parallel ‘shaddow’ implementation of their whole architecture with Cassandra and they are feeding it the same input data the production servers are getting, and they monitor it and fix bugs in it until it looks like it works as required. They slowly and incrementally transfer production traffic into it.
Bottom line, some NoSQL solution can most probably help you right now.
If you need help you can always call us ;)