Thursday, January 8, 2009

LDAP vs Relational Database

Typically, when I ask enterprise developers where they'll store application data the response is immediate, a relational database. The directory server isn't even an afterthought. There's a long laundry list of reasons why: no one gets fired for choosing Oracle, its already there, familiarity with SQL, plethora of RDBMS tools, LDAP writes are slow, JNDI interface - are you kidding?!... While all of these are legitimate retorts, LDAP, and in particular OpenDS, bring with it clear advantages for certain classes of applications.

A short list of OpenDS differentiators include:

It's a database! LDAP isn't just a dumping ground for user profile data, it's a full-fledged hierarchical database. If your data is modeled in a parent-child fashion, not unusual in a XML world, then LDAP might be a better fit. LDAP data modeling is very different than relational modeling though at a simplistic level you might map tables to entries and columns to attributes.
Change schema over protocol: Schema is modifiable over the standard LDAP protocol - no vendor specific tools or syntax required. Building an application that requires frequent and/or unpredictable schema changes is difficult with relational databases and, I've found, often handled by creating generic trees (parent child relationships). Highly, dynamic “Web 2.0'ish” applications come to mind; e.g., generic Atom server where arbitrary schema changes are expected.

Out-of-the-box synchronization: Synchronizing directory databases is a standard feature and requires minimal administration. It just works...at least that's the idea. A very cool feature for H/A and solving data adjacency issues.

Read/write performance parity: At the moment I'm unable to release specifics on OpenDS performance, but those of you with memories of slow write performance are in for a pleasant surprise

Embedded mode: OpenDS need not run in a separate process, rather it can be directly embedded in your application. This is especially useful for cases where one may need the benefits of a LDAP (the other bullets) database with zero external administration.

Scalability: OpenDS, stripped down, runs on PDA's and scales to telcos.

Extensibility: Aside from standard plugin extensions there are numerous hooks within OpenDS that are easily modified by a Java developer. OpenDS is 100% Java after-all. A deployer is free to customize protocol handlers, client connection handlers, configuration, monitors, loggers, and even the backend (we use Berkley DB).

Internationalization over protocol: Fetching language specific directory attributes is easily done over protocol using LDAP attribute options. For example, you may search for the English specific “description” attribute by searching “description;lang-en: software products” or the German version “description;lang-de: Softwareprodukte”

Matching rules: Directory servers provide a flexible facility for determining attribute value equality. Typical equality matching is supported out-of-the-box; e.g., case ignore, greater/less than, etc. The differentiating feature is the ability to write your own matching rules for specific attributes. An interesting application of this is fuzzy matching rules; instead of implementing approximate matches in your application logic this can be done within the data repository. For example, if I search for all Atom entries with the an attribute named “category” and the value “foo” I may have a matching rule that finds and return all entries that contain category values with “bar” and “baz” as well.

If you are setting up a large scale LDAP infrastructure i ‘d suggest you try and break the load across multiple servers instead of keeping everything in one large monolithic beast. That way you are able to easily add more power (by adding another server) if the load increases and you only lose a percentage of your total available power if you lose one server due to software or hardware malfunction. Moreover it’s always cheaper to buy a standard server than a huge multiprocessor multiGB memory mountain.

You should split your ldap servers into two main categories: Write Masters and Read-Only Replicas. Write masters should be the only ones serving write requests and complex reads while read-only replicas should service read requests from other application servers (mail, radius, web servers). The main difference between these two categories is in the type of requests that they will service. More specifically:

* Write Masters should service write requests and search requests that usually return multiple entries (substring queries and others). Clients for write masters should be applications like:
o User Administration Interface
o End users performing people searches through application like email readers (which usually support searching a directory for user emails).
o Applications that perform complex and time consuming queries. For instance dynamic mailing lists which use ldap searches to create their member list on every email sent.
* Read-only replicas should service queries that usually only return a single entry. This mostly includes user authentication for various services (web access, radius, etc.) and email routing. One very important characteristic of such services is that only a small percentage of the total user population actually use them each day. For example you might have 1,000,000 users with email access but you will soon find out that only a small percentage (lower than 50%) actually use it each day. Even smaller percentage use services like dialup access, web access and others.

Splitting your traffic like this allows you to create different requirements on your server hardware. Write Masters usually need to perform a lot of disk activity since they service write requests and complex searches (which requires heavy indexing) and they also require large enough cache memory sizes in order to be able to find most of the requested entries on memory. In most cases, almost all of the user entries available in your directory will have to be searched upon at least once per day. On the other hand, read-only replicas only have to service a small percentage of the total user population per day. Not only that, but the active user population tends to be steady and change very slowly so the users you serviced yesterday will most likely be the same today as well. As a result your cache memory need only be a small percentage of the total user population yet it will still achieve excellent cache hit ratios.

So to sum it up your Master and Read-Only replica hardware specification should adhere to something like the following:

* Masters should have fast disks (in order to handle high volumes of disk traffic), RAID security (the cost of losing the master is greater than losing a read-only replica) and large enough memory to keep at least 70-80% of the database on memory. For example if you ‘re expecting 1,000,000 users use at least 4-8GB of fast memory on your masters. Strong CPU (dual processor machine) is also important since you have to perform tasks like referential integrity, attribute uniqueness and schema checking on the masters whist you can omit them on the read-only replicas.
* Read-only replicas don’t need to be as large as the masters nor even use RAID disk security. Memory requirements are much lower as well. Depending on your application usage you might just need to hold 20-50% of your database on memory and still achieve >90% cache hit ratio. So if you are expecting 1,000,000 users 2-4GB of memory should be enough. As for CPU make sure you have enough power so that it never get’s to more than 30-40% usage (if you get a nice cache hit ratio CPU usage should stay at a minimum). This is necessary so that you can afford to lose one read-only replica without downgrading the whole ldap service performance. Lastly, there’s no need for heavy indexing like in the master servers. Since most queries will invlove user authentication and mail routing you can usually get away with just creating equality indexes on uid,mail and mailalternateaddress.

No comments: