Sep 16, 2004 by Philipp Janert
The expression “Embedded Database” requires an explanation. A “database” is an application that allows the targeted retrieval of stored data – a log-file is not a database. By “embedded” I mean a database that does not run in a separate process, but instead is directly linked (“embedded”) into the application requiring access to the stored data. This is in contrast to more conventional database management systems (such as Postgres or Oracle), which run as a separate process, and to which the application connects using some form of Inter Process Communication (such as TCP/IP sockets, for instance).
The prime advantage of embedded database systems lies in their availability and ease of administration. Since the data is kept in ordinary files in the user’s space, there is no need to obtain special permissions to connect to the database process or to obtain a database account. Furthermore, since embedded databases require nothing more than a normal library, they can be useful in constrained environments (think shared web hosting), where no “proper” database is available. They can even be linked to an application and shipped with it.
The simplest database is still a flat file of records, separated by a record separator. If the separator is chosen to be the new line character , as it usually is, then a record corresponds to a line.
Flat files allow only sequential access. This can be both a blessing and a curse: if the application primarily requires the lookup of individual records in arbitrary order, then flat files become impractical once the number of records becomes too large. On the other hand, if the typical access pattern is sequential for at least a substantial subset of records, then a flat file can be very practical.
Here is a typical (if a little construed) example: Assume that we have several thousand very long strings (say, each string having more than one million characters) and we want to determine whether any two strings are equal. To do so, we have to compare each string to all remaining ones, using two nested loops. However, the total amount of data is too large to be all kept in memory. On the other hand, we access all the records in a predictable, sequential order.
The Perl module allows us to do so in a particularly convenient fashion: it s a Perl array to the file in such a way that each record corresponds to an element of the array.
Here then is a code snippet that demonstrates the use of to our string comparison problem introduced above:
There is no question that this program requires heavy disk access, and if we had additional information about the data in question, we should try hard to use it and avoid the brute force approach of comparing “everything to everything”. However, if we have no choice, then makes code like the above as efficient as can be: It does not attempt to load the entire file into memory, so that arbitrarily large files can be processed. The first time we iterate over the file, builds up an index containing the offset of each record from the beginning of the file, so that subsequent accesses can go directly to the proper position in the file using . Finally, accessed records are cached in memory and the cache size can be adjusted passing the option to the command.
However, any changes to the array (and its underlying file), in particular insertions and deletions in the middle, will always be slow, since all the subsequent records will have to be moved. If your application requires such capabilities, don’t use , use a Berkeley DB!
Fast Lookup: Berkeley DB
A Berkeley DB is almost the exact opposite of a flat file in regards to the access patterns it supports. Scanning through a large number of successive records is difficult if not entirely impossible, but reading, inserting, updating, and deleting random records is very, very fast!
Conceptually, a Berkeley DB is a hash, saved to a file. Each record is found by its key, and the value of the record is treated as a string. It is up to the application to interpret this string - for instance, to break it up into parts, representing different fields.
The following code opens a Berkeley DB (creating a new file, if necessary) and writes a single record, containing information suitable for the file. We then look up the same record, using the login name as key, and the data into its constituent fields.
The code above uses the Perl interface to the Berkeley DB modelled after its C API. The module also supports a interface, tying the database file to a Perl hash, making code like the following possible: . The API-style interface, on the other hand, follows the native interface of the Berkeley DB quite closely, which means that most of the (quite extensive) original C API documentation also applies when programming Perl - which is good, since the perldoc for is a bit sketchy in places. Try to get started on the native Berkeley DB interface, or visit the Berkeley DB’s homepage.
Complex data structures can be stored in a Berkeley DB provided they have been serialized properly. If a simple solution using and as demonstrated above is not sufficient, we can use modules such as or (better) to obtain a flat string representation suitable for storage in the Berkeley DB. There is also the (“Mulit-Level DBM”) module, which provides a convenient wrapper interface, bundling serialization and storage.
For suitable problems, solutions based on Berkeley DBs can work very well. However, there are a couple of problems. One is that the Berkeley DB must be available - it has to be installed separately (usually as ). The bigger limitation is that although there are quite a few problems out there for which a single-key lookup is sufficient, for many applications such a simple access pattern will not work. This is when we require a relational database – such as SQLite.
Complex Queries: SQLite
There are situations when neither of the aforementioned straightforward access patterns applies. Berkeley DBs are great if we know the key for the record - but what happens if we only know part of the key, or we are interested in a set of records, all matching some pattern or logical expression? Alas, the Berkeley DB does not allow wildcard searches. Another frequently arising problem occurs when each record has several fields or attributes and we want to be able to look up records by either one of them (such as being able to search a list of books by author as well as title). This can be achieved in a Berkeley DB using multiple lookups, but one can’t help thinking that there got to be a better way! Finally, the simple key/value mapping of the Berkeley DB occasionally forces duplication of data, since all the relevant information has to be present in each record - this both wastes storage space and creates maintenance problems.
Relational databases address all these points. They permit wildcard and logical expression searches, multiple searchable columns, and joins to represent relationships. Unfortunately, search capabilities such as these used to be restricted to big, standalone relational database engines (such as Postgres or Oracle), often making them unavailable for the quick-and-dirty Perl project.
This is where the SQLite project comes in: Founded in 2000, SQLite tries to combine the convenience of the Berkeley DB architecture with the power of a relational (SQL) database. Its latest release is version 2.8.14, with the next major revision, version 3.0, currently in beta. SQLite implements most of standard SQL (with some exceptions, such as correlated subqueries).
However, with added power comes additional complexity: a relational database no longer s neatly to one of Perl’s built-in data structures. Instead, one has to use the (DataBaseInterface) module and code the queries in SQL. Furthermore, since the structure of the data is no longer predetermined, the programmer has to define and construct the required tables explicitly.
This is not the place for a full discussion of the Perl-DBI, but the example below is enough to get started. First, SQLite has to be installed on the system. Conveniently, the DBI-driver for SQLite, , already contains the database engine itself as part of the module - so installing this module is all that is required to be able to use SQLite from Perl.
We begin by to the SQLite database contained in the file “”. Note that the method loads the appropriate database driver - all we need to do is specify . We create and populate two tables, and then run a query joining the two tables and using an SQL wildcard. The results are returned as a reference to an array. Each array element is an array itself, containing the column values in its elements.
The table references the table via a foreign key. This is a one-to-many relationship: each book has only a single author, but an author can have written multiple books. What would we have done if books could have been co-authored by multiple writers? We would have used a cross-reference table to represent the resulting many-to-many relationship. Try that with Berkeley DBs!
SQLite shares with the Berkeley DB the approach of being mostly typeless and treating data as simple strings. In our table definitions above, we did not specify the data types of any of the columns, as would be required by Oracle or Postgres. SQLite provides a bit more structure, in that it provides separate columns, so that distinct values do not have to be glued together to form a string, as was required when using a Berkeley DB. Furthermore, SQLite will try to convert strings to floating point numbers if numerical comparisons or operations (such as or ) are performed on them.
Since SQLite is a relational database, the data architecture (the “schema”) can be arbitrarily complex, with data distributed across multiple tables and with foreign keys to represent relationships. This is a big topic – any book on the relational model will provide plenty of information, including issues such as database normalization and tools such as Entity/Relationship-Models.
In this paper, we considered three ways to add database capabilities to a Perl program, without requiring major external database installations or administrative support. Depending on the typical access patterns, flat files or Berkeley DBs may be suitable choices. For more complex queries, the SQLite project provides all the power of relational database architecture in a single Perl DBD module.
I participated in the beta evaluation of the BDB SQLite code and one of the things I tried to get a handle on was the performance difference. At this point, I cannot publish exactly what I found until I have at least one other person evaluate my code, run the tests, and confirm the numbers I got (which is being done). However, I can generalize here and say that there are cases where BDB offers significant performance improvements over SQLite, specifically in the area of handling heavy loads that involve write-concurrency.
There are, generally, two measures of "fast" right -- (1) efficiency: how long does it take for a single process to do XYZ vs. (2) concurrency: how many times can many processes do XYZ per unit time. The main problem BDB addresses is concurrency -- large-scale transaction processing. Thus you think of many concurrent connections writing to and/or modifying the contents of the database.
SQLite by design uses database-level locking so there is a maximum of one writer who can be working in the database at a time. Thus, SQLite's transaction rate stays more or less constant with the number of concurrent connections, so it's scalability in write-intensive applications is really measured by its efficiency (1).
BDB on the other hand uses page level locking, which allows multiple writers to be working in the database at a given time (provided that they are working on separate pages). Thus BDB's rate potentially increases with the number of connections and so its scalability is both a matter of efficiency (1) and concurrency (2), which can add up.
Mainly what it boils down to is (write) concurrency. BDB can push more TPS than SQLite for multiple writers. By transaction, I mean something that modifies the database (how are they of any real help for read-only operations?). That said, for read concurrency (apps that mainly do SELECTs), SQLite could very well go head to head with BDB because locking is no longer a critical issue.
As for the size of the dataset, I am not sure. I've not looked into that. Ultimately, they both use B-trees for storage. There may be factors in their respective implementations to consider, but I've not investigated that. I know that SQLite can gracefully handle data sets into the hundreds of MBs and double digit GBs (and perhaps more now that the dirty page map implementation has been changed).
Therefore, if you have an application which employs many connections that modify a given database and page contention is relatively low, then BDB can offer significant performance improvements. But page contention is a critical variable. In the limit, if you had a BDB database whose data consisted of a single page, then its performance would match that of SQLite in all cases because page-level locking here effectively degenerates into the equivalent of database level locking -- everybody is fighting over one thing. However, as the number of pages increases in BDB (and page contention decreases), then the maximum TPS will start to grow with the number of concurrent connections. Then from that point, memory becomes the next limiting factor. But that's another story.
BTW, I am in the process of writing an article about using BDB for those coming from SQLite.
Oracle Berkeley DB SQL API vs. SQLite API – A Technical Evaluation
Oracle Berkeley DB SQL API vs. SQLite API – Integration, Benefits and Differences