Introduction to Object Storage
Object storage is a convenient, affordable, and API-driven way to store and access data in the cloud. Object storage services are designed to facilitate management of all types of data in a programmable fashion. From photos and documents, to code snippets and query results, object storage treats all data equally.
- Streamline operations, since all applications interact with data using the same API;
- Save costs by reducing the infrastructure required to manage storage; and
- Improve performance by reducing requests to your primary server.
The ability to store data in an unstructured form — that is, not in a database, or within a filesystem hierarchy — is on the rise. With object storage, uploaded data becomes an object that fits into “buckets”; this can be thought of as a traditional directory that contains objects instead of files. Buckets have default access privileges, strong metadata, and unique IDs to give rich access to select datasets. Objects are saved on highly available storage infrastructure (able to detect failure and rebuild redundancy) for both availability and reliability. Object storage services are usually designed with “eventual consistency” in mind. This means that, while the object storage service will immediately respond when an object has been created, updated, or deleted, the actual event might not happen until later. This kind of asynchronous design helps applications interact with object storage services faster, since they are not waiting on a synchronous event notification.
The limitation of embracing the “eventually consistent” approach is the possibility of accessing stale data when an object or metadata has been recently overwritten. Files that change frequently (e.g. databases) can be a poor fit for this environment. But files that are more read-heavy, such as static web files, research data, backups, and other latency-tolerant data, are extremely well-suited to this environment.
Public cloud providers are able to provide competitive transfer and storage rates. Alternatively, many open source and self-hosted options can be used to store large amounts of data. Projects like OpenStack Swift can be effectively used to repurpose older hardware into a stable and resilient object storage solution. Self-hosted object storage can save money, especially if you have legacy hardware that needs repurposing; however, it comes with the added burden of managing the hardware and software stack.
Block Storage and Object Storage
A key difference between object storage and traditional storage is how you access the data. Object storage functions by abstracting away the underlying OS and services by allowing an end user — or service — to access items through a REST-based API. In contrast, traditional storage, also known as “block storage”, uses a filesystem to access and work with blocks of data. Interacting with the filesystem directly provides a lower latency connection, resulting in faster read and write times. However, it does not provide the accessibility, scalability, and redundancy of object storage.
Block storage is also very sensitive to latency. Many cloud providers offer network attached block storage, such as Amazon EBS (Elastic Block Store), but they are not automatically replicated across cloud data centres like they are for object storage. Block storage will have noticeably degraded performance when accessed outside a specific cloud data centre. This performance degradation is because block storage uses strong consistency (as opposed to eventual consistency), with write operations only being acknowledged after all storage targets have successfully saved the data.
To support legacy applications, a filesystem gateway can enable access to an object storage service. This gateway frees the application from the need to support an object storage’s REST API, but usually at a cost of slower performance.
Object storage is best suited for certain storage scenarios and challenges, highlighted in the following examples.
Example Use Cases
Archival and Backup Data
Businesses large and small are using object storage to save costs on offsite backup and archiving solutions.
Built-in redundancy lowers the risk of data loss due to hardware failure or other outages.
The ability to download objects via a web link makes retrieving the data quick and painless. APIs also make it easy to automate the uploading and managing of data.
It can be difficult or expensive to recover data from tapes and long-term storage offerings like Amazon Glacier and Azure Archive. Unless you are confident that the archived data will remain archived for a number of months or years, object storage is a reliable target. Some public cloud providers offer the ability to automatically archive idle and old objects. For example, you can set a pre-condition that objects which have not been accessed in three months or more be archived to a cheaper storage service. This kind of automated storage tiering can provide a cost-effective way to intelligently manage frequently and infrequently accessed data.
Object storage is great for offloading large or other awkward web assets from your primary web server. All objects are given a unique and direct web URL. When access permissions are set appropriately on the objects, they can be accessed by anyone on the internet as if they were still stored on your web server.
Assets stored in object storage will not put additional load on your website and can load parallel to other content. This often reduces load times and bandwidth costs.
Some items are better suited for object storage than others. Video files, large or popular PDFs, and downloadable applications are good candidates. Depending on your object storage provider, they may offer geo-redundancy that can act as a rudimentary content delivery network (CDN), bringing files closer to your audience. By leveraging an object’s metadata, you can also create URLs that can only be accessed by authorized users, or for a limited amount of time. Some object storage providers do not optimize for caching, and sometimes do not provide all features that a full-fledged CDN would. Its usefulness depends on the capabilities of your environment.
Object storage is best suited for web assets that may otherwise stress the bandwidth, disk space, or connection limits of your existing hosting setup.
Metadata Use Cases
Another beneficial feature of object storage is metadata. Beyond file and folder names, classic filesystems and block storage do not provide additional information about the file itself. Files are identified by a name (that is not necessarily unique) within a folder hierarchy, which may or may not be descriptive. This is not ideal, especially when dealing with large numbers of unstructured files. Folder hierarchies that rely on dates, projects names, or other subjective names quickly become complicated. Filesystem character limits also highly restrict the context surrounding files. This can result in a separate database or metadata solution being needed.
Objects are assigned a universally unique identifier (UUID). Metadata is then used by other services to process, manipulate, and otherwise work with these objects.
Having rich metadata facilitates automation and adds greater clarity to the object you are working with. Worker processes can then read and take actions on objects based on that metadata.
An example of how metadata can be used: imagine a tool that scrapes images from Twitter for classification. The tool finds an image on Twitter, uploads it to object storage, and then adds a metadata “tag”, marking the image as “unclassified”. Another process then searches for all objects with a tag of “unclassified”, and processes them. This could be done through tools such as TensorFlow or another analytic process. Once reviewed, the script would remove the unclassified tag from the object’s metadata and replaces it with the results. Your primary application is then free to explore the processed objects and look for the metadata relevant to the task at hand.
A data lake is normally a single object bucket where organizations and, in some cases, individuals dump raw data from various sources. The key benefit of a data lake is that all files live together and can be cross connected. Structured data (such as exported database rows/columns), semi-structured data (CSV, logs, XML, JSON, YML, etc.), and unstructured/binary data (documents, emails, images, audio/video, applications) can then use metadata for organization. Ideally, automated tasks will review the data in the lake and take action.
Analytics and other operations keep things organized and relevant.
The term “data lake” was coined in response to the “data mart” concept. A data mart is a smaller repository of curated information gathered from raw data. It does not contain raw data, but instead has curated or filtered results relevant to operations. An issue with this approach is that it tends to silo information into isolated stacks. This isolation can make it harder to cross evaluate with other siloed datasets. Another disadvantage of a data mart — when compared to a data lake — is that the raw data is not always preserved. Once processed, the raw data is removed, leaving only the interpretation of that data. This makes it difficult or impossible to go back to the raw data if an error occurs or new filters are added. Corrupted data can lead to gaps that would otherwise be avoided if the raw files were saved.
By preserving source data and improving the ability to cross analyze, data lakes solve the problems inherent with data marts. But while compelling, data lakes are not without their risks. Without proper management and metadata, the lake can risk becoming a data “swamp”. For example, data lakes can encourage organizations to store everything forever, in the hopes that down the road, the information will be useful. If that data is not managed, you may not be able to find it when you need it. Data lakes are commonly implemented without planning and assumed to be a magic bullet for data analysis. As with any tool, they are great when used properly. But when used improperly, data lakes can cost you time and money, destroy assets, and generally cause mayhem and chaos.
Object and Block Storage Security
Along with different use cases, object and block storage also have different approaches to security. General data security can be broken down into how to secure storage and how to secure access. Encryption and Access Control Lists (ACLs) secure both object and block storage, but their implementations are different.
In the cloud, block storage can be thought of as a physical disk that is attached to your instance. Security is applied to the entire disk by either permitting or denying the connection. Encryption can also be applied to the disk, either automatically by the cloud service or self-managed by the user. The filesystem then manages additional local OS restrictions on files and folders. Web or other non-filesystem access can be managed by system services such as Apache or Samba.
On the other hand, object storage security is handled very differently. Buckets and objects are either public or private. Public objects and buckets are available to anyone with the web link, while private objects and buckets require users to be authenticated and authorized first before they can access the data. Furthermore, some object storage platforms allow object permissions to override bucket permissions through ACLs, bucket policies, or both. Uploading to buckets always requires an authenticated API request. With many providers, they allow you to grant access to other accounts on the same platform. Individual objects can be encrypted before they are uploaded, or many cloud providers offer automatic server-side encryption of objects and buckets.
With a simple yet flexible access and security model, object storage is an attractive choice for many applications. By default, only you (when provided with your credentials) can access your objects. By adding ACLs, you can manage read and write access for other users at the bucket level.
Individuals and organizations are looking for ways to cut costs, streamline operations, and expand the use of their data. For certain workloads, object storage provides significant benefits compared to block storage. Having a first-class, cloud-enabled API makes moving data to the cloud a streamlined and ready-to-automate process. Metadata and unique identifiers provide additional context to that data. Data can be retrieved by a web URL for rapid and flexible access.
Publishing objects to the public can also reduce the load on existing infrastructure. As well, object storage enables flexible security, allowing buckets to provide global permissions that can be overridden for individual objects.
Beyond direct cost savings, treating files as objects is a cloud independent way to streamline and access data storage across multiple cloud providers. Object storage creates new opportunities to collect, manage, and analyze data.