Apache HBase

HBase is the Hadoop database. It is a distributed, scalable, big data store. It provides random, real-time read/write access to your Big Data.

When to Use

Hadoop Ecosystem: Deep integration with HDFS, Hive, Spark.
Petabyte Scale: Serving billions of rows with low latency.
Random Access: When you need random R/W on HDFS data (which is usually WORM - Write Once Read Many).

Uses Java API or Shell.

create 'users', 'info', 'data'
put 'users', 'row1', 'info:name', 'Alice'
get 'users', 'row1'

Data is grouped into column families (info:name, info:email). Families are stored physically together.

HBase scales by splitting tables into "Regions" and hosting them on Region Servers.

Writes go to Write-Ahead-Log (Disk) and MemStore (RAM). When MemStore fills, it flushes to HFile (HDFS).

Do:

Design Row Keys carefully: Row keys determine sorting and sharding. "Hotspotting" (sequential keys) is the enemy. Use salt or hashing.
Pre-split Regions: Don't start with 1 region. Pre-split based on your known key distribution.
Use Phoenix: Apache Phoenix provides a SQL skin over HBase, making it usable like a Relational DB.

Don't:

Don't use for small data: The overhead of HDFS/ZimeKeeper/HBase is huge. Only for >TB scale.
Don't scan excessively: Full table scans are MapReduce jobs.