hmsv2api

Protocol Documentation

Table of Contents

Top

metastore.proto

General notes

Design goals

Some notes on protobuf proto3

AddPartitionRequest

Add a single partition to a table.

Partition is described by list of "values" - one value per partition schema. There is no validation that values actually match partition schema Each partition belongs to a table and each table belongs to a database

Field Type Label Description
sequence uint64   Request sequence (used for bulk requests)
catalog string    
db_id Id    
table_id Id    
partition Partition    

AddPartitionResponse

Response from AdddPartitionRequest matches sequence to the request.

Field Type Label Description
sequence uint64   Comes from request
status RequestStatus    

AlterDatabaseRequest

Alter database

Field Type Label Description
catalog string   Catalog this database belongs to
id Id   Database ID. Database can be found by name or id
database Database   Database object
cookie string   Session cookie

CreateDatabaseRequest

Create a new database.

If database.Id.id is empty, it will be assigned a unique ID database.seq_id is assigned when database is created

Field Type Label Description
catalog string   Catalog this database belongs to
database Database   Database object
cookie string   Session cookie

CreateTableRequest

Create a new table.

Field Type Label Description
catalog string    
db_id Id    
table Table    
cookie string    

Database

Database is a container for tables.

Database object has two sets of parameters:

Database has two IDs:

Original Metastore Database object also had owner information. These can be represented using system parameters if needed since the current metastore service does not interpret Owner info.

Field Type Label Description
id Id   Unique database ID
seq_id uint64   Unique sequence ID within calalog
location string   Default location of database objects
parameters Database.ParametersEntry repeated Database user parameters
system_parameters Database.SystemParametersEntry repeated System parameters (can't be set by user)

Database.ParametersEntry

Field Type Label Description
key string    
value string    

Database.SystemParametersEntry

Field Type Label Description
key string    
value string    

DropDatabaseRequest

Request to drop a database. Dropping a database also drops all objects contained in the database. TODO: Add flag to prohibit dropping non-empty databases

Field Type Label Description
catalog string    
id Id    
cookie string    

DropPartitionsRequest

Delete partition.

Partition is described by list of "values" - one value per partition schema. There is no validation that values actually match partition schema TODO: have flag for specifying fields in the returned value

Field Type Label Description
catalog string    
db_id Id    
table_id Id    
values PartitionValues repeated  
cookie string    

DropTableRequest

Request to drop a table. Dropping a table also drops all objects contained in the table TODO: Add flag to prohibit dropping of non-empty table

Field Type Label Description
catalog string    
db_id Id    
id Id    
cookie string    

FieldSchema

FieldSchema defines name and type for each column.

Field Type Label Description
name string   name of the field
type string   type of the field.
comment string   User description

GetDatabaseRequest

Request to get database by its ID.

Database can be located by either part of the ID. If id.id is specified, it will be used first, otherwise iid.name is used. One of these must be specified.

Field Type Label Description
catalog string   Catalog this database belongs to
id Id   Database ID. Database can be found by name or id
cookie string   Session cookie

GetDatabaseResponse

Result of GetDatabase request

The result consists of the database information (which may be empty in case of failure) and request status. TODO: specify error cases

Field Type Label Description
database Database    
status RequestStatus    

GetPartitionRequest

Get partition information.

Partition is described by list of "values" - one value per partition schema. There is no validation that values actually match partition schema

Field Type Label Description
catalog string    
db_id Id    
table_id Id    
values string repeated  

GetPartitionResponse

Field Type Label Description
partition Partition    
status RequestStatus    

GetTableRequest

Request to get table by its ID.

Field Type Label Description
catalog string    
db_id Id   Database ID
id Id    
cookie string    

GetTableResponse

Field Type Label Description
table Table    
status RequestStatus    

Id

Objects have unique name and unique ID.

Name of the object can change but ID never changes. This allows caching of objects by ID. Both name and ID are just sequence of bytes - there are no assumptions about encoding or length. Implementations may enforce specific assumptions.

Field Type Label Description
name string   Object name
id string   Permanent object ID

ListDatabasesRequest

Request to get list of databases If exclude_params is set, result may omit parameters

Field Type Label Description
catalog string    
cookie string    
name_pattern string    
exclude_params bool    
fields string repeated Field selectors.
If specified, only certain fields are sent. The following fields are supported: - location - id - id.name - parameters

ListPartitionsRequest

Return all partitions in a table

Field selectors.

If specified, only certain fields are sent. The following fields are supported:

Field Type Label Description
catalog string    
db_id Id    
table_id Id    
cookie string    
fields string repeated  
values PartitionValues repeated  
exclude string repeated  

ListTablesRequest

Request to get list of databases.

Field Type Label Description
catalog string    
db_id Id    
cookie string    
fields string repeated Field selectors.
If specified, only certain fields are sent. The following fields are supported: - id: table Id - location: table location - parameters: table user parameters - partkeys: table partition keys

Order

sort order of a column (column name along with asc/desc)

Field Type Label Description
col string   sort column name
ascending bool   asc(1) or desc(0)

Partition

Partition

Field Type Label Description
id Id    
seq_id uint64   Sequential ID within table
values string repeated Values for each partition
sd StorageDescriptor   Partition descriptor
parameters Partition.ParametersEntry repeated User parameters
location string   Partition location
table Table   Enclosing table

Partition.ParametersEntry

Field Type Label Description
key string    
value string    

PartitionValues

Field Type Label Description
value string repeated  

RequestStatus

General status for results.

All non-streaming requests should return RequestStatus.

Field Type Label Description
status RequestStatus.Status   request status
error string   detailed error message

SerDeInfo

Serialization/Deserialization information

Field Type Label Description
type SerdeType   Serde type. If CUSTOM, use the name
name string   name of the serde, table name by default
serializationLib string   usually the class that implements the extractor & loader
parameters SerDeInfo.ParametersEntry repeated NOTE: Should we enum this as well?
initialization parameters

SerDeInfo.ParametersEntry

Field Type Label Description
key string    
value string    

StorageDescriptor

StorageDescriptor holds all the information about physical storage of the data belonging to a table

Field Type Label Description
cols FieldSchema repeated required (refer to types defined above)
inputFormat InputFormat   Specification of input format. If custom, use inputFormatName.
inputFormatName string   Name of input format if custom
outputFormat OutputFormat   Specification of input format. If custom, use outputFormatName.
outputFormatName string   Name of output format if custom
numBuckets int32   this must be specified if there are any dimension columns
serdeInfo SerDeInfo   serialization and deserialization information
bucketCols string repeated reducer grouping columns and clustering columns and bucketing columns`
sortCols Order repeated sort order of the data in each bucket
parameters StorageDescriptor.ParametersEntry repeated any user supplied key value hash
system_parameters StorageDescriptor.SystemParametersEntry repeated Internal parameters not settable by user

StorageDescriptor.ParametersEntry

Field Type Label Description
key string    
value string    

StorageDescriptor.SystemParametersEntry

Field Type Label Description
key string    
value string    

Table

Table information

Field Type Label Description
id Id    
seq_id uint64   Sequential ID within database
sd StorageDescriptor   storage descriptor of the table
partitionKeys FieldSchema repeated partition keys of the table. only primitive types are supported
tableType TableType   table type enum, e.g. EXTERNAL_TABLE
parameters Table.ParametersEntry repeated User-settable parameters
system_parameters Table.SystemParametersEntry repeated Internal parameters
location string   Table location

Table.ParametersEntry

Field Type Label Description
key string    
value string    

Table.SystemParametersEntry

Field Type Label Description
key string    
value string    

InputFormat

Known Input Formats. CUSTOM means that it should be specified as a string.

Name Number Description
IF_CUSTOM 0  
IF_SEQUENCE 1  
IF_TEXT 2  
IF_HIVE 3  
IF_PARQUET 4  

OutputFormat

Known Output Formats. CUSTOM means that it should be specified as a string.

Here is a list of known output formats:

Name Number Description
OF_CUSTOM 0  
OF_SEQUENCE 2  
OF_IGNORE_KEY 3  
OF_HIVE 4  
OF_PARQUET 5  

RequestStatus.Status

Name Number Description
STATUS_OK 0 successful request
STATUS_ERROR 1 General error
STATUS_NOTFOUND 2 Requested object not found
STATUS_CONFLICT 3 Object already exists
STATUS_BUSY 4 Object is busy/used and can't be accessed/destroyed
STATUS_INTERNAL_ERR 5 Internal server error

SerdeType

Known SerDes are represented using enum. Unknown ones are represented using strings.

Name Number Description
SERDE_CUSTOM 0  
SERDE_LAZY_SIMPLE 1  
SERDE_AVRO 2  
SERDE_JSON 3  
SERDE_ORC 4  
SERDE_REGEX 5  
SERDE_THRIFT 6  
SERDE_PARQUET 7 "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
SERDE_CSV 8  

SerializationLib

Name Number Description
SL_CUSTOM 0 Unknown lib, use string
SL_LAZY_SIMPLE 1 LazySimpleSerDe
SL_PARQUET 2 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

TableType

Name Number Description
TTYPE_MANAGED 0  
TTYPE_EXTERNAL 1  
TTYPE_INDEX 2  

Metastore

Metastore service

This API is different from traditional metastore API. It separates all metadata-only operations and does not include any filesystem operations. The assumption is that some other service or the client deals with file system operations.

This API also uses cookies to associates requests with a session. The value of the cookie is likely to be printed in logs so it shouldn't contain any sensitive information. Metastore service does not interpret the cookie but may print it in its logs. We could call it SessionId but callers may decide to use it for whatever other purposes, so using generic term here.

Method Name Request Type Response Type Description
CreateDabatase CreateDatabaseRequest GetDatabaseResponse Create a new database.
GetDatabase GetDatabaseRequest GetDatabaseResponse Get database information
ListDatabases ListDatabasesRequest Database Return all databases in a catalog
DropDatabase DropDatabaseRequest RequestStatus Destroy the database
AlterDatabase AlterDatabaseRequest GetDatabaseResponse Alter database
CreateTable CreateTableRequest GetTableResponse Create a new table
GetTable GetTableRequest GetTableResponse Get table information
ListTables ListTablesRequest Table Get all tables from a database
DropTable DropTableRequest RequestStatus Destroy a table
AddPartition AddPartitionRequest AddPartitionResponse Add partition to a table
AddManyPartitions AddPartitionRequest AddPartitionResponse Add multiple partitions. The first request contains DB and table info, followed by others, for which db and table info is not needed
GetPartition GetPartitionRequest GetPartitionResponse Get partition information
ListPartitions ListPartitionsRequest Partition List all partitions in a table
DropPartitions DropPartitionsRequest RequestStatus Drop partition

Scalar Value Types

.proto Type Notes C++ Type Java Type Python Type
double   double double float
float   float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool   bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str