Hive Metastore Utils¶
About Hive Metastore¶
The Hive Metastore is a database with metadata for Hive tables.
To configure `SparklySession
to work with external Hive Metastore, you need to set hive.metastore.uris
option.
You can do this via hive-site.xml
file in spark config ($SPARK_HOME/conf/hive-site.xml):
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
or set it dynamically via SparklySession
options:
class MySession(SparklySession):
options = {
'hive.metastore.uris': 'thrift://<n.n.n.n>:9083',
}
Tables management¶
Why: you need to check if tables exist, rename them, drop them, or even overwrite existing aliases in your catalog.
from sparkly import SparklySession
spark = SparklySession()
assert spark.catalog_ext.has_table('my_table') in {True, False}
spark.catalog_ext.rename_table('my_table', 'my_new_table')
spark.catalog_ext.create_table('my_new_table', path='s3://my/parquet/data', source='parquet', mode='overwrite')
spark.catalog_ext.drop_table('my_new_table')
Table properties management¶
Why: sometimes you want to assign custom attributes for your table, e.g. creation time, last update, purpose, data source. The only way to interact with table properties in spark - use raw SQL queries. We implemented a more convenient interface to make your code cleaner.
from sparkly import SparklySession
spark = SparklySession()
spark.catalog_ext.set_table_property('my_table', 'foo', 'bar')
assert spark.catalog_ext.get_table_property('my_table', 'foo') == 'bar'
assert spark.catalog_ext.get_table_properties('my_table') == {'foo': 'bar'}
Note properties are stored as strings. In case if you need other types, consider using a serialisation format, e.g. JSON.
Using non-default database¶
Why to split your warehouse into logical groups (for example by system components). In all catalog_ext.* methods you can specify full table names <db-name>.<table-name> and it should operate properly
from time import time
from sparkly import SparklySession
spark = SparklySession()
if spark.catalog_ext.has_database('my_database'):
self.catalog_ext.rename_table(
'my_database.my_badly_named_table',
'new_shiny_name',
)
self.catalog_ext.set_table_property(
'my_database.new_shiny_name',
'last_update_at',
time(),
)
Note be careful using ‘USE’ statements like: spark.sql(‘USE my_database’), it’s stateful and may lead to weird errors, if code assumes correct current database.
API documentation¶
-
class
sparkly.catalog.
SparklyCatalog
(spark)[source]¶ A set of tools to interact with HiveMetastore.
-
create_table
(table_name, path=None, source=None, schema=None, **options)[source]¶ Create table in the metastore.
Extend
SparkSession.Catalog.createExternalTable
by accepting amode='overwrite'
option which creates the table even if a table with the same name already exists. All other args are exactly the same.Note
If the table exists, create two unique names, one for the new and one for the old instance, then try to swap names and drop the “old” instance. If any step fails, the metastore might be currently left at a broken state.
Parameters: mode (str) – if set to 'overwrite'
, drop any table of the same name from the metastore. Given as a kwarg. Default is error out if table already exists.Returns: DataFrame associated with the created table. Return type: pyspark.sql.DataFrame
-
drop_table
(table_name, checkfirst=True)[source]¶ Drop table from the metastore.
Note
Follow the official documentation to understand DROP TABLE semantic. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL #LanguageManualDDL-DropTable
Parameters: - table_name (str) – A table name.
- checkfirst (bool) – Only issue DROPs for tables that are presented in the database.
-
get_table_properties
(table_name)[source]¶ Get table properties from the metastore.
Parameters: table_name (str) – A table name. Returns: Key/value for properties. Return type: dict[str,str]
-
get_table_property
(table_name, property_name, to_type=None)[source]¶ Get table property value from the metastore.
Parameters: - table_name (str) – A table name. Might contain a db name. E.g. “my_table” or “default.my_table”.
- property_name (str) – A property name to read value for.
- to_type (function) – Cast value to the given type. E.g. int or float.
Returns: Any
-
has_database
(db_name)[source]¶ Check if database exists in the metastore.
Parameters: db_name (str) – Database name. Returns: bool
-
has_table
(table_name)[source]¶ Check if table is available in the metastore.
Parameters: table_name (str) – A table name. Returns: bool
-
rename_table
(old_table_name, new_table_name)[source]¶ Rename table in the metastore.
Note
Follow the official documentation to understand ALTER TABLE semantic. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL #LanguageManualDDL-RenameTable
Parameters: - old_table_name (str) – The current table name.
- new_table_name (str) – An expected table name.
-