The first (and best) way is that the stats are taken from the metastore. There are three ways of how the leaf node can compute the statistics. In the next, we will see how the leaf node computes the stats and how the propagation works. The stats are first computed by the Relation operator which is a so-called leaf node and each leaf node is responsible to compute the stats somehow, and then they are propagated through the plan according to some rules. The point is that now you can see the statistics for each operator in the plan, so you can see what are the estimates of the stats after various transformations. There are two metrics available, namely the rowCount and sizeInBytes: This is going to show us some properties of the table including the table-level statistics. To see the statistics of a table we first need to compute them by running a SQL statement (notice that all the SQL statements can be executed in Spark using the sql() function spark.sql(sql_statement_as_string)): ANALYZE TABLE table_name COMPUTE STATISTICSĪfter this, the table level statistics are computed and saved in metastore and we can see them by calling DESCRIBE EXTENDED table_name In this article, we will explore it more in detail. There are however some situations in which Spark can also use some statistical information about the data in order to come up with yet a better plan and this is often referred to as the cost-based optimization or CBO. For example, the PredicatePushDown rule is based on a heuristic rule which assumes that it is better to first reduce the data by filtering and then apply some computation on it. Most of the optimizations that Spark does are based on some heuristic rules that do not take into account the properties of the data that are being processed. In this article, we will explain how these statistics are used in Spark under the hood and we will see in which situations they are useful and how to take advantage of them. The former relies on heuristic rules while the latter can use some statistical properties of the data. Spark SQL optimizer uses two types of optimizations: rule-based and cost-based. A closer look at the cost-based optimizer in Spark.
0 Comments
Leave a Reply. |