The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having
the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))
Correct code block:
transactionsDf.withColumn('cos', round(cos(degrees(transactionsDf.value)),2))
This Question: is especially confusing because col, 'cos' are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to
the
details to identify what the correct answer option is.
The first answer option to throw out is the one that starts with withColumnRenamed: The Question: speaks specifically of adding a column. The withColumnRenamed operator only renames
an
existing column, however, so you cannot use it here.
Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of
withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col('cos') as the option for gap 2 can be disregarded.
This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION
NO: you
can find out that the new column should have 'the values in column value converted to degrees and having the cosine of those converted values taken'. This prescribes you a clear order of
operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,
logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.
More info: pyspark.sql.DataFrame.withColumn --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 49 (Databricks import instructions)
Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?
transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()
Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark
documentation for the union command (link below).
transactionsDf.select('value', 'productId').distinct()
Wrong. This code block returns unique rows, but not unique values.
transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})
Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).
transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})
No. This command will count the number of rows, but will not return unique values.
transactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer')
Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read
up on the difference between union and join, a link is posted below.
More info: pyspark.sql.DataFrame.union --- PySpark 3.1.2 documentation, sql - What is the difference between JOIN and UNION? - Stack Overflow
Static notebook | Dynamic notebook: See test 3, Question: 21 (Databricks import instructions)
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
transactionsDf.select('storeId').dropDuplicates().count()
Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.
transactionsDf.select(count('storeId')).dropDuplicates()
No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count('storeId'))
Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates
instead.
transactionsDf.distinct().select('storeId').count()
Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count
not represent the number of unique values in that column.
transactionsDf.select(distinct('storeId')).count()
False. There is no distinct method in pyspark.sql.functions.
The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which
dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.
Sample of DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+----------------+
2. |transactionId|predError|value|storeId|productId| f| transactionDate|
3. +-------------+---------+-----+-------+---------+----+----------------+
4. | 1| 3| 4| 25| 1|null|2020-04-26 15:35|
5. | 2| 6| 7| 2| 2|null|2020-04-13 22:01|
6. | 3| 3| null| 25| 3|null|2020-04-02 10:53|
7. +-------------+---------+-----+-------+---------+----+----------------+
Code block:
1. transactionsDf = transactionsDf.drop("transactionDate")
2. transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")
This Question: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code
block
includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION
NO: will
make it easier for you to deal with single-error questions in the real exam.
You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to
read the values from column transactionDate.
Values in column transactionDate in the original transactionsDf DataFrame look like 2020-04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words:
The string indicating the date format should be adjusted.
While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to
use here.
Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed().
Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both.
Finally, you cannot assign a column like transactionsDf['columnName'] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark.
So, you need to use Spark's DataFrame.withColumn() syntax instead.
More info: pyspark.sql.functions.unix_timestamp --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 28 (Databricks import instructions)
The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.
Code block:
transactionsDf.write.partitionOn("storeId").parquet(filePath)
No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.
Correct! Find out more about partitionBy() in the documentation (linked below).
The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.
No. There is no information about whether files should be overwritten in the question.
The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.
Incorrect. To write a DataFrame to disk, you need to work with a DataFrameWriter object which you get access to through the DataFrame.writer property - no parentheses involved.
Column storeId should be wrapped in a col() operator.
No, this is not necessary - the problem is in the partitionOn command (see above).
The partitionOn method should be called before the write method.
Wrong. First of all partitionOn is not a valid method of DataFrame. However, even assuming partitionOn would be replaced by partitionBy (which is a valid method), this method is a method of
DataFrameWriter and not of DataFrame. So, you would always have to first call DataFrame.write to get access to the DataFrameWriter object and afterwards call partitionBy.
More info: pyspark.sql.DataFrameWriter.partitionBy --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 33 (Databricks import instructions)
Telma
5 days agoPatti
8 days agoHillary
20 days agoCarmen
1 months agoMelita
1 months agoNieves
2 months agoLili
2 months agoCordelia
2 months agoDulce
3 months agoSelma
3 months agoGene
3 months agoIsaiah
3 months agoDenise
3 months agoCecily
4 months agoDonte
5 months agoRoxane
6 months agoDomitila
6 months ago