Free Access to Google.Professional-Data-Engineer.v2022-10-14.q166 with Valid Practice Test (Page 11)

Question 46

Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the dat
a. Which three machine learning applications can you use? (Choose three.)

A.Unsupervised learning to determine which transactions are most likely to be fraudulent.
B.Reinforcement learning to predict the location of a transaction.
C.Unsupervised learning to predict the location of a transaction.
D.Supervised learning to determine which transactions are most likely to be fraudulent.
E.Clustering to divide the transactions into N categories based on feature similarity.
F.Supervised learning to predict the location of a transaction.

Question 47

The Dataflow SDKs have been recently transitioned into which Apache service?

A.Apache Spark
B.Apache Hadoop
C.Apache Kafka
D.Apache Beam

Question 48

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DTstores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRINGtype. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

A.Add a column TSof the TIMESTAMPtype to the table CLICK_STREAM, and populate the numeric values from the column TSfor each row. Reference the column TSinstead of the column DTfrom now on.
B.Create a view CLICK_STREAM_V, where strings from the column DTare cast into TIMESTAMPvalues.
Reference the view CLICK_STREAM_Vinstead of the table CLICK_STREAMfrom now on.
C.Add two columns to the table CLICK STREAM: TSof the TIMESTAMPtype and IS_NEWof the BOOLEANtype. Reload all data in append mode. For each appended row, set the value of IS_NEWto true. For future queries, reference the column TSinstead of the column DT, with the WHEREclause ensuring that the value of IS_NEWmust be true.
D.Delete the table CLICK_STREAM, and then re-create it such that the column DTis of the TIMESTAMP type. Reload the data.
E.Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DTinto TIMESTAMPvalues. Run the query into a destination table NEW_CLICK_STREAM, in which the column TSis the TIMESTAMPtype. Reference the table NEW_CLICK_STREAMinstead of the table CLICK_STREAMfrom now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Question 49

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average
200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

A.Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
B.Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
C.Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
D.Increase the size of your parquet files to ensure them to be 1 GB minimum.

Question 50

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values
(CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be
processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection
bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in
Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to
transmit the CSV files as is. The goal is to make reports with data from the previous day available to the
executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even
though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three
months. Which two actions should you take? (Choose two.)

A.Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
B.Introduce data compression for each file to increase the rate file of file transfer.
C.Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer
Service to transfer on-premices data to the designated storage bucket.
D.Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in
parallel.
E.Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble
the CSV files in the cloud upon receiving them.

Question 46

Question 47

Question 48

Question 49

Question 50

Download PDF File