Cyber Monday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon Exam MLS-C01 Topic 3 Question 104 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 104
Topic #: 3
[All MLS-C01 Questions]

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

Show Suggested Answer Hide Answer
Suggested Answer: A

AComprehensive Explanation: The best way to split the dataset into a training dataset and a validation dataset is to pick a date so that 80% of the data points precede the date and assign that group of data points as the training dataset. This method preserves the temporal order of the data and ensures that the validation dataset reflects the most recent trends and patterns in the commodity price. This is important for forecasting models that rely on time series analysis and sequential data. The other methods would either introduce bias or lose information by ignoring the temporal structure of the data.

References:

Time Series Forecasting - Amazon SageMaker

Time Series Splitting - scikit-learn

Time Series Forecasting - Towards Data Science


Contribute your Thoughts:

Johnna
2 months ago
Wait, are we sure the answer isn't B? Because if it's not, I'm going to be kicking myself for the rest of the day. Option B all the way!
upvoted 0 times
...
Daniel
2 months ago
Option D might sound tempting, but that would just be a random mess. We need to split the data in a way that mimics the real-world scenario the model will be used in.
upvoted 0 times
Jesusa
1 months ago
C: Definitely. Option A ensures that the model is trained on past data and validated on future data, just like in real life.
upvoted 0 times
...
Denise
1 months ago
B: I agree. Option D would not provide a realistic representation of the data. We need to split it properly.
upvoted 0 times
...
Donte
1 months ago
A: Option A seems like the best choice. We need to maintain the chronological order of the data for accurate forecasting.
upvoted 0 times
...
...
Catarina
2 months ago
Haha, I'm just picturing the data scientist flipping a coin to decide which data points go where. But in all seriousness, Option B is the clear winner here.
upvoted 0 times
Cherrie
1 months ago
Definitely, random sampling wouldn't be as effective as choosing a date for the split.
upvoted 0 times
...
Jovita
1 months ago
Yeah, it makes sense to use a specific date to divide the data points.
upvoted 0 times
...
Erick
2 months ago
I agree, Option B is the most logical choice for splitting the dataset.
upvoted 0 times
...
...
Lyndia
3 months ago
I think randomly sampling data points for the training dataset is also a valid approach. As long as it's done without replacement, it should provide a good representation of the dataset.
upvoted 0 times
...
James
3 months ago
I agree with Kimberely. It makes sense to split the dataset based on a specific date to ensure a fair comparison of model performance.
upvoted 0 times
...
Kimberely
3 months ago
I think the data scientist should pick a date so that 80% of the data points precede the date and assign them as the training dataset.
upvoted 0 times
...
Destiny
3 months ago
I agree with Stefany. Option B is the way to go. Forecasting models need to be trained on historical data and then tested on future data to see how well they perform.
upvoted 0 times
...
Stefany
3 months ago
Option B makes the most sense. We want the training data to come first in time, so the model can learn from the past and then be validated on the future data.
upvoted 0 times
Carissa
1 months ago
Stratified sampling could introduce bias and not represent the dataset accurately.
upvoted 0 times
...
Fannie
1 months ago
Randomly sampling data points might not capture the time sequence needed for accurate forecasting.
upvoted 0 times
...
Muriel
2 months ago
It's important for the model to learn from past data first before being validated on future data.
upvoted 0 times
...
Nydia
2 months ago
I agree, option B is the best choice for splitting the dataset.
upvoted 0 times
...
...

Save Cancel
az-700  pass4success  az-104  200-301  200-201  cissp  350-401  350-201  350-501  350-601  350-801  350-901  az-720  az-305  pl-300  

Warning: Cannot modify header information - headers already sent by (output started at /pass.php:70) in /pass.php on line 77