Reflection_blog_post

I don’t think there is a big change about what I think a data scientist is. However, after this course, I have a better picture of what data scientists do in their work including data collecting, cleaning, modeling, and presenting. I will continue to use R going forward. It is a powerful tool with numerous packages for the steps I mentioned above. I would say this course is the best one in my program in terms of learning practical skills in future Data Science related works. I will do differently in practice from many aspects. I listed several of them below. I will present my work with a shiny app instead of PPT slides. I can discuss and share my work with others through Github instead of sending codes by email. I will use Rmarkdown to generate a report instead of typing and pasting things in Word.

Read More

Fourth Post For 558_project2

2021-10-30 558_project2

Yan Liu 2021/10/30

Explaination

Online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal of this project is to predict the number of shares in social networks (popularity). Here we first showed some summary statistics and plots about the data grouped by weekdays. Then we create several models to predict the response, shares in different channels.

Two links to the github page and the repo itself.

Thoughts

During this project, the most difficult part is to coordinate the progress of the project with the group partner. I usually prefer to start the work early and do a little everyday. However, other people may have some situation and prefer to work in a different pace. Next time, I probably would try to communicate with my group partner better, and set a more clear timeline of working progress (It may also not helpful…). I feel we did many things in a hurry during the last two days. In addition, I learned a lot from this process about prediction with different methods. The way of plan and thinking when getting a new data set. I also read a lot about the model selection which definitely gave me a more clear picture of predication.

Read More

Data Scientists vs Statisticians

I think being a data scientist is about getting useful ideas or making decisions based on complex and high dimensionally data sets from the modern digital world. The major duties of a data scientist involve data ingestion, data transformation, exploratory data analysis, model selection, model evaluation, and data storytelling. To realize the processes mentioned above, a data scientist should have a strong knowledge background in computer science, mathematics/statistics, and specific domain knowledge in a related area. Statistics is a crucial component of data science. In my understanding, statisticians often deal with data that have a relatively simpler structure and use a single model to fit or inference the data, for example, clinical trials. In contrast, data scientists focus on comparing a number of different methods to create the best machine learning model for prediction. However, since both data science and statistics aim to extract knowledge from data, the real situation of these two fields is that each is weak without the other. Statisticians need to understand the modeling and structure of data, while data scientists need to understand applied statistics. Ultimately, the boundary of these two disciplines will not be very clear in the near future as real-world data becomes increasingly complicated. For me, I will do my best to adapt to this coming trend by preparing myself with both solid statistical knowledge and proficient skill in “data science process”.

Read More