Image for post
Image for post
Photo by Jordan McDonald on Unsplash

What is an ensemble method? Construct a set of independent models and predict class labels by combining their predictions made by multiple models. This strategic combination can reduce the total error, including decrease variance (bagging) and bias (boosting), or improve the performance of a single model (stacking).


Image for post
Image for post
Photo by Robin Glauser on Unsplash

What is Decision Tree? A supervised learning method that uses a tree-like model for classification and regression. It helps to find the relationship between a large number of candidate input variables and a target variable. However, it is a greedy algorithm that does not produce an optimal decision tree that minimizes the error.

Decision tree uses variables with if-then-else decision rule to divide up a large collection of records into successively smaller sets of records. The goal is to explore the train data and build a model by cleanly splitting a large heterogeneous population into smaller and more homogeneous groups…


Image for post
Image for post

We often ask multiple questions to find insightful inputs when measuring fuzzy concepts such as “‘service quality”, “consumer trust” or “customer loyalty”. However, there are too many variables, including those that are unimportant or unrelated, that cause dimensionality problem. Therefore, data reduction is a necessary way in this kind of marketing research. It can be divided into two parts: feature selection and feature extraction.

Feature selection and feature extraction reduce the number of variables by obtaining a set of principal variables. The algorithm behind them helps us choose the relevant and significant variables automatically. …


Image for post
Image for post
Photo by Capturing the human heart. on Unsplash

In the previous article “Feature Extraction Using Factor Analysis in R”, we mentioned that besides factor analysis, principal component analysis is also a common way to reduce the dimensionality. So here, I’m going to use the same data to introduce PCA, and also show its result on the perceptual map.

What is PCA?

An unsupervised learning mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated principal components.

Since that PCA calculates a new projection of the data set and the new axis is based on the standard deviation of…


Image for post
Image for post
Photo by Bill Oxford on Unsplash

What is Feature Extraction? A process to reduce the number of features in a dataset by creating new features from the existing ones. The new reduced subset is able to summarize most of the information contained in the original set of features.

There are two methods in feature extraction: factor analysis and principal component analysis. I’ll first talk about factor analysis in this post.

To eliminate the correlation between a large number of variables, we use factor analysis to find the root factors that represent their dependent ones. As we simplify the data, we also want to retain as much…


Image for post
Image for post
Photo by Edu Grande on Unsplash

What is Feature Selection? A process to filter irrelevant or redundant features from the dataset. By doing this, we can reduce the complexity of a model, make it easier to interpret, and also improve the accuracy if the right subset is chosen.

In this post, I will first focus on the demonstration of feature selection using wrapper methods by using R. Here, I use the “Discover Card Satisfaction Study ” data as an example.

cardData = read.csv(“Discover_step.csv”, head=TRUE)
dim(cardData)
head(cardData)


Image for post
Image for post
Photo by Luca Bravo on Unsplash

After data extraction, we usually need to merge those files together for further analysis. It can be several files, but it can also be hundreds of files. It’s easy to copy and paste the data we need in one sheet with just a few files. However, when dealing with a large number of files, there’s no point in doing it manually. So here I’m going to share a three-step way I used to put the data together using Python.

1. Import Library

import pandas as pd
import os

To merge the files, we need to use two modules, pandas for reading the CSV…


  Especially thank my thoughtful friend for sharing his study guide
Image for post
Image for post
Photo by Sarah Noltner on Unsplash

SCM 516: Applied Analytics

Time Series Decomposition

  • Chronologically ordered data are referred to as a time series
  • A time series may contain one or many elements — trend, seasonality, cyclical pattern, autocorrelation, and random variation
  • Identifying these elements and separating the time series data into these components is known as decomposition
  • Exponential smoothing — more recent records are given more weight
  • Holt’s model — includes a parameter that is an estimate of the change in the series from one period to the next
  • Winter’s model — Includes a parameter that…


從當初申請研究所到現在也有一年的時間了,回顧這段日子過得好快也很充實,還記得當初的這個時候正懊惱要去北京大學還是亞利桑那州立大學,現在的我會說我很慶幸當初選擇的是來美國讀商業分析,雖然跨領域、跨文化的生活有時候會有挫折感,但成長的部分也是看的見的,這篇文就來和大家分享我在ASU讀MSBA的介紹和感想。

Image for post
Image for post
Arizona的天空很美每天都不一樣

— 周遭環境

先從生活環境說起,我們的校區在Tempe,我覺得是一個舒適的城市,天氣好、不怎麼下雨(目前只遇過大概不到十天)、不會太擁擠、交通還算方便(有輕軌和orbit,很多地方也是步行的距離)、房租(大多落在六百到一千左右)和物價也都適中。雖然不像大都市那樣的繁榮喧囂,卻也多了一份自在愜意。

— 項目介紹

這是九個月的program,分成4個cohort,一班大概40人左右,國籍組成依比例依序為印度、中國、台灣、美國,其他像是韓國、泰國、法國、俄羅斯、伊朗等等的都是個位數。同學的背景多來自工程、商業,也有不少跨領域就讀,像是數學、人文、語文、傳播(我),多半都已經有工作經驗了。

分組: 我們的學制是用quarter來分(相當於半個semester),兩個quarter分一次組,通常四人一組,固定會有一個美國人和印度人,但當然也有例外。

作業: 小組作業和個人作業的比例大概40/60,小組的要做較複雜的case study和project,個人的是測驗和coding為主 。

考試: 隨堂測驗其實滿多的,最少兩周一次(都不會到太難),也有回家open book的考試,另外再加上期中或期末。

上課: 課堂上老師很常問問題,同學也很會發言,而且因為小班制所以有滿多的討論和互動,我覺得這對於國際生來說是很不錯的部分,訓練思考和口語表達能力。

軟體: 需要用到Python, R, SQL, Tableau, Excel (Precision Tree, Solvers, at Risk),作業環境會建議用Windows比較方便。

資源: department有提供mentor和advisor (academic & career),所以可以自己寫信跟他們約時間諮詢,他們也會不定時寄一些生活或工作的資訊。系上有study room提供小組討論或個人學習,另外還有一間24小時的自習室。

— 課程規劃

第一個quarter教的內容比較基礎,主要是用來銜接之後的課程。第二個quarter的難度就增加了,也開始正式coding,學深一點的BA專業。

第三和第四個quarter是loading最重的階段,除了coding skill和專業知識之外,重點就是執行capstone project (有點像是建教合作那種internship),不過也因為課少的關係,相對有更多時間可以運用。

— 技術學習

35% Supervised Machine Learning (classification and regression: decision tree, random forest, neural, support vector machines, network, k nearest neighbor, Xgboost)

20% Database Management (enterprise analytics, database, SQL)

20% Marketing Analytics (multi-linear regression, logistic regression, factor analysis, profit analysis, BASS model, CLV)

10% Data Visualization (Tableau dashboard, Excel graph)

10% Recommendation System (collaborative and content-based filtering)

5% Unsupervised Machine Learning (clustering: K-means, text mining)

— 未來就業

商業分析師(Business Analyst)、資料分析師(Data Analyst),大部分學生畢業後求職以這兩者為主,工作內容偏向data cleaning, interpreting並提供公司或客戶商業上的決策建議。

專案經理(Program Manager),越來越多人考慮這個方向,這是一個要會溝通也要會技術(基礎架構要懂)的領導職位,統合兩方的需求並下決策。

資料科學家(Data Scientist),也有人往這邊走,不過要具備的技術能力更強,要是只是單單讀BA而沒有自我鍛鍊(改algorithm、建model),一般是勝任不了的。

當然也不限於這幾種類別,但若你要留在美國工作一開始是離不開這些的,原因是用OPT工作需要和你的學科相關領域。

— 心得感想

其實學習本來就很因人而異,對於有coding底子或原本就有在接觸這方面的人來說,可能會認為學校教的太淺太慢了,所以不會覺得這個項目對自己技術方面的成長有太大幫助,學到的就會是其他面向,像是溝通技巧、視覺化呈現、如何運用在商業上的這種軟實力。而對於初學者或是只有一些概念的人來說,我覺得這個項目就有助於實力的提升,當然也不能完全靠學校,學校只是引領你一個學習的方向,告訴你什麼是業界需要的,自己找case找題目來做、找文章來讀才是最有用的,不然之後工作面試被問到一些專業問題的話,也只講的出皮毛而已。

常常有人問我九個月會不會太短,當時選校的時候我也有在猶豫這個問題,但其實差不多讀完後我感覺剛剛好,學校該教的都不會少,重點還是怎麼運用時間做額外的練習,再來BA的環境本來就是fast-pace的,我個人認為早一點進入業界反而更能有效的增加經驗。當然這樣子緊湊的九個月也是有缺點的,你可能會失去掉一些享受生活的部分,適應力慢的人也可能有種怎麼好像才剛來美國,生活還沒有很習慣、語言還沒有很流利、技術還沒有熟練,就要畢業、就要開始找工作的感覺。

商業分析的面向很廣,現在幾乎每個產業都會需要這方面人才,然而在美國的市場也逐漸趨於飽和狀態,所以最後我會建議最好要找到某個自己喜歡的領域(marketing, healthcare, logistic, customer service, education, etc.)深入下去,結合大學所學也是一個好方法,這樣之後找工作會比較有頭緒,也有可能比較突出、比較吃香,另外當然就是多多練習SQL和Python (尤其是SQL),當你開始找工作和面試後,會發現這些是必備的基本款。

謝謝你願意閱讀到最後,之後我還會分享課程相關的內容、我學習後的運用、找工作的歷程方法、還有我的觀察想法等等。如果有興趣的話歡迎追蹤跟分享,有問題的話也可以email我: kelly.szutu@gmail.com 或到我的instagram上留言給我: sct.k


Image for post
Image for post

As a data journalist, a storyteller, or even a data analyst, teaching yourself Python, R, Tableau or Power BI could be challenging in the beginning. However, there are some open sources online which have already prepared the models, structures, and ideas for those who are new in this area to step in.

In this post, I will cover 6 helpful methods for presenting the data, in the order of how cool, how amazing I think they are:

1. Interactive Q & A

Kelly Szutu

Journalist x Data Visualization | Data Analyst x Machine Learning | Python, SQL, Tableau | LinkedIn: www.linkedin.com/in/szutuct/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store