2014年1月6日 星期一

[Kaggle] Predicting a Biological Response


http://www.kaggle.com/c/bioresponse/leaderboard?submissionId=558913


Random forest with n_estimators=260.

Random LogLoss = 0.41105

-Meng-Gen Tsai0.41105-Mon, 06 Jan 2014 13:34:28Post-Deadline

Post-Deadline Entry

If you would have submitted this entry during the competition, you would have been around here on the leaderboard.
大約在 301 ~ 303 之間。



Random forest benchmark with unknown n_estimators.  So we can beat the benchmark by different parameters.

351↓36idrach550.415399Sat, 12 May 2012 03:02:07 (-46.1h)
Random Forest Benchmark0.41540
353↓36ULJ FRI Team0.415401Tue, 20 Mar 2012 00:59:47

2014年1月5日 星期日

面試題目分享





2.  Search a sorted array for the count of k.

Write a method that takes a sorted array A and a key k and returns the count of k.

[Hint:  Binary search commonly asks for the index of any element of a sorted array.  Find the index of the first and last occurrence of k in A.]



3.  Sort a given Perl hash by key without using Perl function sort().



4.  Implement a queue by using stacks.

[Hint:  Implement two methods enqueue() and dequeue().]



5.  Reverse the ordering of words in a string.  For example: "My name is X Y Z" to "Z Y X is name My".  Do not use tokenizers.



6.  Fibonacci spiral.


We move our finger according to Fibonacci spiral from the origin O = (0, 0). 

We will pass (1, 1), (0, 2), (-2, 0), (1, -3) in sequence.  We label the sequence in Pi, that is, P1 = (1, 1), P2 = (0, 2), P3 = (-2, 0), P4 = (1, -3) and so on.   Now write a method to return the coordinate of Pn. 


[Kaggle] Digit Recognizer


Random forest.


因為 Python 記憶體的限制,所以我只有跑 25000 train data,而且不是全部的 pixel 都餵進去,如果全餵進去記憶體會爆掉。此外,n_trees 也沒拉到 1000,200 差不多就緊繃了。

但全部有 40000 個 train data,benchmark 可是全部把原汁原味的資料賞給 random forest,而且 n_trees 設成 1000,所以我的分數比 benchmark 低


喵的。

246newMeng-Gen Tsai0.962293Sun, 05 Jan 2014 12:20:47
接下來可以往 RandomizedPCA 研究看看,不然 features 全塞也不是辦法。今天測一下沒有掌握到要訣,而且 train model 好花時間。繼續研究。



安裝 R software。

n_trees = 1000 跑一次,0.96743。
n_trees = 800 跑一次,0.96843,終於擊敗 RF benchmark。

140newMeng-Gen Tsai0.968435Tue, 07 Jan 2014 02:22:13

Your Best Entry

You improved on your best score by 0.00100.
You just moved up 37 positions on the leaderboard.
152↓29Ravi Chandibhamar0.968296Mon, 09 Dec 2013 21:04:39 (-44.9h)
Random Forest0.96829
154↓29Thomas Hepner0.968291Sat, 14 Dec 2013 22:50:53




2014年1月3日 星期五

[Kaggle] Data Science London + Scikit-learn


106newMeng-Gen Tsai0.912823Sat, 04 Jan 2014 04:03:02 (-1.1h)
svm。

當我嘗試把參數最佳化的時候,參考了以下說明:

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

結果分數變低,個人以為是 overfitting。



傳說中的 random forests 以及其他類的 sklearn ensemble learning 效果不好,cross validation 分數比 svm 低,實際傳上去分數也是低的,喵的,害我浪費一個 submission。

Kaggle 一天只給我們 5 submissions



PCA,開始洗 PCA 相關 papers。

43newMeng-Gen Tsai0.9452313Wed, 08 Jan 2014 06:54:36 (-12.2h)

[Kaggle] titanic-gettingStarted


https://www.kaggle.com/c/titanic-gettingStarted/leaderboard


1111newMeng-Gen Tsai0.727272Fri, 03 Jan 2014 09:51:42
嗚嗚,Random Forest 的極限大概就是這樣。



605newMeng-Gen Tsai0.779906Sat, 04 Jan 2014 01:42:40 (-7.3h)
不過林杯也不是省油的燈,random forest 不行就改試其他的演算法。

參考 Scikit-learn:http://scikit-learn.org/stable/modules/ensemble.html,這是很不錯的 python machine learning library。

當然我也不排除使用 R,只要讓排名變前面就可以了。喵的。



其實 overfitting 對 machine learning 是蠻重要的課題。

339newMeng-Gen Tsai0.7894714Mon, 06 Jan 2014 07:41:36
cross_val_score 分數高僅僅表示分數高而已,不代表可以在 test data 得高分。

249newMeng-Gen Tsai0.7942615Mon, 06 Jan 2014 07:54:36
繼續浪費 submission count :p




162newMeng-Gen Tsai0.7990424Wed, 08 Jan 2014 09:23:11

Your Best Entry

You improved on your best score by 0.00478.
You just moved up 89 positions on the leaderboard.
調參數。

2014年1月1日 星期三

[FWD] ‎Ensemble learning

Reference: http://www.jdl.ac.cn/user/hchang/course_staffs/08_ensemble%20learning.pdf


There is no single learning algorithm that in any domain always induces the most accurate learner.


Ensemble learning:
  • We construct a group of base learners which, when combined, has higher accuracy than the individual learners.
  • The base learners are usually not chosen for their accuracy, but for their simplicity.
  • The base learners should be accurate on different instances, specializing in different subdomains of the problem, so that they can complement each other.

這感覺好像 VirusTotal 的精神,雖然我們不知道最終結果是甚麼,但如果被越多家防毒軟體抓到,檔案有較高的機率是病毒。



Bagging (bootstrap aggregating)



AdaBoost