Plover Temp

2014年1月6日星期一

[Kaggle] Predicting a Biological Response

http://www.kaggle.com/c/bioresponse/leaderboard?submissionId=558913

Random forest with n_estimators=260.

Random LogLoss = 0.41105

-		Meng-Gen Tsai	0.41105	-	Mon, 06 Jan 2014 13:34:28Post-Deadline
Post-Deadline Entry If you would have submitted this entry during the competition, you would have been around here on the leaderboard.

大約在 301 ~ 303 之間。

Random forest benchmark with unknown n_estimators. So we can beat the benchmark by different parameters.

351	↓36	idrach55	0.41539	9	Sat, 12 May 2012 03:02:07 (-46.1h)
		Random Forest Benchmark	0.41540

353	↓36	ULJ FRI	0.41540	1	Tue, 20 Mar 2012 00:59:47

2014年1月5日星期日

面試題目分享

1. http://projecteuler.net/problem=54

2. Search a sorted array for the count of k.

Write a method that takes a sorted array A and a key k and returns the count of k.

[Hint: Binary search commonly asks for the index of any element of a sorted array. Find the index of the first and last occurrence of k in A.]

3. Sort a given Perl hash by key without using Perl function sort().

4. Implement a queue by using stacks.

[Hint: Implement two methods enqueue() and dequeue().]

5. Reverse the ordering of words in a string. For example: "My name is X Y Z" to "Z Y X is name My". Do not use tokenizers.

6. Fibonacci spiral.

We move our finger according to Fibonacci spiral from the origin O = (0, 0).

We will pass (1, 1), (0, 2), (-2, 0), (1, -3) in sequence. We label the sequence in Pi, that is, P1 = (1, 1), P2 = (0, 2), P3 = (-2, 0), P4 = (1, -3) and so on. Now write a method to return the coordinate of Pn.

[Kaggle] Digit Recognizer

Random forest.

因為 Python 記憶體的限制，所以我只有跑 25000 train data，而且不是全部的 pixel 都餵進去，如果全餵進去記憶體會爆掉。此外，n_trees 也沒拉到 1000，200 差不多就緊繃了。

但全部有 40000 個 train data，benchmark 可是全部把原汁原味的資料賞給 random forest，而且 n_trees 設成 1000，所以我的分數比 benchmark 低。

喵的。

246

new

Meng-Gen Tsai

0.96229

Sun, 05 Jan 2014 12:20:47

接下來可以往 RandomizedPCA 研究看看，不然 features 全塞也不是辦法。今天測一下沒有掌握到要訣，而且 train model 好花時間。繼續研究。

安裝 R software。

n_trees = 1000 跑一次，0.96743。
n_trees = 800 跑一次，0.96843，終於擊敗 RF benchmark。

140	new	Meng-Gen Tsai	0.96843	5	Tue, 07 Jan 2014 02:22:13
Your Best Entry You improved on your best score by 0.00100. You just moved up 37 positions on the leaderboard.

152	↓29	Ravi Chandibhamar	0.96829	6	Mon, 09 Dec 2013 21:04:39 (-44.9h)
		Random Forest	0.96829

154	↓29	Thomas Hepner	0.96829	1	Sat, 14 Dec 2013 22:50:53

2014年1月3日星期五

[Kaggle] Data Science London + Scikit-learn

106

new

Meng-Gen Tsai

0.91282

Sat, 04 Jan 2014 04:03:02 (-1.1h)

svm。

當我嘗試把參數最佳化的時候，參考了以下說明：

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

結果分數變低，個人以為是 overfitting。

傳說中的 random forests 以及其他類的 sklearn ensemble learning 效果不好，cross validation 分數比 svm 低，實際傳上去分數也是低的，喵的，害我浪費一個 submission。

Kaggle 一天只給我們 5 submissions

PCA，開始洗 PCA 相關 papers。

new

Meng-Gen Tsai

0.94523

Wed, 08 Jan 2014 06:54:36 (-12.2h)

[Kaggle] titanic-gettingStarted

https://www.kaggle.com/c/titanic-gettingStarted/leaderboard

1111

new

Meng-Gen Tsai

0.72727

Fri, 03 Jan 2014 09:51:42

嗚嗚，Random Forest 的極限大概就是這樣。

605

new

Meng-Gen Tsai

0.77990

Sat, 04 Jan 2014 01:42:40 (-7.3h)

不過林杯也不是省油的燈，random forest 不行就改試其他的演算法。

參考 Scikit-learn：http://scikit-learn.org/stable/modules/ensemble.html，這是很不錯的 python machine learning library。

當然我也不排除使用 R，只要讓排名變前面就可以了。喵的。

其實 overfitting 對 machine learning 是蠻重要的課題。

339

new

Meng-Gen Tsai

0.78947

Mon, 06 Jan 2014 07:41:36

cross_val_score 分數高僅僅表示分數高而已，不代表可以在 test data 得高分。

249

new

Meng-Gen Tsai

0.79426

Mon, 06 Jan 2014 07:54:36

繼續浪費 submission count :p

162	new	Meng-Gen Tsai	0.79904	24	Wed, 08 Jan 2014 09:23:11
Your Best Entry You improved on your best score by 0.00478. You just moved up 89 positions on the leaderboard.

調參數。

2014年1月2日星期四

[ROSALIND] Solved 125 Problems

Rank #19 (2014-01-03)

2014年1月1日星期三

[FWD] ‎Ensemble learning

Reference: http://www.jdl.ac.cn/user/hchang/course_staffs/08_ensemble%20learning.pdf

There is no single learning algorithm that in any domain always induces the most accurate learner.

Ensemble learning:

We construct a group of base learners which, when combined, has higher accuracy than the individual learners.
The base learners are usually not chosen for their accuracy, but for their simplicity.
The base learners should be accurate on different instances, specializing in different subdomains of the problem, so that they can complement each other.

這感覺好像 VirusTotal 的精神，雖然我們不知道最終結果是甚麼，但如果被越多家防毒軟體抓到，檔案有較高的機率是病毒。

Bagging (bootstrap aggregating)

AdaBoost

Random forest

Reference: http://blog.yhathq.com/posts/random-forests-in-python.html

2014年1月6日 星期一

Post-Deadline Entry

2014年1月5日 星期日

Your Best Entry

2014年1月3日 星期五

Your Best Entry

2014年1月2日 星期四

2014年1月1日 星期三

2014年1月6日星期一

2014年1月5日星期日

2014年1月3日星期五

2014年1月2日星期四

2014年1月1日星期三