The Origin Forum - k-means clustering

The Origin Forum

Username:	Password:
Save Password
Forgot your Password? \| Admin Options

All Forums

Origin Forum

k-means clustering

New Topic

Reply to Topic

Printer Friendly

Author

Topic

___@___

USA
Posts

Posted - 08/13/2012 : 02:32:55 AM

Origin Ver 8.6.0. (32bit) SR3. Operating System: Win7.

Dear all,

I need to do k-means clustering of 2000 vectors containing 6000 values each. Could you please let me know, whether it is possible to solve by the standard origin tools the following problem:

I want to split these 2000 vectors into 5 clusters, and generate the heat map of all the 2000 vectors sorted according to these clusters. They have to be clustered according to the proximity of their cluster center regions (say, may vectors are stored as rows of the worksheet, the worksheet has 6000 columns, and I want to select 1000 columns in the center so that the values of the vectors within this region would be compared and the vectors would be clustered accordingly). I am thinking of a plot like that on the Figure 4b in this pdf: http://www.cs.duke.edu/courses/fall07/cps296.3/lee.2007.pdf

Any advice would be appreciated,
Thanks!

Echo_Chu

China
Posts

Posted - 08/14/2012 : 04:45:26 AM

Hi,

You can split the vector into 5 cluster with K-Means cluster and got membership and distance of each vector. Then you can quickly rearrange the data for the heatmap.
I am afraid that Origin does not support heatmap directly but you could plot a similar one with colorfill contour.

That is
1. Run K-Means Cluster Analysis
a. Select Statistics: Multivariate Analysis: K-Cluster Analysis from the Origin menu to to open dialog
b. Make sure to select Distance from Cluster in the dialog
Note: In this step I assume you already know the the initial cluster centers for your data. If not, I would suggest you to get reasonable initial cluster center by running Hierarchical Cluster Analysis first, like the example

2. Rearrange Data
a. In the "Cluster Membership" sheet, select Worksheet: Worksheet Query
b. In the Worksheet Query dialog, Move col(Membership) from the left panel to the right panel, it's Alias will be set as M automatically. Set Condition as if "m==1" and click Apply button
--> We can extract data for each cluster in this way
c. In the extracted worksheet, highlight the col(Distance) and select Worksheet: Sort Worksheet:Ascending. In this way we can sort data by distance

3. Create the "heatmap"
a. Plot a color fill contour with the data
b Double chick on the graph to open the Plot Details dialog, In the Colormap/ Contours tab, select Level header to open Set Levels dialog. Set the Increment as 0.1 or any increment that make each value is in a separate level.

Please see whether this is what you need?

Echo
OriginLab Corp.

___@___

USA
Posts

Posted - 08/14/2012 : 11:55:13 AM

quote:
Originally posted by Echo_ChuI would suggest you to get reasonable initial cluster center by running Hierarchical Cluster Analysis first, like the example

Hi Echo,
Thank you for posting. I am going step by step as you suggested. At this moment I have the following questions:
1) The link that you have provided (see quoted text above) seems to be broken
2) Should I store my 2000 vectors as rows or as columns of the worksheet?
Thanks

___@___

USA
Posts

Posted - 08/14/2012 : 12:11:56 PM

quote:
Originally posted by Echo_Chu
b. Make sure to select Distance from Cluster in the dialog
Note: In this step I assume you already know the the initial cluster centers for your data.

I have my 2000 vectors stored as 2000 rows of the datasheet. Each row consists of 6000 columns. All columns are marked as "Y" axes. The cluster center is the center of each row, so I am selecting one column in the middle. But this does not work since I get the following error:

"The initial centers should have same variables as observations"

Could you please correct, what I am doing wrong?

Echo_Chu

China
Posts

Posted - 08/14/2012 : 11:43:01 PM

hi

The example shows how to begin with hierarchical cluster analysis using randomly selected data to estimate starting values for k-means cluster. And then run k-means cluster analysis on all original data.

http://www.originlab.com/www/helponline/Origin/en/UserGuide/Examples_(Cluster_Analysis).html#Example_2

However, you can also just specify the cluster number in k-means dialog. It also works

Please note that the cluster center is not the center of each row, in your data, it should contain 5 rows and have the same number of columns as in your original data.

___@___

USA
Posts

Posted - 08/15/2012 : 01:39:53 AM

Dear Echo,

Thank you for posting the corrected link. I am trying to do without hierarchical clustering, that would be too complicated. So should I store the vectors as rows or columns? What if I do not have the "X" variable, just the "Y" observables?

Best,
Dan

Echo_Chu

China
Posts

Posted - 08/15/2012 : 03:25:49 AM

If you want to cluster the vector, then please store them in rows.

It is fine that you set all columns as Y and don't have X column.

___@___

USA
Posts

Posted - 08/15/2012 : 04:12:33 AM

quote:
Originally posted by Echo_Chu

Please note that the cluster center is not the center of each row, in your data, it should contain 5 rows and have the same number of columns as in your original data.

Sorry, still not clear. if I store my 2000 vectors (each of 6000 elements) as 2000 rows (with 6000 columns), then I understand that the cluster center is an array of 5 columns and 2000 rows, not the 5 rows and 6000 columns?

Echo_Chu

China
Posts

Posted - 08/15/2012 : 04:49:29 AM

Cluster Analysis is to partition the observation into specified clusters in which each observation belongs to the cluster with nearest mean. The intial cluster center is one of the observation which we are finding observations converge towards. Each row is an observation.

So if you want to cluster your vectors, then you should store the vector in rows. and specify one or several rows as your cluster center.

___@___

USA
Posts

Posted - 08/15/2012 : 05:23:19 AM

quote:
Originally posted by Echo_Chu

Cluster Analysis is to partition the observation into specified clusters in which each observation belongs to the cluster with nearest mean. The intial cluster center is one of the observation which we are finding observations converge towards. Each row is an observation.

So if you want to cluster your vectors, then you should store the vector in rows. and specify one or several rows as your cluster center.

Sorry, I am not getting it. I want to make a figure like Figure 4b here: http://www.cs.duke.edu/courses/fall07/cps296.3/lee.2007.pdf
In this example, they have 5000 vectors, each vector is represented as a line in the figure. There are four clusters in this example, and each cluster has its own center. As far as I understand, what you propose is to indicate 5 neighboring rows (that is, five vectors) as a cluster center. But if the aim is to get four different clusters, then each of them has its own "center", right? Then how do I indicate this if only one cluster center is requested in the dialog for k-means clustering? Could you please look specifically at this example?

Edited by - ___@___ on 08/15/2012 05:26:47 AM

Echo_Chu

China
Posts

Posted - 08/15/2012 : 05:50:48 AM

It is not 5 neighbor rows as a cluster center. It should be 5 rows(not neighbor) and each one is a cluster center. Anyway, we can forget about it if we specify the cluster number directly. Then we don't need to struggling to find out the specified intial cluster center, as you said you are trying to do without hierarchical clustering

Then in step1 of my first post, you can set Number of Clusters as 5 in the kmeans dialog like the image below.

However, would you mind to send us your data so that we can show you how to work with your data?

You could follow the instructions below to send your file.
http://www.originlab.com/index.aspx?go=Support&pid=752

___@___

USA
Posts

Posted - 11/24/2012 : 5:14:09 PM

Hi Echo,
I was busy with other stuff, but now returned again to this problem. The text file with my data is quite large (70 Mb). When I run cluster analysis, Origin stopped working. So what would you propose how to proceed? Can you actually teach me how to do this analysis if I send you the data? And is not it too large file for the upload?

Echo_Chu

China
Posts

Posted - 11/26/2012 : 12:20:08 AM

Hi, Dan

Please send me your file to let me look at it first.

You could follow the instructions below to send your file.
http://www.originlab.com/index.aspx?go=Support&pid=752

Thanks
Echo

___@___

USA
Posts

Posted - 11/26/2012 : 4:27:45 PM

Dear Echo, I have uploaded the file to the FTP server. I included the explanations in the web form. Please have a look.

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

The Origin Forum

Snitz Forums 2000