The Origin Forum - minimizing standard deviations in rows

The Origin Forum

Username:	Password:
Save Password
Forgot your Password? \| Admin Options

All Forums

Origin Forum for Programming

LabTalk Forum

minimizing standard deviations in rows

New Topic

Reply to Topic

Printer Friendly

Author

Topic

coll@inia.

Spain
125 Posts

Posted - 04/29/2017 : 08:21:48 AM

Origin Ver. Origin pro 2017 windows 8

I operate very often with means and standard deviations of 6 to 10 columns of data. I manually eliminate the outliers by trial and error to minimize the standard deviations. Then I choose the corresponding mean for graphic representation. I do that for each one of the rows of a worksheet with < 10 rows.

I wonder whether there is an script which could calculate all possibilities of columns combinations to calculate the corresponding mean and standard deviation for each row and select the combination with > 2 columns which results in the lowest standard deviation.

Could anyone help?

Thank you for your attention, julio

yuki_wu

896 Posts

Posted - 05/02/2017 : 02:45:03 AM

Hi Julio,

Did you want to calculate the Mean or SD for every row? I wrote a simple example that may help:


//Transpose worksheet, 
//and then output to a new worksheet
//so that you can calculate the Mean for every column instead of every row
wtranspose ow:=<new>;
//Get the number of column 
//which is the same with the row number of original data
int ncols = wks.ncols;
dataset ds;
loop(ii, 1, ncols)
{
	ds[ii] = mean(col($(ii)));
}
wks.addcol();
col($(ncols+1))[L]$ = Mean;
//Output the result
col($(ncols+1))=ds;

Yuki
OriginLab

Edited by - yuki_wu on 05/02/2017 02:46:38 AM

coll@inia.

Spain
125 Posts

Posted - 05/02/2017 : 08:41:21 AM

dear Yuki,
Thank you for your efforts!

However I do not see how do you chose the mean with the corresponding minimal sd value for each row!

I think I did not explained well enough what I am trying to do, or I do not understand what you sent me!.

I try to explain my objetive better:
lets take one single row of 5 columns.
If I calculate the mean1+2+3+4+5 & sd1+2+3+4+5 (n=5), I would take account of all the column values. That will probably result in the maximal sd value in most rows.

I will calculate then the mean again but by masking now the value in the first colum to obtain the mean2+3+4+5 & sd2+3+4+5 (n=4).

next I will mask the column value 2 and calculate the mean1+3+4+5 & sd1+3+4+5 (n=4).

and mask all the column combinations masking one at the time for every column (n=4).

Then I will mask 2 columns: for instance mean3+4+5 & sd3+4+5 (n=3), mean 1+4+5 & sd 1+4+5 (n=3), etc......... again all the combinations of columns masking 2 every time and therfore n=3.

Then I will mask 3 columns (.... leaving at least 2 columns to make a mean & sd (n=2)

After having calculated all the mena+sd of all possible combinations of column values except at least two, I will chose the combination with the lower sd and write their mean+sd (n=x) in an empty column.

I need to make those calculations again and again for every single row to make the corresponding graphic with the X axis of rows and the Y axes with the corresponding mean + minimal sd (n=x) values.

In this example, it will be a way to minimize the sd of 5 values for each row masking 1, 2 or 3 values.

of course it gets more complicated when having 10-12 columns!

I am not sure if I explained better but I think I mtself got clearer what I want to do thanks to you !

Please let me know if that is what you did and if so please explain it a little bit more

thanks again, julio

Hideo Fujii

USA
1582 Posts

Posted - 05/02/2017 : 1:52:33 PM

Hi Julio,

As far as I understood from what you wrote, your procedure doesn't make sense for me. More the number of
samples (in a row), larger their standard deviation. So, the procedure would tend to select just two
neighbor values which are closest in the row. If the standard error, instead of SD is used, that might be.
Or, I may do the following procedure:

1) At each cell in each row, calculate the absolute Z value (=abs((x-mean)/SD)) in the row
2) Mask the value (or set to a missing) when the Z value is larger than a certain preset threshold.

The following code is a sample of the implementation:

////////////////////////////////////////////////////
//Function to Remove Outliers
//////////////////////////////
function dataset remoutliers(dataset ds1, double zthreash) {
  dataset ds2;
  double ii;
  sum(ds1);
  double summean=sum.mean;
  double sumsd=sum.SD;
  double nr=ds1.getSize();
  for(ii=1; ii<=nr; ii++) {
    ds2[ii]=abs(ds1[ii]-summean)/sumsd<=zthreash?ds1[ii]:0/0; //reset if it's an outlier
  }
  return ds2;
}

/////////////////// Main //////////////////////////////
zthreshold=1.0;  //Set the threshold Z
wtranspose ow:=<new>;  //transpose
nc=wks.ncols;  //number of columns
for(jj=1; jj<=nc; jj++)
  wcol(jj)=remoutliers(wcol(jj),zthreshold);  //Replace outliers by NIL
wtranspose ow:=<new>;  //retranspose
/////////////////////////////////////////////////////////

I hope I didn't misunderstood your idea, and this gives some hint.

--Hideo Fujii
OriginLab

Edited by - Hideo Fujii on 05/03/2017 10:12:13 AM

coll@inia.

Spain
125 Posts

Posted - 05/07/2017 : 03:22:41 AM

dear Hideo,
THANK YOU very much for the code!!.

I applied to some data and it is working or at least it eliminated the outliers very well.

However, I cannot follow your code, specially the removeoutliers function. My labtalk is very limited.
Could you explain it a little bit adding comentaries?
I would like, with my limited knowledge on labtalk, to understand it..............

Thank you again, julio

coll@inia.

Spain
125 Posts

Posted - 05/07/2017 : 12:43:43 PM

dear Hideo,

Is it possible to mask the value to be discarted instead of making it NIL 0/0? it would help for a final inspection of which values were removed.
perhaps locating the final ds3 on the initial ds1?

I try wcellmask and mark functions but none of them works inside the "<=zthreash?ds1[ii]:0/0;}" final part of the function.

I cannot interpret correctly this final part

Otherwise, the function you kindly sent works perfectly well.

Thanks julio

Hideo Fujii

USA
1582 Posts

Posted - 05/08/2017 : 11:10:13 AM

Hi Julio,

<Condition>?<Return-value-on-success>:<Return-value-on-failure>

is the notation of conditional operator:
http://www.originlab.com/doc/LabTalk/guide/Operators#Conditional_Operator_.28.3F:.29

As a missing value(=0/0) is a value, but mask is a state of a cell, you cannot simply replace 0/0 by some value to mask.
Instead, you can add a code after setting missings to mask a missing cell's state to masked (as by
range<index>=1; ). Here is a modified sample:

////////////////////////////////////////////////////
//Function to Remove Outliers
//////////////////////////////
function dataset remoutliers(dataset ds1, double zthresh) {
  dataset ds2;
  double ii;
  sum(ds1);  //collect stats of ds1
  double summean=sum.mean;  //store mean
  double sumsd=sum.SD;      //store SD
  double nr=ds1.getSize();  //number of rows
  for(ii=1; ii<=nr; ii++) {
    ds2[ii]=abs(ds1[ii]-summean)/sumsd<=zthresh?ds1[ii]:0/0; //set missing if outlier
  }
  return ds2;
}

/////////////////// Main //////////////////////////////;
zthreshold=1.0;  //Set a proper Z threshold to determine outliers
pa=page.active;  //Active sheet#
wtranspose ow:=<new>;  //transpose
nc=wks.ncols;  //number of columns
for(jj=1; jj<=nc; jj++)
  wcol(jj)=remoutliers(wcol(jj),zthreshold);  //Replace outliers by NIL
wtranspose ow:=<new>;  //retranspose

//////////////////// Mask Missings ///////////////////////
nr=wks.maxrows;  //number of rows
nc=wks.ncols;    //number of columns
for(ii=1; ii<=nc; ii++) {
  range rs=$(pa)!wcol(ii);   //column in original sheet
  range rc=wcol(ii);
  for(jj=1; jj<=nr; jj++) {
    if(rc[jj]==0/0) {
      rc[jj]=rs[jj]; //recover the data from original
      rc<jj>=1;      //if cell is missing, mask it
    }
  }
}
////////////////////////////////////////////////////////////

Could you please to try?

--Hideo Fujii
OriginLab

Edited by - Hideo Fujii on 05/08/2017 5:51:22 PM

coll@inia.

Spain
125 Posts

Posted - 05/08/2017 : 4:25:20 PM

YES!!
BEAUTIFUL!!!
THANK YOU VERY MUCH!!

Just two more questions to fully understand your script.

first) I do not understand why the remoutliers function 2nd parameter is called zthreash and not threshold? is it a simplication? if it is, why not thresh? what the z is for?

2nd) related to the first
What zthreash?ds[ii]:0/0 means? I understand the ds[ii]:0/0 part but the use of zthreast and the ? are meaningless to me!!

Thank you Hideo

Hideo Fujii

USA
1582 Posts

Posted - 05/08/2017 : 5:45:56 PM

Hi Julio,

1) Sorry for my misspelling of the parameter. I have fixed my post above.

2) As an expression with a "Conditional Operator (? :), the line:

abs(ds1[ii]-summean)/sumsd<=zthresh?ds1[ii]:0/0

can be read as a pseudo code:

if(((abs(ds1[ii]-summean)/sumsd)<=zthresh)
  then return ds1[ii];
else return MISSING;

Hope this makes sense.

--Hideo Fujii, OriginLab

Edited by - Hideo Fujii on 05/08/2017 5:49:39 PM

coll@inia.

Spain
125 Posts

Posted - 05/08/2017 : 5:54:04 PM

Yes it does!
The funny thing is that despite the misspelling still worked out!

THANK YOU SO MUCH !

Hasta la vista Hideo, julio

Hideo Fujii

USA
1582 Posts

Posted - 05/09/2017 : 09:34:26 AM

Hi Julio,

mucho gusto.

By the way, I am wondering just curiously, how you present the data to be filtered out the outliers in each row.
Like a scatter plot? Or something else? If you don't mind, could you please share a sample chart image here?

--Hideo Fujii
OriginLab

coll@inia.

Spain
125 Posts

Posted - 05/09/2017 : 1:44:33 PM

dear Hideo,
the type of representation depends on the data .
In this case we present the data as bar graphs but it can be anything else....

Just the representation will be like the figure uploaded (I hope).

In the figure legend we state that the means and sd were calculated from by eliminating outliers from n=5, briefly indicating how we did removed the outliers. Up to now we were using as outliers the values > mean+sd. The mean was then recalculated.

In the present case, you develop, we will probably have to state that the outliers were eliminated by minimizing sd using n=5 values. Will be that all right? We may have to mention the threshold and the formula used for a more accurate description in methodology though. In addition, we will experiment more with the formula and the effects of the threshold value.

Hope you can access the data, final script and figure with which I experiment your scripts that I uploaded at:

http://www.originlab.com/ftp/forum_and_kbase/Images/Minimizing.sd.example.opj

Thank you again for your valuable help!!

sincerely julio

Hideo Fujii

USA
1582 Posts

Posted - 05/09/2017 : 5:23:59 PM

Thank you, Julio.

I have looked at your sample graph, and I have realized that the graph type itself is equivalent to the Bar Chart
style "Box chart". ("Plot: Box: Bar Chart" menu)
So, if you prepare the data in column-wise rather than row-wise, you can first apply the above remoutliers function
to each data column, then simply plot this type of box chart. The advantages of this approach are - no need of
transposing:

1) You can select various error bar types like SD, SE, percentile range,confidence interval, etc.
2) You can overlap data points over the bars of the bar chart.

You can consider this approach next time.

--Hideo Fujii
OriginLab

coll@inia.

Spain
125 Posts

Posted - 05/11/2017 : 12:29:08 PM

Yes, good point!

I will consider that possibility for future experimental designs....

THANKS AGAIN,

sincerly, Julio

Hideo Fujii

USA
1582 Posts

Posted - 05/24/2017 : 2:54:04 PM

Hi Julio,

If you organize your data column-wise, you can mask the outlier point using colmask x-function
(http://www.originlab.com/doc/X-Function/ref/colmask), or "Column: Mask Cells by Condition" menu from
user interface. (Thank you, Yuki for this good suggestion.) Then, make the Box chart in Bar-chart style.

--Hideo Fujii
OriginLab

Topic

New Topic

Reply to Topic

Printer Friendly

Jump To:

The Origin Forum

Snitz Forums 2000