User:Statsrick/Articles

Rick McFarland's Data Science Library

Last updated: 2022-05-08

Statistics
BI Visualization tools D3 D3 Github Secure Data Analysis VertPartition:Matrix HorizPartition:Regression Bootstrapping BasicsPaper SAS:NiceBootstrapSummary SAS:Jackboot SAS:BootstrappingLargeData SAS:NiceApplicationUsingBiasCorrection Modeling Skewed Data AnalysisMethods RegressionWithLotsofZeroes How I analyze Mass Advertising InterruptedTimeSeries.pdf PromotionalAnalysis.pdf InterventionAnalysis.pdf Response modeling Uplift_modeling SemiLog Regression SemiLogTrans Biases and Misconceptions List_of_cognitive_biases List_of_memory_biases List_of_fallacies List_of_common_misconceptions Precision_v_Accuracy Selection Bias...it's a bigger problem than you think Activity_Bias.pdf PropensityScoreMatching_with_SAS DoubleRobustEstimation More on Propensity Scores SAS note Weighting Outliers...also a bigger problem than you think SAS:robust_means Confidence intervals IntervalTypes PercentChangeCI Be careful with p-values 12 Misconceptions.pdf why 0.05? P-Value Precision and Reproducibility (use ****) Repeated Measures Proc Mixed vs GLM Weighting using Raking Sugi25 Sugi29 Randomized Response: How can we get accurate answers to a sensitive question which respondents might be reluctant to answer truthfully? Randomized_response Matrix Theory Decomps Nice Statistics References StatPages.org The Little Handbook of Statistical Practice StatsTheory StatisticalLaws Mathematics Elasticity EuclidsThm PythagThm GoldenRatio PerfectNumber Fibonacci P-adic_numbers more on p-adics Education www.crewtonramoneshouseofmath.com Blogs divisbyzero.com Yhat evanmiller.org A/B Testing Multi-Arm Bandit Bayesian Testing formulas from Evan Miller How NOT to run one R-squared correlation and r-squared just cool** Population Segregation Model

Machine Learning
Roadmap to Machine Intelligence Big Names in ML Geoffrey_Hinton: Neural Nets Yan LeCun: FaceBook Andrew Ng: Baidu David M. Blei: Topic Modeling Kyunghyun Cho: Text summarization Training Data Sets Images CIFAR pyCIFAR-10 pyCIFAR-100 Digits MNIST pickle Words Google 1BN words benchmark github Misc MLDB Neural Networks video:train one in 4 min Convolutional Neural Networks for Visual recognition Stanford training class video Implement from scratch using Python Play around for fun (visual) Recurrent Neural Networks High-level NN Library in Python Keras organize layers quick getting started to Keras image recognition using Keras Deep Learning MOOC Tutorial list video:Bay Area Deep Learning School 2016, Stanford – Day 1 video:Bay Area Deep Learning School 2016, Stanford – Day 2 Scikit+TensorFlow tutorial video:Wide & Deep Learning with TensorFlow samplecode:TensorFlow TensorFlow tutorial TensorFlow Install MSFT CNTK wiki video:CNTK Libraries List of 15 Python based (and my favorite) Theano github tutorial tips CNTK by MSFT github C/C++ libraries as well as CUDA for GPU Caffe github Convolutional Neural Nets leader Torch github cheetsheet Java based DeepLearning4J github Speed things up with Vowpal Wabbit (Used in MSFT ML) wiki github tutorial Text Processing Coursera on NLP Deep Learning for NLP NLTK Pattern word2vec Google word2vec Using TensorFlow TensorFlow Install github Topic Modeling using Genism Latent Dirichlet Allocation (LDA) with Python Harvard NLP Handwriting Classifier using TensorFlow Learning Methods Supervised versus Unsupervised Learning Supervised Learning = Statistics: Classification (e.g. logistic regression, discriminant analysis) Unsupervised Learning = Statistics: Cluster Analysis Unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Active Learning Reinforcement Learning video:training Build your own with these software libraries (favorites in bold) ML Open Source List Python python numpy tutorial SciKit Learn SkiKit Tutorial NN using PyBrain pylearn2 NLTK slideshare:NLTK in 20 minutes TensorFlow Theano video:TensorFlow in 5 min NN using Theano NuPIC NiLearn Pattern BeautifulSoup Genism C+ Mlpack Shogun Cuda for GPUs MulitBoost Shogun .Net Aforge Vulpes for GPUs Go GoLearn GoVowpalWabbit JavaScript ConvnetJS Spark MLib Turn your ML models into a service ScienceOps Use someone else's "brain" via APIs, SDKs, Tools, Microservices The programmable web Top 10 from 2015 Translation Google Translate Pure Neural MSFT Google Google Cloud ML GCP Prediction AWS AWS AI AWS ML Amazon ML examples IBM IBM Blue Mix console Watson developer cloud AlchemyAPI MSFT MSFT Azure MSFT Cognitive Services API's CNTK Intel Intel AI AT&T AT&T Devloper Facebook add NLP speech functionality Wit.ai demo Clairify for audio and video classification Clairify API Rick's Cool Python ML Code!

ExcelVBA
Alt+F11 Tutorial VBA Code tips sites VBA Snippets Erlandsen More tips Rick's Cool VB Code! Rick's Cool VB Functions!

PHP
PHP.net PHP variable rules PHP via crontab cron formatting AES symmetric encryption Rick's Cool PHP Code!

Python
Install Python and Rodeo on your computer docs.python.org wiki.python.org Use containers for production: Docker SublimeText2 a great, cheap general purpose editor with key word formatting. Excellent formatting and you can build your code within it and (cmd-B) to run it. PyCharm an even better, professional development tool for Python that costs money. Python Libraries access your AWS services using boto access your AWS passwords using pgpass run postgres queries on databases like AWS Redshift using psycopg alternative to aws s3 tool s3cmd 11 lesser known libraries but still cool XML Parsers easier to use ElementTree lxml minidom parse to a dictionary xmltodict a bit harder to use BeautifulSoup(source_xml, features="xml") more info here Cheatsheets Python Pandas Logistic Regression using Python Random Forests using Python Naive Bayes using Python PandaSQL Group By in Pandas Rick's Python UDF's Python Modules Rick's Cool Python Code!

SAS
SAS Technical Support (919)677-8008 check_first_before_calling MainSupportSite Search SAS Resources Knowledgebase SearchDocumentation SearchSampleCodeNotes SearchSUGI SASpapers Below are shortcuts to some interesting topics Reading in data infile import_dlm Dates Date_and_time_Formats Dates_in_macro_variables Functions Character Functions Medians CI's CI_using_ProcUnivariate SAS-to-Oracle Regression_Beyond_OLS Where to split a piece-wise regression Sampling Stratified Cluster Analysis Advanced_Data_Cleaning_With_Outlier_Management_by_Ron_Cody.ppt Marketing_Research_Methods_In_SAS_by_Warren_Kuhfeld Dummy_variable_coding this is harder than you think PROC MEANS AdvancedTips PROC DATASETS Why_and_how_to_use Overview Efficiency Indexes Creation FAQ PROC LOGISTIC Training Managing_Excessive_Memory R² SAS and the Web ReadingSourceCode ReadingAndWritingToTheWeb SourceCodeToData Advanced PROC FORMAT YouCanDoThatWithProcFormat! PROC IML RediscoverIML PROC SQL DataRS monotonic() Undocumented Bootstrapping BootstrapSummary [ Rick's Cool SAS Code!

BIG Data
MUST READ...AWS Big Data Blog Overview SAS:tips_for_dealing_with_big_data SAS:more_tips PROC DATASETS Trimming Variable Lengths SAS %SQUEEZE Macro Sorting SAS:Sorting_Big_Data Merging SAS:overview_of_methods SAS:Using_ProcFormat SAS:ChoosingTechnique R BigMemoryR filehash Ranking Workaround NESUG17 Bootstrapping SAS:BootstrappingLargeData

SQL
Stanford_Oracle_SQL_Reference SQLjoins OracleReserveWords AskTom for OracleSQL support Analytic_Function_Examples WindowingFunctions Rick's Cool SQL Code!

Amazon Redshift
Redshift documentation Redshift System Tables reference Redshift Allowable SQL Command Reference SQL Commands not allowed in Redshift postgreSQL environment in Amazon Redshift Connecting to a Redshift cluster AWS instructions Setting SQL Workbench AWS instructions Redshift Tips from Shantanu's Blog MUST READ...querying best practices MUST READ...tuning tables Rick's Redshift Tips! NEW Rick's Redshift UDFs!

Google BigQuery
Big Query Quick Start Big Query Console bq command line tool Big Query Python packages the one I use Big Query API Run GQL in Google Sheets with OWOX Addin! UDF's Rick's Cool GQL Code! NEW Rick's BigQuery UDFs!

Amazon Elastic Map Reduce (Hadoop)
Pig Ref Manual Example Log File Parsing with Pig Pig Function Cheat Sheet datetime handling Pig Ref Manual functions Java Regexp Regexp tutorial tester Loading Data from EMR to Redshift Rick's Pig Tips!

AWS BEST PRACTICE: Read AWS Keys and Passwords from file in ~/.aws/ (Python)

AWSUSER='[default]'
AWSRSUSER='admin'

import os
home=os.path.expanduser('~')
with open(home+'/.aws/credentials') as f:
 lines = f.read().splitlines()
for i in range(len(lines)):
 if lines[i]==AWSUSER:
  AWSKEY=lines[i+1].split('=')[1].strip()
  AWSSECRET=lines[i+2].split('=')[1].strip()
with open('/Users/rmcfarland/.aws/credentials_redshift') as f:
 lines = f.read().splitlines()
for i in range(len(lines)):
 if lines[i]==AWSRSUSER:
  AWSRSPWD=lines[i+1].split('=')[1].strip()

print AWSKEY
print AWSSECRET
print AWSRSUSER
print AWSRSPWD

AWS BEST PRACTICE: Read AWS Keys and Passwords from file in ~/.aws/ (PHP)

<?php
$AWSUSER='[default]';
$AWSRSUSER='admin';

$home = getenv("HOME");
$file = file($home.'/.aws/credentials');
$AWSKEY=''; $AWSSECRET='';
for ($i = 0; $i < count($file); $i++) {
if (trim($file[$i])==$AWSUSER) {
$k=explode("=",$file[$i+1]); $AWSKEY=trim($k[1]);
$k=explode("=",$file[$i+2]); $AWSSECRET=trim($k[1]);}
}
$file = file($home.'/.aws/credentials_redshift');
$AWSRSPWD='';
for ($i = 0; $i < count($file); $i++) {
if (trim($file[$i])==$AWSRSUSER) {
$k=explode("=",$file[$i+1]); $AWSRSPWD=trim($k[1]);}
}

print $AWSKEY."\t".$AWSSECRET."\n";
print $AWSRSUSER."\t".$AWSRSPWD."\n";
?>

AWS BEST PRACTICE: Read AWS Keys and Passwords from file in ~/.aws/(Bash)

#!/bin/bash
AWSUSER='[default]'
AWSRSUSER='admin'

FILE_DATA=( $( /bin/cat  ~/.aws/credentials ) )
for I in $(/usr/bin/seq 0 $((${#FILE_DATA[@]} - 1)))
do
 if [ ${FILE_DATA[$I]} = ${AWSUSER} ]
  then AWSKEY=${FILE_DATA[$I+3]}
   AWSSECRET=${FILE_DATA[$I+6]}
 fi
done
FILE_DATA=( $( /bin/cat  ~/.aws/credentials_redshift ) )
for I in $(/usr/bin/seq 0 $((${#FILE_DATA[@]} - 1)))
do
 if [ ${FILE_DATA[$I]} = ${AWSRSUSER} ]
  then AWSRSPWD=${FILE_DATA[$I+3]}
 fi
done
echo ${AWSKEY}
echo ${AWSSECRET}
echo ${AWSRSUSER}
echo ${AWSRSPWD}

Windows tips
Add programs to SendTo. Type this in the Explorer %APPDATA%\Microsoft\Windows\SendTo and drag and drop the programs you want into the list

Quick and dirty way to get user's "local time" from GMT computer time and user's longitude
GMT+round(LON/15)=LocalTime

Convert a CI to a PVALUE

CI:
lc=EST-z*SE
uc=EST+z*SE
Where EST is the statistic estimate (e.g. Xbar) and SE is the standard error (e.g. S/sqrt(n)). 
z is the 1-alpha critical value of the distribution associated with the estimate: 
(e.g. 1-alpha=P(Z<z)+P(Z>z) where Z~N(0,1) ex. 90%=1.65,95%=1.96,99%=2.58).

PVALUE:
1-PHI[EST/SE] for a one-sided test OR 2*{1-PHI[abs(EST/SE)]} for a two-sided test
Where PHI is the CDF associated with the estimate (e.g. PHI(z) = P(Z<=z) where Z~N(0,1))

CONVERSION (under the null hypothesis that the true parameter value = 0):
EST/SE = z*[(uc+lc)/(uc-lc)]

Count the number of rows in a text file
in DOS; find /c /v "~`!@#$%^&()_+" file.txt *in UNIX; wc -l file.txt \| tail -1

Some of my favorite unix tips LinuxTipsWebsite

--GREP--
list file names only that do not contain 'newstimes'
 grep -l -v 'newstimes' *
list files with '6082159.php' searching recursively 
 grep -rl '6082159.php'  /data/hnp/articles/
list files that contain x but not y
 grep -l 'x' | grep -l -v 'y'

--To sort on the fourth column of a tab delimited file
 sort -t\t -k 4,4 <filename>
--You might also want -V which sorts numbers more naturally. For example, yielding 1 2 10 rather than 1 10 2 (lexicographic order).
 sort -t\t -k 4,4 -V <filename>



--'''less''' is more [http://helpdeskgeek.com/linux-tips/more-less-command-linux-unix/]
 [Arrows]/[Page Up]/[Page Down]/[Home]/[End]: Navigation.
 [Space bar]: Next page.
 b: Previous page.
 ng: Jump to line number n. Default is the start of the file.
 nG: Jump to line number n. Default is the end of the file.
 /pattern: Search for pattern. Regular expressions can be used.
 n: Go to next match (after a successful search).
 N: Go to previous match.
 mletter: Mark the current position with letter.
 ‘letter: Return to position letter. [' = single quote]
 ‘^ or g: Go to start of file.
 ‘$ or G: Go to end of file.
 s: Save current content (got from another program like grep) in a file.
 =: File information.
 F: continually read information from file and follow its end. Useful for logs watching. Use Ctrl+C to exit this mode.
 h: Help.
 q: Quit.

--remove duplicate rows of a file without sorting
awk '!x[$0]++' abill.tsv > abill_sm.tsv
In the de-duplicating script, it’s evaluating the expression “!x[$0]++’. 
Breaking this down:
$0 is the entire current line.
x[$0] is a hash array element which assigns the current line to the hash as a key
x[$0]++ post-increments the current element of the hash array, thus increasing every time a duplicate line gets assigned (to the same element).
!x[$0]++ returns true if x[$0] is 0, and false if it’s anything else (since it negates the value). The post-increment happens after this is evaluated.
The expression only evaluates as true, and therefore prints the current line, if the line hasn’t been seen already.

--wget entire directory
wget -r --no-parent --reject "index.html*" http://mysite.com/configs/.vim/


-- output first 7 rows of a file
head -7 file > small_file
head -n 7 file > small_file
head -7l file > small_file
cat file.gz | head -7 > small_file


-- output random selection of 7 rows of a file
shuf -n 7 file > small_file

--delete files below a certain size
find -name "*.csv" -size -160k -delete
--run without the delete to see what it selects


--Find maximum value -f2 is here for 2nd column for eg
  cat filename | cut -f2 -d " " | sort -nr | head -1
--For minimum
  cat filename | cut -f2 -d " " | sort -nr | tail -1

--Show nohup processes: lsof | grep nohup.out
--Kill nohup processes: kill -9 pid

--Check how much space you are taking up: du --max-depth=1 /home/rickmc | sort -n -r
--Check how much space is left: df -k
--File the big files:  find {location} -type f -size +{file size threshhold in kb}k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
  example: find /data -type f -size +10000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

--What processes do you have running: ps -ef |grep rickmc

---Find a file
$ find / -name 'program.c' 2>/dev/null
$ find / -name 'program.c' 2>errors.txt
where  / Start searching from the root directory (i.e / directory)
       -name Given search text is the filename rather than any other attribute of a file
       'program.c' Search text that we have entered. Always enclose the filename in single quotes.
        Why to do this is complex.. so simply do so.
Note : 2>/dev/null is not related to find tool as such. 
2 indicates the error stream in Linux, and /dev/null is the device where anything you send simply disappears. 
So 2>/dev/null in this case means that while finding for the files, in case any error messages pop up simply 
send them to /dev/null i.e. simply discard all error messages.

How to Edit a Windows File as Administrator

From the Start menu, search for Command Prompt.
Right-click on the Command Prompt application and choose Run as Administrator.
At the command prompt type: notepad C:\Windows\System32\drivers\etc\hosts
This will open notepad as an administrator which will let you edit and save the file.

}

Screen Command
>screen ># type commands >Ctrl-a d # to put in background >screen -ls # to see screens running in background >screen -r 39203 # to go back to screen 39203