Since pandas
aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas
. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame
objects from pandas
to R, one option is to use HDF5 files, see External Compatibility for an example.
Quick Reference
We?ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
Querying, Filtering, Sampling
R | pandas |
---|---|
dim(df) | df.shape |
head(df) | df.head() |
slice(df, 1:10) | df.iloc[:9] |
filter(df, col1 == 1, col2 == 1) | df.query('col1 == 1 & col2 == 1') |
df[df$col1 == 1 & df$col2 == 1,] | df[(df.col1 == 1) & (df.col2 == 1)] |
select(df, col1, col2) | df[['col1', 'col2']] |
select(df, col1:col3) | df.loc[:, 'col1':'col3'] |
select(df, -(col1:col3)) |
df.drop(cols_to_drop, axis=1) but see [1]
|
distinct(select(df, col1)) | df[['col1']].drop_duplicates() |
distinct(select(df, col1, col2)) | df[['col1', 'col2']].drop_duplicates() |
sample_n(df, 10) | df.sample(n=10) |
sample_frac(df, 0.01) | df.sample(frac=0.01) |
[1] | R?s shorthand for a subrange of columns (select(df, col1:col3) ) can be approached cleanly in pandas, if you have the list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]) , but doing this by column name is a bit messy. |
Sorting
R | pandas |
---|---|
arrange(df, col1, col2) | df.sort_values(['col1', 'col2']) |
arrange(df, desc(col1)) | df.sort_values('col1', ascending=False) |
Transforming
R | pandas |
---|---|
select(df, col_one = col1) | df.rename(columns={'col1': 'col_one'})['col_one'] |
rename(df, col_one = col1) | df.rename(columns={'col1': 'col_one'}) |
mutate(df, c=a-b) | df.assign(c=df.a-df.b) |
Grouping and Summarizing
R | pandas |
---|---|
summary(df) | df.describe() |
gdf <- group_by(df, col1) | gdf = df.groupby('col1') |
summarise(gdf, avg=mean(col1, na.rm=TRUE)) | df.groupby('col1').agg({'col1': 'mean'}) |
summarise(gdf, total=sum(col1)) | df.groupby('col1').sum() |
Base R
Slicing with R?s c
R makes it easy to access data.frame
columns by name
1 2 | df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5)) df[, c("a", "c", "e")] |
or by integer location
1 2 | df <- data.frame(matrix(rnorm(1000), ncol=100)) df[, c(1:10, 25:30, 40, 50:100)] |
Selecting multiple columns by name in pandas
is straightforward
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | In [ 1 ]: df = pd.DataFrame(np.random.randn( 10 , 3 ), columns = list ( 'abc' )) In [ 2 ]: df[[ 'a' , 'c' ]] Out[ 2 ]: a c 0 - 1.039575 - 0.424972 1 0.567020 - 1.087401 2 - 0.673690 - 1.478427 3 0.524988 0.577046 4 - 1.715002 - 0.370647 5 - 1.157892 0.844885 6 1.075770 1.643563 7 - 1.469388 - 0.674600 8 - 1.776904 - 1.294524 9 0.413738 - 0.472035 In [ 3 ]: df.loc[:, [ 'a' , 'c' ]] Out[ 3 ]: a c 0 - 1.039575 - 0.424972 1 0.567020 - 1.087401 2 - 0.673690 - 1.478427 3 0.524988 0.577046 4 - 1.715002 - 0.370647 5 - 1.157892 0.844885 6 1.075770 1.643563 7 - 1.469388 - 0.674600 8 - 1.776904 - 1.294524 9 0.413738 - 0.472035 |
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc
indexer attribute and numpy.r_
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | In [ 4 ]: named = list ( 'abcdefg' ) In [ 5 ]: n = 30 In [ 6 ]: columns = named + np.arange( len (named), n).tolist() In [ 7 ]: df = pd.DataFrame(np.random.randn(n, n), columns = columns) In [ 8 ]: df.iloc[:, np.r_[: 10 , 24 : 30 ]] Out[ 8 ]: a b c d e f g \ 0 - 0.013960 - 0.362543 - 0.006154 - 0.923061 0.895717 0.805244 - 1.206412 1 0.545952 - 1.219217 - 1.226825 0.769804 - 1.281247 - 0.727707 - 0.121306 2 2.396780 0.014871 3.357427 - 0.317441 - 1.236269 0.896171 - 0.487602 3 - 0.988387 0.094055 1.262731 1.289997 0.082423 - 0.055758 0.536580 4 - 1.340896 1.846883 - 1.328865 1.682706 - 1.717693 0.888782 0.228440 5 0.464000 0.227371 - 0.496922 0.306389 - 2.290613 - 1.134623 - 1.561819 6 - 0.507516 - 0.230096 0.394500 - 1.934370 - 1.652499 1.488753 - 0.896484 .. ... ... ... ... ... ... ... 23 - 0.083272 - 0.273955 - 0.772369 - 1.242807 - 0.386336 - 0.182486 0.164816 24 2.071413 - 1.364763 1.122066 0.066847 1.751987 0.419071 - 1.118283 25 0.036609 0.359986 1.211905 0.850427 1.554957 - 0.888463 - 1.508808 26 - 1.179240 0.238923 1.756671 - 0.747571 0.543625 - 0.159609 - 0.051458 27 0.025645 0.932436 - 1.694531 - 0.182236 - 1.072710 0.466764 - 0.072673 28 0.439086 0.812684 - 0.128932 - 0.142506 - 1.137207 0.462001 - 0.159466 29 - 0.909806 - 0.312006 0.383630 - 0.631606 1.321415 - 0.004799 - 2.008210 7 8 9 24 25 26 27 \ 0 2.565646 1.431256 1.340309 0.875906 - 2.211372 0.974466 - 2.006747 1 - 0.097883 0.695775 0.341734 - 1.743161 - 0.826591 - 0.345352 1.314232 2 - 0.082240 - 2.182937 0.380396 1.266143 0.299368 - 0.863838 0.408204 3 - 0.489682 0.369374 - 0.034571 0.221471 - 0.744471 0.758527 1.729689 4 0.901805 1.171216 0.520260 0.650776 - 1.461665 - 1.137707 - 0.891060 5 - 0.260838 0.281957 1.523962 - 0.008434 1.952541 - 1.056652 0.533946 6 0.576897 1.146000 1.487349 2.015523 - 1.833722 1.771740 - 0.670027 .. ... ... ... ... ... ... ... 23 0.065624 0.307665 - 1.898358 1.389045 - 0.873585 - 0.699862 0.812477 24 1.010694 0.877138 - 0.611561 - 1.040389 - 0.796211 0.241596 0.385922 25 - 0.617855 0.536164 2.175585 1.872601 - 2.513465 - 0.139184 0.810491 26 0.937882 0.617547 0.287918 - 1.584814 0.307941 1.809049 0.296237 27 - 0.026233 - 0.051744 0.001402 0.150664 - 3.060395 0.040268 0.066091 28 - 1.788308 0.753604 0.918071 0.922729 0.869610 0.364726 - 0.226101 29 - 0.481634 - 2.056211 - 2.106095 0.039227 0.211283 1.440190 - 0.989193 28 29 0 - 0.410001 - 0.078638 1 0.690579 0.995761 2 - 1.048089 - 0.025747 3 - 0.964980 - 0.845696 4 - 0.693921 1.613616 5 - 1.226970 0.040403 6 0.049307 - 0.521493 .. ... ... 23 - 0.469503 1.142702 24 - 0.486078 0.433042 25 0.571599 - 0.000676 26 - 0.143550 0.289401 27 - 0.192862 1.979055 28 - 0.657647 - 0.952699 29 0.313335 - 0.399709 [ 30 rows x 16 columns] |
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df
and splitting it into groups by1
and by2
:
1 2 3 4 5 6 | df <- data.frame( v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9), v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99), by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12), by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)) aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean) |
The groupby()
method is similar to base R aggregate
function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | In [ 9 ]: df = pd.DataFrame({ ...: 'v1' : [ 1 , 3 , 5 , 7 , 8 , 3 , 5 ,np.nan, 4 , 5 , 7 , 9 ], ...: 'v2' : [ 11 , 33 , 55 , 77 , 88 , 33 , 55 ,np.nan, 44 , 55 , 77 , 99 ], ...: 'by1' : [ "red" , "blue" , 1 , 2 , np.nan, "big" , 1 , 2 , "red" , 1 , np.nan, 12 ], ...: 'by2' : [ "wet" , "dry" , 99 , 95 , np.nan, "damp" , 95 , 99 , "red" , 99 , np.nan, ...: np.nan] ...: }) ...: In [ 10 ]: g = df.groupby([ 'by1' , 'by2' ]) In [ 11 ]: g[[ 'v1' , 'v2' ]].mean() Out[ 11 ]: v1 v2 by1 by2 1 95 5.0 55.0 99 5.0 55.0 2 95 7.0 77.0 99 NaN NaN big damp 3.0 33.0 blue dry 3.0 33.0 red red 4.0 44.0 wet 1.0 11.0 |
For more details and examples see the groupby documentation.
match
/ %in%
A common way to select data in R is using %in%
which is defined using the function match
. The operator %in%
is used to return a logical vector indicating if there is a match or not:
1 2 | s <- 0:4 s %in% c(2,4) |
The isin()
method is similar to R %in%
operator:
1 2 3 4 5 6 7 8 9 10 | In [ 12 ]: s = pd.Series(np.arange( 5 ),dtype = np.float32) In [ 13 ]: s.isin([ 2 , 4 ]) Out[ 13 ]: 0 False 1 False 2 True 3 False 4 True dtype: bool |
The match
function returns a vector of the positions of matches of its first argument in its second:
1 2 | s <- 0:4 match(s, c(2,4)) |
The apply()
method can be used to replicate this:
1 2 3 4 5 6 7 8 9 10 | In [ 14 ]: s = pd.Series(np.arange( 5 ),dtype = np.float32) In [ 15 ]: pd.Series(pd.match(s,[ 2 , 4 ],np.nan)) Out[ 15 ]: 0 NaN 1 NaN 2 0.0 3 NaN 4 1.0 dtype: float64 |
For more details and examples see the reshaping documentation.
tapply
tapply
is similar to aggregate
, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball
, and retrieving information based on the array team
:
1 2 3 4 5 6 7 8 | baseball <- data.frame(team = gl(5, 5, labels = paste("Team", LETTERS[1:5])), player = sample(letters, 25), batting.average = runif(25, .200, .400)) tapply(baseball$batting.average, baseball.example$team, max) |
In pandas
we may use pivot_table()
method to handle this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | In [ 16 ]: import random In [ 17 ]: import string In [ 18 ]: baseball = pd.DataFrame({ ....: 'team' : [ "team %d" % (x + 1 ) for x in range ( 5 )] * 5 , ....: 'player' : random.sample( list (string.ascii_lowercase), 25 ), ....: 'batting avg' : np.random.uniform(. 200 , . 400 , 25 ) ....: }) ....: In [ 19 ]: baseball.pivot_table(values = 'batting avg' , columns = 'team' , aggfunc = np. max ) Out[ 19 ]: team team 1 0.394457 team 2 0.395730 team 3 0.343015 team 4 0.388863 team 5 0.377379 Name: batting avg, dtype: float64 |
For more details and examples see the reshaping documentation.
subset
New in version 0.13.
The query()
method is similar to the base R subset
function. In R you might want to get the rows of a data.frame
where one column?s values are less than another column?s values:
1 2 3 | df <- data.frame(a=rnorm(10), b=rnorm(10)) subset(df, a <= b) df[df$a <= df$b,] # note the comma |
In pandas
, there are a few ways to perform subsetting. You can use query()
or pass an expression as if it were an index/slice as well as standard boolean indexing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | In [ 20 ]: df = pd.DataFrame({ 'a' : np.random.randn( 10 ), 'b' : np.random.randn( 10 )}) In [ 21 ]: df.query( 'a <= b' ) Out[ 21 ]: a b 0 - 1.003455 - 0.990738 1 0.083515 0.548796 3 - 0.524392 0.904400 4 - 0.837804 0.746374 8 - 0.507219 0.245479 In [ 22 ]: df[df.a < = df.b] Out[ 22 ]: a b 0 - 1.003455 - 0.990738 1 0.083515 0.548796 3 - 0.524392 0.904400 4 - 0.837804 0.746374 8 - 0.507219 0.245479 In [ 23 ]: df.loc[df.a < = df.b] Out[ 23 ]: a b 0 - 1.003455 - 0.990738 1 0.083515 0.548796 3 - 0.524392 0.904400 4 - 0.837804 0.746374 8 - 0.507219 0.245479 |
For more details and examples see the query documentation.
with
New in version 0.13.
An expression using a data.frame called df
in R with the columns a
and b
would be evaluated using with
like so:
1 2 3 | df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b) df$a + df$b # same as the previous expression |
In pandas
the equivalent expression, using the eval()
method, would be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | In [ 24 ]: df = pd.DataFrame({ 'a' : np.random.randn( 10 ), 'b' : np.random.randn( 10 )}) In [ 25 ]: df. eval ( 'a + b' ) Out[ 25 ]: 0 - 0.920205 1 - 0.860236 2 1.154370 3 0.188140 4 - 1.163718 5 0.001397 6 - 0.825694 7 - 1.138198 8 - 1.708034 9 1.148616 dtype: float64 In [ 26 ]: df.a + df.b # same as the previous expression Out[ 26 ]: 0 - 0.920205 1 - 0.860236 2 1.154370 3 0.188140 4 - 1.163718 5 0.001397 6 - 0.825694 7 - 1.138198 8 - 1.708034 9 1.148616 dtype: float64 |
In certain cases eval()
will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.
plyr
plyr
is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a
for arrays
, l
for lists
, and d
for data.frame
. The table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
ddply
An expression using a data.frame called df
in R where you want to summarize x
by month
:
1 2 3 4 5 6 7 8 9 10 11 12 | require(plyr) df <- data.frame( x = runif(120, 1, 168), y = runif(120, 7, 334), z = runif(120, 1.7, 20.7), month = rep(c(5,6,7,8),30), week = sample(1:4, 120, TRUE) ) ddply(df, .(month, week), summarize, mean = round(mean(x), 2), sd = round(sd(x), 2)) |
In pandas
the equivalent expression, using the groupby()
method, would be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | In [ 27 ]: df = pd.DataFrame({ ....: 'x' : np.random.uniform( 1. , 168. , 120 ), ....: 'y' : np.random.uniform( 7. , 334. , 120 ), ....: 'z' : np.random.uniform( 1.7 , 20.7 , 120 ), ....: 'month' : [ 5 , 6 , 7 , 8 ] * 30 , ....: 'week' : np.random.randint( 1 , 4 , 120 ) ....: }) ....: In [ 28 ]: grouped = df.groupby([ 'month' , 'week' ]) In [ 29 ]: grouped[ 'x' ].agg([np.mean, np.std]) Out[ 29 ]: mean std month week 5 1 71.840596 52.886392 2 71.904794 55.786805 3 89.845632 49.892367 6 1 97.730877 52.442172 2 93.369836 47.178389 3 96.592088 58.773744 7 1 59.255715 43.442336 2 69.634012 28.607369 3 84.510992 59.761096 8 1 104.787666 31.745437 2 69.717872 53.747188 3 79.892221 52.950459 |
For more details and examples see the groupby documentation.
reshape / reshape2
melt.array
An expression using a 3 dimensional array called a
in R where you want to melt it into a data.frame:
1 2 | a <- array(c(1:23, NA), c(2,3,4)) data.frame(melt(a)) |
In Python, since a
is a list, you can simply use list comprehension.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | In [ 30 ]: a = np.array( list ( range ( 1 , 24 )) + [np.NAN]).reshape( 2 , 3 , 4 ) In [ 31 ]: pd.DataFrame([ tuple ( list (x) + [val]) for x, val in np.ndenumerate(a)]) Out[ 31 ]: 0 1 2 3 0 0 0 0 1.0 1 0 0 1 2.0 2 0 0 2 3.0 3 0 0 3 4.0 4 0 1 0 5.0 5 0 1 1 6.0 6 0 1 2 7.0 .. .. .. .. ... 17 1 1 1 18.0 18 1 1 2 19.0 19 1 1 3 20.0 20 1 2 0 21.0 21 1 2 1 22.0 22 1 2 2 23.0 23 1 2 3 NaN [ 24 rows x 4 columns] |
melt.list
An expression using a list called a
in R where you want to melt it into a data.frame:
1 2 | a <- as.list(c(1:4, NA)) data.frame(melt(a)) |
In Python, this list would be a list of tuples, so DataFrame()
method would convert it to a dataframe as required.
1 2 3 4 5 6 7 8 9 10 | In [ 32 ]: a = list ( enumerate ( list ( range ( 1 , 5 )) + [np.NAN])) In [ 33 ]: pd.DataFrame(a) Out[ 33 ]: 0 1 0 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 NaN |
For more details and examples see the Into to Data Structures documentation.
melt.data.frame
An expression using a data.frame called cheese
in R where you want to reshape the data.frame:
1 2 3 4 5 6 7 | cheese <- data.frame( first = c('John', 'Mary'), last = c('Doe', 'Bo'), height = c(5.5, 6.0), weight = c(130, 150) ) melt(cheese, id=c("first", "last")) |
In Python, the melt()
method is the R equivalent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | In [ 34 ]: cheese = pd.DataFrame({ 'first' : [ 'John' , 'Mary' ], ....: 'last' : [ 'Doe' , 'Bo' ], ....: 'height' : [ 5.5 , 6.0 ], ....: 'weight' : [ 130 , 150 ]}) ....: In [ 35 ]: pd.melt(cheese, id_vars = [ 'first' , 'last' ]) Out[ 35 ]: first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0 In [ 36 ]: cheese.set_index([ 'first' , 'last' ]).stack() # alternative way Out[ 36 ]: first last John Doe height 5.5 weight 130.0 Mary Bo height 6.0 weight 150.0 dtype: float64 |
For more details and examples see the reshaping documentation.
cast
In R acast
is an expression using a data.frame called df
in R to cast into a higher dimensional array:
1 2 3 4 5 6 7 8 9 10 | df <- data.frame( x = runif(12, 1, 168), y = runif(12, 7, 334), z = runif(12, 1.7, 20.7), month = rep(c(5,6,7),4), week = rep(c(1,2), 6) ) mdf <- melt(df, id=c("month", "week")) acast(mdf, week ~ month ~ variable, mean) |
In Python the best way is to make use of pivot_table()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | In [ 37 ]: df = pd.DataFrame({ ....: 'x' : np.random.uniform( 1. , 168. , 12 ), ....: 'y' : np.random.uniform( 7. , 334. , 12 ), ....: 'z' : np.random.uniform( 1.7 , 20.7 , 12 ), ....: 'month' : [ 5 , 6 , 7 ] * 4 , ....: 'week' : [ 1 , 2 ] * 6 ....: }) ....: In [ 38 ]: mdf = pd.melt(df, id_vars = [ 'month' , 'week' ]) In [ 39 ]: pd.pivot_table(mdf, values = 'value' , index = [ 'variable' , 'week' ], ....: columns = [ 'month' ], aggfunc = np.mean) ....: Out[ 39 ]: month 5 6 7 variable week x 1 114.001700 132.227290 65.808204 2 124.669553 147.495706 82.882820 y 1 225.636630 301.864228 91.706834 2 57.692665 215.851669 218.004383 z 1 17.793871 7.124644 17.679823 2 15.068355 13.873974 9.394966 |
Similarly for dcast
which uses a data.frame called df
in R to aggregate information based on Animal
and FeedType
:
1 2 3 4 5 6 7 8 9 10 | df <- data.frame( Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', 'Animal2', 'Animal3'), FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'), Amount = c(10, 7, 4, 2, 5, 6, 2) ) dcast(df, Animal ~ FeedType, sum, fill=NaN) # Alternative method using base R with(df, tapply(Amount, list(Animal, FeedType), sum)) |
Python can approach this in two different ways. Firstly, similar to above using pivot_table()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In [ 40 ]: df = pd.DataFrame({ ....: 'Animal' : [ 'Animal1' , 'Animal2' , 'Animal3' , 'Animal2' , 'Animal1' , ....: 'Animal2' , 'Animal3' ], ....: 'FeedType' : [ 'A' , 'B' , 'A' , 'A' , 'B' , 'B' , 'A' ], ....: 'Amount' : [ 10 , 7 , 4 , 2 , 5 , 6 , 2 ], ....: }) ....: In [ 41 ]: df.pivot_table(values = 'Amount' , index = 'Animal' , columns = 'FeedType' , aggfunc = 'sum' ) Out[ 41 ]: FeedType A B Animal Animal1 10.0 5.0 Animal2 2.0 13.0 Animal3 6.0 NaN |
The second approach is to use the groupby()
method:
1 2 3 4 5 6 7 8 9 | In [ 42 ]: df.groupby([ 'Animal' , 'FeedType' ])[ 'Amount' ]. sum () Out[ 42 ]: Animal FeedType Animal1 A 10 B 5 Animal2 A 2 B 13 Animal3 A 6 Name: Amount, dtype: int64 |
For more details and examples see the reshaping documentation or the groupby documentation.
factor
New in version 0.15.
pandas has a data type for categorical data.
1 2 | cut(c(1,2,3,4,5,6), 3) factor(c(1,2,3,2,2,3)) |
In pandas this is accomplished with pd.cut
and astype("category")
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | In [ 43 ]: pd.cut(pd.Series([ 1 , 2 , 3 , 4 , 5 , 6 ]), 3 ) Out[ 43 ]: 0 ( 0.995 , 2.667 ] 1 ( 0.995 , 2.667 ] 2 ( 2.667 , 4.333 ] 3 ( 2.667 , 4.333 ] 4 ( 4.333 , 6 ] 5 ( 4.333 , 6 ] dtype: category Categories ( 3 , object ): [( 0.995 , 2.667 ] < ( 2.667 , 4.333 ] < ( 4.333 , 6 ]] In [ 44 ]: pd.Series([ 1 , 2 , 3 , 2 , 2 , 3 ]).astype( "category" ) Out[ 44 ]: 0 1 1 2 2 3 3 2 4 2 5 3 dtype: category Categories ( 3 , int64): [ 1 , 2 , 3 ] |
For more details and examples see categorical introduction and the API documentation. There is also a documentation regarding the differences to R?s factor.
Please login to continue.