ডেটাফ্রেমে পরপর শূন্যগুলি সন্ধান করুন এবং শর্তসাপেক্ষে প্রতিস্থাপন করুন

10

আমার এই জাতীয় ডেটাসেট রয়েছে:

নমুনা ডেটাফ্রেম

import pandas as pd

df = pd.DataFrame({
    'names': ['A','B','C','D','E','F','G','H','I','J','K','L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]})

আমি কয়েকটি 0এর মধ্যে col1এবং এর col2সাথে প্রতিস্থাপন করতে চাই 1, তবে একই কলামে 0যদি তিন বা ততোধিক সংখ্যক 0ধারাবাহিক হয় তবে এর প্রতিস্থাপন করব না । পান্ডা দিয়ে এটি কীভাবে করা যায়?

মূল ডেটাসেট:

names   col1    col2
A   0   0
B   1   0
C   0   0
D   1   0
E   1   1
F   1   0
G   0   1
H   0   0
I   0   1
J   1   0
K   0   0
L   0   0

পছন্দসই ডেটাসেট:

names   col1    col2
A   1   0
B   1   0
C   1   0
D   1   0
E   1   1
F   1   1
G   0   1
H   0   1
I   0   1
J   1   0
K   1   0
L   1   0

python pandas dataframe

— কেভিন
সূত্র

কি col2?

— oW_

df.loc[(df['col1']+df['col1'].shift(1)+df['col1'].shift(2)>0)&(df['col1']+df['col1'].shift(1)+df['col1'].shift(-1)>0)&(df['col1']+df['col1'].shift(-1)+df['col1'].shift(-2)>0)]=1

তবে এটি প্রথম এবং শেষের দুটি সারিটি অচ্ছুত রেখে দেয়

— oW_

9

নিম্নলিখিত পদ্ধতির বিবেচনা করুন:

def f(col, threshold=3):
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(0)
    col.update(col.loc[mask].replace(0,1))
    return col

In [79]: df.apply(f, threshold=3)
Out[79]:
       col1  col2
names
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0

ধাপে ধাপে:

In [84]: col = df['col2']

In [85]: col
Out[85]:
names
A    0
B    0
C    0
D    0
E    1
F    0
G    1
H    0
I    1
J    0
K    0
L    0
Name: col2, dtype: int64

In [86]: (col != col.shift()).cumsum()
Out[86]:
names
A    1
B    1
C    1
D    1
E    2
F    3
G    4
H    5
I    6
J    7
K    7
L    7
Name: col2, dtype: int32

In [87]: col.groupby((col != col.shift()).cumsum()).transform('count')
Out[87]:
names
A    4
B    4
C    4
D    4
E    1
F    1
G    1
H    1
I    1
J    3
K    3
L    3
Name: col2, dtype: int64

In [88]: col.groupby((col != col.shift()).cumsum()).transform('count').lt(3)
Out[88]:
names
A    False
B    False
C    False
D    False
E     True
F     True
G     True
H     True
I     True
J    False
K    False
L    False
Name: col2, dtype: bool

In [89]: col.groupby((col != col.shift()).cumsum()).transform('count').lt(3) & col.eq(0)
Out[89]:
names
A    False
B    False
C    False
D    False
E    False
F     True
G    False
H     True
I    False
J    False
K    False
L    False
Name: col2, dtype: bool

— MaxU
সূত্র

ব্যাখ্যা ছাড়াও col.groupby((col != col.shift()).cumsum())। দ্রষ্টব্য :, groupby(by, ...)এখানে byকোনও ডিক বা সিরিজ হতে পারে, যখন কোনও ডিক বা সিরিজ পাস হয়, তখন গ্রুপগুলি নির্ধারণের জন্য সিরিজ বা ডিক ভ্যালু ব্যবহার করা হবে।

— মিথিল

5

pandas.DataFrame.shift()আপনার প্রয়োজনীয় প্যাটার্নটি খুঁজে পেতে আপনার ব্যবহার করা উচিত ।

কোড:

def fill_zero_not_3(series):
    zeros = (True, True, True)
    runs = [tuple(x == 0 for x in r)
            for r in zip(*(series.shift(i)
                           for i in (-2, -1, 0, 1, 2)))]
    need_fill = [(r[0:3] != zeros and r[1:4] != zeros and r[2:5] != zeros)
                 for r in runs]
    retval = series.copy()
    retval[need_fill] = 1
    return retval

পরীক্ষার কোড:

import pandas as pd

df = pd.DataFrame({
    'names': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]}).set_index('names')

df['col1'] = fill_zero_not_3(df['col1'])
df['col2'] = fill_zero_not_3(df['col2'])
print(df)

ফলাফল:

       col1  col2
names            
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0

— স্টিফেন রাউচ
সূত্র

আমি মনে করি আপনার চেয়ে একটি দ্রুত পথ পেয়েছি।

— কেভিন

2

@ স্টেফেন রাউচের উত্তরটি খুব স্মার্ট, তবে আমি যখন এটি একটি বড় ডেটাসেটে প্রয়োগ করি তখন এটি ধীর হয়। এই পোস্টটি দ্বারা অনুপ্রাণিত হয়ে , আমি মনে করি আমি একই লক্ষ্য অর্জনের জন্য আরও কার্যকর উপায় পেয়েছি।

কোড:

import pandas as pd

df = pd.DataFrame({
    'names': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]}).set_index('names')

for i in range(df.shape[1]):
    iszero = np.concatenate(([0], np.equal(df.iloc[:, i].values, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    zerorange = np.where(absdiff == 1)[0].reshape(-1, 2)
    for j in range(len(zerorange)):
        if zerorange[j][1] - zerorange[j][0] < 3:
            df.iloc[zerorange[j][0]:zerorange[j][1], i] = 1
print(df)

ফলাফল:

        col1  col2
names            
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0

— কেভিন
সূত্র