স্ট্রিং টাইপ থেকে পিএসপার্কে ডাবল প্রকারে কীভাবে ডেটাফ্রেম কলাম পরিবর্তন করবেন

Question 1

আমার কাছে স্ট্রিং হিসাবে কলাম সহ একটি ডেটাফ্রেম রয়েছে। আমি কলামের প্রকারটি পাইসপার্কে ডাবল প্রকারে পরিবর্তন করতে চেয়েছিলাম।

নিম্নলিখিতটি উপায়, আমি করেছি:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

কেবল জানতে চেয়েছিলেন, লজিস্টিক রিগ্রেশন চালানোর সময় এটি কি এটি করার সঠিক উপায়, আমি কিছুটা ত্রুটি পাচ্ছি, তাই আমি আশ্চর্য হলাম, এই কারণেই এই সমস্যার কারণ।

Question 2

এখানে কোনও ইউডিএফের দরকার নেই। Columnইতিমধ্যে উদাহরণ সহ castপদ্ধতি সরবরাহ করে :DataType

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

বা সংক্ষিপ্ত স্ট্রিং:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

যেখানে ক্যানোনিকাল স্ট্রিংয়ের নাম (অন্যান্য রূপগুলিও সমর্থন করা যায়) simpleStringমানের সাথে মিলিত হয়। সুতরাং পারমাণবিক ধরণের জন্য:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

এবং উদাহরণস্বরূপ জটিল ধরণের

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

Question 3

কলামটির নাম সংরক্ষণ করুন এবং ইনপুট কলাম হিসাবে একই নাম ব্যবহার করে অতিরিক্ত কলাম সংযোজন এড়ান:

changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Question 4

প্রদত্ত উত্তরগুলি সমস্যার সাথে মোকাবিলা করার জন্য যথেষ্ট তবে আমি অন্য একটি উপায়ে ভাগ করতে চাই যা স্পার্কের নতুন সংস্করণটি প্রবর্তিত হতে পারে (আমি এটি সম্পর্কে নিশ্চিত নই) সুতরাং প্রদত্ত উত্তরটি এটি ধরেনি।

আমরা col("colum_name")কীওয়ার্ড সহ স্পার্ক স্টেটমেন্টে কলামে পৌঁছাতে পারি :

from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

Question 5

পাইপার্ক সংস্করণ:

  df = <source data>
  df.printSchema()

  from pyspark.sql.types import *

  # Change column type
  df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
  df_new.printSchema()
  df_new.select("myColumn").show()

Question 6

সমাধানটি সহজ ছিল -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))