Quantcast
Channel: Correctly reading the types from file in PySpark - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Correctly reading the types from file in PySpark

$
0
0

I have a tab-separated file containing lines as

id1 name1   ['a', 'b']  3.0 2.0 0.0 1.0

that is, an id, a name, a list with some strings, and a series of 4 float attributes.I am reading this file as

rdd = sc.textFile('myfile.tsv') \    .map(lambda row: row.split('\t'))df = sqlc.createDataFrame(rdd, schema)

where I give the schema as

schema = StructType([    StructField('id', StringType(), True),    StructField('name', StringType(), True),    StructField('list', ArrayType(StringType()), True),    StructField('att1', FloatType(), True),    StructField('att2', FloatType(), True),    StructField('att3', FloatType(), True),    StructField('att4', FloatType(), True)])

Problem is, both the list and the attributes do not get properly read, judging from a collect on the DataFrame. In fact, I get None for all of them:

Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)

What am I doing wrong?


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images