I have a tab-separated file containing lines as
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
that is, an id, a name, a list with some strings, and a series of 4 float attributes.I am reading this file as
rdd = sc.textFile('myfile.tsv') \ .map(lambda row: row.split('\t'))df = sqlc.createDataFrame(rdd, schema)
where I give the schema as
schema = StructType([ StructField('id', StringType(), True), StructField('name', StringType(), True), StructField('list', ArrayType(StringType()), True), StructField('att1', FloatType(), True), StructField('att2', FloatType(), True), StructField('att3', FloatType(), True), StructField('att4', FloatType(), True)])
Problem is, both the list and the attributes do not get properly read, judging from a collect
on the DataFrame. In fact, I get None
for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
What am I doing wrong?