LaVOZs

The World’s Largest Online Community for Developers

'; scala - Can't read from S3 bucket with s3 protocol, s3a only - LavOzs.Com

I've been through all the threads on the dependencies for connecting spark running on an aws EMR to an s3 bucket, however my issue seems to be slightly different. In all of the other discussions I have seen, the s3 and s3a protocols have the same dependencies. Not sure why one is working for me while the other is not. Currently, running spark in local mode, s3a does the job just fine, but my understanding is that s3 is what's supported running on EMR (due to its reliance on HDFS block storage). What am I missing for the s3 protocol to work?

spark.read.format("csv").load("s3a://mybucket/testfile.csv").show()
//this works, displays the df

versus

spark.read.format("csv").load("s3://mybucket/testfile.csv").show()
/*
java.io.IOException: No FileSystem for scheme: s3
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 51 elided
*/

Apache Hadoop provides the following filesystem clients for reading from and writing to Amazon S3:

  1. S3 (URI scheme: s3) - Apache Hadoop implementation of a block-based filesystem backed by S3.

  2. S3A (URI scheme: s3a) - S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB and up to 5TB, and it provides performance enhancements and other improvements.

  3. S3N (URI scheme: s3n) - A native filesystem for reading and writing regular files on S3. s3n supports objects up to 5GB in size

Reference:

Technically what is the difference between s3n, s3a and s3?

https://web.archive.org/web/20170718025436/https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/

Related
structured streaming with Spark 2.0.2, Kafka source and scalapb
Pyspark randomly fails to write tos3
Spark: java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate
How to read parquet data from S3 to spark dataframe Python?
ApacheSpark read dataframe from http source (e.g. csv, …)
Error while Connecting PySpark to AWS Redshift
How can I load XML files in Spark 2.2.0?
How to save parquet in S3 from AWS SageMaker?
Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request