Hadoop FileSystem support

Hadoop FS Support is an implementation of the
Apache Hadoop File System interface,
which opens several opportunities for you to work with the platform and other
industry standard, big data tools while mitigating the need for you to write a
lot of custom code to do so. A few such examples include:

Read and write to Object store layer using Hadoop FS support in Spark (see the
tutorial
Bring your data into HERE platform using an object store layer (see the
tutorial
Connecting an object store layer to open-source processing and analytics tools
outside the platform such as Apache Spark, Apache Drill, Presto, AWS EMR, and
others.

Layer support

Hadoop FS Support is only available for the objectstore layer type.

Hadoop FileSystem interface support

The following Hadoop Filesystem methods are supported:

String getScheme() returns the scheme, which is blobfs.
URI getUri() returns the URI, for example
blobfs://hrn:here:data::olp-here:blobfs-test:test-data.
void initialize(URI name, Configuration conf) initializes the
BlobFs FileSystem.
FSDataInputStream open(Path f, int bufferSize) provides an InputStream to
read data from an object stored in the Object Store layer.
FileStatus getFileStatus(Path f) provides information about a file.
FileStatus[] listStatus(Path f) provides a list of files along with their
respective information.
FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress)
provides an OutputStream to write data to. The parameters progress,
permission and replication are not implemented.
boolean rename(Path src, Path dst). Renames the path.
boolean delete(Path f, boolean recursive) deletes a path, if the recursive
flag is set to true, all sub-directories and files are also deleted.
boolean mkdirs(Path f, FsPermission permission) creates directories for the
given path. The parameter permission is not implemented.
void close() closes the file system.

Usage

For the catalog HRN hrn:here:data::olp-here:blobfs-test and the layer ID
test-data, the URL to be used in Hadoop/Spark/Drill is
blobfs://hrn:here:data::olp-here:blobfs-test:test-data.

Spark

Your spark application will need to have a dependency on the hadoop-fs-support
package.

val catalogHrn = "hrn:here:data::olp-here:blobfs-test"
val layerId = "test-data"

val sourcePath = s"blobfs://$catalogHrn:$layerId/source"
val destinationPath = s"blobfs://$catalogHrn:$layerId/destination"

val sourceRdd = sparkContext.textFile(sourcePath)
sourceRdd.saveAsTextFile(destinationPath)

Hadoop fs shell

You can use the
Hadoop File System Shell
to explore the contents of the Object Store layer. The operations supported from
the Hadoop File System Shell are the following:

cat
copyFromLocal
copyToLocal
count
cp
ls
mkdir
moveFromLocal
moveToLocal
mv
put
rm
rmdir
rmr
test
text
touchz
truncate
usage

export HADOOP_CLASSPATH="hadoop-fs-support_2.13-${VERSION}-assembly.jar"
hadoop fs -mkdir blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -cp file.txt blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -ls blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1

Hadoop configurations

BlobFs supports the following Hadoop configurations:

fs.blobfs.multipart.part-upload-parallelism BlobFs uploads an object by
splitting the object into various parts and using the multi-part upload
functionality of Object Store. This configuration defines how many parts for a
single object can be uploaded simultaneously. The minimum allowed parallelism
is 1. The default value is 2. The upload speed can increase with an
increased parallelism, doing that is more costly as each uploaded part is
buffered in the memory.
fs.blobfs.multipart.part-size Size of each part of the object that is
uploaded in bytes. The minimum part size allowed is 5242880. The maximum
part size allowed is 100663296. The default value is 100663296.

Authentication

For instructions on how to set up HERE credentials, see
Get Your Credentials.

Additionally, HERE credentials can also be passed as Hadoop configuration.

fs.blobfs.accessKeyId HERE access key ID. This configuration is the same
as here.access.key.id in the credentials.properties.
fs.blobfs.accessClientId HERE client ID. This configuration is the same as
here.client.id in the credentials.properties.
fs.blobfs.accessKeySecret HERE access key secret. This configuration is
the same as here.access.key.secret in the credentials.properties.
fs.blobfs.accessEndpointUrl HERE token endpoint URL. This configuration is
the same as here.token.endpoint.url in the credentials.properties.

Configuring Hadoop installations

EMR

In order to run on EMR, you will need to create an EMR cluster with an
additional parameter configurations, for example

aws emr create-cluster \
  --name "$cluster_name" \
  --release-label "emr-5.17.0" \
  --applications Name="Spark" \
  --region ${region} \
  --log-uri ${log_location} \
  --instance-type "m4.large" \
  --instance-count 4 \
  --service-role "EMR_DefaultRole" \
  --ec2-attributes KeyName="some-key",InstanceProfile="EMR_EC2_DefaultRole",AdditionalMasterSecurityGroups="sg-xxxx",AdditionalSlaveSecurityGroups="sg-xxxxx",SubnetId="subnet-xxxx" \
  --configurations file://emr-config.json

The content of the emr-config.json file is the following:

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.blobfs.impl": "com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem"
    }
  }
]

Standalone Hadoop installations

The BlobFS fat jar needs to be included on the Hadoop classpath.
In some cases, the core-site.xml file needs to have the following property
to be added:

  <property>
    <name>fs.blobfs.impl</name>
    <value>com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem</value>
  </property>

Drill

The BlobFS fat jar needs to be included on the Hadoop classpath.

Notes

Object Store is not a true file system. Object Store is a distributed
key-value store and as such, it does not behave exactly like a file system. A
file system expects that operations such as delete or rename are atomic.
For Object Store, these operations will finish eventually. A file system
expects that during reading from or writing to a file, the content of the file
should not be changed or the file should not be deleted. Object Store does not
assure this behavior. You will need to take precaution against this behavioral
difference on your own.

Hadoop FS Support implements Hadoop FileSystem version 2.7.3. It is highly
likely that BlobFS will work with Hadoop versions up to 2.9.x, but it is not
guaranteed to work. Hadoop version 3.x.x is not yet supported.

Compatibility issues with HRN. BlobFs requires HRN as the authority of the
URI. Some tools create incorrect URIs for HRNs. To work around such cases, you
can also pass the hex value of the HRN as the authority of the URI. For the
catalog HRN hrn:here:data::olp-here:blobfs-test and the layer ID
test-data, this looks as follows:
val hrnHex = Hex.encodeHexString("hrn:here:data::olp-here:blobfs-test:test-data".getBytes)
val blobFsUri = "blobfs://$hrnHex"
Hadoop FS Support does not support Apache Flink.