Google Cloud Bigquery
Google Cloud - Big Query
API Java
If you plan to use a service account with client library code, you need to set an environment variable:
export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json
Now, we can run a application without specify explicitly credentials
java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
-Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
-Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
bigdata.googlecloud.bigquery.SimpleApp \
google-analytics 112233445 ga_sessions_intraday_20180103
On the other hand, this option no need to export the env variable but need create credentials inside the code. So json file with credentials is an input argument
java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
-Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
-Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
bigdata.googlecloud.bigquery.SimpleAppWithCred \
~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json google-analytics 112233445 ga_sessions_intraday_20180103
API REST
CLI installation and running
Download gcloud
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-182.0.0-linux-x86_64.tar.gz
Initialize gcloud
Add to .bashrc
gcloud init
Authorization
Export env variable
export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json
Create Access Token
ACCESS_TOKEN="$(gcloud auth application-default print-access-token)"
Metadata
ProjectId | DatasetId | TableId |
---|---|---|
google-analytics | 112233445 | ga_sessions_intraday_yyyymmdd |
General Endpoints
List of main methods
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId/data
GET https://www.googleapis.com/bigquery/v2/projects/projectId/jobs
POST https://www.googleapis.com/upload/bigquery/v2/projects/projectId/jobs
GET https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId
POST https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId/cancel
POST https://www.googleapis.com/bigquery/v2/projects/projectId/queries
GET https://www.googleapis.com/bigquery/v2/projects/projectId/queries/jobId
Examples
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20171212
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data
POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_gRaHSTeIJCmIZzz4bctj0zsuffxc
Google reference
- Datasets:
get
- Datasets:
list
- Tables:
get
- Tables:
list
- Tabledata:
list
- Jobs:
query
- Jobs:
getQueryResults
- Jobs:
get
- Jobs:
list
- Jobs:
insert
Testing
Tables list
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables \
-H "Authorization: Bearer $ACCESS_TOKEN"
Tables get
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103 \
-H "Authorization: Bearer $ACCESS_TOKEN"
Tabledata list
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data \
-H "Authorization: Bearer $ACCESS_TOKEN"
Jobs query
We must to run JOBS. Jobs are asynchronous tasks such as running queries, loading data, and exporting data. Query
method run a job and return a job ID.
Request: With hardcode json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
-d '{"query":"select * from ga_sessions_intraday_20180103","kind":"bigquery#queryRequest","maxResults":2,"defaultDataset":{"projectId":"google-analytics","datasetId":"112233445"},"dryRun":false}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
Request: With file request-body-query.json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
-d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-query.json \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
request-body-query.json
looks like:
{
"query": "SELECT visitorId,visitNumber,visitId,visitStartTime,date,hits.time,hits.hour,hits.minute,hits.referer,hits.page.pagePath FROM ga_sessions_intraday_20180104 OMIT RECORD IF COUNT(hits.hour) < 2 LIMIT 1",
"kind": "bigquery#queryRequest",
"maxResults": 2,
"defaultDataset": {
"projectId": "google-analytics",
"datasetId": "112233445"
},
"dryRun": false,
"useLegacySql": true
}
Response: If we don’t set timeout
parameter probably the query does not return data, but it return a jobId
to use in getQueryResults
method.
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "google-analytics",
"jobId": "job_RvsOhR4xxWHk43eKShWcqm5a0srh"
},
"jobComplete": false
}
Jobs get
(job status)
Get the status of a specific job
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs/job_35axds8rFh5v6x_ztzwCNIxvIR9- \
-H "Authorization: Bearer $ACCESS_TOKEN"
One of the output fields in response body is status.state
. It can be PENDING state, RUNNING or DONE.
Response has fields with destination table in BigQuery where result data has been inserted:
...
"destinationTable": {
"projectId": "google-analytics",
"datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
"tableId": "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
}
...
Jobs getQueryResults
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_RvsOhR4xxWHk43eKShWcqm5a0srh/timeoutMs/60000 \
-H "Authorization: Bearer $ACCESS_TOKEN"
Response looks like this getQueryResults json body
Jobs insert
(extract data)
To extract data about specific JobId we must to use Google Cloud Storage.
Request: With file request-body-insert.json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
-d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-insert.json \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
request-body-insert.json
looks like:
{
"jobReference": {
"projectId": "google-analytics",
"jobId": "custom-jobId-csv-20170104_1200"
},
"configuration": {
"extract": {
"sourceTable": {
"projectId": "google-analytics",
"datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
"tableId": "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
},
"destinationUris": ["gs://google-analytics/extraction-0001-ga_sessions_intraday_20180104.csv"],
"destinationFormat": "CSV",
"compression": "NONE"
}
}
}
Jobs list
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
-H "Authorization: Bearer $ACCESS_TOKEN"
Example of workflow
- Export de la variable
GOOGLE_APPLICATION_CREDENTIALS
contiene el path del json con credenciales - Obtener
ACCESS_TOKEN
- Llamar a
query
recoger el jobID ->job_35axds8rFh5v6x_ztzwCNIxvIR9-
- Llamar a
get
para el job status y recoger del body response los camposjsodatasetId
ytableId
del objetodestinationTable
- Llamar a
insert
para hacer el extract, pasandole eljobId
custom, eldatasetId
ytableId
anteriores. Esto escribe un fichero en el formato indicado, en este caso CSV, en un bucket de GCS (Google Cloud Storage). Hay que parametrizar el nombre del fichero csv para que sea distinto por cada llamada a BigQuery - Llamar al API Rest de Google Cloud Storage